SQL Zip Compression, RegEx and Random Functions

Brad Joss

4.00/5 (1 vote)

Dec 18, 2014

CPOL

7 min read

20602

305

SQL CLR Functions for Zip Compression and Regular Expression evaluations

Download source - 151.6 KB

Introduction

The attached code exposes the SharpZipLib zip compression library as a thread-safe Common Language Runtime (CLR) SQL functions. Additional functions included in the library (since they were already compiled into our production systems) are provided for Regular Expression (RegEx) evaluation within SQL. This is probably best introduced though a quick example:

-- Zip Example --
DECLARE @input nvarchar(4000) = REPLICATE(N'HELLO 1 HELLO 2 HELLO 3 HELLO', 100)
DECLARE @zipped nvarchar(1000) = dbo.ZipString(@input)
SELECT 
    @zipped, -- eJztyTENACAMADArswBoIN ...
    LEN(@input), -- 2900 bytes
    LEN(@zipped), -- 68 bytes
    100.0 - (100.0 * ((LEN(@zipped)*1.0)/(LEN(@input)*1.0))), -- 97.6% Reduction
    IIF(@input = dbo.UnzipString(@zipped), 'PASS', 'FAIL') -- Pass

-- RegEx Example --
select dbo.RegexMatchCount([Phone], '\d') as [DIGIT_COUNT]
from #Users

Background

Note: This code is only made possible through the use of SharpZipLib) which is an amazing project. They have my profound thanks.

While SQL Server natively supports storing data as compressed (See MSDN for details), with this library we are able to achieve goals that transcend any one application layer. Some practical uses of this code might include:

An N-Tiered application where the client end-point and the database need to share large chunks of highly compressible data (like XML or HTML) and SQL Server does not need to access that data very often. For example: client specific application settings (e.g. how to render out the UI components) stored as XML.
Large/Expensive to compute queries that are computed and stored in a batch job, but need to be delivered to a zip-enabled client in HTML/XML. For example: Customer sales reports that use data from several remote system, but needs to be readily available to the calling client UI. In this case, you could compute the results in the batch job FOR XML AUTO, zip the results at the time of compute and store in a table keyed by customer ID. When the UI needs the data, they send a simple query to get the zipped result by customer ID, and decompress the data on the client. Since the "hard" work of zipping and computing was all done in backend batches, the UI remains very responsive and is very low impact on SQL at client runtime.
The creation of SQL scripts that contain large (many MB) text string. SQL Server Management Studio seems to struggle to load scripts of this nature, so compressing the data sometimes is the only viable option. For example, in my other posting about transforming .NET DataTables into SQL scripts, it would be ideal to export the script data compressed.

The biggest advantage is that since the data is stored and delivered compressed, it is low impact on SQL (both Disk I/O and CPU) and low impact to the network to deliver the data to the client. This opposed to SQL native compression where SQL compresses on receive and decompresses on send, the network then recompresses while sending, then the client decompresses the network packet on receive.

Using the Code

To deploy the code, you can run the included "SQLCLRCommon.publish.sql" against your target database, or compile and deploy using Visual Studio. Once deployed, the basic usage is pretty straight forward: To compress a string value, call dbo.ZipString, to decompress, call dbo.UnzipString.

DECLARE @input nvarchar(4000) = N'HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO HELLO'
DECLARE @zipped nvarchar(1000) = dbo.ZipString(@input)
SELECT @zipped, dbo.UnzipString(@zipped)

There are a few other helpful functions included in the assembly:

[dbo].[CLRTestText](@text NVARCHAR (4000)) RETURNS NVARCHAR (4000)
Returns meta data details about the passed text. Used in debugging.
[dbo].[Random] (@min INT, @max INT, @seed INT) RETURNS INT
Returns a semi-random int between the @min (inclusive) and @max (exclusive) using the specific seed (or null for a time-based seed). Useful when using ROW_NUMBER() as the seed to generate large sets of random data.
[dbo].[RandomString] (@len INT, @includeLower BIT, @includeUpper BIT, @includeNumbers BIT, @includeExtended BIT, @seed INT) RETURNS NVARCHAR (MAX)
Returns a semi-random string, useful for generating pseudo data for testing systems early in the development process when the system has no "real" data yet, but you need the data to have a cardinality roughly similar to what you might see in production.
[dbo].[RegexIsMatch] (@text NVARCHAR (4000), @pattern NVARCHAR (4000)) RETURNS BIT
Test a text value against a regular expression returning 1 (true) if the expression matches at least once, else 0 (false).
[dbo].[RegexMatch] (@text NVARCHAR (4000), @pattern NVARCHAR (4000)) RETURNS INT
Return the first zero-based index position of text matching the passed regular expression or -1 when no match is found.
[dbo].[RegexMatchCount] (@text NVARCHAR (4000), @pattern NVARCHAR (4000)) RETURNS INT
Return a count of all matches matching the regular expression.
[dbo].[ToTitleCase] (@text NVARCHAR (4000)) RETURNS NVARCHAR (4000)
Format a string in Title Case, e.g. convert "star wars" to "Star Wars".

The C# unit tests and SQL unit test scripts ("Assembly Function Tests.sql", " ZipTest.sql") have more detailed examples of how the code can be used, but here are the basics:

SELECT [dbo].[Random] (1, 100, NULL) -- e.g.: 33

SELECT [dbo].[RandomString] (10, 1, 1, 1, 0, NULL) -- e.g.: 'Tj1LZPGLno'

SELECT [dbo].[RegexIsMatch] ('ABC1234', '\d{4}') -- e.g.: 1

SELECT [dbo].[RegexMatch] ('ABC1234', '\d{4}') -- e.g.: 3

SELECT [dbo].[RegexMatchCount] ('ABC1234', '\d') -- e.g.: 4

SELECT [dbo].[ToTitleCase] ('star wars') -- e.g.: 'Star Wars'

SELECT [dbo].[CLRTestText](N'Hello')
/*
-- Example Output --
[Text] = "Hello"
[LCID] = "1033"
[IgnoreCase] = "True"
[CompareInfo.Name] = "en-US"
[CultureInfo.DisplayName] = "English (United States)"
[CultureInfo.ThreeLetterISOLanguageName] = "eng"
[TextInfo.ANSICodePage] = "1252"
*/

Points of Interest

Zip Code Challenges

The only really challenging part was creating a thread-safe SharpZipLib. Out of the box, SharpZipLib is not known to be thread-safe, which is a requirement for SQL CLR functions. To solve this, I removed all SharpZipLib code not directly related to compressing streams (e.g. the file compression stuff), which is why that folder is named SharpZipLibLite. For the remaining public static writable properties, they were moved to two instance classes DeflaterHuffmanStatic and InflaterHuffmanTreeStatic, both of which are then bundled under another instance class called StaticClasses, which is passed into the SharpZipLib stream handlers:

StaticClasses sc = new StaticClasses();
using (var stream = new DeflaterOutputStream(memoryStream, ref sc)){…}

StaticClasses sc = new StaticClasses();
using (Stream s2 = new InflaterInputStream(new MemoryStream(byteInput), ref sc)){…}

For more details, review ZipUtils.cs/ZipBytes and UnzipBytes. If you want to use a newer version of SharpZipLib, you can follow this same pattern to alter that code as you best see fit.

There were several other smaller challenges, like how to best encode the compressed byte array as a string value? The solution was to use System.Convert.FromBase64String and ToBase64String for a hard-coded System.Text.Encoding of UTF-8. To use a different encoding, like ASCII, alter the SERIALIZATION_ENCODING in ZipUitls.cs, although I have not tested the code with anything other than UTF-8.

One of the challenges that remains is effectively using the System.IO.Streams to compress and decompress the data. At present, the code is taking the whole Stream, converting it to a string value which is then converted to SqlChars to be returned to SQL. This is in part so that we can use the same assembly and its base functions (ZipBytes and UnzipBytes) both on a .NET-enabled client and on SQL Server. If there is a better way to achieve this goal with more optimized code, feel free to suggest alternatives.

Random Challenges

The problem with Random was the seed. If you use the default seed (based on current time), the random distribution is too low since so many results get returned per millisecond. Likewise, we can't just use ROW_NUMBER() since then we will get the same random values every time for row 1, 2, 3 and so on, so it isn't very random. The "split the difference" solution is in the function SafeSeed that tries to get the passed value and use in conjunction with the current time within the year. Coders who truly love math challenges likely could come up with better ideas, so feel free to offer constructive feedback.

The other problem was with RandomString. The current code is somewhat optimized for large string values (say more than 255 characters) since it creates a pool of all possible random values to return and then pulls the values from the pool. If you are getting very small strings, this is relatively inefficient, so be advised.

RegEx Challenges

To avoid having to pass in a "Is Case Sensitive" flag, I tried to make it smart and parse the input text metadata. If the input string is of a case sensitive collation, the RegEx will evaluate as case sensitive, else case insensitive. This is in part why I exposed CLRTestText() to determine the case sensitivity of the text string.

Future Work

We have been using this library for a few years now and it has worked well for our team and our projects. However, while putting this article together, I see some opportunities to do some cool new work to extend this in different directions. Notably:

Exposing ZipBytes and UnzipBytes natively as SQL functions to allow them to be used on binary data (varbinary).
Exposing the full suite of SharpZipLib code, not just the stream compression stuff. I am not sure this would be possible, but it would be fun to try.
Exposing a RegExMatches function that returns a table of found matches for a particular expression. For example, RegExMatches('1234','\d{2}') would return a two row table with row one containing "12" and row two containing "34". Additional columns for match position and what not would be a bonus.

If you feel as though you have made a significant contribution to this code, send it my way so I can post it!

History

17 Dec 2014 - Initial public draft
23 Dec 2014 - Minor article updates, code is unchanged.