A T-SQL Regular Expression Library for SQL Server 2005

Steve Abraham

4.81/5 (38 votes)

Jul 6, 2007

CPOL

9 min read

266433

2320

This article shows the reader how to construct a library of scalar and table valued functions for SQL Server 2005 to perform regular expression analysis.

Introduction

With the advent of CLR integration into SQL Server 2005, it has become incredibly easy to extend the power of the T-SQL programming language. Two of the areas that can be improved upon by way of CLR integration are string matching and string manipulation.

Background

T-SQL has a handful of basic string matching functions (e.g. CHARINDEX, PATINDEX, SOUNDEX) and string matching operators (e.g. =, <>, <, >, LIKE). These are insufficient for tasks such as creating a check constraint on a column of email addresses where you want to ensure only valid email addresses exist. The most common way to validate strings like this is by using regular expressions. This is where CLR integration comes into play.

SQL Server 2005 now allows you to create user defined functions (among other things) using your .NET language of choice. This article will demonstrate how to develop a set of general purpose, user-defined, regular expression functions for consumption in T-SQL using C#.

Using the Code

General Approach

My objective here is to wrap some of the more commonly used static methods of the RegEx class in the .NET Framework into something useable in a T-SQL environment. I felt the best approach was to develop this wrapper as a set of user defined functions that closely mirror the inputs and outputs of these methods in the RegEx class.

Interface

All four of the functions listed in this article share the same first two parameters:

@Input NVARCHAR(MAX)

This is the string to be analyzed. You may pass either a literal string, or a column name. This is the string on which the regular expression will be executed.

@Pattern NVARCHAR(MAX)

This is the regular expression which will be executed against the @Input parameter.

In addition, all four functions share the same last parameter.

@IgnoreCase BIT

This is a boolean parameter which when set to "1" will instruct the regular expression engine to ignore cases when executing the regular expression against the input string. If this parameter is set to "0", a case-sensitive analysis will be performed.

Functions

ufn_RegExIsMatch

The purpose of this function is to mirror the functionality of the Regex.IsMatch method. In short, if the pattern specified in the @Pattern parameter is found within the string specified by @Input, the return value will be "1". Otherwise, the return value will be "0".

ufn_RegExMatches

While ufn_RegExIsMatch will tell you if there is a match of @Pattern within @Input, this function will tell you what the matches are, where they are located in the string and how long each match is. This function wraps the functionality of the Regex.Matches method in the .NET Framework. This function returns a table. The columns of this table are as follows:

Match NVARCHAR(MAX)
MatchIndex INT
MatchLength INT

ufn_RegExReplace

This function mirrors the functionality of Regex.Replace in the .NET Framework and closely resembles the functionality of the REPLACE function in T-SQL. The primary difference between ufn_RegExReplace and REPLACE is that the matching in REPLACE is done solely on literal string match comparison while ufn_RegExReplace matches based on the regular expression specified in the @Pattern parameter.

ufn_RegExReplace also takes an additional parameter that none of the other functions in this library contain which is the @Replacement parameter. This is another NVARCHAR(MAX) parameter which specifies the literal string used to replace matches identified in @Input when @Pattern is executed.

This function returns an NVARCHAR(MAX)which represents the input string with the specified replacements in place.

ufn_RegExSplit

Splitting a delimited string into its elements is a common task among T-SQL developers. The ufn_RegExSplit function does exactly that. This implementation uses the Regex.Split method from the .NET Framework. ufn_RegExSplit, much like the ufn_RegExMatches function returns a table as its output. This table however has only one column which is of the data type NVARCHAR(MAX). The column name is Match and it contains the elements of the string split out by the delimiter specified in @Pattern.

Note: I realize that for most implementations, simply delimiter splitting is sufficient and for such a situation, I would recommend a similar function that uses the String.Split method rather than regular expressions. However, since the focus of this article is regular expressions in T-SQL, I decided to stick with the regular expression implementation.

SQL Server Implementation

Now that the assembly has been created, we need to implement it into SQL Server. The steps below outline the process.

Determine the Database where the Assembly will Reside

Assemblies are database level objects – not server level objects. You will need to select the database that will contain the assembly:

--Set the database to which these functions will be installed
use AdventureWorks
GO

Enable CLR Integration

CLR integration is disabled by default. To enable it, you must run sp_configure and set the "clr enabled" property to 1. You must then issue the "RECONFIGURE" command for the setting to take effect. Otherwise, the setting will not take effect until SQL Server is restarted.

--Enable CLR Integration
exec sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

Create the Assembly

Once you've enabled CLR integration, the next step is to "create the assembly" by importing it from the actual compiled DLL file. SQL Server stores the assembly as a stream of bytes inside of the database. Once the assembly has been imported using the "CREATE ASSEMBLY" command, you no longer need the actual DLL file.

CREATE ASSEMBLY [SqlRegEx] FROM 'C:\SqlRegEx.dll' WITH PERMISSION_SET = SAFE

Create the Functions

Once the assembly has been created, you can create the actual functions. The syntax to do this is very similar to creating a T-SQL function. The primary difference is instead of writing out the function body, you simply specify "AS EXTERNAL NAME" followed by [AssemblyName].[Namespace.Class].[Method] where [AssemblyName] is the name of the assembly in SQL Server and Namespace, Class and Method all refer to the namespace, class and method inside of the assembly.

CREATE FUNCTION [dbo].[ufn_RegExIsMatch]
(@Input NVARCHAR(MAX), @Pattern NVARCHAR(MAX), @IgnoreCase BIT)
RETURNS BIT
AS EXTERNAL NAME SqlRegEx.[SqlClrTools.SqlRegEx].RegExIsMatch

Call the Functions

At this point, the functions are loaded and ready to go. To call them, you simply use the same syntax that you would to call any other T-SQL function.

SELECT dbo.ufn_RegExIsMatch('Hello World', 'w', 1) --Ignores Case
SELECT dbo.ufn_RegExIsMatch('Hello World', 'w', 0) --Case Sensitive

Points of Interest

PERMISSION_SET

When creating an assembly in SQL Server, you must specify the requested permission set for the assembly.

In most instances, you will select "SAFE" as this article does. This means that you are not directly accessing system resources such as the disk, network, etc. This is the most restrictive and safest set of permissions.

If you require access to the disk or network or other resources you may need to use the "EXTERNAL_ACCESS" permission set.

In the most rare of circumstances, you may need to use "UNSAFE". This permission set grants everything that EXTERNAL_ACCESS grants and also allows access to unmanaged code. Use this setting with great caution.

Unicode and NVARCHAR

All .NET strings are handled in Unicode. As such, all string data into and out of these functions is of the NVARCHAR data type. I chose to use the "MAX" length as we have no idea how large the strings are that are being passed to these functions. Of course, you may wish to limit this in your own implementation.

Attributes

You may notice that each of the methods included in the SqlRegEx class are decorated with certain attributes. The list below will outline the purpose of each.

DataAccess

The DataAccess attribute signals to SQL Server whether this method will access any user data on the current instance of SQL Server. In the case of these functions, none of them do so directly, so the value for this attribute is false for each.

IsDeterministic

To quote the SQL Server Books Online "Deterministic functions always return the same result any time they are called with a specific set of input values and given the same state of the database. Non-deterministic functions may return different results each time they are called with a specific set of input values even if the database state that they access remains the same." All of the functions listed in this article are deterministic and as such this property has been marked true.

IsPrecise

The IsPrecise attribute indicates if the result is precise or not. For example, a function calculating values using the FLOAT data type would not be precise as the FLOAT data type can lose information during calculations. The functions in this library are precise and as such, this attribute is marked as true for each function.

Name

The Name attribute contains a string indicating what this function should be called when registered in SQL Server. This field is not required and has no practical value in SQL Server, but I have set it for each of these functions for my own personal documentation.

SystemDataAccess

The SystemDataAccess attribute is much the same as the DataAccess attribute with the exception that it refers to system data rather than user data.

FillRowMethodName

The FillRowMethodName is only set for the functions that return tables. You will notice that for each function that returns a table, the actual return type in the .NET code is IEnumerable. That means that whatever you return from this function must implement the IEnumerable interface. The IEnumerable interface allows the .NET Framework to loop through your results and for each result in your result set, it will call the function listed in the FillRowMethodName attribute. If you look at the functions that are listed in the FillRowMethodName attribute, you will see that they each take one object input parameter which represents the current element in the above mentioned loop. The rest of the parameters are output parameters and they will represent the columns of table that is output to SQL Server.

One thing that bothers me about this particular setup is that it does not appear that you can use System.Collections.Generic.IEnumerable<T> for a return type from your function. Instead you must use System.Collections.IEnumerable. This means that you must accept an object for the first parameter in your fill row method instead of being able to specify the actual type. This of course means boxing and un-boxing operations each time you create a new row. Since generics are available in .NET 2.0, I think it would be appropriate to allow them here.

Deterministic vs. Non-Deterministic

It is worth noting that if you plan to use either of the scalar functions listed in this article in a computed column and you wish to persist that computed column, the function must be marked as "Deterministic". You cannot persist a computed column that contains anything that is not deterministic. If you cannot persist your column, you will not be able to apply any indexes to it and it will be re-computed every time it is called. If your function is truly deterministic, you would want to make sure that it is not recomputed every time since by definition deterministic functions always return the same information, you would waste CPU cycles.

Summary

The CLR is a powerful new tool available to SQL Server developers. It opens up nearly the entire .NET Framework to SQL Server and if used appropriately, can bring tremendous power and value to any SQL Server application.