What is the best way to create Regular Expressions?

Alex Perepletov

Rate me:

2.36/5 (8 votes)

26 Sep 2008CPOL2 min read

18.6K

A convenient way to document the intent of each part of a regex.

What is the best way to create Regular Expressions?

Regular Expressions are notorious for being confusing to read and understand. The longer the Regular Expression, the higher the chance of making a mistake in it, and the more difficult it is to debug or modify. Of course, every Regular Expression would be commented thoroughly. It would still suffer from being a single long line of characters.

Consider this Regular Expression that is found at http://regexlib.com/REDetails.aspx?regexp_id=731:

(?s)( class=\w+(?=([^&lt;]*&gt;)))|(&lt;!--\[if.*?&lt;!\[endif\]--&gt;)|
  (&lt;!\[if !\w+\]&gt;)|(&lt;!\[endif\]&gt;)|(&lt;o:p&gt;[^&lt;]*&lt;/o:p&gt;)|
  (&lt;span[^&gt;]*&gt;)|(&lt;/span&gt;)|
  (font-family:[^&gt;]*[;'])|(font-size:[^&gt;]*[;'])(?-s)

There's nothing wrong with the expression itself. Unfortunately, no matter how thorough we document it, we cannot easily, visually, associate a comment with the part of the Regular Expression string that is being described.

The real problem is that a single long Regular Expression line does not allow a developer to show the intent of each significant part of it. Each part of a Regular Expression must scream its purpose. If a Regular Expression is several lines long, and it does not work properly, the developer will have a hard time locating the point that is responsible for the failure.

The solution is really simple. I have not seen a similar technique used anywhere, so this feels like a good example to share. Instead of entering a Regular Expression as a single long cryptic string, the string is built dynamically as a sum of very short cryptic strings. Each short piece of Regular Expression is commented separately.

For example, the following class creates a regex to validate a Canadian postal code:

public class CanadianPostalCodeRegex
{
    /// <summary>
    /// Canadian postal code regular expression pattern.
    /// </summary>
    private string _strPattern;
    /// <summary>
    /// Singleton access.
    /// </summary>
    private static CanadianPostalCodeRegex Instance = new CanadianPostalCodeRegex();


    private CanadianPostalCodeRegex()
    {
        StringBuilder patternBuilder = new StringBuilder();

        // Pattern description:
        // Start of string.
        patternBuilder.Append(@"^");
        // Start the FSA group
        patternBuilder.Append(@"(?<FSA>");
        // FSA group consists of ANA, where A is a letter and N is a digit
        patternBuilder.Append(@"\p{L}\d\p{L}");
        // End the FSA group
        patternBuilder.Append(@")");
        // An optional single white space
        patternBuilder.Append(@"\s?");
        // Start the LDU group
        patternBuilder.Append(@"(?<LDU>");
        // LDU group consists of NAN, where A is a letter and N is a digit
        patternBuilder.Append(@"\d\p{L}\d");
        // End the LDU group
        patternBuilder.Append(@")");
        // End of string.
        patternBuilder.Append(@"$");

        _strPattern = patternBuilder.ToString();
    }



    /// <summary>
    /// Gets the Canadian postal code regex pattern.
    /// </summary>
    public static string Pattern
    {
        get { return Instance._strPattern; }
    }
}

A Regular Expression is created piece by piece. Each smallest meaningful unit is thoroughly commented. The intention of each part is crystal clear, which is a huge help when one needs to fix or modify the regex. At all times, we need to deal with a fairly small regex string, instead of an unwieldy cryptic monster.

This technique also promotes the syntactic correctness of the Regular Expression. For example, a group construct can be entered first, making sure parenthesis match.

// Start the LDU group
patternBuilder.Append(@"(?<LDU>");
// End the LDU group
patternBuilder.Append(@")");

Next, the group's pattern is entered.

// Start the LDU group
patternBuilder.Append(@"(?<LDU>");
// LDU group consists of NAN, where A is a letter and N is a digit
patternBuilder.Append(@"\d\p{L}\d");
// End the LDU group
patternBuilder.Append(@")");

Being a Singleton, the expression will be built only once. There is virtually no performance penalty. Readability and maintainability improves significantly.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Alex Perepletov

Software Developer (Senior)

Canada

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

// Start of string. patternBuilder.Append(@"^"); // Start the FSA group patternBuilder.Append(@"(?<fsa>"); // FSA group consists of ANA, where A is a letter and N is a digit patternBuilder.Append(@"\p{L}\d\p{L}"); // End the FSA group patternBuilder.Append(@")"); </fsa>

// FSA group // Optional paren, then tag <fsa> const string FSA_GROUP_START = "(?<fsa>"; // One letter, one digit, one letter const string FSA_GROUP_DTL = "\p{L}\d\p{L}"; // End the group const string FSA_GROUP_END = ")"; // End the group const string FSA_GROUP = FSA_GROUP_START + FSA_GROUP_DTL + FSA_GROUP_END; </fsa></fsa>

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

What is the best way to create Regular Expressions?

What is the best way to create Regular Expressions?

License

Comments and Discussions