What is the best way to create Regular Expressions?
Regular Expressions are notorious for being confusing to read and understand. The longer the Regular Expression,
the higher the chance of making a mistake in it, and the more difficult it is to debug or modify.
Of course, every Regular Expression would be commented thoroughly. It would still suffer from being a single long line of characters.
Consider this Regular Expression that is found at http://regexlib.com/REDetails.aspx?regexp_id=731:
(?s)( class=\w+(?=([^<]*>)))|(<!--\[if.*?<!\[endif\]-->)|
(<!\[if !\w+\]>)|(<!\[endif\]>)|(<o:p>[^<]*</o:p>)|
(<span[^>]*>)|(</span>)|
(font-family:[^>]*[;'])|(font-size:[^>]*[;'])(?-s)
There's nothing wrong with the expression itself. Unfortunately, no matter how thorough we document it, we cannot easily, visually, associate
a comment with the part of the Regular Expression string that is being described.
The real problem is that a single long Regular Expression line does not allow a developer to show the intent of each significant part of it.
Each part of a Regular Expression must scream its purpose. If a Regular Expression is several lines long, and it does not work properly,
the developer will have a hard time locating the point that is responsible for the failure.
The solution is really simple. I have not seen a similar technique used anywhere, so this feels like a good example to share.
Instead of entering a Regular Expression as a single long cryptic string, the string is built dynamically as a sum of very short cryptic strings.
Each short piece of Regular Expression is commented separately.
For example, the following class creates a regex to validate a Canadian postal code:
public class CanadianPostalCodeRegex
{
private string _strPattern;
private static CanadianPostalCodeRegex Instance = new CanadianPostalCodeRegex();
private CanadianPostalCodeRegex()
{
StringBuilder patternBuilder = new StringBuilder();
patternBuilder.Append(@"^");
patternBuilder.Append(@"(?<FSA>");
patternBuilder.Append(@"\p{L}\d\p{L}");
patternBuilder.Append(@")");
patternBuilder.Append(@"\s?");
patternBuilder.Append(@"(?<LDU>");
patternBuilder.Append(@"\d\p{L}\d");
patternBuilder.Append(@")");
patternBuilder.Append(@"$");
_strPattern = patternBuilder.ToString();
}
public static string Pattern
{
get { return Instance._strPattern; }
}
}
A Regular Expression is created piece by piece. Each smallest meaningful unit is thoroughly commented. The intention of each part is crystal clear,
which is a huge help when one needs to fix or modify the regex. At all times, we need to deal with a fairly small regex string, instead of an unwieldy cryptic monster.
This technique also promotes the syntactic correctness of the Regular Expression. For example, a group construct can be entered first, making sure parenthesis match.
patternBuilder.Append(@"(?<LDU>");
patternBuilder.Append(@")");
Next, the group's pattern is entered.
patternBuilder.Append(@"(?<LDU>");
patternBuilder.Append(@"\d\p{L}\d");
patternBuilder.Append(@")");
Being a Singleton, the expression will be built only once. There is virtually no performance penalty. Readability and maintainability improves significantly.
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.