Click here to Skip to main content
15,903,748 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
Hello,

I cannot find a valid regular expression pattern for my needs.

I have a sample string like this:

I have four child of seven years each, [seven] years ago I had no child, because I was fourteen

now, I want match and then substitute the words "four" and "[seven]".

So I have used a pattern like:

\bfour\b|\b\[seven\]\b

(searches using word boundaries to match exact words. Square brackets are escaped to match them literally)

but only "four" is matched and substituted.

If I change the pattern to:

four|\[seven\]


"four" and "[seven]" are both matched. But because I have removed the word boundary command "\b", now partial word matches can happen ("four" into "fourteen", for example) and this is not what I want.

Ultimately seems that "\b" has to do with this strange behaviour but I don't know why and how to solve.

Any help is appreciated. Thanks.
Posted
Updated 1-Jul-11 6:13am
v2
Comments
Manfred Rudolf Bihy 1-Jul-11 12:57pm    
I like your question! 5+
Please also see my answer for an explanation.
thatraja 1-Jul-11 13:15pm    
/*I like your question! 5+*/
Manfred, I think you clicked Vote 1 instead of Vote 5 :)
vlad781 1-Jul-11 17:07pm    
I clicked 5

Let me elaborate a bit on what Catalin already said. \w is the class of characters "[A-Za-z0-9_]". Word boundaries can occurr only right next to these characters. The code below illustrates this quite nicely:

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

using TestSupportService.ServiceReference;

namespace TestSupportService
{
    class Program
    {
        static void Main(string[] args)
        {

            String example = "I have four child of seven years each, [seven] years ago I had no child, because I was fourteen";
            Regex rexWillDo = new Regex(@"\bfour\b|\[\bseven\b\]");
            Regex rexWontDo = new Regex(@"\bfour\b|\b\[seven\]\b");

            Console.WriteLine("Now you see it!");
            MatchCollection matches = rexWillDo.Matches(example);
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value);
            }

            Console.WriteLine("\nAnd now you don't!");
            matches = rexWontDo.Matches(example);
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value);
            }
            Console.ReadLine();

        }
    }
}


So by moving the word boundary detectors next to (real) word characters the expression works. I do admit that I also did not expect that kind of behavior. Regular expressions usually work quite nicely for me, but once in a while MS's implementation of it rears it's ugly head and bites us. :(

Cheers!

—MRB
 
Share this answer
 
Comments
Nyarlatotep 1-Jul-11 13:03pm    
I did not expected it too. But it seems that the real \b behavior is what has been indicated by Catalin. I want to try the same pattern in other languages (PHP incidentally) and see how it behaves. But I think it will be the same.
[seven] does not match definition of 'world class'. Try to use \bfour\b|\[seven\]
See here for details

I recommend you to download Expresso and play with it
 
Share this answer
 
Comments
Nyarlatotep 1-Jul-11 12:44pm    
Uhm yes, now that you have pointed out it seems clear that this is the cause.
Thanks, it seems I need further study on regular expression universe :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900