Click here to Skip to main content
15,890,717 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Is there some way to restrict matches to when a substring appears only ONCE in an item?

so for instance, if I am interested in the substring "ark" and we have these items:

shark
aaaaarkdd
darkdayshark

I should only get the first two items. The substring "ark" appears twice in the third entry. Can this be done with straight regex? Or do you have to write a looping script?

I've tried the count quantifier {x,y} but it only seems to apply to consecutive occurrences, (something most tutorials don't make very clear by the way).

I've tried using backreferences, But it doesn't seem built for negation, or haven't found any examples. Besides, with the way characters are consumed, and because the substring could occur many times anywhere, not sure it would work anyway.

I've tried the non-greedy quantifiers, but they don't really help. First of all, at terminal level on a iMac, the non-greedy quantifiers don't seem to work at all. Even on simple examples, I get no difference in results whether I use greedy or non-greedy quantifiers, even if I grep with -o so you can see what was actually matched. But even if they did work, I'm still not sure they would help. On the third item for example, does it matter whether it matches up to dark, or whether it matches all the way up to the end of darkdayshark? It's still a match, so it returns the item, which is incorrect. It would only help if I was extracting what was matched. Here I want the complete item that meets the condition, which is not the same thing.

I've looked at lookahead lookbehind, but from what I've read, they are for looking at the immediately previous or next item. The substring could appear multiple times at any distance from the first occurrence.

This doesn't seem like it ought to be that hard, but haven't found anything that works. Except for the ^ inside square brackets, there's not much provided for negative evaluation.
Posted

1 solution

This seems to work (a regular expression of the .Net variety):
(\r\n|^)((?!.*ark.*ark.*)(.*))ark((?!.*ark.*)(.*))(\r\n|$)

That might need some tweaking (e.g., newlines and that second negative lookahead may not be required), but it seems to work for the most part.
 
Share this answer
 
Comments
echoCon 19-Aug-10 4:44am    
thanks! I'll take a look. I'm kind of new to regex. If I even get closer, would be great and hopefully pick up some new approaches.
AspDotNetDev 19-Aug-10 12:35pm    
You are welcome. Remember, if this answer helped, vote on it and mark it as accepted. :-)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900