Click here to Skip to main content
15,891,184 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more: , +
My file has data with each line starting with a specific pattern

VB
1000000179|abcd.....
1000000180|wedwedw...
1000000181|wnewedwed...



there are 10 numerals followed by a pipe.

How to find/replace lines that DO NOT have this pattern.. Eg.. the second line below is invalid


VB
1000000179|abcd.....
%d20000180|wedwedw...
1000000181|wnewedwed...
Posted

What you need is called negative lookahead.
The expression will be this one: ^(?!\d{10}).{10}\|.*$ (with MultiLine option).
Please note it's logic: the first formula is a lookahead that checks for substrings that do not match the "ten digits" pattern. The rest is a general pattern that allows both good and bad strings with the pipe at the 11th position.
For more details about lookaround, read this article[^].
 
Share this answer
 
Comments
Brian A Stephens 8-Apr-13 21:31pm    
Ah, yes: negative lookahead; that's the elegant way to do it. However, the regex you provided doesn't match lines that fail the requirement of a pipe in the 11th position. With a slight modification, it will match those too: ^(?!\d{10}\|).*$
Zoltán Zörgő 9-Apr-13 2:18am    
You'r right. I haven't tested it with more input.
The challenge here, as Mohan implied, is finding the "negative" match. It's straightforward to find a line that starts with 10 digits and a pipe, but how do you find one that doesn't?

Here's a regex that will do it:
^([^|]*[^|\d][^|]*\||.{10}[^|])

It matches only the last 4 of these input lines:
1000000179|abcd.....|abc|
1000000180|wedwedw...|234|
1000000179|abcd.....
%d20000180|wedwedw...
3214a23642|abcd
123456789|whatever
1234567890_abcde


Breaking down the regex, it's looking for one of two conditions at the beginning of a line:
1) Any non-digit before the first pipe ( [^|]*[^|\d][^|]*\| )
2) Any non-pipe character in the 11th position ( .{10}[^|] )
 
Share this answer
 
The pattern to check for a valid line would be :

C#
string pattern = @"^\d{10}\|.*$";


Which would match if the line is beginning with 10 digits, followed by a pipe, followed by any number of any character.
 
Share this answer
 
Just test for non-numeric characters in the first 10.
 
Share this answer
 
See regex for VALID line:
^\d{10}\|
 
Share this answer
 
It's easy to define the valid lines, but slightly more difficult to define the invalid lines.
So, you define the (in pseudo code) line = valid or invalid.
Even it sounds trivial, this covers all the lines.

Loop over all lines and process the matches where the invalid part is available.
E.g.
C#
string filePath = @"...";
string data = File.ReadAllText(filePath);
string linePattern = @"^(?:\d{10}\|.*|(.*))$";
var invalidLines = from m in Regex.Matches(data, linePattern, RegexOptions.Multiline).Cast<Match>()
                   where m.Groups[1].Success
                   select m.Groups[1].Value;
foreach(string invalidLine in invalidLines)
{
   //...
}


Since the above regex pattern is greedy, it takes the first match, which is either the valid one (\d{10}\|.*) or the invalid one (.*). The two are separated by the or operator (|).
To get access to the invalid data, it is enclosed in parenthesis ((.*)).
To limit the pattern to a line each, the whole pattern is enclosed in ^...$ and grouped by a non-referencing group ((?:...), i.e. a group that does not count in the Groups array of a match.

Putting it all together results in ^(?:\d{10}\|.*|(.*))$.
Note that the match option is set to Multiline to give the ^ and $ the needed meening: begin/end of each line (where in Singleline mode the ^ and $ would mean begin/end of the whole string).

Cheers
Andi
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900