Click here to Skip to main content
15,881,938 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I want to search some regex patterns in files (*.txt) which are inside a folder whose path I'have given in a text box, and the folder contains other sub-folders with txt files in the form 12345-2031-30201\2031\30201\txt\110.txt and if the pattern matches even in one file, then a string is written on a log file which is created inside the folder whose path I've given in the text box and then it moves on to the next regex and so on.

The problem I'm having is the log file is only showing the first matched pattern and not showing whether other patterns have matched in any file or not. Basically what is happening is that the program opens say the first file and searches the first pattern and it finds a match, so it writes the text associated with that pattern "check figure link" and that's all that is in the log file, but the same file(and some other files) do match the second and the third pattern but it does not show the texts associated with those patterns like "check table link" and "check section link".

Can anyone help?

What I have tried:

VB
Dim patterns = New List(Of String()) From {
({"Check figure link", "(?<!>)(figure|fig\.|figs\.|figures) (\d+)"}),
({"Check table link", "(?<!>)(table|tab\.|tabs\.|tables) (\d+)"}),
 ({"Check section link", "(?<!>)(section|sec\.|sect\.|section) (\d+)"}),
 ({"Check space", "</inline>w+"}),}

    Dim output = From pattern In patterns.AsParallel
                 Let regEx = New Regex(pattern(1), RegexOptions.Compiled)
                 From tFile In Directory.EnumerateFiles(TextBox1.Text, "*.txt", SearchOption.AllDirectories)
                 Where tFile Like "*\#*\#*\#*\txt\#*.txt" AndAlso regEx.IsMatch(File.ReadAllText(tFile))
                 Take 1
                 Select pattern(0)

    File.WriteAllLines(TextBox1.Text.TrimEnd("\"c) & "\Checklist.log", output)
    MsgBox("Process Complete")
Posted
Updated 2-Sep-16 17:37pm
v7
Comments
Richard Deeming 2-Sep-16 12:29pm    
You've told us what you want your log file to contain, but you haven't told us what output you get when you run your code.
Member 12692000 2-Sep-16 12:36pm    
I basically want the list, which patterns matched at least one time in at least one file as i have around 200+ patterns to search in files but I don't want to go through all those patterns, I just wanna search the matched patterns in my files and then modify it accordingly.
Richard Deeming 2-Sep-16 12:37pm    
That still doesn't tell us what output you get when you run your code.
Member 12692000 2-Sep-16 12:44pm    
The output is the log file, which gets created when a pattern is matched and then it writes the text associated with that pattern into the log file. Does that make it clear?
Richard Deeming 2-Sep-16 12:50pm    
No.

You've told us what you want to happen.

You've shown us the code you're currently running.

But you haven't told us what happens when you run that code.

1 solution

So it looks like you want to find which files have errors according to the patterns.
What you have will give you up to 4 lines in the log indicating which type of errors exist somewhere in the file hierarchy.
This seems very uninformative!
Wouldn't it be better to know (at least) which files have errors?
Wouldn't it be better to know which errors are in each of those files?

Here's what I suggest.
(This may be totally off from your requirements, but what you have just feels like a lot of work for nearly zero information content.)

The way you have this structured the parallelization is very inefficient, as it reads each file for each pattern, so each file is potentially scanned five times!
Parallelize across the filenames.
This then means that the Regex instances need to be created and compiled ahead of the execution loops.
You also read the whole file each time. The patterns that you've shown do not span lines, so match against each line one at a time to avoid potentially lots of IO.
I've extracted the per-file checking to a function so it can eliminate redundant IO and Regex matching.
I also simplified the Regex patterns.
Something like (my VB is rusty):
VB
'Edited MTH: based on your comment...
Dim patterns = New List(Of String()) From {
     ({"Check figure link", "(?<!>)(?:figures?|figs?\.) \d+"}),
     ({"Check table link", "(?<!>)(?:tables?|tabs?\.) \d+"}),
     ({"Check section link", "(?<!>)(?:sections?|sect?\.) \d+"}),
     ({"Check space", "</inline>\w+"})}

Dim compiledPatterns = New Dictionary(Of Regex, String)
For Each pat As String() In patterns
    compiledPatterns.Add(New Regex(pat(1), RegexOptions.Compiled), pat(0))
Next

'Edit: Setup to "transpose" the information collected below.
'This would be tricky to do combined with parallelization.
'So it is done as a sequential pass through the collected info.
Dim pathsByMessage = New Dictionary(Of String, List(Of String))
For Each pat As String() In patterns
    pathsByMessage.Add(pat(0), New List(Of String))
Next

Dim filteredFilenames = From tFile In Directory.EnumerateFiles(TextBox1.Text, "*.txt", SearchOption.AllDirectories)
             Where tFile Like "*\#*\#*\#*\txt\#*.txt"

Dim output = From tFile In filteredFilenames.AsParallel
               Let checks = CheckFile(tFile, compiledPatterns)
               Where checks.Any
               Select Path = tFile, Messages = checks

'Edit: Now "transpose" this
'It's OK to process this *sequentially*, even if the query is still running parallelized!
For Each pm In output
    For Each msg In pm.Messages
        pathsByMessage(msg).Add(pm.Path)
    Next
Next

File.WriteAllLines(TextBox1.Text.TrimEnd("\"c) & "\Checklist.log",
                   From pbm In pathsByMessage
                     Where pbm.Value.Any
                     Select String.Format("""{0}""=={1}", pbm.Key, vbNewLine & String.Join(vbNewLine, pbm.Value)))
MsgBox("Process Complete")

And
VB
Function CheckFile(tFile As String, compiledPatterns As Dictionary(Of Regex, String)) As List(Of String)
    'structure this to eliminate redundant checking
    Dim messages As New List(Of String)
    Dim checks = New HashSet(Of Regex)(compiledPatterns.Keys)

    For Each line In File.ReadLines(tFile)
        If Not checks.Any Then
            Exit For
        End If
        For Each re In checks.ToList
            If re.IsMatch(line) Then
                messages.Add(compiledPatterns(re))
                checks.Remove(re)
            End If
        Next
    Next

    Return messages
End Function
 
Share this answer
 
v2
Comments
Member 12692000 2-Sep-16 22:57pm    
@Matt Your solution seems to be a bit different that what I'm looking for, how do I make the log file show errors in the below fashion

"Check figure link"==
D:\test\bk\1235-12-3053\230\124\txt\124.txt
D:\test\bk\1235-12-3053\230\131\txt\131.txt
"Check table link"==
D:\test\bk\1235-12-3053\230\124\txt\124.txt
D:\test\bk\1235-12-3053\230\205\txt\205.txt
and so on

And also, could you please explain your coding(how it works) as I'm a beginner in vb.net or in any type of coding for that matter !!
Member 12692000 5-Sep-16 22:17pm    
Thanks mate..
Maciej Los 6-Sep-16 2:51am    
5ed!
Matt T Heffron 6-Sep-16 12:58pm    
Thanks!

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900