Click here to Skip to main content
15,891,713 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I'm trying to search a regular expression
pattern
and, if it matches, find whether the value of that pattern exists inside any tag of the form "<sec id="sec123">" in a file. If it does, I want to replace it with
result1
. I think it can be done with the MatchEvaluator function, but I can't figure out how to apply it.

I'm new to VB.NET (and programming in general) and really don't know what to do. This is what I've tried so far:

sample input:
HTML
<sec id="sec1">
<p>"You fig. 23 did?" I <xref ref-type="section" rid="sec12">section 12</a> asked, surprised.</p>
<p>"There are always better terms <xref ref-type="section" rid="sec6">section 6</a>, Richard!" my mom said sharply.</p>
<p>I <xref ref-type="section" rid="sec2">section 2</a> stood. I <xref ref-type="section" rid="sec2">section 2</a> had to hurry if I <xref ref-type="section" rid="sec1">section 1</a> was going to get to work on time.
<fig id="fig4">
<caption><p>I'm confused</p></caption>
</fig> 
</p>
<p>Turning to face her, I <xref ref-type="section" rid="sec2">section 2</a> walked backward. "I"ve seriously got to get ready. Why don"t we get together for lunch and talk more then?"</p>
<sec id="sec2">
<p>"You fig. 23 can"t be""</p>
<p>I <xref ref-type="section" rid="sec4">section 4</a> adored the Art Deco elegance of the Chrysler Building. I <xref ref-type="section" rid="sec2">section 2</a> could pinpoint my place on the island in relation to the posit table 9ion of the Empire State Building.</p>
<p>I <xref ref-type="section" rid="sec1">section 1</a> felt Gideon before I <xref ref-type="section" rid="sec1">section 1</a> saw him, my entire body humming wit table 9h awareness as he stepped out of the Bentley, which had pulled up behind the Benz.</p>
</sec>
</sec>


I want the program to find all rid="secX" elements in the file and check whether that "secX" element is present inside any of the expressions <sec id="secX"> in the entire file and if there is a mismatch, then the <xref ref-type="section" rid="secX">section X</a> will be removed to section X and this will go on until there is no <b>rid="secX"</b>
expression is left to check

What I have tried:

VB
Dim pattern As String="(?<=rid=\"sec)(\\d+)(?=\">)"
Dim r As Regex = New Regex(pattern)
Dim m As Match = r.Match(input)
If (m.Success) Then
    Dim x As String=" id=""sec"+ pattern +""""
    Dim r2 As Regex = New Regex(x)
    Dim m2 As Match = r2.Match(input)
    If (m2.Success) Then
        Dim tgPat AsString="<xref ref-type="section" rid=""sec + pattern +"">(\w+) (\d+)</a>"
        Dim tgRep As String= "$1 $2"
        Dim tgReg As New Regex(tgPat)
        Dim result1 As String = tgReg.Replace(input, tgRep)
    Else
    EndIf
EndIf
Next
Posted
Updated 19-Aug-16 5:54am
v5
Comments
Patrice T 18-Aug-16 13:12pm    
You should show what is the text you try to match and the pattern you use to replace.
Use Improve question to update your question.
Member 12692000 19-Aug-16 11:20am    
I've updated my question, hopefully I made the requirements of the program much clearer now...

Something like this should work:
VB.NET
Dim xref As New Regex("<xref[^>]+rid=""(?<id>sec\d+)""[^>]*>(?<content>[^<]+)</xref>")

Dim result As String = xref.Replace(input, Function(match)
    Dim sec As New Regex(" id=""" & match.Groups("id").Value & """")
    Return If(sec.IsMatch(input), match.Value, match.Groups("content").Value)
End Function)

However, you should double-check your input. It almost looks like HTML, except you have an opening <xref> tag closed with an </a> tag, which doesn't match.

If the input is HTML, you might have better luck using an HTML parser like AngleSharp to parse and modify the document.

Regular Expression Language - Quick Reference[^]
AngleSharp - Home[^]
 
Share this answer
 
v4
Comments
Member 12692000 19-Aug-16 12:44pm    
@Richard Could you please explain your code in a bit detail? I'm new to coding, so it will be much easier for me to understand what the process is actually doing and btw, the closing tag will be "</xref>" and not "</a>", my bad
Richard Deeming 19-Aug-16 12:52pm    
1) Create a regular expression to find all <xref> tags with a rid attribute containing sec followed by some numbers. Capture the rid attribute value and the content of the tag.

2) Replace the matches using a match evaluator function:

2a) Create a regular expression to find an id attribute with the same content as the matched rid attribute;

2b) If the entire input contains a match for that expression, return the outer <xref> match unchanged;

2c) Otherwise, the rid points to a section that doesn't exist, so return the content of the <xref> tag;


The MSDN documentation[^] explains how the match evaluator works.
Member 12692000 20-Aug-16 10:48am    
I have updated my full code according to your method, but it's not working, maybe I've made some stupid mistakes in the coding as I'm just a beginner, can you help me out


Imports System.IO
Imports System.Text.RegularExpressions
Public Class Form1

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
If FolderBrowserDialog1.ShowDialog = DialogResult.OK Then
TextBox1.Text = FolderBrowserDialog1.SelectedPath
End If

End Sub

Private Sub Button2_Click(sender As Object, e As EventArgs) Handles Button2.Click
Dim targetDirectory As String
targetDirectory = TextBox1.Text
Dim txtFilesArray As String() = Directory.GetFiles(targetDirectory, "*.txt")
For Each txtFile In txtFilesArray
Dim FileInfo As New FileInfo(txtFile)
Dim FileLocation As String = FileInfo.FullName
Dim input As String = File.ReadAllText(FileLocation)
Dim pattern As String = "(?<=rid=\""sec)(\d+)(?=\"">)"
Dim r As Regex = New Regex(pattern)
Dim m As Match = r.Match(input)
Dim xref As New Regex("<xref[^>]+rid=""(?<id>sec\d+)""[^>]*>(?<content>[^<]+)")
Dim result As String = xref.Replace(input, Function(match)
Dim sec As New Regex(" id=""" & m.Groups("id").Value & """")
Return If(sec.IsMatch(input), match.Value, match.Groups("content").Value)
End Function)
input = result
File.WriteAllText(FileLocation, input)
Next
MessageBox.Show("Process complete")
End Sub
End Class
Richard Deeming 20-Aug-16 11:01am    
Well, you can simplify the code a bit:

Dim targetDirectory As String = TextBox1.Text
Dim txtFilesArray As String() = Directory.GetFiles(targetDirectory, "*.txt")
For Each txtFile In txtFilesArray
   Dim input As String = File.ReadAllText(txtFile)
   Dim xref As New Regex("<xref[^>]+rid=""(?<id>sec\d+)""[^>]*>(?<content>[^<]+)</xref>")
   Dim result As String = xref.Replace(input, Function(match)
      Dim sec As New Regex(" id=""" & match.Groups("id").Value & """")
      Return If(sec.IsMatch(input), match.Value, match.Groups("content").Value)
   End Function)
   File.WriteAllText(txtFile, result)
Next


You also missed the closing </xref> tag in the regular expression.

All I can tell you is that the code I posted worked on the sample input you provided. If it's not working on the files you're using, then they're not the same format as the sample input.
Member 12692000 20-Aug-16 12:22pm    
Well I'm getting an error

Error BC30518 Overload resolution failed because no accessible 'Replace' can be called with these arguments:
'Public Overloads Function Replace(input As String, replacement As String) As String': Lambda expression cannot be converted to 'String' because 'String' is not a delegate type.
'Public Overloads Function Replace(input As String, evaluator As MatchEvaluator) As String': 'm' is not declared. It may be inaccessible due to its protection level.


<pre lang="vb">Dim targetDirectory As String = TextBox1.Text
Dim txtFilesArray As String() = Directory.GetFiles(targetDirectory, "*.txt")
For Each txtFile In txtFilesArray
Dim input As String = File.ReadAllText(txtFile)
Dim xref As New Regex("<xref[^>]+rid=""(?<id>sec\d+)""[^>]*>(?<content>[^<]+)")
Dim result As String = xref.Replace(input, Function(match)
Dim sec As New Regex(" id=""" & match.Groups("id").Value & """")
Return If(sec.IsMatch(input), match.Value, match.Groups("content").Value)
End Function)
File.WriteAllText(txtFile, result)
Next</pre>
RegEx cascading the way you do is a bad idea; It is inefficient and error prone.
If there is match on third RegEx, it imply there was a match on second RegEx which imply there was a match on first RegEx. Make it 1 RegEx.

Bug: you 3 pattern are using 3 different conventions of writing, at least 2 are wrong.
You should show:
- an example of text you want to match, not something which contain it, just an exact text.
- the pattern you use to match that text (without respect of VB rules).
- the pattern using the VB rules.
- result you want in the replace.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900