Click here to Skip to main content
15,891,423 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
Hi Everyone,

Thanks a lot for answering my previous queries. :)

Now a days I am working with huge XML files and my job is to find the error created by software while generating this huge XML files.

This all files are converted into XML by using some softwares.

In between this conversion the software can make the following possible errors.

1) The Some tags can be left blank or empty.

2) The value between the designator value attribute can be incorrect.

This two possible error can occur maximum.

Now My Questions are


1) how to create C#.Net Application to read this XML file and find the location of empty markups and generate the result on notepad or on any text file.(Rule applicable for every empty markup)

2) to match the designator vulue's attribute value as per the text given between the markup.

below the sample XML tagging:

XML
<lnci:content>ABC,123</lnci:content>
<heading>
<designator value="a">(a).</desigantor>
<title>This is Title.</title>
</heading>
<lawTextComponet>
<p>This is the Content.</p>
<p></p>
</lawTextComponet>
...


Please suggest me as soon as possible.

Regards
Mayur Alaspure
Posted
Updated 12-Sep-12 3:37am
v3
Comments
Zoltán Zörgő 12-Sep-12 9:42am    
If these are the only tasks, I would use regular expressions.
Mayur2258 12-Sep-12 9:56am    
i tried with regular expressions but its not working as i want it because XML files are huge about 120MB or more and it contains thousands of Markups.
Mayur2258 12-Sep-12 9:57am    
Can you provide the Regular Expressions.?
Zoltán Zörgő 15-Sep-12 4:54am    
Any progress?

You can try with XmlReaderSettings and an XSD.

With this MSDN example:

-it never loads the entire document
-the while(reader.Reader()) just enumerates the entire file at the node level
-validation is enabled via the XmlReaderSettings

For no empty string, use minlength in your XSD (you can generate you XSD with XSD.exe.


For match the designator value's, you can use Regular Expressions.
 
Share this answer
 
Create a valid xml and generate XSD using xsd.exe(comes with VS).

Edit your exsd to have string restrictions.

XML
 <xs:element minOccurs="0" name="UserName">
  <xs:simpleType>
    <xs:restriction base="xs:string">
      <xs:minlength value="5" >
      <xs:maxLength value="50" >
    </xs:restriction>
  </xs:simpleType>
</xs:element>



Now you use it to validate your xml using XmlSchemaValidator, an example can be found at:
http://msdn.microsoft.com/en-us/library/system.xml.schema.xmlschemavalidator.aspx[^]
 
Share this answer
 
v3
I know this looks a little bit strange, but sometimes syntactic text manipulation gives better performance, than semantic one. Matching an xml against a schema is not always the best. If I have understood it correctly, the second "error" needed to be found is more semantic one, a schema validation would not be straightforward.

These two regular expressions could identify the problematic elements:
1) <(.*?)>\s*</\1>
2) <designator value="(.*?)">(?!(\1<)).*?</desigantor>
Ok, these could be refined, but if the xml is well formed, should be enough.

The only thing that could be considered in addition is the size of the file. I suppose it can be fragmented on load if needed, since it has a structure. But the size of several hundred MiB looks not extremely much - of course, this depends on the machine. The framework will be able to process it, but might use more virtual memory.

By the way, if you decide to take this path, there are also implementations of regex over streams out there, like this one: http://www.developer.com/net/article.php/3719741/Building-a-Regular-Expression-Stream-Search-with-the-NET-Framework.htm[^]

Good luck!
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900