Click here to Skip to main content
15,886,799 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hi,

Is there a logic to search for a particular value in an XML file without loading it in memory? XML Document is working fine for my requirement. But I want the file to be handled without loading into memory since the actual XML file might be sized to 5GB+.

XMLReader is the alternative I tried by using examples such as How to Open Large XML files without Loading the XML Files?[^]

But I'm not able to find out a logic to traverse through the XML nodes and search for a specific value.

The sample XML :
XML
<backup>
  <project>
    <issues>
       <issue>
          <fieldvalue id="fld1">1</fieldvalue>
          <fieldvalue id="fld2">test01</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
       <issue>
          <fieldvalue id="fld1">2</fieldvalue>
          <fieldvalue id="fld2">test02</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
       <issue>
          <fieldvalue id="fld1">3</fieldvalue>
          <fieldvalue id="fld2">test03</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
       <issue>
          <fieldvalue id="fld1">4</fieldvalue>
          <fieldvalue id="fld2">test04</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
    </issues>
  </project>
</backup>


here the "fld1" is the ID of the issue. I want to search by ID. if the ID exists in the XML, i want to take the entire
XML
<issue>
node for further processing.

And the code snippet
C#
//Using XMLDocument
		protected void Button1_Click(object sender, EventArgs e)
        {
            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.Load(Server.MapPath(@"export_sample.xml"));
            XmlNodeList addlst = xmlDoc.SelectNodes("backup/project/issues/issue/fieldvalue[@id='fld1']");
            foreach (XmlNode issueNode in addlst)
            {
                if (issueNode.InnerText == IDTextBox.Text)
                {
                    Status.Text = issueNode.ParentNode.InnerXml;
                    break;
                }
                else
                {
                    Status.Text = "ID does not Exists";
                }
            }
        }
		//Using XMLTextReader
        protected void Button2_Click(object sender, EventArgs e)
        {
            XmlTextReader myTextReader = new XmlTextReader(Server.MapPath(@"export_sample.xml"));
            myTextReader.WhitespaceHandling = WhitespaceHandling.None;
            while (myTextReader.Read())
            {
                currentIssueNode = "";
                if (myTextReader.NodeType == XmlNodeType.Element &&
                    myTextReader.LocalName == "issue" 
                    && myTextReader.IsStartElement() == true)
                {
                    currentIssueNode = myTextReader.ReadOuterXml();
                    if (currentIssueNode.Contains("<fieldvalue id=\"fld1\">" + IDTextBox.Text + "</fieldvalue>"))
                    {
                        squishStatus.Text = "ID exist";
                        currentIssueNode = "";
                        myTextReader.Skip();
                    }
                    else {
                        currentIssueNode = "";
                        squishStatus.Text = "ID does not exist";
                    }
                 }
                myTextReader.MoveToContent();
                }
            myTextReader.Close();
         }
Posted
Updated 2-Apr-15 2:02am
v2
Comments
Maciej Los 2-Apr-15 8:27am    
There's no way to read xml data without loading it ;(

Please, read my comment to the question.

I'd suggest to use Linq, but i need to warn you. If a portion of data is huge, the performance of below code might be unsatisfying.

C#
var qry = xDoc.Element("backup")
            .Descendants("project")
            .Descendants("issues")
            .Descendants("issue")
            .Where(x=>x.Element("fieldvalue").Attribute("id").Value=="fld1");


Above linq query returns <issue> nodes.
 
Share this answer
 
v2
Comments
Sriram Mani 6-Apr-15 5:07am    
Thanks for the suggestion Maciej. As Mario said in his comment, using Linq feature might pose a problem since the file size is huge.
Don't use Maciej approach, you will definitely get an OutOfMemoryException exception if you are expecting an XML file of +5GB.
Also that second approach in which you are using the XmlTextReader is a right way to go, that is the only way you can read XML without loading an entire document at once, but you need to tweak it a bit.
First note that it's recommended to use XmlReader.Create method instead of instantiating an XmlTextReader, second when you are reading an "issue" element you want to use ReadInnerXml instead of ReadOuterXml, third you don't want to use Skip instead you want to continue with reading the sibling "issue" elements.

Try this:
C#
protected void Button2_Click(object sender, EventArgs e)
{
    string currentIssueNode = null;
    XmlReaderSettings settings = new XmlReaderSettings() { IgnoreWhitespace = true };
    using (var reader = XmlReader.Create(Server.MapPath(@"export_sample.xml"), settings))
    {
        string fieldvalue = string.Format("<fieldvalue id=\"fld1\">{0}</fieldvalue>", IDTextBox.Text);
        if (reader.ReadToFollowing("issue"))
        {
            do
            {
                currentIssueNode = reader.ReadInnerXml();
                if (currentIssueNode.Contains(fieldvalue))
                    break;
                else
                    currentIssueNode = null;
            } while (reader.ReadToNextSibling("issue"));
        }
    }
    if (!string.IsNullOrEmpty(currentIssueNode))
        Status.Text = currentIssueNode;
    else
        Status.Text = "ID does not Exists";
}
 
Share this answer
 
Comments
Sriram Mani 6-Apr-15 5:04am    
Thanks for the Code Mario. I tried the code you given. It is working fine. But the loop is skipping out alternate issue (mostly even) nodes. For example, If i have a list of issue nodes like 1,2,3,4,5,6 It is reading 1,3,5 and skipping each even numbered node. I think that at some point, the loop is reading data twice such that it reads consecutive two issue nodes, ending up skipping the second one. I'm not sure if this is the case. Could you please share your comments on this? Thanks in advance
Mario Z 7-Apr-15 3:54am    
First I apologize for a bit late response, you know ... holidays take all the free time ...
Now back to your issue, unfortunately I'm unable to reproduce it with the above XML, I was able to target all four issues by changing the IDTextBox.Text value.
Can you temporarily upload somewhere your XML, or better jet a small VS test project that will reproduce your issue and I will investigate it?

Also just as a side note, you can try replacing the if (reader.ReadToFollowing("issue")) with the while (reader.ReadToFollowing("issue")) and then remove that other do - while (reader.ReadToNextSibling("issue")).
But again I'm curious why you are experiencing an issue and would like to debug it a bit.
Sriram Mani 13-Apr-15 6:40am    
Hi Mario,
Apologies for the delayed response. I figured out the reason why it is skipping the nodes. The XML setting IgnoreWhitespace is set as true, which made the code to skip some nodes. I made it as false. Then it worked.

Also I have another clarification. I was using VS2010 when started writing the code. I used a big XML file sample weighing about 2.7GB and it traversed well. I recently updated to VS2013 and tried searching. It throws out of memory exception at
currentIssueNode = reader.ReadInnerXml();
The server is IISExpress. Any ideas how we could overcome this problem?

Thanks in advance.
Mario Z 14-Apr-15 2:09am    
Hi, I must admit I wouldn't guess that IgnoreWhitespace was causing the issue, nevertheless I'm glad you where able to figure that out.
Now regarding the new issue, do you have an exactly the same code accessing the same XML file and you are reproducing an issue in VS2013 and not in VS2010?
Would it be possible for you to temporarily upload somewhere a test project that reproduces an issue so that I may take a look at it?
Sriram Mani 14-Apr-15 5:23am    
The same code was working fine in VS2010. I haven't tested it again in VS2010 since I have migrated the project to VS2013. In VS2013, even after deploying the file into IIS, I'm facing the same exception.

I can try to upload or mail the Project as a ZIP and a sample XML file, but unfortunately not the actual file in which I'm facing this issue since it contains client information and i'm not allowed to share this outside. The sample XML may not be useful since the issue will not be replicated in it.

The main reason for the problem as per my guess is the presence of attachments (files) inside issue nodes which are written as huge Binary strings. When multiple MB sized attachments are inside the issue node, it faces the memory problem while reading the inner XML of issue node.

Is there any alternative way of traversing the issue node instead of using ReadInnerXML method? Like a line by line read?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900