Search for a value in XML without loading it in memory

Question

0.00/5 (No votes)

See more:

Hi,

Is there a logic to search for a particular value in an XML file without loading it in memory? XML Document is working fine for my requirement. But I want the file to be handled without loading into memory since the actual XML file might be sized to 5GB+.

XMLReader is the alternative I tried by using examples such as How to Open Large XML files without Loading the XML Files?[^]

But I'm not able to find out a logic to traverse through the XML nodes and search for a specific value.

The sample XML :

XML

<backup>
  <project>
    <issues>
       <issue>
          <fieldvalue id="fld1">1</fieldvalue>
          <fieldvalue id="fld2">test01</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
       <issue>
          <fieldvalue id="fld1">2</fieldvalue>
          <fieldvalue id="fld2">test02</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
       <issue>
          <fieldvalue id="fld1">3</fieldvalue>
          <fieldvalue id="fld2">test03</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
       <issue>
          <fieldvalue id="fld1">4</fieldvalue>
          <fieldvalue id="fld2">test04</fieldvalue>
          <fieldvalue id="fld3">some desc</fieldvalue>
       </issue>
    </issues>
  </project>
</backup>

here the "fld1" is the ID of the issue. I want to search by ID. if the ID exists in the XML, i want to take the entire

XML

<issue>

node for further processing.

And the code snippet

C#

//Using XMLDocument
		protected void Button1_Click(object sender, EventArgs e)
        {
            XmlDocument xmlDoc = new XmlDocument();
            xmlDoc.Load(Server.MapPath(@"export_sample.xml"));
            XmlNodeList addlst = xmlDoc.SelectNodes("backup/project/issues/issue/fieldvalue[@id='fld1']");
            foreach (XmlNode issueNode in addlst)
            {
                if (issueNode.InnerText == IDTextBox.Text)
                {
                    Status.Text = issueNode.ParentNode.InnerXml;
                    break;
                }
                else
                {
                    Status.Text = "ID does not Exists";
                }
            }
        }
		//Using XMLTextReader
        protected void Button2_Click(object sender, EventArgs e)
        {
            XmlTextReader myTextReader = new XmlTextReader(Server.MapPath(@"export_sample.xml"));
            myTextReader.WhitespaceHandling = WhitespaceHandling.None;
            while (myTextReader.Read())
            {
                currentIssueNode = "";
                if (myTextReader.NodeType == XmlNodeType.Element &&
                    myTextReader.LocalName == "issue" 
                    && myTextReader.IsStartElement() == true)
                {
                    currentIssueNode = myTextReader.ReadOuterXml();
                    if (currentIssueNode.Contains("<fieldvalue id=\"fld1\">" + IDTextBox.Text + "</fieldvalue>"))
                    {
                        squishStatus.Text = "ID exist";
                        currentIssueNode = "";
                        myTextReader.Skip();
                    }
                    else {
                        currentIssueNode = "";
                        squishStatus.Text = "ID does not exist";
                    }
                 }
                myTextReader.MoveToContent();
                }
            myTextReader.Close();
         }

Posted 2-Apr-15 1:53am

Sriram Mani

Updated 2-Apr-15 2:02am

v2

Add a Solution

Comments

Maciej Los 2-Apr-15 8:27am

There's no way to read xml data without loading it ;(

2 solutions

Solution 1

Please, read my comment to the question.

I'd suggest to use Linq, but i need to warn you. If a portion of data is huge, the performance of below code might be unsatisfying.

C#

var qry = xDoc.Element("backup")
            .Descendants("project")
            .Descendants("issues")
            .Descendants("issue")
            .Where(x=>x.Element("fieldvalue").Attribute("id").Value=="fld1");

Above linq query returns <issue> nodes.

Posted 2-Apr-15 2:32am

Maciej Los

Updated 2-Apr-15 2:43am

v2

Comments

Sriram Mani 6-Apr-15 5:07am

Thanks for the suggestion Maciej. As Mario said in his comment, using Linq feature might pose a problem since the file size is huge.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Mario Z · Accepted Answer · 2015-04-03T02:57:00

Solution 2

Don't use Maciej approach, you will definitely get an OutOfMemoryException exception if you are expecting an XML file of +5GB.
Also that second approach in which you are using the XmlTextReader is a right way to go, that is the only way you can read XML without loading an entire document at once, but you need to tweak it a bit.
First note that it's recommended to use XmlReader.Create method instead of instantiating an XmlTextReader, second when you are reading an "issue" element you want to use ReadInnerXml instead of ReadOuterXml, third you don't want to use Skip instead you want to continue with reading the sibling "issue" elements.

Try this:

C#

protected void Button2_Click(object sender, EventArgs e)
{
    string currentIssueNode = null;
    XmlReaderSettings settings = new XmlReaderSettings() { IgnoreWhitespace = true };
    using (var reader = XmlReader.Create(Server.MapPath(@"export_sample.xml"), settings))
    {
        string fieldvalue = string.Format("<fieldvalue id=\"fld1\">{0}</fieldvalue>", IDTextBox.Text);
        if (reader.ReadToFollowing("issue"))
        {
            do
            {
                currentIssueNode = reader.ReadInnerXml();
                if (currentIssueNode.Contains(fieldvalue))
                    break;
                else
                    currentIssueNode = null;
            } while (reader.ReadToNextSibling("issue"));
        }
    }
    if (!string.IsNullOrEmpty(currentIssueNode))
        Status.Text = currentIssueNode;
    else
        Status.Text = "ID does not Exists";
}

Posted 3-Apr-15 2:57am

Mario Z

Comments

Sriram Mani 6-Apr-15 5:04am

Thanks for the Code Mario. I tried the code you given. It is working fine. But the loop is skipping out alternate issue (mostly even) nodes. For example, If i have a list of issue nodes like 1,2,3,4,5,6 It is reading 1,3,5 and skipping each even numbered node. I think that at some point, the loop is reading data twice such that it reads consecutive two issue nodes, ending up skipping the second one. I'm not sure if this is the case. Could you please share your comments on this? Thanks in advance

Mario Z 7-Apr-15 3:54am

First I apologize for a bit late response, you know ... holidays take all the free time ...
Now back to your issue, unfortunately I'm unable to reproduce it with the above XML, I was able to target all four issues by changing the IDTextBox.Text value.
Can you temporarily upload somewhere your XML, or better jet a small VS test project that will reproduce your issue and I will investigate it?

Also just as a side note, you can try replacing the if (reader.ReadToFollowing("issue")) with the while (reader.ReadToFollowing("issue")) and then remove that other do - while (reader.ReadToNextSibling("issue")).
But again I'm curious why you are experiencing an issue and would like to debug it a bit.

Sriram Mani 13-Apr-15 6:40am

Hi Mario,
Apologies for the delayed response. I figured out the reason why it is skipping the nodes. The XML setting IgnoreWhitespace is set as true, which made the code to skip some nodes. I made it as false. Then it worked.

Also I have another clarification. I was using VS2010 when started writing the code. I used a big XML file sample weighing about 2.7GB and it traversed well. I recently updated to VS2013 and tried searching. It throws out of memory exception at
currentIssueNode = reader.ReadInnerXml();
The server is IISExpress. Any ideas how we could overcome this problem?

Thanks in advance.

Mario Z 14-Apr-15 2:09am

Hi, I must admit I wouldn't guess that IgnoreWhitespace was causing the issue, nevertheless I'm glad you where able to figure that out.
Now regarding the new issue, do you have an exactly the same code accessing the same XML file and you are reproducing an issue in VS2013 and not in VS2010?
Would it be possible for you to temporarily upload somewhere a test project that reproduces an issue so that I may take a look at it?

Sriram Mani 14-Apr-15 5:23am

The same code was working fine in VS2010. I haven't tested it again in VS2010 since I have migrated the project to VS2013. In VS2013, even after deploying the file into IIS, I'm facing the same exception.

I can try to upload or mail the Project as a ZIP and a sample XML file, but unfortunately not the actual file in which I'm facing this issue since it contains client information and i'm not allowed to share this outside. The sample XML may not be useful since the issue will not be replicated in it.

The main reason for the problem as per my guess is the presence of attachments (files) inside issue nodes which are written as huge Binary strings. When multiple MB sized attachments are inside the issue node, it faces the memory problem while reading the inner XML of issue node.

Is there any alternative way of traversing the issue node instead of using ReadInnerXML method? Like a line by line read?

Sriram Mani 14-Apr-15 11:30am

Hi Mario,

XElement was the answer.

XElement el = XNode.ReadFrom(reader) as XElement;
if (el.Element(searchName).Value == IDTextBox.Text)
break;
else
el = null;

This worked like a magic without any exceptions!
Also it provides a good way for manipulating elements inside the issue node with LINQ-XML concepts.

Mario Z 15-Apr-15 5:05am

Sriram that is a much better approach from the one that I was preparing.
You're taking the best from both worlds (using both XmlReader and LINQ to XML).
I'm glad you where able to resolve your issues and good luck with a future development.