Fast screen scraping with XPath over a modified XmlTextReader and SgmlReader

Scott Holodak

4.94/5 (10 votes)

May 10, 2007

CPOL

4 min read

54402

973

Adding XPath-like position tracking to XmlTextReader and SgmlReader for lightweight, high performance screen scraping.

Screenshot - FastXPathReaderScreenshot.png

Introduction

This is a simple and lightweight modification to the XmlTextReader and Microsoft's SgmlTextReader that simply adds position tracking in the form of an XPath expression. The objective is to extract data from a website without using regular expressions or creating an XmlDocument in memory.

Background

I've been doing quite a bit of screen-scraping work over the years. I've noticed that every time I started working on a new project, I would abandon my existing code and decide to re-write it. Over time, I've experimented with various tag balancers, DOM vs. pull parsers, etc.. I've tracked the position in the document using everything from nested if statements and boolean variables to stacks. The resulting code was difficult to debug and painful to maintain (i.e., when the site template changes).

The solution I present here is based on XML pull parsing (i.e., Readers), uses a stack (sort of) to track position, and exposes that position in the form of an XPath expression. The XPath expression is correct; however, it is not namespace aware, it is not the minimal expression, it does not match attributes (once you find the element, you can get the attributes easily enough), and it may match more than one section of the document. The first is because I'm lazy, and the others are in the interest of speed and simplicity. For example:

While you might like to get this:
id('featured1ct')/span/h3/a

...or this
//html/body/div[1]/div[3]/div[2]/div[1]/div[1]/div[1]/div[2]/span[1]/span/h3/a

You get one or more of these instead
//html/body/div/div/div/div/div/div/div/div/span/span/h3/a

...quickly

If you are only concerned with a particular occurrence of a matching document fragment, you will need to count how many times it's matched on your own.

I am aware that there is an XPathReader project out there, and I have played around with it. It offers a powerful solution for extracting data, but without even looking at the code, I can tell you that it's doing a lot more work behind the scenes than is necessary for most simple screen scraping jobs, and it's got a larger memory footprint. Given the one-way nature of Readers, you still have to be careful about ordering your expressions with respect to the structure of the document. And finally, unless I'm missing something, there's no easy way to attach two readers together. In other words, I couldn't find a way to get the SgmlReader to feed directly into the XPathReader without dumping the output into a MemoryStream first. If you really need to match one of the first few expressions, go with the XPathReader. Otherwise, you're in the right place.

The code

The FastSgmlXPathReader and FastXPathReader are intended for HTML and XML documents, respectively. The implementation differs slightly, but the end result is the same.

The FastSgmlXPathReader (for HTML)

using System;
using System.Collections.Generic;
using System.Text;
using Sgml;
using System.Windows.Forms;
using System.Xml;

namespace FastXPathReader {
  public class FastSgmlXPathReader : SgmlReader {
    // Not a stack b/c we need to view the entire list
    // to create the string representation

    private List<string> PositionTracker = new List<string>();

    // Used to build the string representation

    private StringBuilder XPathBuilder = new StringBuilder();

    // Override the Read() function to track changes to the XPath

    public override bool Read() {      
      bool Value = base.Read();
      if (Value && base.NodeType == XmlNodeType.Element) {
        while (PositionTracker.Count > this.Depth) {
          // Remove any elements beyond this depth

          PositionTracker.RemoveAt(PositionTracker.Count - 1);
        }
        if (this.Depth != PositionTracker.Count) {
          // Add a new element at this depth

          PositionTracker.Add(this.Name);
        } else {
          // Change the element at this depth

          PositionTracker[PositionTracker.Count - 1] = this.Name;
        }
      }
      return Value;
    }

    // Build an XPath expression from the current location.

    public string XPath {
      get {
        XPathBuilder.Length = 0;
        XPathBuilder.Append("/");
        for (int i = 0; i < PositionTracker.Count; i++) {
          XPathBuilder.Append("/" + PositionTracker[i]);
        }
        return XPathBuilder.ToString();
      }
    }

    // Call the base constructors

    public FastSgmlXPathReader() : base() { }
  }
}

The FastXPathReader (for XML)

using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.IO;

namespace FastXPathReader {
  public class FastXPathReader : XmlTextReader {
    // Not a stack b/c we need to view
    // the entire list to create the string representation

    private List<string> PositionTracker = new List<string>();
    
    // Used to build the string representation

    private StringBuilder XPathBuilder = new StringBuilder();

    // Override the Read() function to track changes to the XPath

    public override bool Read() {
      bool Value = base.Read();
      if (Value) {
        if (base.NodeType == XmlNodeType.Document || 
                   base.NodeType == XmlNodeType.Element) {
          if (PositionTracker.Count < this.Depth || 
                   this.Depth == 0 || PositionTracker.Count == 0) {
            // Add the item

            PositionTracker.Add(this.Name);
          } else {
            if (PositionTracker.Count == 0) { 
              // Don't change the root node.

            } else if (PositionTracker.Count > this.Depth) {
              // Change the item at this depth

              PositionTracker[PositionTracker.Count - 1] = this.Name;
            } else {
              // Add a new item for this depth

              PositionTracker.Add(this.Name);
            }
          }
        } else if (base.NodeType == XmlNodeType.EndElement) {
          // Strange bug fix/workaround, but don't remove the root element

          if (PositionTracker.Count > 1) {
            PositionTracker.RemoveAt(PositionTracker.Count - 1);
          }
        }   
      }
      return Value;
    }

    // Build an XPath expression from the current location.

    public string XPath {
      get {
        XPathBuilder.Length = 0;
        XPathBuilder.Append("/");
        for (int i = 0; i < PositionTracker.Count; i++) {
          XPathBuilder.Append("/" + PositionTracker[i]);
        }
        return XPathBuilder.ToString();
      }
    }

    // Call the base constructors

    public FastXPathReader(Stream input) : base(input) {}
    public FastXPathReader(string url) : base(url) {}
    public FastXPathReader(TextReader input) : base(input) { }
    protected FastXPathReader(XmlNameTable nt) : base(nt) { }
    public FastXPathReader(Stream input, XmlNameTable nt) : base(input, nt) { }
    public FastXPathReader(string url, Stream input) : base(url, input) { }
    public FastXPathReader(string url, TextReader input) : base(url, input) { }
    public FastXPathReader(string url, XmlNameTable nt) : base(url, nt) { }
    public FastXPathReader(TextReader input, XmlNameTable nt) : base(input, nt) { }
    public FastXPathReader(Stream xmlFragment, XmlNodeType fragType, 
                           XmlParserContext context) : 
                           base(xmlFragment, fragType, context) { }
    public FastXPathReader(string url, Stream input, 
                           XmlNameTable nt) : base(url, input, nt) { }
    public FastXPathReader(string url, TextReader input, 
                           XmlNameTable nt) : base(url, input, nt) { }
    public FastXPathReader(string xmlFragment, XmlNodeType fragType, 
                           XmlParserContext context) : 
                           base(xmlFragment, fragType, context) { }
  }
}

Using the code

I provide two samples. The first scrapes the headline and description of the top story on yahoo.com, and the second extracts the top three albums for a band (Alice in Chains in the example) from Last.fm's Audioscrobbler webservice. The former demonstrates scraping an unbalanced HTML document, while the latter provides a fast and easy way to get data out of an XML document.

In both cases, you just loop through Read() statements on the Readers and do a 'switch' on the XPath property I've added to the Readers to find the XPath you're looking for.

The Yahoo Example (HTML)

// Create a request for the Yahoo! homepage

HttpWebRequest Request = (HttpWebRequest)
   HttpWebRequest.Create("http://www.yahoo.com/");

// Pretend we're Firefox so we know what Yahoo! is serving up.

Request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0;" + 
                    " en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3";

// Get the response from the server

using (HttpWebResponse Response = (HttpWebResponse)Request.GetResponse()) {
  // Create a FastSgmlXPathReader SGML Reader

  using (FastSgmlXPathReader SgmlReader = new FastSgmlXPathReader()) {
    // Wrap the response stream in a StreamReader

    using (StreamReader InputStreamReader = 
           new StreamReader(Response.GetResponseStream())) {
      // Initialize the SgmlReader

      SgmlReader.InputStream = InputStreamReader;
      SgmlReader.DocType = "HTML";
      bool AllDone = false;
      while (!AllDone && SgmlReader.Read()) {
        if (SgmlReader.NodeType == XmlNodeType.Element) {
          switch (SgmlReader.XPath) {
            case "//html/body/div/div/div/div/div/div/div/div/span/span/h3/a":
              string Url = "http://www.yahoo.com/" + SgmlReader["href"];
              lnkHeadline.Text = SgmlReader.ReadInnerXml();
              lnkHeadline.Links.Add(0, lnkHeadline.Text.Length, Url);
              break;
            case "//html/body/div/div/div/div/div/div/div/div/span/span/p":
              string Details = SgmlReader.ReadInnerXml();
              lblDetails.Text = Details.Substring(0, Details.IndexOf('<'));
              AllDone = true;
              break;
          }
        }
      }
    }
  }
}

And the Last.FM example (XML)

Request = (HttpWebRequest)HttpWebRequest.Create("http://ws.audioscrobbler.com" + 
                         "/1.0/artist/Alice+In+Chains/topalbums.xml");
Request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; " + 
                    "en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3";

// Get the response from the server

int i = 0;
using (HttpWebResponse Response = (HttpWebResponse)Request.GetResponse()) {
  using (FastXPathReader XPathReader = 
         new FastXPathReader(Response.GetResponseStream())) {
    bool AllDone = false;
    while (!AllDone && XPathReader.Read()) {
      if (XPathReader.NodeType == XmlNodeType.Element) {
        switch (XPathReader.XPath) {
          case "//topalbums":
            lblArtist.Text = XPathReader["artist"];
            break;
          case "//topalbums/album/name":
            i++;
            if (i == 1) {
              lbl1.Text = "1. " + XPathReader.ReadInnerXml();
            } else if (i == 2) {
              lbl2.Text = "2. " + XPathReader.ReadInnerXml();
            } else if (i == 3) {
              lbl3.Text = "3. " + XPathReader.ReadInnerXml();
            } else {
              AllDone = true;
            }
            break;
        }
      }
    }
  }
}

As you can see, it's fairly painless to get the data out, whether its HTML or XML you're dealing with. The bulk of the work is just setting up the WebRequests and getting the response stream. Notice that you're reading XmlNodes directly out of the HttpWebResponse on-the-fly without any intermediate storage.

Points of interest

While the FastSgmlXPathReader should work fine on XML documents, you should always use the FastXPathReader instead when you know the document is well formed (in the interest of performance).

It was interesting to discover that the same code didn't work for the FastSgmlXPathReader and the FastXPathReader, seeing as they both essentially operate in the same way. I spent a few hours trying to figure out why they behave differently, and ultimately just decided to make them both work the same. I suspect it has something to do with the SgmlReader creating elements on-the-fly and how that does or does not affect the subsequent Read() operations. If anyone has any insight, I'm all ears.

If you run into any problems using the SGML parser itself (i.e., entity issues), you'll have to do some digging. I don't have anything to do with that project.

Possible performance enhancements

I realize that the XPath property code could be more efficient. I had considered abandoning the stack (List<string>) altogether and trimming and appending to the StringBuilder by scanning for the last position of the /. I'm open to your suggestions.

Resources

I'd link to the GotDotNet workspaces, but it looks like GotDotNet's on the way out.

History

May 10, 2007: Initial release.