Click here to Skip to main content
16,017,241 members
Articles / Web Development / HTML
Tip/Trick

C#: Website HTML Content Parsing, or How To Get Needed Info From Website

Rate me:
Please Sign up or sign in to vote.
2.00/5 (5 votes)
12 Jan 2015CPOL2 min read 79.7K   1.9K   15   3
How to get and parse website content.

Introduction

How can we get some content from some website?

We can use one of three ways:

1. Open website in a browser engine, i.e. standard WebBrowser or some third-party engine (here is article about WebBrowser and third-party engines) and get content of some DOM element of page.

2. Download HTML content via System.Net.WebClient and next parse it by String.IndexOf()/Substring, regular expressions or HtmlAgilityPack library.

3. Use website's API (if exists): send query to API and get response, also using System.Net.WebClient or other System.Net classes.

Way 1 - Via browser engine

For example, we have website about weather, with such HTML content:

HTML
<html>
<head><title>Weather</title></head>
<body>
  City: <div id="city">Monte-Carlo</div>
  Precipitation:
  <div id="precip">
    <img src="/rain.jpg" />
  </div>
  Day temperature: <div class="t">20 C</div>
  Night temperature: <div class="t">18 C</div>
</body>
</html>

Tip: if you haven't internet access or can't locate my site (or create your own), you can navigate local *.html file with such HTML content.

Let's get city name (i.e. Monte-Carlo).

You creating a WebBrowser (programmatically or in an form designer), navigating to website, and when website loaded (in DocumentCompleted event, make sure the website is indeed fully loaded), we getting DOM element (first div) by it id "city" and getting it inner text ("Monte-Carlo"):

C#
// getting city
var divCity = webBrowser1.Document.GetElementById("city"); // getting an fist div element
var city = divCity.InnerText;
label1.Text = "City: " + city;

Next, let's get precipitation image link to show it in PictureBox:

C#
// getting precipitation
var divPrecip = webBrowser1.Document.GetElementById("precip");
var img = divPrecip.Children[0]; // first child element of precip, i.e. <img>
var imgSrc = img.GetAttribute("src"); // get src attribute of <img>
pictureBox1.ImageLocation = imgSrc;

Lastly, let's get day and night temperature:

C#
// IE haven't Document.GetElementsByClassName method, therefore we writing it ourself
private HtmlElement[] GetElementsByClassName(WebBrowser wb, string tagName, string className)
{
    var l = new List<HtmlElement>();

    var els = webBrowser1.Document.GetElementsByTagName(tagName); // all elems with tag
    foreach (HtmlElement el in els)
    {
        // getting "class" attribute value...
        // but stop! it isn't "class"! It is "className"! 0_o
        // el.GetAttribute("className") working, and el.GetAttribute("class") - not!
        // IE is so IE...
        if (el.GetAttribute("className") == className)
        {
            l.Add(el);
        }
    }

    return l.ToArray();
}

// ...

// getting day and night temperature
var divsTemp = GetElementsByClassName(webBrowser1, "div", "t");
// day
var divDayTemp = divsTemp[0]; // day temperature div
var dayTemp = divDayTemp.InnerText; // day temperature (i.e. 20 C)
label2.Text = "Day temperature: " + dayTemp;
// night
var divNightTemp = divsTemp[1]; // night temperature div
var nightTemp = divNightTemp.InnerText; // night temperature (i.e. 18 C)
label3.Text = "Night temperature: " + nightTemp;

Way 2 - Via WebClient And HtmlAligityPack

You can download full HTML website content via System.Net.WebClient:

C#
using System.Net;
// ...
string HTML;
using (var wc = new WebClient()) // "using" keyword automatically closes WebClient stream on download completed
{
    HTML = wc.DownloadString("http://csharp-novichku.ucoz.org/pagetoparse.html")
}

And then you can parse it via HtmlAgilityPack third-party library identically to web engine:

C#
// create HtmlAgilityPack document object from HTML
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(HTML);

// parsing HTML
label1.Text = "City: " + doc.GetElementbyId("city").InnerText;

Note that HtmlAgilityPack supports NOT all WebBrower engine methods! For example, there are no GetElementByTagName() method. You should define them yourself.

Way 3 - Via Website API

To be continued...

Which way is better?

Website API is most convenient and light way usually. But it usually contains limitations, mainly for reasons of security.

WebBrowser way is lighest way if there is no API in website. Also, it very naturally simulates user actions and sometimes allow you to bypass website anti-bot security. It is one way if site content can load only by JS, because JS can't work in WebClient.

WebClient way is more fast-running and usually very durable than WebBrowser way.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



Comments and Discussions

 
QuestionHttpAgilityPack or Appliedalgo Pin
devvvy13-Jan-15 18:36
devvvy13-Jan-15 18:36 
QuestionA good start, but... Pin
DaProgramma13-Jan-15 3:29
DaProgramma13-Jan-15 3:29 
In my experience only solution 1 is feasable for larger websites. They are loaded with Javascript that do AJAX all the time. Those websites want a full browser application, and the browser must have screen space. After many hours trying I came to the solution that it is best to use an automation framework like Selenium or White and access the Browser's DOM with that.
GeneralA helpful thing when you need to login first Pin
Anton Shtern13-Jan-15 0:19
Anton Shtern13-Jan-15 0:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.