Click here to Skip to main content
16,017,241 members
Articles / Web Development / HTML

Multi Threaded WebScraping in C#

Rate me:
Please Sign up or sign in to vote.
4.92/5 (111 votes)
7 Feb 2014CPOL29 min read 368.6K   6.6K   286   68
Beginner to Advance - Multithreaded Web Scraping with Examples of WebBrowser, WebClient, HttpWebRequest/HttpWebResponse, Regex, BackgroundWorker.

Download source

Suggestions have been incorporated. Kindly Suggest, Vote, Comment to improve it

Image 1

Introduction    

*All the code examples are for learning purpose. Any misuse is not encouraged. 

* Project with Source Code of most of the Examples has been added.

Web Scraping involves obtaining information of interest from the webpages. I tried to make a step by step guide starting from basic of webscraping using WebBrowser to a little bit advance topics like performing login and maintaining sessions via HTTPWebRequest. This is the first release of the article and there may be errors/mistakes. I welcome all the suggestions and would try to include them ASAP.

I have used the tutorial based Step by Step approach and web scrapping work starts from the first line of the tutorial. I have taken Example/Task Oriented method to keep it interested, 2-3 examples are followed by 2-3 Tasks to keep the learner motivated. I am assuming the users have basic knowledge of C# and Visual Studio Programming Environment.   

Contents 

The Contents I have covered are: 

  • WebBrowser 
    • WebBrowser Download Event
    • Navigating To Olx's first Page
    • Accessing All Adds Shown at ...
    • Yahoo Signin Form Filling & Submission
    • Modifying WebBrowser Headers
    • Saving All Images of a WebPage
    • Solving Captcha Using API ...
    • Setting Proxy For WebBrowser
  • Regular Expressions 
    • Finding a Number in Text
    • Regex Operators
    • Finding Words in a Sentence
    • Numbers of Format ddd-ddddd
    • Finding Email Addresses in Text 
    • Finding IP Addresses in Text
    • A Regex Utility
  • WebClient
    • Downloading HTML as String
    • Downloading & Saving an Image
    • Blocking Mode of WebClient
    • Non - Blocking Mode of WebClient
    • Read / Write Streams
    • Query String for WebClient
    • Uploading File to URL 
    • Little more about WebClient ... 
  • BackGroundWorker
    • Running Time Consuming Function
    • Work Completion Report
    • Updating the Progress
    • Stopping the Worker
    • Multi Threaded App to Download Images ... 
  • HttpWebRequest/HttpWebResponse 
    • HTTP Request Headers
    • How the Sessions Work
    • HTTP Response Headers
    • Mozilla Live HTTP Headers
    • User Agent Strings
    • Getting Facebook Login Page HTML ...
    • Performing Login by HTTP requests
    • Custom HTTPWebRequest for Login... 
    • Understanding HTML Form Get/POST 
    • Getting Form Hidden Fields 
    • Preparing HTTP POST Data
    • Picture Upload by HTTP to Facebook ...

And a lot of relevant Tasks to keep the learner motivated to explore his innovation.

WebBrowser Control 

Top This control provides a built in full browser as a control. It enables the user to navigate Web pages inside your form.

Example: WebBrowser Download Event

  1. Add WebBrowser Control to the Form. Make it Dock in Parent Container

  2. Double click the WebBowser Control to Add WebDocumentCompleted Event

  3. Image 2

  4. Navigate Function is used to navigate to the given address

  5. Document completed event is fired once the document is completed loaded

  6. Now Run the program

  7. Image 3

  8. There are many solutions available to solve the above problem, like counting i-frames and then counting  number of times the Document Completed Event Fire. This is pretty much complex and the easiest one is  to Maintain History.

  9. Add a List<string> hist to the program, and modify the Document Competed Event as below:

  10. Image 4

Example: Navigate to OLX’s 1st Add’s Page  

  1. Before making any web scrapper, click bot etc, understanding of that website’s layout is necessary. After that following is important

    1. Finding Fields of Interest

    1. Narrowing Down the Text of Interest

    1. Finding tags with ids near the interested tags

  2. First Install Mozilla Firefox 15. Navigate to http://www.olx.com/cars-cat-378

  3. Write click on the First Add Link and click on the Inspect Element.

  4. You will see something like the image below
     

  5. Image 5

  6. To Visit the 1st Add, we needs its link address.  

  7. The anchor tag highlighted in above picture has no id, so if we use GetElementByTag(“a”) function, we will get a list of all the anchor tags, which will include links of other pages of olx, help, contact us etc.(so its not good option)

  8. So, try to Find the nearest Tag which has ID.

  9. On the Tags Bar, Keep Selecting Tags toward the Left until you find some tag with ID

  10. Image 6

  11. Once you reach the div tag with id the-list, you will see it is the container for all the Adds Links

  12. Image 7

  13. So all the anchor tags in div#the-list, are links to the individual add pages

  14. Following is the code to get it programmatically. 

  15. C#
    //Getting AddsBlock HtmlElement 
    HtmlElement he = webBrowser1.Document.GetElementById("the-list"); 
    
    //Getting Collection of all the Anchor Tags in AddsBlock 
     HtmlElementCollection hec = he.GetElementsByTagName("a"); 
  16. We want to navigate to 1st adds page   

  17. C#
    //Naviagting to 1st Add Page 
    //obtainign href value to get the page address  
    
    webBrowser1.Navigate(hec[0].GetAttribute("href"));

Example: Navigate to All the Adds shown on

http://www.olx.com/cars-cat-378

  1. You have seen how to Navigate to the 1st Add.

  2. To navigate to all the Pages, We need to store all the Adds Links in a List, so that later on we can visit those Adds

  3. To Do This Make A List That Stores href values of all the Add’s Links

  4. C#
    List<string>
     
    urls = new List<string>();
  5. Modify the Document Completed Event to add all the links to the URLs

  6. C#
    HtmlElement he = webBrowser1.Document.GetElementById("the-list");
    HtmlElementCollection hec = he.GetElementsByTagName("a");
    
    foreach(HtmlElement a in hec)
    {
        string href = a.GetAttribute("href");
        if(href != "http://www.olx.com/cars-cat-378")
        {
           if(!urls.Contains(href))
             urls.Add(href); 
     
        }
    }
  7. Why we are checking href != "http://www.olx.com/cars-cat-378" ? Because each individual Add Block contains a Link to the page on which it is being shown (that means to make accurate scrapper, you need to understand well what all is there and where is it).

  8. All the links are stored in urls list, now we need to make the browser automatically navigate to all of these

  9. C#
    if(urls.Count > 0)       
    {    
         string u = urls[0];      
         urls.RemoveAt(0);       
         webBrowser1.Navigate(u);      
         this.Text = "Links Remaining" + urls.Count.ToString();      
    }      
    else      
    {	MessageBox.Show("Complete");	}

Task 1: Modify The Above Code, make it browse next pages

Like: http://www.olx.com/cars-cat-378-p-2

http://www.olx.com/cars-cat-378-p-3

http://www.olx.com/cars-cat-378-p-4 and so on

Task 2: On Each Add Page, scrape owner name and Number(if given)

Task 3: Make The App, which scrapes specified number of individual adds from the given url of olx categorey.

Example: Yahoo Signin Form Filling and Submission 

  1. Navigate to http://mail.yahoo.com/

  2. Check the ID of the username and password textboxes (Use Inspect Element)

  3. Make a Button Click Event in the app

  4. C#
    htmlElement hu = webBrowser1.Document.GetElementById("username");
    hu.Focus();      
     
    hu.SetAttribute("Value","userName");      
    
    HtmlElement hp = webBrowser1.Document.GetElementById("passwd");
    hp.Focus();
    hp.SetAttribute("Value", "password"); 
  5. For  Sign in Button Click (actually we need to submit the form, so find Signin Form ID. Get its Element, and invoke submit function on it

C#
HtmlElement hf = webBrowser1.Document.GetElementById("login_form");      
 
  hf.InvokeMember("submit"); 


Task 1: Findout how to Select Value of Dropdown List, CheckBox, Radio Button
. You can try Filling Yahoo Signup Page

Task 2: Perform Click on the Hyperlink

Example: In WebBrowser Control We can Add/Change the Headers. The Most important Header’s are Referrer and User-Agent. 

  1. User Agent header tells the Web Server about the Browser From which the Request was sent  

  2. Referrer Tells the Web Server, that From which web page the user was sent to the current web page

  1. To Change User-Agent Header 

webBrowser1.Navigate("url", "_blank", null, "Referrer: sample user agent");

Task1: Browse to webBrowser1.Navigate("logme.mobi");

To see HTTPHeaders, then try modifying your User-Agent and Referrer

You can get a Complete List of User Agent Strings at http://www.useragentstring.com/pages/useragentstring.php

Task2: Vist www.google.com in C# app, with some Apple, Linux browser User Agent. Get Google Search Results non-javscript page.

Example: Saving All the Images of the Web Page 

  1. Add Reference to using mshtml;

  2. You can use Yahoo Sign up Page for Practice 

C#
IHTMLDocument2 doc = (IHTMLDocument2)webBrowser1.Document.DomDocument;

IHTMLControlRange imgRange = (IHTMLControlRange)((HTMLBody)doc.body).createControlRange(); 

foreach (IHTMLImgElement img in doc.images)       {       
 
imgRange.add((IHTMLControlElement)img);       
 
imgRange.execCommand("Copy", false, null);  

try{ 
    using(Bitmap bmp = (Bitmap)Clipboard.GetDataObject().GetData(DataFormats.Bitmap))            
    bmp.Save(img.nameProp + ".jpg");     
   } 
catch (System.Exception ex)
   {  
   MessageBox.Show(ex.Message);          
   }       
}

The Above code will save all the images of the Webpage in current directory

Task: Find Pattern in the captcha name, modify the code to only save captcha

Example: Solving Captcha using DeathByCaptcha Api

  1. Add Reference to using DeathByCaptcha;

  2. Following code solves the captcha. 

  3. C#
    Client client = (Client)new SocketClient(capUser, capPwd);
    try
        {           
            Captcha captcha = client.Decode(path + capName, 50); 
    
    		if (null != captcha)             
    		{ 
    		    //Captcha Solved
    		    MessageBox.Show(captcha.Text);            
            }
    		else
    		{ 
    			//Captcha Not Solved Show Error Message            
    		}          
    	}          
    	catch(DeathByCaptcha.Exception ex)          {           
    	MessageBox.Show(ex.Message); 
    } 
  4. Study How to Report that Captcha.Text was wrong

Example: Setting Prxoy For WebBrowser

C#
using Microsoft.Win32;  
RegistryKey reg = Registry.CurrentUser.OpenSubKey(
  "Software\\Microsoft\\Windows\\CurrentVersion\\InternetSettings", true);

registry.SetValue("ProxyEnable", 1); 

registry.SetValue("ProxyServer", "192.168.1.1:9876");

Regex  

Top A concise and flexible means of matching strings in the text

In C#, Regex, Match, MatchCollection classes are used for finding string patterns. These Clasess are in following Namespace.

C#
using System.Text.RegularExpressions;

Example 1: Finding a Number in a text

Image 8

Following are the Regex Operators:

[xyz]

A character set. Matches any one of the enclosed characters. For example, "[abc]" matches the "a" in "plain".

[^xyz]


A negative character set. Matches any character not enclosed. For example, "[^abc]" matches the "p" in "plain".

[a-z]

A range of characters. Matches any character in the specified range. For
example, "[a-z]" matches any lowercase alphabetic character in the range "a" through "z".

[^m-z]

A negative range characters. Matches any character not in the specified
range. For example, "[m-z]" matches any character not in the range "m" through "z".


*

Matches the preceding character zero or more times. For example, "zo*" matches either "z" or "zoo".

+

Matches the preceding character one or more times. For example, "zo+" matches "zoo" but not "z".

?

Matches the preceding character zero or one time. For example, "a?ve?" matches the "ve" in "never".

.

Matches any single character except a newline character.

Example 2: Finding Words in a sentence

Image 9

The last word sentence is not in the match list, as it didnt have space after it.

{n}

n is a non-negative integer. Matches exactly n times. For example, "o{2}" does not match the "o" in "Bob," but matches the first two o's in "foooood".

{n,}

n is a non-negative integer. Matches at least n times. For example, "o{2,}" does not match the "o" in "Bob" and matches all the o's in "foooood." "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".

{n,m}

m and n are non-negative integers. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood." "o{0,1}" is equivalent to "o?".

Example: Matching Telephone Number of the format ddd-ddddd. Where d means digit

Image 10

Example: Finding email in the text

TheRegex of a normal email can be

"\b[A-Za-z0-9_]+@[A-Za-z0-9_]+\.[A-Za-z0-9]{2,4}\b"

Where \b defines a blank space  [A-Za-z0-9_]+ defines username which may include repitition of anything from A-Z, 0-9, a-z and _(uderscore)  @[A-Za-z0-9_]+ defines hosting company name, for example yahoo \. Defines dot(.)  [A-Za-z0-9]{2,4} Matches Top Level doamin, like com, net, edu etc

Example: Regex to find IP address in the text

A basic regex to find ip address can be "\b[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\b"  But this wil match something of this type also 1925.68541.268.1 (that mean any number of deigits with 3 dots – and its not valid ip address)

An Other can be 

"\b[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}\b" 

Now this wil not match a string whihc has more that 3 digits with dots. But it may matches 999.999.999.999 which is again invalid address

So a regex can be as complex as following 

"b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\."+ "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\."+"(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\."+"(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b"

There is always a trade of between complexity and accuracy. So depending upon you input text you may give one dimension more important over other.

Task 1: Anchor Tags of HTML are of vital importance in scrapping. The value of the link is placed in the href attribute as shown below. Write regex to Find href value

Image 11

Answer 1: The Regex to this can be as simple as

Regex r = new Regex("href=\"[^\"]+\"");

\”  is used to define double quote(“) as double quote is a special character so it needs to be written with slash(\).  Later on u can remove href=” and “ from the value of the match.

Task 2: Pins at Pinterest.com has following format of web addresses.

"/pin/125678645821924781/" "/pin/63894888434527857/"

"/pin/25825397833321410/"

Write Regex to find pins addresses from the html of the www.pinterest.com

Task 3: Thumbnail images of the Ads are shown at http://www.olx.com/cars-cat-378

Image 12

Write a Regex to Match Images Links

Task 4: Make Following Utility.

Image 13

  1. Load Input Text File 

  2. Press Load Button

  3. Write Regex

  4. Press Execute

  5. All The Matches are show in Multiple Line Text Box with One Match Value Per Line

WebClient 

Top This class is found under using System.Net;. It provides various funtions to download files from the internet. It can be used to download HTML source of the webpages as string, as file. It supports downloading files as data bytes.

This class is very helpful in scrapping, as it lets the coder download only the html file where as using webbrowser for scrapping is simple, but not an efficient/speedy way.

Example: Dowloading Yahoo.com html source as string

  1. Create a Button and TextBox on the form

  1. In Button Click Event add the following code, and press the Button at run

  2. C#
    WebClient wc = new WebClient();  
    
    textBox1.Text = wc.DownloadString("http://www.yahoo.com");
  3. Here we are making Webclient Variable and then using its DownloadString method to download the html of the given url.

  4. The downloaded html is shown in the textbox1

Image 14

Benefits of using Webclient

  1. Its easy to use

  2. Supports Various Methods for file and string downloading

  3. Efficient, uses much less bandwidth as compared WebBrowser

Once HTML source is downloaded, u can use Regex or 3rd party HTML Parsers to get required info from the HTML source.

Example 2: Downloading and Saving an Image

  1. Add the Following Code to Button Click Event, and press the Button at run

WebClient wc = new WebClient();

wc.DownloadFile("http://www.dotnetperls.com/one.png", "one.png");
  1. The image one.png will be downloaded and stored in the current directory

Image 15

  1. The 1st argument is the url of the image and 2nd is the name of the image

  2. The same way, WebClient class provides methods for string and file uploading, but we wil use HTTPWebRequest class for that

  3. wc.DownloadData()Method provides downloading the data as bytes. This is useful where differnent encoding is used like UTF8 etc

Example: Blocking Mode of Webclient

  1. Adds Following code to the Button Click and press button at run

  WebClient wc = new WebClient();

wc.DownloadData("http://www.olx.com");
  1. Just after pressing buttion, try to move the Form, and Form will go to Not responding

Image 16

  1. Why is it so? The downloading string, file or data from internet is time consuming and webclient class is performaing download operation on the same Thread as the UI is. This causes UI to go Unresponsive

  1. This mean, no other task can be performed by the App, once WebClient is downloading. This is blocking Mode. Solution to this problem is using WebClient in Non – Blocking Mode

Example: Non– blocking Mode of webclient 

  1. wc.DownloadStringAsync()isused to perform the download operation on a separate thread. This causes the UI to remain responsive and can App can do other task meanwhile the downloading is performed

wc.DownloadStringAsync(new Uri("http://www.yahoo.com")); 
  1. This Downloads a String from resource, without blocking the calling thread.

  2. To Perform Asynchronous Download operation, user needs to define Download Completed Event, so that calling thread can be informed once Downloading is Complete

  3. In above case, we need to add DownloadStringCompleted Event.

  4. Following Piece of code will cause the Webclient to Asynchronous Dwonload string and will fire Download Completed event on completion

WebClient wc = new WebClient(); 


wc.DownloadStringCompleted+=new DownloadStringCompletedEventHandler(wc_DownloadStringCompleted); 

wc.DownloadStringAsync(new Uri("http://www.yahoo.com"));
  1. The Downloaded string is passed as argument to the Download Complete Event and can be accessed by following way

void wc_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
      //Accessing the Downloaded String
        string html = e.Result;

     //Code to Use Downloaded String       

     textBox1.Text = html;
 }
  1. The Download Completed Event is Fired at Calling Thread, so u can easily Access UI elements

Example: Read / Write Streams

  1. Webclient class provides various Blocing and Non Blocking Methods to Access the Stream for direct Read and Write Operations

  2. Following Piece of code obtains read Stream in blocking mode

WebClient wc = new WebClient(); 

StreamReader sr = new StreamReader(wc.OpenRead("http://www.yahoo.com")); 
//Here You Can Perform IO 

//Operations like, Read, ReadLine   

//ReadBlock, ReadToEnd etc 

//Supported by StreamReader Class
  1. The Same Way Write stream can be obtained for Write Related IO operations

Example: QueryString for Webclinet

  1. Gets or sets a collection of query name/value pairs associated with the request.

  2. Query String is helpful in sending the parametres to the url by url posting mothed

  3. Search Result Page of Google has following format of Address

https://www.google.com.pk/search?q=search+phrase


  1. In above url, 1 is a parameter and search+phras is its value

  2. Following Example Shows how to use Query String for sending parameters and their values to a URL

string uriString = "http://www.google.com/search"; 

//Create a new WebClient instance.  

WebClient wc = new WebClient(); 

//Create a new NameValueCollection instance to hold the QueryString parameters and values.

NameValueCollection myQSC = new NameValueCollection();

//Add Parameters to the Collection      

myQSC.Add("q", "Search Phrase"); 

// Attach QueryString to the WebClient. 
     wc.QueryString = myQSC;  

//Download the search results Web page into 'searchresult.htm' 

wc.DownloadFile(uriString, "searchresult.htm");
  1. NameValueCollection class is under System.Collections.Specialized

Example: Uplading File To the URL

String uriString = "FileUploadPagePath";      

// Create a new WebClient instance.  

WebClient myWebClient = new WebClient();     

//Path to The File to Upload 
     string fileName = "File Path";      


// Upload the file to the URI.

//The 'UploadFile(uriString,fileName)' method

//implicitly uses HTTP POST method.      


byte[] responseArray = myWebClient.UploadFile(uriString, fileName);      

// Decode and display the response.      

textBox1.Text = "Response Received. " +  System.Text.Encoding.ASCII.GetString(responseArray);

Example: Additional Info for WebClient

  1. Setting Proxy

wc.Proxy = new WebProxy("ip:port");

  1. Adding Custom Headers

wc.Headers.Add(HttpRequestHeader.UserAgent, "user-agent");  

  1. Obtaining Respose Headers

WebHeaderCollection whc = wc.ResponseHeaders; 

Task 1: Add Refrrer Header.

Task 2: Read Response Code and Status from Response Header

Task 3: What is BaseAddress of the WebClient

Task 4: Use WebClient.QueryString to Do Search on Google

Task 5: Use WebClient.Upload to upload some File

BackGroundWorker

Top This class provides an easy way to run time-consuming operations on a background thread. The BackgroundWorker class enables you to check the state of the operation and it lets you cancel the operation.

Example: Running Time Consuming Function on BackGroundWorker

  1. For This Example We Are assuming Following Function a time consuming, and user need to run this function for various times which causes the UI to go unresponsive

private void  HeavyFunction()
{

System.Threading.Thread.Sleep(1000);
}
  1. Make a Form a shown Below with Start, Stop Button and Status Text. Add a BackGroundWorker from the Componnets to the Form

Image 17

  1. Create an event handler for the background worker's DoWork event. The DoWork event handler is where you run the time-consuming operation on the background thread. You can make this Event By Double Clicking in the Event Pane for BackGroundWorker

Image 18

  1. Any values that are passed to the background operation are passed in the Argument property of the  DoWorkEventArgs object that is passed to the event handler.

Image 19

  1. Let's  Call the HeavyFunction 5 times in the backgroundWorker1_DoWork Event

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e) 
{  
for(int i = 0; i < 5; i++)
HeavyFunction(); 
 	}
  1. To Start the BackGroundWorker Work, we need to call RunWorkerAsync() Function of the backgroundWorker1. Call it in the Start Button Click Event

private void Start_Click(object sender, EventArgs e)  
{ 
backgroundWorker1.RunWorkerAsync();
}
  1. Once, the Start Button will be Clicked, the BackgroundWorker will start working, But UI will remain responsive.

  1. You have sucessfully learnt how to put Time Consuming Functions on Easily Maneged Separate Thread

  1. RunWorkerCompleted Event is Fired Once The Work is Complete

  2. The Event is Called on the Calle Thread(Thread From which the BackGroundWorker.RunWorkerAsync() was called). In our case, its UI thread


  3. To be notified, About BackGroundWorker Completion, add the Event RunWorkerCompleted

Image 20

  1. RunWorkerCompleted Event Will be Fired on UI Thread, so we can easily Access All the UI Elements

private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
   { 
   	Status.Text = "Work Complete";
}
  1. Now After 5 seconds of Pressing Start Button, the Status Label Text Will be set to Work Complete

Image 21

  1. While Performing some Time Consuming Function on the BackGroundWorker, we may want to update the progress to the user. For example in a scenario of downloading several files, we may want to update UI to show how many files have been completed

  2. To perform such update, ReportProgress Function is called which raises the PogressChanged Event on the calle Thread.

  1. To Call Report Progrees, First you need to Add Progress Changed Event and set the WorkReportProgress Property to True

Image 22 Image 23

  1. In Report Progress Method 2 arguments can be passed, int ProgresssPercentage and object UserState. These Both arguments are available in ProgressChangedEventArgs ProgressPercentage and UserState Properties

Image 24


  1. To Report Progress, Change the BackGroundWorker DoWork Event as Following

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e) 
{
for (int i = 0; i < 100; i++)  
{
	HeavyFunction();  

     backgroundWorker1.ReportProgress(i, " Heavy Function Done"); 
      } 
}
  1. To update UI in ProgressChanged Event, modify it as following

private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
     Status.Text = e.ProgressPercentage.ToString() + (string)e.UserState;
}
  1. Now Once you press the Start Button, the status will be updated with arguments passed in ReportProgress Method

Image 25

  1. When the BackGroundWorker will finish working, the RunWorkerCompleted Event will be fired, so the status will be updated to Work Complete.

Image 26

  1. To Stop the BackGroundWorker During the Work, We need to Set the Property  WorkerSupportsCancellation to True

Image 27

  1. At any time during the Work, we can Stop the BackgroundWorker by calling CancelAsync() Function. Modify the Stop button Click Event as Following

   private void Stop_Click(object sender, EventArgs e) 
   { 

backgroundWorker1.CancelAsync(); 
   }
  1. Modfiy the DoWork Event as Following to Stop if Cancelling is Pending

  private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e) 
  { 
for (int i = 0; i < 1; i++) 
     { 
         if (backgroundWorker1.CancellationPending)
         { 
           e.Cancel = true; 
            break; 
         } 
        HeavyFunction(); 
        backgroundWorker1.ReportProgress(i, "Heavy Function Done"); 
       } 
   }
  1. To Update the UI with accurate info, u can modfiy BackGroundWorkerCompleted Event as following to  show that either BackGroundWorker was stopped or it completed the Work

private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
{ 
   if(e.Cancelled)
            Status.Text = "Work Stooped";
   else 
             Status.Text = "Work Complete";
     }
  1. If BackGroundWorker is Busy in some Task and user again press the Start Button, this is going to cause an error and throw and Exception. The IsBusy Propert tells either worker is busy or not, So Before Calling RunWorkerAsync() function, one must check that either BackGroundWorker is Busy in Work or Not. Following Code Does so 

private void Start_Click(object sender, EventArgs e) 
       { 
            if (!backgroundWorker1.IsBusy)
                    backgroundWorker1.RunWorkerAsync();
            else
                    MessageBox.Show("Busy in Work - Press Stop"); 
     }
  1. You can send Non UI objects as argument to the RunWorkerAsync function and then access it in the DoWork Event

Example: Make Multi Threaded App to download images from pinterest.com

  1. Design the UI as shown Blow

Image 28

  1. Add a one BackGroundWorker, name it backGroundWorker1, add DoWork, ProgressChange and RunWorkCompleted Events. Set WorkerReportsProgress and WorkerSupportsCancellation Properties to  True

  2. Program Logic: We are going to use backGroundWorker1 to download html source of the http://www.pinterest.com using WebClient, then we will use regex to find urls of all the images and add it to a List<string> urls. The BackGroundWorkers equal to the number of threads set by the user will be created at run time, each of these backgroundworkers will take one url from List<string> urls, and download that image using WebClient.

  3. Add Following Code for the backgroundworker1 Events

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
WebClient wc = new WebClient();
string html = wc.DownloadString("http://www.pinterest.com");

Regex reg = new Regex("src=\"http://[^/]+/upload/[^\"]+"); 

MatchCollection mc = reg.Matches(html);

backgroundWorker1.ReportProgress(0,  mc.Count.ToString() + "Images Found");

System.Threading.Thread.Sleep(2000);  
            lock(urls) 
            { 
               foreach (Match m in mc) 
                {      
                 urls.Add(m.Value.Replace("src=\"","")); 
                } 
            } 
        }
private  void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
Status.Text = (string)e.UserState; 
}
  1. In DoWork, we have just downloaded the html, used regex to get images links, and added it in the List<string> urls

  2. Now we need to make workers for downloading images. We will do this once user press that Start Button, Add Following Code to Start Button Click Event

private void Start_Click(object sender, EventArgs e) 
        { 
            int maxThrds;
            if(!int.TryParse(NoOfThreads.Text, out maxThrds)) 
            { 
             MessageBox.Show("Enter Correct Number of Threads");
               return;
            }
            if(maxThrds <= 0)
            {  
             MessageBox.Show("Enter 1 or more Threads"); 
              return;
            }
            if (!backgroundWorker1.IsBusy) 
            {
                for(int i = 0; i < maxThrds; i++)
                { 
    BackgroundWorker bgw = new BackgroundWorker();

    bgw.WorkerReportsProgress = true;  

    bgw.WorkerSupportsCancellation = true;                     

    bgw.DoWork += new DoWorkEventHandler(bgw_DoWork);                   

    bgw.ProgressChanged += new ProgressChangedEventHandler(bgw_ProgressChanged);

    bgw.RunWorkerCompleted += new RunWorkerCompletedEventHandler(bgw_RunWorkerCompleted);
 //Start The Worker 
                   bgw.RunWorkerAsync();
                }
               

backgroundWorker1.RunWorkerAsync();
            }
            else
            {

MessageBox.Show("Busy in Work");
            } 
        } 
  1. First we are checking for correct input, then we are making backgroundworkers at the run time.

  2. Once all properties of the run time threads are set, we are calling bgw.RunWorkerAsync for each worker. 

  1. Following is the Code for DoWork Event of the RunTime made Workers
private void bgw_DoWork(object sender, DoWorkEventArgs e)
        {

BackgroundWorker bgw = (BackgroundWorker)sender;
            while(true)
            {
             string imgLink = "";
                lock(urls)
                { 
                if(urls.Count > 0)
                   {
                    imgLink = urls[0];
                    urls.RemoveAt(0);
                    count++;
                   }
                   else
                   {   
                   System.Threading.Thread.Sleep(500);
                   }
               }
               if (imgLink != "")
               {

              string filename = imgLink.Substring(imgLink.LastIndexOf("/") + 1); 

              WebClient wc = new WebClient();

              wc.Headers.Add(HttpRequestHeader.Referer, "Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20100101  firefox/15.0.1"); 

              wc.DownloadFile(imgLink, filename); 

              bgw.ReportProgress(0, count.ToString() +  "Images Downloaded");  

             }
       }   
}
  1. In 1st line, we are casting sender to the BackGroundWorker object, so that we can ReportProgress for it. Then we have put all the code in a loop, in each iteration we are removing one url from List and then  putting it on Download

  2. If there is no link in List<string> urls, we have put the thread to sleep for 500ms

  3. Since many threads will be accessing List<string> urls, so have put it in lock.

Task 1: How to Stop Run Time Created Workers

Task 2: Modify backgroundworker1 to collect user defined number of images. For example 30, 100, 220(for more than 50, u have to scrape page 2,3,4 ....

Hint for Task 1: Following Options can be used

  1. Option 1: You can maintain a List of Run Tim Created Workers and then call CancelAsync() for each worker in the List. Then modify the code of each run time Worker to break The Loop if CancellationPending

  2. Option 2: Declare a Global Variable int rnd, assign it some random value in the start Button Click Event and pass it to BackGroundWorker DoWork Event as Argumet.  

//Start Button Click Event
if(!backgroundWorker1.IsBusy) 
 { 

 rnd = new Random().Next(0, 99999);

 for (int i = 0; i < maxThrds; i++) 
  { 
  BackgroundWorker bgw = new BackgroundWorker();
  bgw.WorkerReportsProgress = true;
  bgw.WorkerSupportsCancellation = true;  
  bgw.DoWork += new DoWorkEventHandler(bgw_DoWork);
  bgw.ProgressChanged += new ProgressChangedEventHandler(bgw_ProgressChanged);
  bgw.RunWorkerCompleted += new RunWorkerCompletedEventHandler(bgw_RunWorkerCompleted);


  //Start The Worker, Pass rnd as Argument  

  bgw.RunWorkerAsync(rnd);
  }  

 backgroundWorker1.RunWorkerAsync();
} 
  1. Cast the rnd value to a local int variable, Modify the DoWork Event to work until rnd is not changed

void bgw_DoWork(object sender, DoWorkEventArgs e)
{
  int chk = (int)e.Argument;
    while (chk == rnd)
    {
        //Do the Task
    }
}
  1. In Stop Button Click Event, Assign some new value to rnd, which will cause all the runtime created workers to break from loop

private void Stop_Click(object sender, EventArgs e)
{

	rnd = new Random().Next(0, 9999);  		
}

Task 3: Think of some more options

HTTPWebRequest / HTTPWebResponse

Top

  1. Before Starting this, we need to understand a bit about main HTTP Headers and install few ADD On’s which help us in determining Layout, Packets and Altering the Packets for a Website.

  1. In an any of your Browser, Go to http://logme.mobi . You will get something like following

Image 29

  1. This is the Data of your HTTP Headers, which your browser sent to the web server of the http://logme.mobi In this, the User – Agent (it defines which browser is used for browsing) and Connection Headers are important.

  2. Now Just Referesh the Page, and u will get something like following

Image 30

  1. This Header defines the Cookie, what is Cookie? A Cookie is a small piece of information stored as a text file on your computer that a web server uses when you browse certain web sites.

  2. To maintain sessions, the cookie header is very important.

  3. How the Sessions Work? Once user request login page, few cookies are issued by the server, then user submits login info along with the cookies, in case of successful login, server issues a new set of cookies, which identifies the user as authentiated user to the server. Then for further requests to the server, these newly issued set of cookies is used. This way a session is maintained. At any time, if u clean the Cookies Header, u will be redirected to the Login Page.

A Typical Cookie Exchange

  1. Install Mozilla Firefox 15.0 and then install the Live HTTP Headers add on for it. You can get it from here. Run the LiveHTTPHeader, and Referesh the Page http://logme.mobi. You will see something like following
     

    Image 32

  1. The LiveHTTPHeader Add on shows the HTTPHeaders of all the Requests and Responses once you dosome browsing using Mozilla Firefox. This Tool is Helpful in determining the website’s HTTP Packet formats, specially it helps in knowing what all data is being posted once some POST Action is performed

  2. Now install the Add on Tamper Data from here. This Tool is helpful in modifying the content of HTTP Headers while browsing the web, this tool is of great use in determining that what fields and headers are compulsory for performing some HTTP POST Request and what all stuff we can skip out from a perticular post request. Once u run it, it will look like following

Image 33


  1. Click on the Start Tamper Button. In address bar type useragentstring.com/ press Enter, As soon u press Enter Following Window Will open, asking for u to Tamper Data, Submit Request or Abort Request

Image 34

  1. Click on Tamper Data, Then Following Window will open up

Image 35


  1. In the User Agent Field, Enter Following and Press OK

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1 


  1. Now submit all the subsequent requests and once the page will be loaded u will see website showing ur browser as Chrome where as you are using FireFox

Image 36

  1. The same way, you can alter the POST method parameters.

HTTPWebRequest Class in .Net: This class is under System.Net namespace and it provides methods and
properties to make HTTP Request to a web server.

Example 1: Downloading HTML of the Facebook login page

  1. To make an object of this class, WebRequest.Create function is used HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create(url);

  2. Open LiveHTTPHeaders, in browser, and browse to http://www.facebook.com

Image 37

  1. The picture above, shows the HTTP Request Headers, lets make this in C#

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://www.facebook.com/");

request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/15.0";

request.Accept = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";

request.Headers.Add("Accept-Language: en-us,en;q=0.5");

request.Headers.Add("Accept-Encoding: gzip, deflate"); 

request.KeepAlive = true; 
  1. First line, creates an HTTPWebRequest Object to the given url, then we are adding the User-Agent and Accept Header to the HTTP packet by using properties. Not all of the Herders are Accessible via properties so user may need to Add Headers by Adding it to Headers Collection. Then we are adding Accept-Language and Accept-Encoding Headers by Adding it to Headers Collection.

  1. Next important Stuff is Adding the Cookie Container to the HTTPWebRequest Object, as we want to keep record of the Cookies sent by the server in response to the request. If no Cookie container is Added then we can not Access the Cookies in the Response Header.

request.CookieContainer = new CookieContainer();
  1. Declare a CookieCollection variable Globally, all the received Cookies will be added in this Collection so that we can use received cookies for subsequent requests. If each time you use new Cookie Container, then its not possible to maintain session.

  2. CookieCollectioncookies = new CookieCollection();
  3. Now we are done with making required HTTPWebRequest Object. Before making HTTPWebResponse object, lets see what response we got in LiveHTTPHeader for the request which we sent by browser

  4. Image 38

  5. The 1st line shows the Code and Status. Then we are interested in Cookies Only. We a got 4-5 Cookies, which wil stored in browser Cookies foleder and will be sent with the next request. In case we perform   login, then these Cookies wil be sent with the HTTP Request which will be generated for login, and then for successful login, server will issue some additional Cookies(those Cookies wil contain info which wil make us authenticated users for subsequent Requests)

  6. Now lets make the HTTPWebResponse Object, its very simple

  7. C#
    HttpWebResponse response = (HttpWebResponse)request.GetResponse(); 
  8. Once Response is Received, next thing is adding the Received Cookies to the globally defined CookieCollection. But Before that Lets see what all Cookies we received. Add the following Code after above line to see recevied Cookies

  9. C#
    string txt = "Cookies Count=" + response.Cookies.Count.ToString() + "\n"; 
    
    foreach (Cookie c in response.Cookies)
    {  
       txt += c.ToString() +  "\n"; 
    } 
    MessageBox.Show(txt); 
    //Adding Recevied Cookies To Collection
    cookies.Add(response.Cookies);
  10. This will show your cookies in a MessageBox

Image 39

  1. Now the Response is recieved, Next Step can be downloding the Data from Stream, it can be HTML source code, some other file or may be nothing at all depending upon the url to which u made the request. In our case it is HTML source of the Facebook Login page.

StreamReader loginPage = new StreamReader(response.GetResponseStream()); 

string html = loginPage.ReadToEnd();
  1. This html source can be used to get some info by Regex or using some 3rd party HTML Parsing Library or stored in an html file as offline page.

Example 2: Performing Login to FaceBook

  1. Before diong login by C#, lets perform login in mozilla and analyze the HTTP Header by LiveHTTPHeaders. Start LiveHTTPHeader, Browse to http://www.facebook.com , enter username and password, click Login button. The HTTP Web Request sent by the Browser will look somthing like this

Image 40

  1. The First Line is the URL to which your username and password being sent(later in Example 3, we will see how to find this url). Second line tells the HTTP method and version used, which is POST and 1.1 respectively.

  2. Then all the fields are just like normall HTTP Header as we saw in Example 1. The important stuff starts from Cookie Header, in Example 1, once we browse to http://www.facebook.com, there was no Cookie Header where as we received some Cookies in the Response Header, now when we click on the Login Button, the previously received set of Cookies is being sent in this Cookie Header.

  3. Next Header shows Content Type, there are two major content types used to POST data, application/x-www-form-urlencoded and multipart/form-data. You can find more info about these here 

  4. Next Header shows Content Length and in last line Content is being shown. You will see your email address and password in this line. Actually last Line shows the data which is being sent to the server by HTTP Post method.

  5. There are several other values also, later in Example, we will see what are these values and from where to obtain these Values ! ! !

  6. Lets examine the Response Header for the above Request.

Image 41

  1. The Response Header shows a lot of Cookies, these are the Cookies which are issued by the server on  successful login, now for any subsequect request, the browser will send these Cookies to the server and in this way session wil be maintained

  2. Got to Tools->Clear Recent History and delete the Cookies, then try to browse to your facebook profile page, and u will see that u will be redirected to facebook login page. 

  1. Now lets create the same login Request header as we saw in above screen shot and test that either we are able to successfully log in or not

string getUrl = "https://www.facebook.com/login.php?login_attempt=1"; 

string postData = "lsd=AVo_jqIy&email=YourEmailAddress&pass=YourPassword&default_persistent=0& charset_test=%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84&timezone=-300&lgnrnd=072342_0iYK&lgnjs=1348842228&locale=en_US";

HttpWebRequest getRequest = (HttpWebRequest)WebRequest.Create(getUrl);  

getRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Firefox 15.0.0.1";

getRequest.CookieContainer = new CookieContainer(); 

//Adding Previously Received Cookies 

getRequest.CookieContainer.Add(cookies); 

getRequest.Method = WebRequestMethods.Http.Post;

getRequest.ProtocolVersion = HttpVersion.Version11;

getRequest.AllowAutoRedirect = false;

getRequest.ContentType = "application/x-www-form-urlencoded"; 

getRequest.Referer = "https://www.facebook.com"; 

getRequest.KeepAlive = true; 
  1. The getUrl is assigned to the address to which data will be posted, postData variable is copy of the Content from above HTTP Request Packet. Then we have created an HTTPWebRequest Object, and set  its  User-Agent Header

  2. The Cookies which we received in Response to the Request for http://www.facebook.com are added to the HTTPWebRequest object, if we dont add these Cookies, then instead of enterteaining our request for login, Server will redirect us to Login page. Next we are setting HTTP Method to Post and Version to 1.1(used for HTTPS).

  3. Setting the AllowAutoRedirect Property to false for requests in which we try to login is very important, if  this property is set to true, then the HTTPWebRequest object will follow the Redirection Responses. And
    during the redirections, you may lost access to the Cookies which server sent in response to Login Request.

  4. Now Lets send the Login Info to the Server. 

//Converting postData to Array of Bytes    

byte[] byteArray = Encoding.ASCII.GetBytes(postData); 
 
//Setting Content-Length Header of the Request

getRequest.ContentLength = byteArray.Length;

//Obtaining the Stream To Write Data

Stream newStream = getRequest.GetRequestStream();  

//Writing Data To Stream

newStream.Write(byteArray, 0, byteArray.Length);

newStream.Close();
  1. Data is written to stream, now lets get the Response and see what all Cookies we Receive

HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();

string txt = "Cookies Count=" + getResponse.Cookies.Count.ToString() + "\n"; 

foreach (Cookie c in getResponse.Cookies) { 

    txt += c.ToString() + "\n";
} 
MessageBox.Show(txt);

Image 42

  1. We successfully logged into the system and received 9 Cookies, the snapshot above shows very little info about the received Cookies, you can get more info by accessing the properties of the Cookies

  1. Add the received Cookies to globally defined CookieCollection so that it can be used in subsequent requests

  1. How to Check Login was Successfull or Not? Normally Cookies Count is an easy way to determine that Login was successfuly or not, to bemore sure, you can try getting HTML of Home Page, if you r not redirected to Login Page, that means u r successfully logged in.

Example 3: Custom HTTPWebRequest for Login

  1. In Last example, we just replayed the HTTP Packet which mozilla Browser generated. Now let see from where the POST Url and PostData fields were obtained. Log Off from FaceBook, and open the Login Page. Right Click on Email textBox and click on Inspect Element.

Image 43


  1. Following HTML pane will appear in the bottom. Click on the Form Element

Image 44


  1. Here you can see the action filed in the highlighted area, this filed tells the url on which data is to be posted.

  1. Below  the highlighted area u can see few input fields, in Example 2, postData u saw many fields other than the email and password, so basically these fields were being sent to the server along with the email and password. These are part of the login Form, and these must be sent to the server along with the login info. Facebook changes the values of these fields frequently, so you cant hardcode these field’s values in the software/app.

  2. Now  we will see how to obtain these values from the facebook login page source code.

  1. You can use Regex, string manipulation or some 3rd party HTML Parsing Library to obtain these fields  and their values. I am using HTML Agility Pack to get the Login form tag and its all child input tags, and finally preparing the postData

  2. C#
    string email="youremail"; 
    
    string passwd="yourpassword"; string postData = ""; //Load FB login Page HTML
    A.HtmlDocument doc = new A.HtmlDocument();
    
    doc.LoadHtml(fb_html); //Get Login Form Tag A.HtmlNode
    
    node = doc.GetElementbyId("login_form");
    
    node = node.ParentNode; //Get All Hidden Input Fields //Prepare Post Data
    int i = 0; foreach(A.HtmlNode h in node.Elements("input"); { 
    
    if(i>0) 
    {
       postData += "&";
    }
    if(i == 1)
    {
      postData += "email=" + email + "&";
      postData += "pass=" + passwd + "&";
    }
    
    postData += (h.GetAttributeValue("name", "") + "=" + 
       h.GetAttributeValue("value", ""));
    i++;
    }
  3. Now you can post the Data in same way as we did in Example 2. Once the Successful Login Cookies are recevied, Add it to Globally defined CookieCollection, then for any subsequent request, send these  Cookies with the HTTPWebRequest.

Example 4: Uploading Pic to the Profile

  1. We are going to upload picture to the profile using mobile version of the facebook, and leaving the upload to normall facebook as a task for the user.

  2. To upload pictures/files, multipart/form-data is used as Content Type.

  3. First lets examine the HTTP traffic for uploading the picture by LiveHTTPHeaders

  4. Login to http://m.facebook.com/, and upload a picture.

  5. You will see HTTP Request like following in LiveHTTPHeaders

Image 45

  1. By now you must be familiarize with the above HTTP Request Headers, the only thing different is the way of posting the data, instead of using application/x-www-form-urlencoded, we are using multipart/form-data. You can also observe the layout of the postData (just below the ContenType)

  1. Now Lets Examine from where these all fields like fb_dtsg, characterset etc came. Right Click on Upload Photo form on upload page and select Inspect element.

Image 46


  1. You can see all the fields are here under the form tag for photo upload. Again you can use Regex, string manipulation or HTMLAgilityPack to get the name and values of these fields. But 1st you need to get the HTML of this page

  2. C#
    HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://m.facebook.com/upload.php");
    req.CookieContainer = new CookieContainer();
    req.CookieContainer.Add(cookies);  
    req.AllowAutoRedirect=true;  
    req.UserAgent = "Mozilla/2.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/5.0.874.121";
    HttpWebResponse resp = (HttpWebResponse)req.GetResponse(); 
    StreamReader sr = new StreamReader(resp.GetResponseStream()); 
    string uploadHTML = sr.ReadToEnd();
  3. Let's get all these fields and add to a dictionary Collection. This class is available in System.Collections. Make a dictionary variable nvc 

  4. C#
    Dictionary<string, string> nvc = new Dictionary<string, string>();
  5. As you can see there is no ID for the photoupload form, so 1st use string manipulation to get the form tag html and then use HTMLAgilityPack to easily get all input tags. 

  6. C#
    uploadHTML = uploadHTML.Substring(uploadHTML.IndexOf("<form"));  
    uploadHTML = uploadHTML.Replace("<form","<formid=\"myform\" "); 
    uploadHTML = uploadHTML.Remove(uploadHTML.IndexOf("/form>") + 6);
    A.HtmlDocument doc = new A.HtmlDocument(); 
    doc.LoadHtml(html);  
    A.HtmlNode node = doc.GetElementbyId("myform"); 
    node = node.ParentNode;
    foreach (A.HtmlNode h in node.Elements("input"))
    {
        string key = h.GetAttributeValue("name", "");
    
        if (key != "")  
            nvc.Add(key, h.GetAttributeValue("value",""));  
    }
  7. We will use Following Function to upload photo

  8. C#
    HttpUploadFile("http://upload.facebook.com/mobile_upload.php", 
       "file1", "filename", @"filePath", "image/jpeg", nvc);
  9. The Details of the passes arguments is as following

  • Action URL of the Upload Form

  • Name of the input tag for the File to upload

  • Name of the file

  • Path to the file on your computer

  • File type, in this case its image with extension jpeg

  • A dictionary containing all the input tags name and values

  1. Following the complete piece of code for the HTTPUploadFile function

  2. C#
    public void HttpUploadFile(string url,string paramName, string filename,
        string filepath, string contentType,  Dictionary<string,string> nvc)
    {
     
    //Prepairing PostData Format
    string boundary = "---------------------------" + DateTime.Now.Ticks.ToString("x");
    byte[] boundarybytes = System.Text.Encoding.ASCII.GetBytes("\r\n--" + boundary + "\r\n");
    
    //Creating Request to Action URL
    HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url); 
    wr.ContentType = "multipart/form-data; boundary=" + boundary;            
    wr.KeepAlive = true;
    wr.CookieContainer = new CookieContainer();
    
    //Adding Cookies Received at Login
    wr.CookieContainer.Add(cookies); 
    wr.Method = WebRequestMethods.Http.Post;
    wr.UserAgent = "Mozilla/2.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/5.0.874.121";
    wr.AllowWriteStreamBuffering = true;
    wr.ProtocolVersion = HttpVersion.Version11; 
    wr.AllowAutoRedirect = true;
    wr.Referer = "Referer: http://m.facebook.com/upload.php";  
    
    //Obtaining Stream to Write Data
    Stream rs = wr.GetRequestStream();  
    string formdataTemplate = "Content-Disposition:
    form-data; name=\"{0}\"\r\n\r\n{1}"; 
    foreach (string key in nvc.Keys)
    { 
    rs.Write(boundarybytes, 0, boundarybytes.Length);
    string formitem = string.Format(formdataTemplate, key, nvc[key]); 
    byte[] formitembytes = System.Text.Encoding.UTF8.GetBytes(formitem);
    //Writing all the input tags values
    rs.Write(formitembytes, 0, formitembytes.Length);
    } 
    
    rs.Write(boundarybytes,0, boundarybytes.Length);
    
    //Writing File Contents
    string headerTemplate = "Content-Disposition: form-data; " + 
          "name=\"{0}\"; filename=\"{1}\"\r\nContent-Type:{2}\r\n\r\n";
    string header = string.Format(headerTemplate, paramName, filename, contentType);  
    byte[]headerbytes = System.Text.Encoding.UTF8.GetBytes(header);
    rs.Write(headerbytes, 0, headerbytes.Length);
    FileStream fileStream = new FileStream(filepath, FileMode.Open, FileAccess.Read);
    byte[] buffer = new byte[4096];
    int  bytesRead = 0;  
    
    while((bytesRead = fileStream.Read(buffer, 0, buffer.Length)) != 0)
    {
    rs.Write(buffer, 0,bytesRead);
    }
    fileStream.Close();
    
    //Completing the Data 
    byte[] trailer = System.Text.Encoding.ASCII.GetBytes("\r\n--" + boundary + "--\r\n");
    rs.Write(trailer, 0, trailer.Length);
    rs.Close();
    
    //Receving Response
    HttpWebResponse wresp = (HttpWebResponse)wr.GetResponse();
    cookies.Add(wresp.Cookies);
    StreamReader sr = new StreamReader(wresp.GetResponseStream());
    string sourceCode = sr.ReadToEnd();
    StreamWriter sw = new StreamWriter("upload.html");
    sw.Write(sourceCode);
    sw.Close();

Task 1: Make Wall Posting Software for www.tagged.com

Task 2: Investigate some site which uses AJAX, to see how to use HTTPWebRequest, HTTPWebResponse for it

Task 3: Perform Login at some sites, using login cookies, view login protected pages

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Pakistan Pakistan
I am a learner.

Comments and Discussions

 
QuestionCAPTCHA/reCAPTCHA "I'm not a robot" Pin
Brian C Hart3-May-16 12:00
professionalBrian C Hart3-May-16 12:00 
AnswerRe: CAPTCHA/reCAPTCHA "I'm not a robot" Pin
DeeEllEll15-Sep-16 5:25
DeeEllEll15-Sep-16 5:25 
Questionnice work.. Pin
Md. Marufuzzaman18-Sep-15 2:11
professionalMd. Marufuzzaman18-Sep-15 2:11 
Questionfacebook login function didn't work.. Pin
suk hwan7-Jul-15 19:13
suk hwan7-Jul-15 19:13 
AnswerRe: facebook login function didn't work.. Pin
dontumindit23-Dec-15 6:07
dontumindit23-Dec-15 6:07 
GeneralMy vote of 5 Pin
Mario Z8-Apr-15 3:25
professionalMario Z8-Apr-15 3:25 
GeneralMy vote of 5! Pin
Mazen el Senih12-Dec-14 5:39
professionalMazen el Senih12-Dec-14 5:39 
SuggestionThe website took out the-list id. Pin
Member 1066204115-May-14 10:03
Member 1066204115-May-14 10:03 
GeneralRe: The website took out the-list id. Pin
dontumindit16-May-14 8:13
dontumindit16-May-14 8:13 
GeneralMy vote of 5 Pin
CSharls13-Apr-14 20:47
CSharls13-Apr-14 20:47 
Questionproblem Pin
saeedahi2-Apr-14 1:19
saeedahi2-Apr-14 1:19 
AnswerRe: problem Pin
dontumindit3-Apr-14 18:50
dontumindit3-Apr-14 18:50 
GeneralClient Side Javascript, Ajax etc. Pin
DaProgramma11-Feb-14 0:56
DaProgramma11-Feb-14 0:56 
Interesting Article - but today most of the Websites that are interesting for scraping have some sort of live update. That is, the DOM is changing all the time. Think, for example, of a website with real time financial data, like stock quotes.

So the concept of "docuemnt completed" has no meaning any longer. The document is changing all the time, and is never "completed".

Another topic is client side scripting. Any decent site has some sort of Javascript today. The "completed" Event fires, but that does NOT mean that your DOM is ready for scraping - scripts probably still run and it is difficult to determine when this comes to an end, if ever (for sites that display life data probably never).

The only way I got this all running is by using the Winforms Webbrowser Control - but this thing has its own challenges. For example, I can only have one of these running in the same App Domain. If I instantiante another one, things start to mess up afer some seconds.

But anyway, your article gives a good foundation to start with.
GeneralRe: Client Side Javascript, Ajax etc. Pin
dontumindit11-Feb-14 2:28
dontumindit11-Feb-14 2:28 
GeneralRe: Client Side Javascript, Ajax etc. Pin
DaProgramma12-Feb-14 5:44
DaProgramma12-Feb-14 5:44 
GeneralRe: Client Side Javascript, Ajax etc. Pin
dontumindit13-Feb-14 2:32
dontumindit13-Feb-14 2:32 
GeneralRe: Client Side Javascript, Ajax etc. Pin
DaProgramma13-Feb-14 3:09
DaProgramma13-Feb-14 3:09 
QuestionWeb Scraping From Paginated Grid View Pin
Member 1050050624-Jan-14 19:12
professionalMember 1050050624-Jan-14 19:12 
QuestionHTTP Headers need to be added as well, like UserAgent, ContentType, Accept, Referer.. Pin
Member 104883123-Jan-14 15:56
Member 104883123-Jan-14 15:56 
AnswerRe: HTTP Headers need to be added as well, like UserAgent, ContentType, Accept, Referer.. Pin
Member 104883124-Jan-14 2:20
Member 104883124-Jan-14 2:20 
AnswerRe: HTTP Headers need to be added as well, like UserAgent, ContentType, Accept, Referer.. Pin
dontumindit11-Jan-14 21:00
dontumindit11-Jan-14 21:00 
GeneralRe: HTTP Headers need to be added as well, like UserAgent, ContentType, Accept, Referer.. Pin
Member 1048831212-Jan-14 12:30
Member 1048831212-Jan-14 12:30 
GeneralRe: HTTP Headers need to be added as well, like UserAgent, ContentType, Accept, Referer.. Pin
Member 1048831226-Jan-14 12:38
Member 1048831226-Jan-14 12:38 
GeneralRe: HTTP Headers need to be added as well, like UserAgent, ContentType, Accept, Referer.. Pin
Member 1048831221-Aug-14 7:30
Member 1048831221-Aug-14 7:30 
GeneralMy vote of 5 Pin
Manuele Camilletti9-Sep-13 3:30
professionalManuele Camilletti9-Sep-13 3:30 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.