Click here to Skip to main content
15,867,308 members
Articles / Web Development / ASP.NET

A Web Spider Library in C#

Rate me:
Please Sign up or sign in to vote.
4.28/5 (40 votes)
18 Sep 2007CPOL2 min read 264.2K   15.4K   170   36
An article about a spider library to grab websites and store them locally

Sample Image - ZetaWebSpider.png

Don't fear, it's just a web spider ;-)

Introduction

Today, while looking through some older code, I came across a set of classes I wrote at the beginning of this year for a customer project.

The classes implement a basic web spider (also called "web robot" or "web crawler") to grab web pages (including resources like images and CSS), download them locally and adjust any resource hyperlinks to point to the locally downloaded resources.

While this article is not a full-featured article with detailed explanations as I usually like to write, I still want to put the code online with this short article. Maybe some reader can still take some ideas from this code and use it as a starting point for his own project.

Overview

The classes allow for synchronous as well as asynchronous download of the web pages, allowing multiple options to be specified like hyperlink-depth to follow and proxy settings.

The downloaded resources get their own new file names, based on the hash code of the original URL. I did this for simplifications (for me as the programmer).

To parse a document, I am using the SGMLReader DLL from the GotDotNet website.

Also, since I didn't need it for the project I wrote, the library does not care about "robots.txt" or throttling or other features.

Using the Code

The download for this article contains the library ("WebSpider") and a testing console application ("WebSpiderTest"). The testing application is rather short and should be rather easy to understand.

Basically, you do create an instance of the WebSiteDownloaderOptions class, configure several parameters, create an instance of the WebSiteDownloader class, optionally connect event handlers and then tell the instance to either start synchronously or asynchronously processing the given URL.

History

  • 2007-09-17: Fixed several issues
  • 2006-09-10: Initial release of the article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Chief Technology Officer Zeta Software GmbH
Germany Germany
Uwe does programming since 1989 with experiences in Assembler, C++, MFC and lots of web- and database stuff and now uses ASP.NET and C# extensively, too. He has also teached programming to students at the local university.

➡️ Give me a tip 🙂

In his free time, he does climbing, running and mountain biking. In 2012 he became a father of a cute boy and in 2014 of an awesome girl.

Some cool, free software from us:

Windows 10 Ereignisanzeige  
German Developer Community  
Free Test Management Software - Intuitive, competitive, Test Plans.  
Homepage erstellen - Intuitive, very easy to use.  
Offline-Homepage-Baukasten

Comments and Discussions

 
Generaldoesn't work correctly in indexof: based sites Pin
Arpit sharma2-Apr-11 4:27
Arpit sharma2-Apr-11 4:27 
GeneralProblem In SGML Pin
Arpit sharma27-Dec-10 4:23
Arpit sharma27-Dec-10 4:23 
GeneralIssue following links Pin
StubbsPKS26-Oct-10 6:07
StubbsPKS26-Oct-10 6:07 
AnswerRe: Issue following links Pin
StubbsPKS26-Oct-10 7:31
StubbsPKS26-Oct-10 7:31 
GeneralRe: Issue following links Pin
Uwe Keim26-Oct-10 8:41
sitebuilderUwe Keim26-Oct-10 8:41 
GeneralMy vote of 2 Pin
DaveAuld19-Jun-10 23:51
professionalDaveAuld19-Jun-10 23:51 
GeneralRe: My vote of 2 Pin
Kevin Yochum9-Aug-10 7:59
Kevin Yochum9-Aug-10 7:59 
GeneralRe: My vote of 2 Pin
Uwe Keim9-Aug-10 19:39
sitebuilderUwe Keim9-Aug-10 19:39 
General3k Pin
songmei.lv@163.com7-Dec-09 0:03
songmei.lv@163.com7-Dec-09 0:03 
General.STATE File PinPopular
Member 441033831-Oct-09 11:46
Member 441033831-Oct-09 11:46 
GeneralMy vote of 1 Pin
babakzawari20-Oct-09 0:23
babakzawari20-Oct-09 0:23 
GeneralWeb Spider Issue Pin
Member 4747242-Sep-09 22:17
Member 4747242-Sep-09 22:17 
GeneralRe: Web Spider Issue Pin
Uwe Keim2-Sep-09 22:29
sitebuilderUwe Keim2-Sep-09 22:29 
Generalxml site map Pin
Rohit_kakria10-Aug-09 1:51
Rohit_kakria10-Aug-09 1:51 
GeneralJust Links [modified] Pin
Sosyopat30-Jul-09 10:18
Sosyopat30-Jul-09 10:18 
GeneralRe: Just Links Pin
Uwe Keim30-Jul-09 19:22
sitebuilderUwe Keim30-Jul-09 19:22 
GeneralThe remote server returned an error: (403) Forbidden Pin
MustangU21-Jul-09 1:00
MustangU21-Jul-09 1:00 
GeneralRe: The remote server returned an error: (403) Forbidden Pin
Uwe Keim21-Jul-09 1:20
sitebuilderUwe Keim21-Jul-09 1:20 
GeneralRe: The remote server returned an error: (403) Forbidden Pin
benaceur3-Nov-16 10:32
benaceur3-Nov-16 10:32 
QuestionRe: The remote server returned an error: (403) Forbidden Pin
benaceur4-Nov-16 1:39
benaceur4-Nov-16 1:39 
Generalquestion plz Pin
naroqueen22-May-09 19:22
naroqueen22-May-09 19:22 
Generalnice spider and nice code Pin
psyhf21-Mar-09 20:03
psyhf21-Mar-09 20:03 
QuestionI am getting an unhandled exception while running the test project Pin
Member 47118243-Jun-08 9:47
Member 47118243-Jun-08 9:47 
AnswerRe: I am getting an unhandled exception while running the test project Pin
AnamaryHdez2-May-17 6:55
AnamaryHdez2-May-17 6:55 
GeneralException Running Test Pin
cornix430-Apr-08 9:49
cornix430-Apr-08 9:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.