Click here to Skip to main content
15,912,400 members
Please Sign up or sign in to vote.
3.67/5 (2 votes)
See more:
Suppose i entered the url-www.w3school.org,then all the pages of w3school should save in a folder and create a summary page index.htm,after that how i click on index then it will look like w3school home page.I know i can use httrack but if i will do it through C# programming,then it will explore me.
Posted

1 solution

To do this you can use a System.Net.WebClient[^]

- use it to download the page as a big string
- save the string in a file to the harddisk
- parse the string, using regular expressions, for images and what else you want.
- download the images, and what else you want

(If you want to get all the pages of the website you will need to also parse for hyperlinks and recursively download all of those too)

Please do keep in mind that the home page has a link to the home page!
To get around circles like that use a Dictionary to keep track of what you have and have not downloaded. A dictionary can contain a dictionary, that way you can save index.html in the first one, and add a dictionary for asp.net, and in the 2d dictionary save index.html again.

In the end you can then recursively print the dictionary to give you the sitemap.

Do keep in mind that this will generate a lot of traffic, and might not always be allowed by the website owners.

Hope this helps you on your way :-)
 
Share this answer
 
Comments
StackQ 5-Nov-12 7:39am    
can u help me with some programming logic? it will be better for me.
Christiaan Rakowski 5-Nov-12 13:54pm    
If you look on the pages for the webclient and FileStream class you will see the code needed to do parts 1, 2 and 4. For part 3 you could use the Regex class, for an example Regex take a look at this post: http://stackoverflow.com/questions/5717312/regular-expression-for-url
The Regex class gives you an array of Matches, those containt the next URL to parse (recursivally) in their value.


http://msdn.microsoft.com/en-us/library/system.net.webclient(v=vs.100).aspx
http://msdn.microsoft.com/en-us/library/system.io.filestream(v=vs.100).aspx
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex(v=vs.100).aspx

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900