Click here to Skip to main content
15,904,926 members
Please Sign up or sign in to vote.
3.67/5 (2 votes)
See more:
Is there any way in C# that I can get a list of all of the pages that are on a web site? For example, if I choose to use 'www.microsoft.com' then it should return:
'www.microsoft.com/shop'
'www.microsoft.com/products'
'www.microsoft.com/downloads'
'www.microsoft.com/support'
...and all the other pages for it...

In the end what I want to do is get the links to every image resource that is used on the site so I thought if I can get the links to the pages then I could download the images from each one. Pseudo code:
C#
foreach (WebPage wp in WebSite.GetWebPages("http://www.microsoft.com"))
{
    Console.WriteLine(wp.GetWebUrl.ToString();
    foreach (WebPageImage wpi in WebPage.GetImages(wp))
    {
        WepPageImage.DownloadImage(wpi.GetWebUrl.ToString());
        Console.WriteLine("Image Downloaded: " + wpi.GetWebUrl.ToString());
    }
}
// or something like that

I hope there is a way to do this and how I would do it. I hope you understand what I want to do. Thank you.
Posted
Comments
[no name] 16-Dec-13 10:52am    
Recursiv or not? Better not, but only to make this point clear :)
Henry Hunt 16-Dec-13 10:57am    
Sorry, what do you mean by that?
[no name] 16-Dec-13 10:58am    
If a page contains a link, you like also to have their subsequent links ... and so on... and...?
Henry Hunt 16-Dec-13 11:02am    
Ah I get you, yes it may be useful. All I really want is a list of URLs to all of the pages belonging to a website/server/domain
[no name] 16-Dec-13 11:21am    
Sorry, replied before on a wrong level.

Are you aware what this means? E.g. a link of a "sub page" can link again to a page which you allready scanned (the simplest example to get an idea, it refers again you starting page).

Ok it is not a big thing if you are aware about it.

"All I really want is a list of URLs to all of the pages belonging to a website/server/domain"

There isn't such a thing as far as I know - and it is very unlikely to exist if you think about it - many sites server up specific information from a DB by processing 404 errors (file or directory not found) and using the page or folder name details to access the DB and retrieve the info. These aren't "genuine" pages, but they are valid URL's which will result in a page of data going back to the user.

AFAIK, IIS looks for pages on a request-by-request basis: it doesn't "cache" urls and serve up only pages which exist.
 
Share this answer
 
Comments
[no name] 16-Dec-13 11:24am    
Maybe it is a lack of my english, hope not. But I think there is something like this. I used it before about three year. Unfortunately I don't find the link again...Searching at this moment :)
take one datagridview

when you write response.redirect for another page just add this page name into datagridview

its easy
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900