Click here to Skip to main content
15,923,087 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am using a WebClient routine to capture a webpage. Previously all was well, however on one website I am now suddenly retrieving a page that seems to come "before" the page I'm after.

In brief the routine looks like this:

C#
// ~~> Declaring 'x' as a new WebClient() method
   WebClient x = new WebClient();
// ~~>Setting the URL, then downloading the data from the URL.
   string source = x.DownloadStringAsync(SearchTerm);


Previously all was fine but now one website is returning a page than begins like this:

C#
<html><head><meta http-equiv="Pragma" content="no-cache">


This is only occurring on one website (rest are fine) and I assume that this must be a change to this particular website. Can someone point me towards what it is that has changed and how I get round this to load the "true" file?

Thanks.
Posted
Comments
Kornfeld Eliyahu Peter 8-Sep-15 9:41am    
What happening if you browse that address from a browser?
Andrw_S 8-Sep-15 9:44am    
The result is exactly as I would expect - i.e. the "true" page is loaded.
Kornfeld Eliyahu Peter 8-Sep-15 10:00am    
There is no secondary page (advertising)?
F-ES Sitecore 8-Sep-15 9:42am    
Ask the owners of the site you're downloading the html from.

1 solution

After much fiddling, I discovered that the page actually loaded twice. The first load was this default header file that now precedes all their pages. The second was the true page.

Should you ever run into this problem you can check for something similar using a WebBrowser DocumentLoaded event. Something like this:

C#
// ~~> Create a browser.                
WebBrowser browser = new WebBrowser();

// ~~> Add handler to browser etc.
browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(browser_DocumentCompleted);

// ~~> Disable scripting errors.
browser.ScriptErrorsSuppressed = true;


Then check the content of the page from the browser DocumentText.

C#
private void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    // ~~> Get source string.
    string source = browser.DocumentText;

}


Wrap the DocumentCompleted event in some sort of holding routine until the string "source" contains the information you expect.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900