Click here to Skip to main content
15,887,083 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I need to scrape a table of info from a site for which I have valid credentials - the owners of the site do not provide an API.

I performed a login and saved the traffic with Fiddler, and am trying to replicate the key steps. I'm going to show the steps I've done so far, and get to where i am stuck. Sorry for some of this being so elementary.

I am doing this using HTTPWebRequest and HTTPWebResponse.

I am guessing that the cookie data in the third call, below, is needed, and that is is set by a client-side script that gets collected between the 2nd and 3rd call - but I am new to this and unsure - and have no idea how to get a valid cookie without using a browser.

I can probably solve this by using a webbrowser object, but that seems like a clumsy solution. Is there a less clumsy way to go? Are there other objects or libraries I should try? (RestSharp? Postman? Webrequest object instead of HTTPWeRequest?) Is there any object type that will run a script and allow me to grab the cookies?
C#



What I have tried:

1 - I long into the base url - call it https:\www.abc.com. Along with the return a cookie is set. My code looks like this:

C#
CookieContainer jar = new CookieContainer();
request = (HttpWebRequest)WebRequest.Create(urlBase);
request.CookieContainer = cookieJar;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
newUrl = response.ResponseUri.ToString();

Note - when I look at the cookiecontainer (cookieJar) it has a count of 1 after the call. Interestingly the response object does not contain the cookie - but I think all is okay because I can use cookieJar.

2 - Now there is a 2nd call (I'm not yet at the page where the name and password are presented - that doesn't happen until the 4th call. My code looks like this:

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlBase +
secondCallFolderAddition);

CookieCollection bakery = new CookieCollection();

request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", @"1");
//request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
OPR/46.0.2597.57";
request.Accept =
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,
image/apng,*/*;q=0.8";
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate, br");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string newURL = response.ResponseUri.ToString();


So far so good - I get an OK status, and the response looks good compared to the original Fiddler traffic capture. In the original this 2nd call does not set a cookie, and no cookie is set here.

But here's where I get lost. For the third call the browser sent cookie data with three values (I've obfuscated):

C#
__utma=1.123456789.123456789.123456789.123456789.1
olfsk=olfsk12345678901234567890123456789
hblid=abCDl11ABCabXabc1aABv1FLFX1RE1OS


I don't know where those values get set. They seem to relate to Google Analytics (from articles I've found) but i don't know how to collect them so that i can attach them to the call I make. My call looks like this:

C#
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(newUrl); // url 
collected above

request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", "1");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 OPR/46.0.2597.57";
 request.Accept = 
"text/html,application/xhtml+xml,application/xml;
 q=0.9,image/webp,image/apng,*/*;q=0.8";

request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate, br");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");

///request.Headers.Set(HttpRequestHeader.Cookie, 
@"__utma=1.123456789.123456789.123456789.123456789.1; 
olfsk=olfsk12345678901234567890123456789; 
hblid=abCDl11ABCabXabc1aABv1FLFX1RE1OS");

HttpWebResponse response = (HttpWebResponse)request.GetResponse();

Uri newURL = response.ResponseUri;



Please note the commented out line with the cookie data - I've tried this with that line un-commented also.

What happens is that I never get a response to the call.
Posted
Comments
Graeme_Grant 14-Aug-17 18:41pm    
Have you tried using a program called Fiddler[^] to look at the normal flow when a web browser is used?
Ken-in-California 14-Aug-17 21:08pm    
Graeme,
Yes - Fiddler was my starting point. I logged in with a browser and navigated to the page where the data list is (grabbing the data is my goal).

As you know, Fiddler shows when cookies are set, and when they are sent as part of a call.

There is a cookie that is set during a response, and I grab that cookie and put it in a CookieContainer. That cookie is not the problem. The problem is that in the third call there is some cookie data being sent, but there is no "record" of it being set in the Fiddler sessions.

I have concluded that the cookie is being set by a javascript script in a page (in the page received in response to the second call).

So what I am planning to try is to find the script, and to try to replicate in c# what it does, and then use the cookie data generated that way.

If all else fails I'll do this with a webbrowser object - but that is clunky and I don't think I'll be able to manage multiple threads of http traffic the way I was planning to later in the program.
Graeme_Grant 14-Aug-17 22:01pm    
It should still be visible... Check also your response "method" type matches. There could be header keys not being passed...
Ken-in-California 15-Aug-17 10:48am    
I promise you the first time they show up is in the request call that includes them, as i described above. And I've been meticulous about the headers. The cookies that appear are the ones noted above: __utma, olfsk, and hblid. The first appears to be created by a google analytics script. The second and third appear to be set by olark (which is a live chat add-in). And they are all being set on the browser side.
Richard Deeming 15-Aug-17 14:17pm    
It looks like you're using a different CookieContainer for each request.

You need to use a single CookieContainer instance across all requests that should share the same set of cookies. That should take care of reading the cookies from the response, and passing them to the next request.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900