I need to scrape a table of info from a site for which I have valid credentials - the owners of the site do not provide an API.
I performed a login and saved the traffic with Fiddler, and am trying to replicate the key steps. I'm going to show the steps I've done so far, and get to where i am stuck. Sorry for some of this being so elementary.
I am doing this using HTTPWebRequest and HTTPWebResponse.
I am guessing that the cookie data in the third call, below, is needed, and that is is set by a client-side script that gets collected between the 2nd and 3rd call - but I am new to this and unsure - and have no idea how to get a valid cookie without using a browser.
I can probably solve this by using a webbrowser object, but that seems like a clumsy solution. Is there a less clumsy way to go? Are there other objects or libraries I should try? (RestSharp? Postman? Webrequest object instead of HTTPWeRequest?) Is there any object type that will run a script and allow me to grab the cookies?
What I have tried:
1 - I long into the base url - call it https:\www.abc.com. Along with the return a cookie is set. My code looks like this:
CookieContainer jar = new CookieContainer();
request = (HttpWebRequest)WebRequest.Create(urlBase);
request.CookieContainer = cookieJar;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
newUrl = response.ResponseUri.ToString();
Note - when I look at the cookiecontainer (cookieJar) it has a count of 1 after the call. Interestingly the response object does not contain the cookie - but I think all is okay because I can use cookieJar.
2 - Now there is a 2nd call (I'm not yet at the page where the name and password are presented - that doesn't happen until the 4th call. My code looks like this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlBase +
secondCallFolderAddition);
CookieCollection bakery = new CookieCollection();
request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", @"1");
//request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
OPR/46.0.2597.57";
request.Accept =
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,
image/apng,*/*;q=0.8";
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate, br");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string newURL = response.ResponseUri.ToString();
So far so good - I get an OK status, and the response looks good compared to the original Fiddler traffic capture. In the original this 2nd call does not set a cookie, and no cookie is set here.
But here's where I get lost. For the third call the browser sent cookie data with three values (I've obfuscated):
__utma=1.123456789.123456789.123456789.123456789.1
olfsk=olfsk12345678901234567890123456789
hblid=abCDl11ABCabXabc1aABv1FLFX1RE1OS
I don't know where those values get set. They seem to relate to Google Analytics (from articles I've found) but i don't know how to collect them so that i can attach them to the call I make. My call looks like this:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(newUrl);
collected above
request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", "1");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 OPR/46.0.2597.57";
request.Accept =
"text/html,application/xhtml+xml,application/xml;
q=0.9,image/webp,image/apng,*/*;q=0.8";
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate, br");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
@"__utma=1.123456789.123456789.123456789.123456789.1;
olfsk=olfsk12345678901234567890123456789;
hblid=abCDl11ABCabXabc1aABv1FLFX1RE1OS");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Uri newURL = response.ResponseUri;
Please note the commented out line with the cookie data - I've tried this with that line un-commented also.
What happens is that I never get a response to the call.