Click here to Skip to main content
15,891,513 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Before we start I want to say that I understand google and others have specific programs for http parsing and extraction but due to time constrains I have not had time to learn how to use them and some cost money which I don't have much of.

So I made a regex based link collector approach based of a keyword that is collected from the http title of another website, I am simply trying to find all websites that talk about similar stuff to the article the title is taken from. Hence I produced this code but it outputs 0 links can anybody help ...thanks much appreciated.

Overview of how it works. A URL is input in urlAddress, and the program collects all the https of that website, and I use a bit of code that extracts the title, I then join that title onto urlAddressofW and repeat but this time it does not represent a specific website but instead the screen after you have pressed search and before you open a specific website, when there are links to all the websites, the search results page. The code then extracts all the html of the first page of search results and feeds it through a regex "FindLinks" which I note, is kind of haphazardly combined so that may have been the problem, and this outputs all the links between the two points WebsiteExtractS and WebsiteExtractE which represents a link between the two, the code is then finally supposed to output each link it found, but outputs 0 links as the check NumofLinks stay as zero links.
C#
string Titlestart = $"<title>";
string Titleend = $"</title>";
string WebsiteExtractS = $"iUh30 bc";
string WebsiteExtractE = $"&rsaquo";
string ADDING = "";
int NumofLinks = 0;
bool contains = false;


//Finding Website title
string urlAddress = "https://www.bbc.co.uk/news/uk-politics-50305284";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream = null;
    if (response.CharacterSet == null)
    {
        readStream = new StreamReader(receiveStream);
    }
    else
    {
        readStream = new StreamReader(receiveStream,
        Encoding.GetEncoding(response.CharacterSet));
        readStream.ToString();
        // Console.WriteLine(readStream);
    }
    string data = readStream.ReadToEnd();
    string ImpureText = data;

    int titlestart = ImpureText.IndexOf(Titlestart) + Titlestart.Length;
    int titleend = ImpureText.IndexOf(Titleend) - 10; // this will take x amount of characters off but unknow how many characters are present that need taking off.
    string PureTexttitle = ImpureText.Substring(titlestart, titleend - titlestart);
    Console.WriteLine("Title-" + PureTexttitle);












    //Using title to find Links
    ADDING = PureTexttitle;


    string urlAddressofW = "http://google.com/search?q=" + ADDING;

    HttpWebRequest requestW = (HttpWebRequest)WebRequest.Create(urlAddressofW);
    HttpWebResponse responseW = (HttpWebResponse)requestW.GetResponse();
    if (responseW.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStreamW = responseW.GetResponseStream();
        StreamReader readStreamW = null;
        if (responseW.CharacterSet == null)
        {
            readStreamW = new StreamReader(receiveStreamW);
        }
        else
        {
            readStreamW = new StreamReader(receiveStream,
            Encoding.GetEncoding(responseW.CharacterSet));
            readStreamW.ToString();

            //Console.WriteLine(readStreamW);

        }
        string dataW = readStreamW.ReadToEnd();
        string ImpureTextW = dataW.ToString();

        if (ImpureTextW.Contains("iUh30 bc"))
        {
            contains = false;
            Console.WriteLine("YELP");
        }


        //this code is never reached
        MatchCollection Findlinks = Regex.Matches(ImpureTextW, "iUh30 bc(.*?) &rsaquo");

        //var Findlinks = Regex.Matches(ImpureTextW, "<p>(.*?)</p>");


        foreach (Match Link in Findlinks)
        {



            Console.WriteLine(Link.Value);

            NumofLinks += 1;




        }
        Console.WriteLine(NumofLinks);



    }
}
Console.ReadLine();


What I have tried:

Tried editing the regex to see if that was the issue and I think that seems to be it, as it says it can not find anything that is containing within the regex code.

Doing checks to see if links are even being collected.
Posted
Updated 28-Nov-19 23:31pm
v2

1 solution

At the first look, i see 3 errors:
1.
C#
//missing [s] in url addresss
string urlAddressofW = "https://google.com/search?q=" + ADDING;


2.
C#
//Google replaces spaces [" "] with [+]
ADDING = PureTexttitle.Replace(" ", "+");


3.
C#
//missing [W] in receiveStream
    if (responseW.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStreamW = responseW.GetResponseStream();
        StreamReader readStreamW = null;
        if (responseW.CharacterSet == null)
        {
            readStreamW = new StreamReader(receiveStreamW);
        }
        else
        {
            //here!!!
            readStreamW = new StreamReader(receiveStreamW,
            Encoding.GetEncoding(responseW.CharacterSet));
            readStreamW.ToString();

            //Console.WriteLine(readStreamW);

        }


Note: Google returns such of document:
HTML
<!doctype html><html lang="pl"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>General election 2019: Greens call for £100bn a year for climate action - 

<!-- skipped lines here -->

<script nonce="FdWZZ+ZWymCQ/pKGYdroJA==">(function(){var e='YfDgXcOLDZ-70PEPlJ2wmA4';(function(){var a=e,b=window.performance&&window.performance.navigation;b&&2==b.type&&window.ping("/gen_204?ct=backbutton&ei="+a);}).call(this);})();(function(){var b=[function(){google.tick&&google.tick("load","dcl")}];google.dclc=function(a){b.length?b.push(a):a()};function c(){for(var a;a=b.shift();)a()}window.addEventListener?(document.addEventListener("DOMContentLoaded",c,!1),window.addEventListener("load",c,!1)):window.attachEvent&&window.attachEvent("onload",c);}).call(this);(function(){(function(){google.csct={};google.csct.ps='AOvVaw3FMWI9xeFwYcVhHj3sXm62\x26ust\x3d1575109089244484';})();})();(function(){(function(){google.csct.rd=true;})();})();google.drty&&google.drty();</script></body></html>


Note #2: I'd avoid of creating new objects of: HttpWebRequest and HttpWebResponse unless it is necessary to go through the results of first request.
 
Share this answer
 
Comments
CPallini 29-Nov-19 5:39am    
5.
Maciej Los 29-Nov-19 8:00am    
Thank you, Carlo.
HamzaMcBob 30-Nov-19 7:48am    
Thanks for the help , was stressing out about it to a certain extent, is there a certain reason why I should do the #2 , is it because we don't want to get in trouble with google?
Maciej Los 30-Nov-19 8:33am    
No. To avoid errors and use less memory. :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900