Collect links from websearch

Question

0.00/5 (No votes)

See more:

Before we start I want to say that I understand google and others have specific programs for http parsing and extraction but due to time constrains I have not had time to learn how to use them and some cost money which I don't have much of.

So I made a regex based link collector approach based of a keyword that is collected from the http title of another website, I am simply trying to find all websites that talk about similar stuff to the article the title is taken from. Hence I produced this code but it outputs 0 links can anybody help ...thanks much appreciated.

Overview of how it works. A URL is input in urlAddress, and the program collects all the https of that website, and I use a bit of code that extracts the title, I then join that title onto urlAddressofW and repeat but this time it does not represent a specific website but instead the screen after you have pressed search and before you open a specific website, when there are links to all the websites, the search results page. The code then extracts all the html of the first page of search results and feeds it through a regex "FindLinks" which I note, is kind of haphazardly combined so that may have been the problem, and this outputs all the links between the two points WebsiteExtractS and WebsiteExtractE which represents a link between the two, the code is then finally supposed to output each link it found, but outputs 0 links as the check NumofLinks stay as zero links.

C#

string Titlestart = $"<title>";
string Titleend = $"</title>";
string WebsiteExtractS = $"iUh30 bc";
string WebsiteExtractE = $"&rsaquo";
string ADDING = "";
int NumofLinks = 0;
bool contains = false;


//Finding Website title
string urlAddress = "https://www.bbc.co.uk/news/uk-politics-50305284";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
    Stream receiveStream = response.GetResponseStream();
    StreamReader readStream = null;
    if (response.CharacterSet == null)
    {
        readStream = new StreamReader(receiveStream);
    }
    else
    {
        readStream = new StreamReader(receiveStream,
        Encoding.GetEncoding(response.CharacterSet));
        readStream.ToString();
        // Console.WriteLine(readStream);
    }
    string data = readStream.ReadToEnd();
    string ImpureText = data;

    int titlestart = ImpureText.IndexOf(Titlestart) + Titlestart.Length;
    int titleend = ImpureText.IndexOf(Titleend) - 10; // this will take x amount of characters off but unknow how many characters are present that need taking off.
    string PureTexttitle = ImpureText.Substring(titlestart, titleend - titlestart);
    Console.WriteLine("Title-" + PureTexttitle);












    //Using title to find Links
    ADDING = PureTexttitle;


    string urlAddressofW = "http://google.com/search?q=" + ADDING;

    HttpWebRequest requestW = (HttpWebRequest)WebRequest.Create(urlAddressofW);
    HttpWebResponse responseW = (HttpWebResponse)requestW.GetResponse();
    if (responseW.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStreamW = responseW.GetResponseStream();
        StreamReader readStreamW = null;
        if (responseW.CharacterSet == null)
        {
            readStreamW = new StreamReader(receiveStreamW);
        }
        else
        {
            readStreamW = new StreamReader(receiveStream,
            Encoding.GetEncoding(responseW.CharacterSet));
            readStreamW.ToString();

            //Console.WriteLine(readStreamW);

        }
        string dataW = readStreamW.ReadToEnd();
        string ImpureTextW = dataW.ToString();

        if (ImpureTextW.Contains("iUh30 bc"))
        {
            contains = false;
            Console.WriteLine("YELP");
        }


        //this code is never reached
        MatchCollection Findlinks = Regex.Matches(ImpureTextW, "iUh30 bc(.*?) &rsaquo");

        //var Findlinks = Regex.Matches(ImpureTextW, "<p>(.*?)</p>");


        foreach (Match Link in Findlinks)
        {



            Console.WriteLine(Link.Value);

            NumofLinks += 1;




        }
        Console.WriteLine(NumofLinks);



    }
}
Console.ReadLine();

What I have tried:

Tried editing the regex to see if that was the issue and I think that seems to be it, as it says it can not find anything that is containing within the regex code.

Doing checks to see if links are even being collected.

Posted 27-Nov-19 10:36am

HamzaMcBob

Updated 28-Nov-19 23:31pm

GKP1992

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Maciej Los · Accepted Answer · 2019-11-28T23:32:00

At the first look, i see 3 errors:
1.

C#

//missing [s] in url addresss
string urlAddressofW = "https://google.com/search?q=" + ADDING;

2.

C#

//Google replaces spaces [" "] with [+]
ADDING = PureTexttitle.Replace(" ", "+");

3.

C#

//missing [W] in receiveStream
    if (responseW.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStreamW = responseW.GetResponseStream();
        StreamReader readStreamW = null;
        if (responseW.CharacterSet == null)
        {
            readStreamW = new StreamReader(receiveStreamW);
        }
        else
        {
            //here!!!
            readStreamW = new StreamReader(receiveStreamW,
            Encoding.GetEncoding(responseW.CharacterSet));
            readStreamW.ToString();

            //Console.WriteLine(readStreamW);

        }

Note: Google returns such of document:

HTML

<!doctype html><html lang="pl"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>General election 2019: Greens call for £100bn a year for climate action - 

<!-- skipped lines here -->

<script nonce="FdWZZ+ZWymCQ/pKGYdroJA==">(function(){var e='YfDgXcOLDZ-70PEPlJ2wmA4';(function(){var a=e,b=window.performance&&window.performance.navigation;b&&2==b.type&&window.ping("/gen_204?ct=backbutton&ei="+a);}).call(this);})();(function(){var b=[function(){google.tick&&google.tick("load","dcl")}];google.dclc=function(a){b.length?b.push(a):a()};function c(){for(var a;a=b.shift();)a()}window.addEventListener?(document.addEventListener("DOMContentLoaded",c,!1),window.addEventListener("load",c,!1)):window.attachEvent&&window.attachEvent("onload",c);}).call(this);(function(){(function(){google.csct={};google.csct.ps='AOvVaw3FMWI9xeFwYcVhHj3sXm62\x26ust\x3d1575109089244484';})();})();(function(){(function(){google.csct.rd=true;})();})();google.drty&&google.drty();</script></body></html>

Note #2: I'd avoid of creating new objects of: HttpWebRequest and HttpWebResponse unless it is necessary to go through the results of first request.