Click here to Skip to main content
15,911,030 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
i have following code for link extractor which extracts all internal links for given url

C#
SearchEngines Search = SearchEngines.Google;
LinksExtractor extractor = new LinksExtractor("http://yahoo.com/",Search,10);
          
for (int i = 0; i < extractor.Links.Count; i++)
{
    Console.Write(extractor.Links[i].Href.ToString());
    //Console.ReadKey();
    Console.ReadLine();
}


This Code giving me all inks inside yahoo.com
like yahoo.com/sports
yahoo.com/business
but it also gives unwanted links like if some advertisement on yahoo for shadi.com
then it give shadi.com's link also
that i dont want
please help
Posted
Updated 6-Aug-11 2:00am
v2

Is it that hard to ignore the links you don't want? For instance, anything that doesn't start with "http://yahoo.com/"?
 
Share this answer
 
I wonder if you can make use of Google's Advanced filtering capabilities in creating your WebRequest ?

For example, this Google search[^] shows you only sites within Yahoo.com, and only sites in English.

But, perhaps you've already eliminated that as a strategy, so:

If extractor.Links is a collection of type IEnumerable<Link>, then you should be able to use a relatively simple Linq filter operation like:
string matchStr = "yahoo.com";

var filteredMatches = extractor.Links.Where(link => link.Href.ToString().Contains(matchStr)).ToList<Link>();
Disclaimer: this code fragment is off the 'top-of-my-head' and may not work for you as is, is not tested, and may be flawed: it is intended only to suggest a strategy to you.
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900