Click here to Skip to main content
15,888,142 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
A web crawler desktop app using c# that would separate internal and external links i.e. <a href="about.html"> is internal and <a href="http|https://www.xyz.com"> is external, I've tried many solutions but all are finding links great but no solution for separation of internal and external links of a website for crawling is available.
I'm using the following code to separate internal and external links but it doesn't work as I need. It's been 2 days I'm working on it but still no improvements. Can you check this and guide me about it.
C#
List inter = new List();
List dates = new List();
int count = 0;
List i2 = new List();
WebClient web = new WebClient();
string html = web.DownloadString(textBox1.Text);
string n3 = "", s4 = "";
MatchCollection m0 = Regex.Matches(html, @"]*?href[\s]?=[\s\""\']+(?.*?)[\""\']+.*?>(?[^<]+|.*?)?<\/a>", RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

foreach (Match m in m0)
{
string city = m.Groups[1].Value;



Match m2 = Regex.Match(city, "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))", RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
string city2 = m2.Groups[1].Value;
dates.Add(city2);

s4 = textBox1.Text;
string n2 = s4.Remove(0, 11);
n3 = s4.Remove(0, 12);
string n4 = s4.Remove(0,7);

Match m3 = Regex.Match(city, @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])", RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
string s5 = m3.Groups[1].Value;
if (m3.Groups[1].Value != s4 && m3.Groups[1].Value != n2 && m3.Groups[1].Value != n3&& m3.Groups[1].Value!=n4)
{



i2.Add(city);
inter.Add(s5);

count = 1;

}


}
if (count != 0)
{
AllLinks.Items.Add(s4);
}
Hrefs.DataSource = i2;
//AllLinks.DataSource = dates;
inter.RemoveAll(string.IsNullOrWhiteSpace);
ExternalLinks.DataSource = inter;
Posted
Updated 3-Apr-15 0:24am
v4
Comments
Sinisa Hajnal 1-Apr-15 3:08am    
How would you separate the links? When you think of the way you would do it manually then you can write an algorithm. Until then, you cannot do anything.
M Adeel Khalid 2-Apr-15 1:14am    
thank you for your reply.
Sinisa Hajnal 2-Apr-15 2:02am    
Once you have all the links, why is it a problem to separate those that start with http? Or even that contain home domain path?
M Adeel Khalid 2-Apr-15 6:38am    
the problem is, whenever i try to fetch external it also gets me internal, I'm stuck and don't know how to differentiate them and also on which basis. Really becomes a headache.
Sinisa Hajnal 2-Apr-15 7:27am    
But you just said it - you separate them by having http:// at the start. Why couldn't you use that?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900