Click here to Skip to main content
15,914,419 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
How to get only company address block from website conatct page

i have tried this..

C#
public void Extract_all_text_from_webpage(string filename)
{
    HtmlDocument document = new HtmlDocument();
    document.Load(new MemoryStream(File.ReadAllBytes(filename)));
    textBox1.Text += Environment.NewLine + (ExtractViewableTextCleaned(document.DocumentNode));
   // if (_addressDictionaries.AddressDictDuplicates.Contains(ExtractViewableTextCleaned(document.DocumentNode)))
    {
        listBox1.Items.Add(Environment.NewLine + (ExtractViewableTextCleaned(document.DocumentNode)));
    }
}

public static string ExtractViewableTextCleaned(HtmlNode node)
{
    string textWithLotsOfWhiteSpaces = ExtractViewableText(node);
    return _removeRepeatedWhitespaceRegex.Replace(textWithLotsOfWhiteSpaces, " ").Replace(" ","").Replace("©","");
}

public static string ExtractViewableText(HtmlNode node)
{
    StringBuilder sb = new StringBuilder();
    ExtractViewableTextHelper(sb, node);
    return sb.ToString();
}

private static void ExtractViewableTextHelper(StringBuilder sb, HtmlNode node)
{
    if (node.Name != "script" && node.Name != "style" && node.Name!="a")
    {
        if (node.NodeType == HtmlNodeType.Text)
        {
            AppendNodeText(sb, node);
        }

        foreach (HtmlNode child in node.ChildNodes)
        {
            ExtractViewableTextHelper(sb, child);
        }
    }
}

private static void AppendNodeText(StringBuilder sb, HtmlNode node)
{
    string text = ((HtmlTextNode)node).Text;
    if (string.IsNullOrWhiteSpace(text) == false)
    {
        sb.Append(Environment.NewLine + text);

        // If the last char isn't a white-space, add a white space
        // otherwise words will be added ontop of each other when they're only separated by
        // tags
        if (text.EndsWith("\t") || text.EndsWith("\n") || text.EndsWith(" ") || text.EndsWith("\r"))
        {
            // We're good!
        }
        else
        {
            sb.Append(" ");
        }
    }
}
Posted
Updated 21-Jan-13 21:00pm
v3
Comments
[no name] 22-Jan-13 2:32am    
Please elaborate it more pls.
What type of controls you are using and what have you tried.
ram salunke 22-Jan-13 2:38am    
i have update my question..
Thanks..
Ask Dj 22-Jan-13 2:37am    
Is your target is to get address from contact page from single website/multiple website?
ram salunke 22-Jan-13 2:38am    
multiple website
Thanks..

1 solution

Who knows what's wrong? — because this depends on the content heavily, nearly hard-coded, and we don't see the sample of the content. Such code might be too fragile if something changes in the content, even if the change is decorative.

Probably, having less of ad-hoc approach could help you. You would benefit much is you start from some HTTP parser. If HTTP is well-formed XML, this is trivial, as .NET has more then enough XML libraries. But what to do if it is not? You may need a parser which can tolerate the lack of well-formed content. I would advise to look at this one:
http://www.majestic12.co.uk/projects/html_parser.php[^].

—SA
 
Share this answer
 
Comments
[no name] 22-Jan-13 2:50am    
Sir is right.
Sergey Alexandrovich Kryukov 22-Jan-13 2:51am    
Thank you...
—SA
ram salunke 22-Jan-13 3:20am    
thank you for help me sir...
Sergey Alexandrovich Kryukov 22-Jan-13 10:25am    
You are welcome.
Good luck, call again.
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900