Click here to Skip to main content
15,911,487 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

I want to extract the text only from an HTML document...
How can I do it,... Pls guide me...
Posted

1 solution

You can use the Html Agility Pack[^] for this:
C#
using HtmlAgilityPack; // first, add a reference to HtmlAgilityPack.dll, and then add this line at the top of your code file

C#
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<u>This is a</u> <strong>HTML</strong> <em>text</em>!"); // you can also load a HTML file, using the Load() method
string textOnly = doc.DocumentNode.InnerText;

You can also use a regular expression:
C#
string html = "This is a <strong id='anId'>HTML</strong> <em><u>text</u></em>.";
string textOnly = System.Text.RegularExpressions.Regex.Replace(html, "<(.+?)>", "");

Hope this helps.
 
Share this answer
 
v5
Comments
Yesudasan Moses 13-Oct-13 10:00am    
need this big reference for this small task :( ??
I expected some linq strip-off operations :'(
Thomas Daniels 13-Oct-13 10:01am    
It's not required to use this, but it's the easiest way. You can also use a regular expression.
Yesudasan Moses 13-Oct-13 10:04am    
Anyway, The solution is great <3
I would like to try it if regex fails :)
Thomas Daniels 13-Oct-13 10:05am    
Thank you!
I updated my answer to add a regular expression.
Yesudasan Moses 13-Oct-13 10:07am    
great,,,, thanks so much dear... :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900