How to strip off HTML content using C#

Question

0.00/5 (No votes)

See more:

C#

XML

HTML

Hi,

I want to extract the text only from an HTML document...
How can I do it,... Pls guide me...

Posted 13-Oct-13 3:19am

Yesudasan Moses

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Thomas Daniels · Accepted Answer · 2013-10-13T03:51:00

Solution 1

You can use the Html Agility Pack[^] for this:

C#

using HtmlAgilityPack; // first, add a reference to HtmlAgilityPack.dll, and then add this line at the top of your code file

C#

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<u>This is a</u> <strong>HTML</strong> <em>text</em>!"); // you can also load a HTML file, using the Load() method
string textOnly = doc.DocumentNode.InnerText;

You can also use a regular expression:

C#

string html = "This is a <strong id='anId'>HTML</strong> <em><u>text</u></em>.";
string textOnly = System.Text.RegularExpressions.Regex.Replace(html, "<(.+?)>", "");

Hope this helps.

Posted 13-Oct-13 3:51am

Thomas Daniels

Updated 18-Oct-13 7:58am

v5

Comments

Yesudasan Moses 13-Oct-13 10:00am

need this big reference for this small task :( ??
I expected some linq strip-off operations :'(

Thomas Daniels 13-Oct-13 10:01am

It's not required to use this, but it's the easiest way. You can also use a regular expression.

Yesudasan Moses 13-Oct-13 10:04am

Anyway, The solution is great <3
I would like to try it if regex fails :)

Thomas Daniels 13-Oct-13 10:05am

Thank you!
I updated my answer to add a regular expression.

Yesudasan Moses 13-Oct-13 10:07am

great,,,, thanks so much dear... :)

Thomas Daniels 13-Oct-13 10:08am

You're welcome!