|
hi , you can condense your tag stripping down to one simple function for any tag.
static string StripTag(string input,string tag)
{
StringBuilder sb = new StringBuilder();
sb.Append(@"<\s*");
sb.Append(tag);
sb.Append(@"([^>])*>.*?<\s*/\s*");
sb.Append(tag);
sb.Append(@"\s*>");
return Regex.Replace(input, sb.ToString(), string.Empty, RegexOptions.IgnoreCase | RegexOptions.SingleLine);
}
you don't have to remove the attributes etc first. your code then becomes :
result = StripTag(result,"head");
result = StripTag(result,"script");
..etc.
Note, you need the RegexOptions.SingleLine if there are any cr/lf's in your text. Also, that the tag stripping won't work for all cases where any embeded text may contain tags which close the pair (mainly script blocks).
cheers,
Rob
http://rob.runtothehills.org.
--
Rob Hill
Information Analyst
EDS Sheffield,
UK
-- modified at 8:35 Tuesday 30th January, 2007
|
|
|
|
|
Good point but for some tags the useful text within the tags will go as well. With the slight correction this is a brilliant solution:
static string StripTag(string input, string tag)
{
string result;
result = Regex.Replace(input, @"<\s*" + tag + "([^>])*>", string.Empty, RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"])*>", string.Empty, RegexOptions.IgnoreCase);
return result;
}
|
|
|
|
|
OOps the code in the previous message does not display correclty. This should display fine:
static string StripTag(string input, string tag)
{
string result;
result = Regex.Replace(input, @"<\s*" + tag + "([^>])*>", string.Empty, RegexOptions.IgnoreCase);
result = Regex.Replace(result, @"</\s*" + tag + "([^>])*>", string.Empty, RegexOptions.IgnoreCase);
return result;
}
|
|
|
|
|
(minor points) -
I thought the idea was to strip the tag and its contents, for example, your modificaiton to strip a script tag, will leave all the javascript code intact. I don't think this is what you're after, if all you want is the text from the page.
Also, since strings are immutable, we should take care to concatenate with them as little as possible, which is why i used the StringBuilder.
If you're after stripping a tag only then you can use an even more simple regex -
@"<[^>]*>"
and the majority of those other replaces can be compacted down to various for loops thus :
foreach (string regex in new string[] { @"(\t\t)+", @"<", @">", @">", @"<[^>]*>", })
Replace( regex, string.Empty);
you can then modify what the Replace function replaces to do the \n combinations too.
I think you'll find this a large reduction in the amount of code you've posted here
I'll be doing an article that uses these techniques for stripping out acronyms and definitions from a web page (or any other text) and presents it in a funky interface using WPF.
keep an eye out!
I should point out that im stripping more out than you'd want, stuff like punctuation that might limit me finding the acronyms, but the principles can be applied to what you're doing too.!
Enjoy!
--
Rob Hill
Information Analyst
Agile Alliance Member
EDS Sheffield,
UK
|
|
|
|
|
It allows me to finish a task my boss assigned me in 30 minutes.
Keep up your great work!!!
|
|
|
|
|
Does this functions do the same as System.Web.HttpUtility.HtmlDecode(str)?
Cheers,
Les
|
|
|
|
|
No, HTMLUtility lets you embed special characters into HTML but it does not extract the plain text. For example, if you want to display "<" character in your HTML you can run HTMLEncode on the string containing it and it will output something like "<hn" instead of "<". Use HTMLIncode/Decode to make sure HTMl parser does not trp over special characters.
|
|
|
|
|
here is the perfect exaple: I was trying to enter "& l t" (minus spaces) but the browser outputs "<" because "& l t" is HTMLEncoded "<" character.
|
|
|
|
|
this has been very helpful.
one ought to insert link detection (writing out the href-attribute to preserve links rather than the text of the link, or both).
I can give the regex to detect any urls in any text, that is easily modified to detect only those in hrefs:
public static readonly Regex URLRegex = new Regex(@"\b(http\://|https\://|ftp\://|mailto\:|www\.)([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]+)*@)?((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.[a-zA-Z]+)(\:[0-9]+)?([a-zA-Z0-9\.\,\?\'\\/\+&%\$#\=~_\-@]*)*\b", RegexOptions.Multiline|RegexOptions.CultureInvariant|RegexOptions.IgnoreCase|RegexOptions.Compiled);
Is it right to replace the lines that shorten the tabs and lines to groups of 2 and 4 with that more efficient code?
while (result.IndexOf("\r\r\r") >= 0)<br />
result = result.Replace("\r\r\r", "\r\r");<br />
while (result.IndexOf("\t\t\t\t\t") >= 0)<br />
result = result.Replace("\t\t\t\t\t", "\t\t\t\t"); <br />
cheers,
matt
|
|
|
|
|
this regexp "(<script>).*(</script>)"
must be replaced by "(<script>).*?(</script>)"
if not, the engine will impact the largest <TAG> </TAG> combinaison, and could delete all the html page
|
|
|
|
|
Public Function StripHTML(ByVal Source As String) As String
Try
Dim result As String
' Remove HTML Development formatting
result = Source.Replace("\r", " ") ' Replace line breaks with space because browsers inserts space
result = result.Replace("\n", " ") ' Replace line breaks with space because browsers inserts space
result = result.Replace("\t", String.Empty) ' Remove step-formatting
result = System.Text.RegularExpressions.Regex.Replace(result, "( )+", " ") ' Remove repeating speces becuase browsers ignore them
' Remove the header (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*head([^>])*>", "<head>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(<( )*(/)( )*head( )*>)", "</head>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(<head>).*(</head>)", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' remove all scripts (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*script([^>])*>", "<script>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(<( )*(/)( )*script( )*>)", "</script>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
'result = System.Text.RegularExpressions.Regex.Replace(result, @"(<script>)([^(<script>\.</script>)])*(</script>)",string.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result, "(<script>).*(</script>)", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' remove all styles (prepare first by clearing attributes)
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*style([^>])*>", "<style>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(<( )*(/)( )*style( )*>)", "</style>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(<style>).*(</style>)", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
'insert tabs in spaces of <td> tags
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*td([^>])*>", "\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' insert line breaks in places of <BR> and <LI> tags
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*br( )*>", "\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*li( )*>", "\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' insert line paragraphs (double line breaks) in place if <P>, <DIV> and <TR> tags
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*div([^>])*>", "\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*tr([^>])*>", "\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<( )*p([^>])*>", "\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' Remove remaining tags like <a>, links, images, comments etc - anything thats enclosed inside < >
result = System.Text.RegularExpressions.Regex.Replace(result, "<[^>]*>", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' replace special characters:
result = System.Text.RegularExpressions.Regex.Replace(result, " ", " ", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "•", " * ", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "‹", "<", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "›", ">", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "™", "(tm)", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "⁄", "/", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "<", "<", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, ">", ">", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "©", "(c)", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "®", "(r)", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' Remove all others. More can be added, see http://hotwired.lycos.com/webmonkey/reference/special_characters/
result = System.Text.RegularExpressions.Regex.Replace(result, "&(.{2,6});", String.Empty, System.Text.RegularExpressions.RegexOptions.IgnoreCase)
' make line breaking consistent
result = result.Replace("\n", "\r")
' Remove extra line breaks and tabs: replace over 2 breaks with 2 and over 4 tabs with 4.
' Prepare first to remove any whitespaces inbetween the escaped characters and remove redundant tabs inbetween linebreaks
result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)( )+(\r)", "\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(\t)( )+(\t)", "\t\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(\t)( )+(\r)", "\t\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)( )+(\t)", "\r\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)(\t)+(\r)", "\r\r", System.Text.RegularExpressions.RegexOptions.IgnoreCase) ' Remove redundant tabs
result = System.Text.RegularExpressions.Regex.Replace(result, "(\r)(\t)+", "\r\t", System.Text.RegularExpressions.RegexOptions.IgnoreCase) ' Remove multible tabs followind a linebreak with just one tab
Dim breaks As String = "\r\r\r" ' Initial replacement target string for linebreaks
Dim tabs As String = "\t\t\t\t\t" ' Initial replacement target string for tabs
Dim int As Integer
For int = 0 To result.Length
result = result.Replace(breaks, "\r\r")
result = result.Replace(tabs, "\t\t\t\t")
breaks = breaks + "\r"
tabs = tabs + "\t"
Next
' Thats it.
Return result
Catch ex As Exception
SendMail("help@sexydepo.com", "debug@sexydepo.com", "ERROR: in clsUtils.StripHTML", ex.ToString)
End Try
End Function
Joshua
joshua@joshuaz.com
http://www.joshuaz.com
|
|
|
|
|
Joshua, I'm not sure your translation is complete. For example, consider the first few lines:
' Remove HTML Development formatting
result = Source.Replace("\r", " ") ' Replace line breaks with space because browsers inserts space
result = result.Replace("\n", " ") ' Replace line breaks with space because browsers inserts space
result = result.Replace("\t", String.Empty) ' Remove step-formatting
In C#, "\r" is an escape sequence, which indicates a carriage return, "\n" is a line feed, and so on. VB doesn't use this terminology, it would instead look for a literal instance of "\r", not a carriage return. So you'd have to change this to VB something like this:
result = Source.Replace(vbCr, " ")
result = Source.Replace(vbLf, " ")
result = Source.Replace(vbTab, " ")
There is a lot of this kind of stuff near the end of the function as well.
Not a critisicm, just wanted to let you know
-Todd Davis (toddhd@gmail.com)
|
|
|
|
|
Anybody have complete vb conversion?
Would be greatly appreciated!
David
|
|
|
|
|
This may be a little late, but I thought I should comment. First, to do this in VB, you can pretty much cut and paste the code paceman has supplied in his excellent article. Only a few minor changes are required:
quick and dirty (untested), I would suggest to change only the parts:
<br />
'Remove HTML Development formatting<br />
result = source.Replace(vbCr, " ")<br />
result = source.Replace(vbLf, " ")<br />
result = source.Replace(vbTab, " ")<br />
result = Regex.Replace(result, "( )+", " ") 'Remove repeating spaces because browsers ignore them<br />
...<br />
<br />
'insert tabs in spaces of <td> tags<br />
result = Regex.Replace(result, "<( )*td([^>])*>", vbTab, RegexOptions.IgnoreCase)<br />
<br />
'insert line breaks in places of <BR> and <LI> tags<br />
result = Regex.Replace(result, "<( )*br( )*>", vbCr, RegexOptions.IgnoreCase)<br />
result = Regex.Replace(result, "<( )*li( )*>", vbCr, RegexOptions.IgnoreCase)<br />
<br />
'insert line paragraphs (double line breaks) in place if <P>, <DIV> and <TR> tags<br />
result = Regex.Replace(result, "<( )*div([^>])*>", vbCr + vbCr, RegexOptions.IgnoreCase)<br />
result = Regex.Replace(result, "<( )*tr([^>])*>", vbCr + vbCr, RegexOptions.IgnoreCase)<br />
result = Regex.Replace(result, "<( )*p([^>])*>", vbCr + vbCr, RegexOptions.IgnoreCase)<br />
.....<br />
that should take care of conversion to VB of the first part.
Then Paceman adds a For Next loop to clean up what now would be the surplus VbCr's in Vb. If you are going through all the trouble of learning/using regex to do your text handling, it is a real shame to resort to For-Nexts to handle something as trivial. Perhaps I am missing something, but I don't see why he is doing that. But if you have the computing power to waste, go ahead.
In my humble opinion, it would probably be better to just stick with the regex logic and convert all breaks to VbCr; remove duplicate VbCr's, and remove double white spaces.
<br />
'make line breaking consistent<br />
result = result.Replace(vbCrLf, vbCr) 'worth remembering this little guy, too.<br />
result = result.Replace(vbLf, vbCr)<br />
'remove all breaks<br />
result = Regex.Replace(result, vbCr, " ")<br />
'remove double breaks<br />
'result = Regex.Replace(result, vbCr + vbCr, vbCr)<br />
result = Regex.Replace(result, vbTab, " ")<br />
'remove double tabs<br />
'result = Regex.Replace(result, vbTab + vbTab, vbTab)<br />
'remove double-spaces<br />
result = Regex.Replace(result, "(\s\s)", " ")<br />
Return result<br />
Of course, you need to decide how much formatting you are going to leave (hence the comments - play around with that). Good luck!
Jokva
|
|
|
|
|
I have to admit that using loops for removing line breaks is pretty silly. There is no reason to use loops instead of regexes. Thanks for noting.
|
|
|
|
|