|
This has made my day. There are 5 billion articles and samples about translating XML->HTML, etc. but no one wants to touch HTML->something freaking usable.
Thanks!
|
|
|
|
|
This is good article. I will say 5 star.
We can't stop asking "WHY!!"
|
|
|
|
|
the filterindex being returned by the common file dialog is 1-based, yielding indices 1 and 2. the code using the filterindex is checking for 0 or 1. the result is that xhtml is always exported.
|
|
|
|
|
I found some codes which let me confuse.
I'm not sure that these codes are correct/incorrect.
in HtmlParser.cs
code line 272:
original: if( i + 4 < input.Length && input.Substring( i , 4 ).Equals( "<!--" ) )<br />
suggestion: if( i + 3 < input.Length && input.Substring( i , 4 ).Equals( "<!--" ) )
code line 344:
original: if( i + 2 < input.Length && input.Substring( i , 2 ).Equals( "<!" ) )<br />
suggestion: if( i + 1 < input.Length && input.Substring( i , 2 ).Equals( "<!" ) )
code line 352:
original: i += 3;<br />
suggestion: i += 1;
code line 542:
<br />
original: if( i+2 < input.Length && input.Substring( i , 2 ).Equals( "<" ) )<br />
suggestion: if( i+1 < input.Length && input.Substring( i , 2 ).Equals( "</" ) )
code line 588:
<br />
original: if( i+1<input.Length && input.Substring( i , 1 ).Equals( "/>" ) )<br />
suggestion: if( i+1<input.Length && input.Substring( i , 2 ).Equals( "/>" ) )
code line 711:
In some case, value of attribute could be "images/about_logo.gif", and we can't use "/ to decide the end of the value string. (This case can be found in http://www.google.com/intl/en/about.html[^])
original: while( i<input.Length && input.Substring( i , 1 ).IndexOfAny( " \r\n\t/>".ToCharArray() ) == -1 )<br />
suggestion: while(i < input.Length && input.Substring(i , 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1 && i + 1 < input.Length && input.Substring(i , 2).Equals("/>") == false) ++ i;
I wish this information can help this project. =^_^=
|
|
|
|
|
I just download the code and I noticed that when parsing '', the parser think the node has two attributes. One attribute name is the 'width' with '10' as value and the other attribute is name is blank and null as value.
Vincent
|
|
|
|
|
if you have a page with the following meta tag for special characters
<meta http-equiv=Content-Type content="text/html; charset=ISO-8859-1">
then later use them “like these slanted quotes”
the parser removes the slanted-quote characters completely
would be happy to fix, but am at a loss as to where to look - i don't see an obvious place in the parser code for this...
--S
|
|
|
|
|
my bad - problem was reading file, not parser.
to read a html/asp file using special characters where the file was not created as a unicode/utf-8 file [e.g. the file was created in Visula Interdev], use the Encoding.Default parameter when creating the StreamReader, e.g.
FileStream fs = File.OpenRead(origFname);
StreamReader sro = new StreamReader(fs,
System.Text.Encoding.Default);
textBox1.Text = sro.ReadToEnd();
sro.Close();
thanks to Andy for helping me track this down
--S
|
|
|
|
|
very nice library; one minor issue - it doesn't seem to handle server-side includes or server script blocks correctly, e.g.
<!-- #INCLUDE FILE="..\Includes\stdvars.asp" -->
<%
Dim x, y
'blah
if (x > y) blah...
%>
%>
<html>
...etc.
becomes
<% Dim x, y 'blah if (x />
followed by a text node; the SSI is dropped completely
any chance of this being fixed soon, or should I attempt code surgery?
thanks!
--S
|
|
|
|
|
Thanks to Andy for a link to the current version, which handles #INCLUDEs already, and for instructions on how to modify the code for server-side script block handling.
there is an immediate work-around with the current version: replace "<%" and "%>" with "<!-- %" and "% -->" before parsing and the parser pulls out the server-size script as a comment block, which is good enough for my purposes
am taking the liberty of posting the link to the current code: http://powney.demon.co.uk/milhtml.html
--S
|
|
|
|
|
Try parsing the HTML at www.dn.se, your parser fails to correctly identify the <html> tag that follows directly after the <!DOCTYPE> tag. I get "html>" as a HtmlText node.
|
|
|
|
|
There's a bug that results in a parse-error in some <a href=...>. In fact it will cause a lot of other things to fail also, but I discovered it by following a link from google.
Google's main page contains this link:
<a href=/advanced_search?hl=en>Advanced Search</a>
The parser gets this totally wrong, ending up thinking there are 3 attributes:
attrib-name "href" maps to attrib-value "";
attrib-name "" maps to attrib-value "advanced_search?hl";
attrib-name "hl" maps to attrib-value "en".
(yes, that middle attrib-name really is an empty string).
Based on my cursory examination of the code, I fear this might be hard to fix. I think the root problem is that the code assumes it can tokenize HTML indepent of the parsing phase. This example (from one of the world's most popular web sites) shows, I believe, that attribute values must be tokenized differently from other things, which means that the tokenization is context sensitive. I hope I am wrong.
I wish you the best of success,
Marshall
PS: I have some ideas for how you might repair this. I'd be willing to spend a few minutes corresponding with you off-line if you are interested.
|
|
|
|
|
|
i've had the same problem today, and i think i've found a quick fix: when searching for the end of the value, instead of searching for />, search just for >, and afterwise, if it was a /> roll back the index.
regards.
line 718 in htmlparser.cs
//***** original:
//while (i < input.Length && input.Substring(i, 1).IndexOfAny(" \r\n\t/>".ToCharArray()) == -1)
//{
// i++;
//}
//***** new:
//do not search for /, if it's a /> we'll fix it later
while (i < input.Length && input.Substring(i, 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1)
{
i++;
}
if (input.Substring(i-1,2) == "/>")
i--;
|
|
|
|
|
do you know if there is any vb/vb.net version of html parser?
Regards,
unruledboy@hotmail.com
|
|
|
|
|
Reviewing the code of HtmlEncoder.cs: you can use a hack to avoid the parsing of each literal (&xxx;) existent or future
Replacing the lines 828 to 1603 (a big "if" block!) with the following code*:
<br />
string encodedLiteral = System.Web.HttpUtility.HtmlDecode (token.ToString ());<br />
output.Append (encodedLiteral);<br />
will avoid the "manual" parsing.
I think that is better to replace the whole functions DecodeValue and EncodeValue for the equivalent functions of HttpUtility class HtmlDecode & HtmlEncode.
Check the HttpUtility class, have a lot of nice functions!
Congratulations for the code...
PD:
*(need to add a reference to System.Web to the project for the HtmlUtility class)
!
Giralt
|
|
|
|
|
This is a great article. Your tip makes it perfect.
Thanks to all you guys.
|
|
|
|
|
Well use your own version , cause HtmlUtility class has some bugs for the FullHalf Characters, which happens when you use a different Character-Encoding then UTF-8.
|
|
|
|
|
It does not properly parse double-byte characters like Simplifed-Chinese.
It simply encode these characters to be something like "кϢ"
|
|
|
|
|
Nice piece of work.
I was about to make one myself, but this is great.;)
Is there a license on this code?
Aaron Eldreth
TheCollective4.com
|
|
|
|
|
I recently published a very lightweight HTML parser written in java that also specialises in analysing badly formed HTML and reproducing it verbatim with any disired changes.
I wanted to run it through the JLCA to produce a .NET version but found I need VS.NET 2003 (to convert the java collection classes), which I am only getting in a couple of weeks time. If anyone is interested in this you can monitor the package the sourceforge project page to receive announcements. I'm not sure how efficient the code is that JLCA produces but a simple library like this is a good candidate for finding out.
http://sourceforge.net/projects/jerichohtml/
The approach differs from this one in that it does NOT produce a DOM object in memory. Each method call analyses the source text directly (but using internal caching for efficiency), which allows you to see an exact representation of the document, even with overlapping elements.
|
|
|
|
|
As Jonathan points out in his posting, SGMLReader posted by Microsoft on www.gotdotnet.com creates well formed HTML. It also does it without generating a DOM. Because it inherits from the Reader class, you can use it wherever you would use a reader.
Regards
Bill Seddon
|
|
|
|
|
Have you tried running it through ikvmc?
|
|
|
|
|
I've been waiting for someone to try this for a while. Good work! It's especially difficult to parse information that's not well-formed.
With Microsoft and others being committed to the future of Virtual Execution Systems (i.e. - .NET) there will be a strong need in the future for software written purely on the CLI. Yup, that's right folks. All those wonderful C/C++ libraries will need to be rewritten. In the not to distant future, the .NET runtime will not sit on top of the Windows/Linux/Mac operating system, it will BE the operating system.
.NET programmers shouldn't be afraid to "reinvent the wheel" because the wheel still needs a lot of work.
|
|
|
|
|
You are not forced to follow any thing MS proposes.
If it should really go the way you are prognosing here, there is always the possibility to switch to ReactOS[^]
This is pure WIN32 and will provide compatibility to Windows NT if MS really might drop it. (however I strongly doubt this)
Martin Fuchs
martin-fuchs@gmx.net
|
|
|
|
|
Whoa... I'm not trying to start any debates about legacy windows code versus the .Net initiative, Microsoft’s market share, or Windows versus Linux versus Mac. I love each platform and I’m not biased in any way. I’m simply speaking as an experienced programmer and someone who knows the economics of computers, not just the hobby of them.
Obviously I’m enthusiastic about the platform. In a more appropriate forum I would be happy to explain why most of the industry sees Virtual Execution Systems as the future, but that is not appropriate for this thread.
I’m just giving kudos to the man’s hard work on this project and reassuring all .Net programmers that their hard work will be noticed.
|
|
|
|