|
hmmmm...i'm not too sure here but i don't think that mal-formed, sloppy, redundant, ill-written, newbie html code is their fault. any parser can handle 'good' code. a parser ought to account for code of this quality without throwing an error.
|
|
|
|
|
You are correct, this is clearly a bug. I found it in the GetTokens method within the HtmlParser.cs file. I have a fix if anyone needs it.
I changed the code starting at line # 711 to read as follows:
<code>
while ((i < input.Length) && (input.Substring(i, 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1))
{
i++;
}
int dataLength = (i - value_start);
if (input.Substring(i, 1).Equals(">") &&
input.Substring(i - 1, 1).Equals("/"))
{
dataLength--;
}
tokens.Add(input.Substring(value_start, dataLength));
</code>
The original search list of " \r\n\t/>" was getting a false end of data trigger on the forward slashes in the href path. I considered removing the \r\n to because I didn't think it was valid to trigger the end of data on those characters, but wasn't sure so I left them in for now.
modified on Thursday, July 16, 2009 6:35 PM
|
|
|
|
|
Natural Cause wrote: You don't wrap any of the values in double quotes
Actually that's perfectly valid HTML. It's invalid in XHTML, and although I prefer quoting, it's still valid.
|
|
|
|
|
I have succesfully used your library in a project to query books by isbn and collect catalogue data on them from various sites. Your html parser was by far the best of the five or so parser libraries that I tried, but still I missed some features in the API. I made some changes to my copy of the source, perhaps you would be willing to consider them?
My changes were:
1- change abstract class HtmlEncoder from internal to public, so that I can decode any html text fragment myself.
2- bugfix in decoding &xNN; hexadecimal html escape in HtmlEncoder.cs (see message earlier today)
3- Introduce extended matching for methods for attribute values:
public enum SearchMethod
{
ExactMatch, // default
ValueBeginsWith, // uses .StartsWith to match beginning of attribute value
ValueContains // uses .IndexOf to match any part of a value
}
I made an extra overload to FindByAttributeNameValue that has a searchMethod parameter to incorporate this.
Usage example: Consider an amazon.com "product overview" for a book. Authors are contained in A elements where the href attribute contains the substring "&field-author=". Having the SearchMethod parameter allows me to directly find only the nodes that I need:
HtmlNodeCollection nc = htmlDoc.FindByAttributeNameValue("href", "&field-author=", true, SearchMethod.ValueContains);
4- added an extra method FindByNameAttributeNameValue to match both node name and an attribute name/value pair. The example above can be made more efficient by also specifying the node name a:
HtmlNodeCollection nc = htmlDoc.FindByNameAttributeNameValue("a", "href", "&field-author=", true, SearchMethod.ValueContains);
This will return the same collection, but significantly faster because it no longer has iterate through every attribute of each node in the html document, but only through the small subset of a nodes.
Best regards,
Berend Engelbrecht
|
|
|
|
|
I was using the HtmlDocument.Create(...) against HTML returned from the msn search site, and kept getting a FormatException, i managed to trace it to this call:
int v = int.Parse( token.ToString().Substring(2,token.Length-3) );
line 831 in the HtmlEncoder, the token.ToString().Substring(2, token.Length-3) resulted in the following value "xB7" as it is using a hex base character entity "·", think some logic needs to be added to check for hex entity as opposed to dec.
Thanks,
Mike
|
|
|
|
|
Since I had to parse a web site that used A0; for nonbreaking spaces everywhere, I took the liberty of fixing it in my copy. I would welcome that my fix (or similar code) is included in the standard version:
if (token[1] == '#')<br />
{<br />
try<br />
{<br />
if (token[2] == 'x')<br />
{<br />
int v = int.Parse(token.ToString().Substring(3).Split(';')[0], System.Globalization.NumberStyles.HexNumber);<br />
output.Append((char)v);<br />
}<br />
else<br />
{<br />
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));<br />
output.Append((char)v);<br />
}<br />
}<br />
catch (Exception ex)<br />
{<br />
Trace.Write(ex);<br />
}<br />
}<br />
|
|
|
|
|
Hi all
I'm implementing a Winform app about 'HTML parser'.
In my app, the users input an URL (such as: www.amazon.com) and my app will show the expected page in a web browser control.
I want to let users can choose an area on that page and a label control will show all texts in that selected area. How can I do that???
I mean that: how can I determine the HTML tags (in that page) which enclose all selected texts ???
EX:
HTML:
Page:
selected text
none selected text
When I drag the mouse to enclose "selected text", I want to determine that table with id=1 is selected and "selected text" will be showed in a label control.
Please show me your ideas.
Thank in advance.
mns
|
|
|
|
|
can anyone solve my problem
i have developed a webapplication where i have parsed the contents of the webpage using
MILHTML parser
i have the document now in html format
i need to use the parser's attributes like
htmldocument
htmlelement
htmlnode
htmlattributes
am really new to this Dotnet environment and now i need to know
how to find the the tags with
i need to seperate the input tags first and then find their attributes like type="submit,hidden" name="" etc....
have anybody done this before or can anybody give me an idea abt how to write the recursive function to seperate the input tags from the document
plz help am running short of time
thanks
Rama
|
|
|
|
|
Maybe the following program can match your requirement
in DOL HTML Parser (http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]).
Good Luck
// Open HTML file "xxx.htm"
DHtmlGeneralParser parser = DHtmlGeneralParser();
DHtmlDocument htmlDoc = new DHtmlDocument(parser);
htmlDoc.Load(@"..\xxx.htm");
DHtmlNodeCollection result = new DHtmlNodeCollection();
// Find all tag of this pattern in all html document
// function: void FindByNameAttribute
// (
// DHtmlNodeCollection result, // a collection to collect result
// string name, // tag name which you want to find
// string attributeName, // attribute name which you want to find
// bool searchChildren // whether it searchs child with recursive
// )
htmlDoc.Nodes.FindByNameAttribute(result, "input" "type", true);
|
|
|
|
|
The MIL HTML Parser is an useful library for me, but the project has stopped to maintain.
I created a project "DOLS HTML Parser" based on MIL HTML Parser in codeproject and wish it can help everyone.
A non-well-formed HTML Parser and CSS Resolver,
The URL:http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]
|
|
|
|
|
Great code! For the most part it works, but more often than note it does not pick up on an IMG node I am looking for. I have tweaked the source HTML docs a bit and usually get it to work, but havn't nailed down the cause.
Is there a requirement or restrictions for the source HTML/ XHTML?
thanks!
|
|
|
|
|
I just thought of asking, why use a String instead of a Stream?
|
|
|
|
|
Ì dont understand why you find that so hard, while its soo easy!
Try this:
Dim mDocument As MIL.Html.HtmlDocument
Dim html As String = "Your HTML thingies here, instead of a StreamReading result)
mDocument = HtmlDocument.Create(html, False)
Then just do whatever yo want with mDocument, just a bit of hushling with the demo project...
Though, I find it a rather stupid question for a Microsoft Partner, since the readed streams are actually Strings...
SO instead of:
Dim html As String = Stream.ReadToEnd
you just change Stream.ReadToEnd to your string...
Or am I misunderstanding a question here?
You mess with the best, you die like the rest... well... kinda???
|
|
|
|
|
Read Streams are *not* strings. The ReadToEnd method is not the best way to use streams, especially if you are parsing...
Strings are immutable, meaning that everytime I do an operation on them, a new one is created. So, if I am doing 25 operations on a 60 kb string, that'll allocate some 1.2MB! And, that's just garbage, waiting to be GCed...
So, in my app, I process about an MB of HTML a second(since It's a Scrapper), using a string based solution would just not scale...
Streams Scale: If you do a Read with a 25 byte buffer, you only allocate 25 bytes, so it scales...
Yes, building a Stream based parser is harder, but I've already found one: HTMLAgilityPack[^], which is quite fast, and stream based...
Thanks anyway...
|
|
|
|
|
I strongly suggest moving your project somewhere with more exposure. I googled hard to find it and at this point it's much more usefull than microsoft's wrapper of the IE control. Such libraries are important just because the cost of keeping them up to date for a small project or for research purposes (my case) is prohibitive.
I suggest GotDotNet workspaces or the shiny new Codeplex.
Btw great work
|
|
|
|
|
I think it is very good but it can't fix some error of html code like IE. For example, I have a html code here:
Hi
Hello
Does it fix this error like ie : Text "Hello" is subnode of ?
Who has fix this error?
Please help me.
sfdafsda
|
|
|
|
|
Spaces between words get removed if a formatting tag is contained between the words. For example:
Dear Ally
gets displayed as:
DearAlly
|
|
|
|
|
HtmlDocument.Create() has an overload for "wantSpaces"
Set that to true and your spaces will be preserved.
|
|
|
|
|
I found that the eats some CJK(Chinese,Japanese,Korean) words.
It took me a long time to find the reason.
Finally I found that it just because of line 315:
Dim sr As System.IO.StreamReader = System.IO.File.OpenText(OpenHtmlFileDialog.FileName)
It will be ok when making a chang to :
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(OpenHtmlFileDialog.FileName, System.Text.Encoding.Default)
Thanks a lot for your work.
It makes a good help to me.
|
|
|
|
|
Hello!
I have found a minor error during weeping out of the element. If there is something meaningfull immediatelly behind section, then this first two characters of it are removed.
This error lies in the file HtmlParser.cs in RemoveSGMLComments function. The repaired version of the while cycle (at the beginning of this function) is at the end of this letter.
Greetings,
jmas8109
Source code section:
--------------------
while( i < input.Length )
{
if( i + 2 < input.Length && input.Substring( i , 2 ).Equals( "" , i );
if( i == -1 )
{
break;
}
i += 1; // originally there was i += 3 (which is a bug)
jmas8109
|
|
|
|
|
This has made my day. There are 5 billion articles and samples about translating XML->HTML, etc. but no one wants to touch HTML->something freaking usable.
Thanks!
|
|
|
|
|
This is good article. I will say 5 star.
We can't stop asking "WHY!!"
|
|
|
|
|
the filterindex being returned by the common file dialog is 1-based, yielding indices 1 and 2. the code using the filterindex is checking for 0 or 1. the result is that xhtml is always exported.
|
|
|
|
|
I found some codes which let me confuse.
I'm not sure that these codes are correct/incorrect.
in HtmlParser.cs
code line 272:
original: if( i + 4 < input.Length && input.Substring( i , 4 ).Equals( "<!--" ) )<br />
suggestion: if( i + 3 < input.Length && input.Substring( i , 4 ).Equals( "<!--" ) )
code line 344:
original: if( i + 2 < input.Length && input.Substring( i , 2 ).Equals( "<!" ) )<br />
suggestion: if( i + 1 < input.Length && input.Substring( i , 2 ).Equals( "<!" ) )
code line 352:
original: i += 3;<br />
suggestion: i += 1;
code line 542:
<br />
original: if( i+2 < input.Length && input.Substring( i , 2 ).Equals( "<" ) )<br />
suggestion: if( i+1 < input.Length && input.Substring( i , 2 ).Equals( "</" ) )
code line 588:
<br />
original: if( i+1<input.Length && input.Substring( i , 1 ).Equals( "/>" ) )<br />
suggestion: if( i+1<input.Length && input.Substring( i , 2 ).Equals( "/>" ) )
code line 711:
In some case, value of attribute could be "images/about_logo.gif", and we can't use "/ to decide the end of the value string. (This case can be found in http://www.google.com/intl/en/about.html[^])
original: while( i<input.Length && input.Substring( i , 1 ).IndexOfAny( " \r\n\t/>".ToCharArray() ) == -1 )<br />
suggestion: while(i < input.Length && input.Substring(i , 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1 && i + 1 < input.Length && input.Substring(i , 2).Equals("/>") == false) ++ i;
I wish this information can help this project. =^_^=
|
|
|
|
|
I just download the code and I noticed that when parsing '', the parser think the node has two attributes. One attribute name is the 'width' with '10' as value and the other attribute is name is blank and null as value.
Vincent
|
|
|
|