Click here to Skip to main content
15,881,715 members
Articles / Desktop Programming / MFC
Alternative
Article

HTML Parser C++ (Demo Project)

Rate me:
Please Sign up or sign in to vote.
5.00/5 (12 votes)
4 Oct 2013CPOL1 min read 69.6K   4K   31   9
This is a sample project for "HTML Reader C++ Class Library"

Introduction

This is a sample project developed using this tiny HTML Parser library. Its main purpose is to show the use of that library. However I have added some additional features to the lib. The project has support for UNICODE builds. The code warps the HTML tags to a tree model, exposing a function to retrieve a specific HTML element.  

An HTML element is an individual component of an HTML document or "web page", once this has been parsed into the Document Object Model.

In the HTML syntax, most elements are written with a start tag and an end tag, with the content in between. An HTML tag is composed of the name of the element, surrounded by angle brackets. An end tag also has a slash after the opening angle bracket, to distinguish it from the start tag.

<p>In the HTML syntax, most elements are written ...</p>

Between the starting/ending tags, any number of other tags may exist. This project offers a way to search for a specific tag, and also specify an attribute with a value for that tag. Then extract the content of that element. It's a cheap alternative to Microsoft's MSHTML parser (full of leaks).

Image 1

Using the Code

Add to your project the files form AClass directory.

Include some headers you may need like:

C++
#include "AClass/LiteHTMLReader.h"  
#include "AClass/HtmlElementCollection.h"


Instantiate the reader which will parse the HTML string.

C++
CLiteHTMLReader theReader;
CHtmlElementCollection theElementCollectionHandler;
theReader.setEventHandler(&theElementCollectionHandler);  

If you want to get a specific set of tags with a specific attrib use:

C++
theElementCollectionHandler.InitWantedTag(_T("style"), _T("id"),_T("sss"));

Call the parser function. At the end, the theElementCollectionHandler will be filled with the parsed structure.

C++
theReader.Read(m_szHtmlPage)

Now start retrieving the elements' text to a CString var.

C++
for (int i=0;i<theElementCollectionHandler.GetNumElementsFiltered();i++){
    theElementCollectionHandler.GetOuterHtml(i, szTxt, 1);
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionHow to delete a html tag from all the HtmlTree-tree tags? Pin
jpkfox4-Aug-16 1:23
jpkfox4-Aug-16 1:23 
GeneralMy vote of 5 Pin
nvect29-Jan-14 19:35
nvect29-Jan-14 19:35 
QuestionIt's really great Pin
nvect29-Jan-14 19:26
nvect29-Jan-14 19:26 
AnswerRe: It's really great Pin
dchris_med29-Jan-14 20:23
dchris_med29-Jan-14 20:23 
GeneralRe: It's really great Pin
nvect31-Jan-14 12:33
nvect31-Jan-14 12:33 
GeneralRe: It's really great Pin
dchris_med31-Jan-14 12:54
dchris_med31-Jan-14 12:54 
GeneralRe: It's really great Pin
nvect1-Feb-14 1:20
nvect1-Feb-14 1:20 
QuestionI do not know the cause of the failure symptoms Pin
shint10-Nov-13 3:51
shint10-Nov-13 3:51 
http://www.devpia.com/Maeul/Contents/Detail.aspx?BoardID=50&MAEULNO=20&no=923363&ref=923363&page=1

http://www.codeproject.com/Articles/663186/HTML-Parser-Cplusplus-Demo-Project

----------------------------------------------
I do not know the cause of the failure symptoms
----------------------------------------------
Possible Cause
----------------------------------------------
- A loop around a lot . Tree control over the depth of the problem, the limit value ... ㅡ _ ㅡ ; ;
- Two short of the call stack due to recursion problems.
  And so on. Project set to increase in the stack . Although pragma push to 1 . No effect None
- Beyond the bounds of the array . : TRACE () or buffer array range is greater than -1 and less than or equal to the size of the array is greater than the
- HTML file has not been down at all parsing by applying
----------------------------------------------

Once. In order to minimize the problem . Unicode " is off and the . Testing multi-byte .


-------------------------------------------------- ------------
Problems and causes symptoms
-------------------------------------------------- ------------
HtmlParser
Encountered an improper argument.
-------------------------------------------------- ------------
HtmlParser.exe the 0x7c812aeb (kernel32.dll) in the first exception. Microsoft C + + exception : CInvalidArgException ( memory location 0x0692eaa8).
Warning: Uncaught exception in WindowProc (returning 1).
'[1560] HtmlParser.exe: Native ' has exited the program (code : 2 (0x2)).
-------------------------------------------------- ------------



-------------------------------------------------- ------------
Temporary solution
-------------------------------------------------- ------------
www.naver.com such problems when connected to the www.yahoo.co.jp

1 . Depth value : Naver and comment out this part . ( As a result, the depth value is less jyeoseo feel resolved .)
LiteHTMLAttributes.h
inline UINT CLiteHTMLAttributes :: parseFromStr (LPCTSTR lpszString)
if (! (nTemp = oElemAttr.parseFromStr (& lpszString [nRetVal])))

Do this part commented . Error message disappear
Two . Bounds of the array "or increase the depth value
UINT CLiteHTMLReader :: parseDocument (void)
case _T ('&'):
{
/ / Commented UngetChar ();
lTemp = CLiteHTMLEntityResolver :: resolveEntity (& m_lpszBuffer [m_dwBufPos], ch);
ch = ReadChar ();
+ + dwCharDataLen;
}

Three . Do changes to IE browser information . All prospers . (But continue Not guaranteed to be good )
/ / call init
/ / Netscape 5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101
/ / Grab.Initialise (_T ("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"), NULL);
grab.Initialise (_T ("Microsoft Internet Explorer 4.0 (compatible; MSIE 6.0))"), NULL);
-------------------------------------------------- ------------


Four . After downloaded the file . The last Sleep (2000) and TRACE () loop Ran me . Error disappears temporarily
After downloaded the file .
BOOL CWebGrab :: GetFile (LPCTSTR szURL, CString & szBuffer, LPCTSTR szAgentName / * = NULL * /, CWnd * pWnd / * = NULL * /)
This is the wrong way . Easy to test .

INTERNET_STATUS_REQUEST_COMPLETE exactly where getting the message . Completion Must run in the state parsing .
void CWebGrabSession :: OnStatusCallback (DWORD dwContext,
                                       DWORD dwInternetStatus, == INTERNET_STATUS_REQUEST_COMPLETE
                                       LPVOID lpvStatusInformation,
                                       DWORD dwStatusInformationLength)
                                       
-------------------------------------------------- ------------
Five . LiteHTMLReader.h look at . ReadChar () and UngetChar () , which is ...
   + + And - are working properly and it is difficult to know ...

You got to check the entire contents of the source ...
At that time, to make quicker . Stable . May be effective .
Still. Taking into account the effort made ​​in minutes . Stabilization and to yijeongdo . The finish looks built .

TCHAR ReadChar (void)
{
ASSERT (m_lpszBuffer! = NULL);
if (m_dwBufPos> = m_dwBufLen)
return (NULL);
return (m_lpszBuffer [m_dwBufPos + +]);
}

TCHAR UngetChar (void)
{
ASSERT (m_lpszBuffer! = NULL);
ASSERT (m_dwBufPos);
return (m_lpszBuffer [- m_dwBufPos]);
}

modified 10-Nov-13 10:03am.

AnswerRe: I do not know the cause of the failure symptoms Pin
dchris_med10-Nov-13 8:41
dchris_med10-Nov-13 8:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.