Click here to Skip to main content
15,868,016 members
Articles / Desktop Programming / MFC
Alternative
Article

HTML Parser C++ (Demo Project)

Rate me:
Please Sign up or sign in to vote.
5.00/5 (12 votes)
4 Oct 2013CPOL1 min read 69.4K   4K   31   9
This is a sample project for "HTML Reader C++ Class Library"

Introduction

This is a sample project developed using this tiny HTML Parser library. Its main purpose is to show the use of that library. However I have added some additional features to the lib. The project has support for UNICODE builds. The code warps the HTML tags to a tree model, exposing a function to retrieve a specific HTML element.  

An HTML element is an individual component of an HTML document or "web page", once this has been parsed into the Document Object Model.

In the HTML syntax, most elements are written with a start tag and an end tag, with the content in between. An HTML tag is composed of the name of the element, surrounded by angle brackets. An end tag also has a slash after the opening angle bracket, to distinguish it from the start tag.

<p>In the HTML syntax, most elements are written ...</p>

Between the starting/ending tags, any number of other tags may exist. This project offers a way to search for a specific tag, and also specify an attribute with a value for that tag. Then extract the content of that element. It's a cheap alternative to Microsoft's MSHTML parser (full of leaks).

Image 1

Using the Code

Add to your project the files form AClass directory.

Include some headers you may need like:

C++
#include "AClass/LiteHTMLReader.h"  
#include "AClass/HtmlElementCollection.h"


Instantiate the reader which will parse the HTML string.

C++
CLiteHTMLReader theReader;
CHtmlElementCollection theElementCollectionHandler;
theReader.setEventHandler(&theElementCollectionHandler);  

If you want to get a specific set of tags with a specific attrib use:

C++
theElementCollectionHandler.InitWantedTag(_T("style"), _T("id"),_T("sss"));

Call the parser function. At the end, the theElementCollectionHandler will be filled with the parsed structure.

C++
theReader.Read(m_szHtmlPage)

Now start retrieving the elements' text to a CString var.

C++
for (int i=0;i<theElementCollectionHandler.GetNumElementsFiltered();i++){
    theElementCollectionHandler.GetOuterHtml(i, szTxt, 1);
}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionHow to delete a html tag from all the HtmlTree-tree tags? Pin
jpkfox4-Aug-16 1:23
jpkfox4-Aug-16 1:23 
GeneralMy vote of 5 Pin
nvect29-Jan-14 19:35
nvect29-Jan-14 19:35 
QuestionIt's really great Pin
nvect29-Jan-14 19:26
nvect29-Jan-14 19:26 
AnswerRe: It's really great Pin
dchris_med29-Jan-14 20:23
dchris_med29-Jan-14 20:23 
GeneralRe: It's really great Pin
nvect31-Jan-14 12:33
nvect31-Jan-14 12:33 
GeneralRe: It's really great Pin
dchris_med31-Jan-14 12:54
dchris_med31-Jan-14 12:54 
GeneralRe: It's really great Pin
nvect1-Feb-14 1:20
nvect1-Feb-14 1:20 
QuestionI do not know the cause of the failure symptoms Pin
shint10-Nov-13 3:51
shint10-Nov-13 3:51 
AnswerRe: I do not know the cause of the failure symptoms Pin
dchris_med10-Nov-13 8:41
dchris_med10-Nov-13 8:41 
what's the URL which give's you problems?

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.