Click here to Skip to main content
15,867,325 members
Articles / Desktop Programming / ATL

The Most Minimized XML Parser

Rate me:
Please Sign up or sign in to vote.
3.18/5 (22 votes)
11 May 20042 min read 115.6K   1.3K   36   18
An XML parser to substitute the big MSXML.

Introduction

XML is an essential data structure when you construct an object model. Without it, you can hardly do something hierarchical. MSXML is such a big DLL, and its version changes on and on, makes the installation not so easy. From the other side, I just need to parse a little document, and that is a little application, but I need to distribute a big DLL. These made me want to reinvent the wheel.

After I finished my draft, I found there have been some successful projects about XML-parsing, such as TinyXml, simple STL based XML parser By David Hubbard, XMLite: simple XML parser by Cho, Kyung-min, etc. each one of them has their own features, and my requirement is something different from above mentioned, my code snippet wasn't involved with MFC, STL, and don't need to be run on other OSes.

Using the Code

C++
#pragma once
#define MAX_TEXT_LEN 0x400
#define MAX_NAME_LEN 0x100

class CAxNode
{
public:
  CAxNode() { parent = NULL; }
  CAxNode(CAxNode* p):parent(p){}
  ~CAxNode()
  {
    for(long i=0; i<childNodes.GetSize();i++)
      delete childNodes[i];
  }

  CComBSTR  elementType;
  CSimpleArray<CComBSTR> arrText;
  CSimpleMap<CComBSTR,CComBSTR> attrMap;
  CSimpleArray<CAxNode*> childNodes;
  CAxNode*   parent;
};

inline void FillSTR(LPTSTR& s,CComBSTR &x)
{
  TCHAR name[MAX_NAME_LEN];
  long j=0;
  while(name[j++]=*s++)    
  // should be char or dig, and not dig head
  if(*s==_T('\r')||*s==_T('\n')||
    *s==_T('\t')||*s==_T(' ')||*s==_T('>')||
    *s==_T('=')||*s==_T('<')||*s==_T('/'))
    break;

  name[j] =_T('\0');
  //CharLower(name);
  x = name;
}

inline void SkipFormatChar(LPCTSTR& s)
{
  while(*s++)
    if(*s!=_T('\r')&&*s!=_T('\n')&&*s!=_T('\t')&&*s!=_T(' '))
      break;
}
inline void ParseNode(CAxNode* Node, LPTSTR& s)
{
  while(*s++!=_T('<'));
  FillSTR(s,Node->elementType);

  TCHAR szText[MAX_TEXT_LEN];
  if(*s!=_T('>'))
  {
    // now spawn attr map
    SkipFormatChar(s);
    while(*s)
    {
      CComBSTR attr;
      FillSTR(s,attr);
      while(*s++!=_T('"'));

      long j = 0;
      while(*s!=_T('"')&&j<MAX_TEXT_LEN)
        szText[j++] = *s++;
      szText[j] = _T('\0');

      // remove duplicate attributes
      if(Node->attrMap.FindKey(attr)<0)
        Node->attrMap.Add(attr,szText);

      SkipFormatChar(s);
      if(*s==_T('/'))
      {
        while(*s++!='>');
        return;  // closeness of some self-closed tag
      }

      if(*s==_T('>'))  
        break;  // now process innertext
    }
  }

  // processing child nodes
  while(*s)
  {
    if(*s==_T('>'))  
      *s++;

    long j = 0;
    while(*s!=_T('<')&&j<MAX_TEXT_LEN) 
      szText[j++] = *s++;
    szText[j] = _T('\0');
    Node->arrText.Add(szText);
  
    SkipFormatChar(s);
    if(*s==_T('/')) 
    {
      SkipFormatChar(s);
      CComBSTR elementType;
      FillSTR(s,elementType);
      ATLASSERT(Node->elementType==elementType);
      return;
    }
    *s--;

    CAxNode* child = new CAxNode(Node);
    Node->childNodes.Add(child);
    ParseNode(child,s);
  }
  return;
}

#pragma warning(push)
#pragma warning(disable: 4244) 
inline void RemoveBlock(LPTSTR &s,LPCTSTR szleft,LPCTSTR szright)
{
  long i1,i2;
  LPTSTR s1,s2;
  while(1)
  {
    s1 = _tcsstr(s,szleft);
    if(s1==NULL)
      break;
    i1 = s1-s;
    if(i1<0) 
      break;
    s2 = _tcsstr(s,szright);
    i2 = s2-s+lstrlen(szright);
    if(i2<0) 
      break;

    ATLASSERT(i2>i1);
    for(int i=i1;i<i2;i++)
      s[i] = _T(' ');
  }
  return;
}
#pragma warning(pop)

This is all of the code, totally 135 lines. In order to use it, you just need the atlsimplecoll.h. If you don't have one, you can copy it from another box, this file has little dependency on other ATL headers.

C++
HANDLE hFile = ::CreateFile(testfile,GENERIC_READ,
  FILE_SHARE_WRITE,NULL,OPEN_EXISTING,NULL,NULL);
if (hFile == INVALID_HANDLE_VALUE)
  return -1;

ULARGE_INTEGER liFileSize;
liFileSize.LowPart = ::GetFileSize(hFile, &liFileSize.HighPart);
if (liFileSize.LowPart == 0xFFFFFFFF)
  return -1;

LPTSTR lpXML = 
 new TCHAR[((size_t)liFileSize.QuadPart)/sizeof(TCHAR)+1];

DWORD pdwRead;
BOOL b = ::ReadFile(hFile, lpXML, 
  (DWORD)liFileSize.QuadPart,&pdwRead,NULL);
lpXML[pdwRead/sizeof(TCHAR)]= _T('\0');
  
RemoveBlock(lpXML,_T("<?"),_t("?>"));
RemoveBlock(lpXML,_T("<!--"),_T("-->"));

CAxNode* pNode = new CAxNode(NULL);
  
LPTSTR xxx = lpXML;
DWORD dw1 = GetTickCount();
ParseNode(pNode,xxx);
DWORD dw2 = GetTickCount();  

delete [] lpXML;
  
// remove child at 0,0
//CAxNode* child = pNode->childNodes[0]->childNodes[0];
//delete child;
//pNode->childNodes[0]->childNodes.RemoveAt(0);

delete pNode;
::CloseHandle(hFile);   

History

If you read my last version, you can see that I have changed it a lot, such as LPTSTR has been replaced by CComBSTR, this makes me omit simplearray and equalhelper. And a new constructor has been added, because of my personal usage, you can just remove it. In the last version, I add innertext only when its length is greater than 0, this is not so reasonable, so I changed it.

The performance still needs a lot of work. if you don't need to change node items through the whole runtime, you can make the CComBSTR to be LPCTSTR, this will exhaust lower memory, and the performance will be improved accordingly.

The last word, and also the most important statement obviously you should take care of is, if you use my code in your commercial products, you certainly should give part of your profit to me! I am serious sir! Or I will call the police!

Anyway, comments are really appreciated, so I can make it better for every one of us. I am waiting for you right here.

License

This article has no explicit license attached to it, but may contain usage terms in the article text or the download files themselves. If in doubt, please contact the author via the discussion board below. A list of licenses authors might use can be found here.


Written By
Software Developer (Senior)
China China
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 1 Pin
KarstenK26-Jan-09 0:45
mveKarstenK26-Jan-09 0:45 
GeneralW3C compliance Pin
Umut Alev30-Oct-05 1:23
Umut Alev30-Oct-05 1:23 
GeneralRe: W3C compliance Pin
ChauJohnthan30-Oct-05 1:54
ChauJohnthan30-Oct-05 1:54 
Generalunsafe code Pin
that_4-Aug-05 0:55
that_4-Aug-05 0:55 
Generalwhy in C# category Pin
jmw13-May-04 10:43
jmw13-May-04 10:43 
Questioncompared to TinyXML ? Pin
seriema2-Apr-04 14:32
seriema2-Apr-04 14:32 
AnswerRe: compared to TinyXML ? Pin
ChauJohnthan2-Apr-04 22:22
ChauJohnthan2-Apr-04 22:22 
GeneralCompared to PugXML Pin
Ralph Walden12-May-04 5:08
Ralph Walden12-May-04 5:08 
GeneralRe: Compared to PugXML Pin
ChauJohnthan12-May-04 7:11
ChauJohnthan12-May-04 7:11 
GeneralRe: Compared to PugXML Pin
Ralph Walden12-May-04 8:08
Ralph Walden12-May-04 8:08 
GeneralRe: Compared to PugXML Pin
Neville Franks12-May-04 10:06
Neville Franks12-May-04 10:06 
GeneralRe: Compared to PugXML Pin
ChauJohnthan12-May-04 18:51
ChauJohnthan12-May-04 18:51 
GeneralRe: Compared to PugXML Pin
Neville Franks12-May-04 19:14
Neville Franks12-May-04 19:14 
GeneralRe: Compared to PugXML Pin
ChauJohnthan12-May-04 17:14
ChauJohnthan12-May-04 17:14 
PugXML doesn't allocate memory for the strings (it parses in place), so there's little or no need for the ATL memory classes that your method uses

if you read it carefully, you will find this is also "parses in place"Wink | ;-)
this is the real update of this version.

as to said memory class, CComBSTR, which I haven't metioned the dependency
unaccurately, is just a BSTR wrapper. BSTR is just LPWSTR. I use it for
omiting ::SysAllocXXX api, to make it tidy, keep compatible with ATL, easy
to understand the code.

I think maintaining PugXML, and TinyXML project will be a great pain to me,
because of they are full-featured. I just use xml as an internal data structure.
they are obvioursly overskilled to me.

if I am using arbitritly xml from user, then choose PugXML and TinyXML or MSXML is a must.
GeneralRe: Compared to PugXML Pin
X. Y.12-May-04 8:17
sussX. Y.12-May-04 8:17 
GeneralRe: Compared to PugXML Pin
Neville Franks12-May-04 10:09
Neville Franks12-May-04 10:09 
GeneralRe: Compared to PugXML Pin
X. Y.12-May-04 12:09
sussX. Y.12-May-04 12:09 
GeneralRe: Compared to PugXML Pin
Neville Franks12-May-04 12:34
Neville Franks12-May-04 12:34 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.