Introduction
There are many cases where software projects require a small, fast and portable XML parser without worrying about platform dependency or something like COM interface of MSXML. I recently designed a fast cross-platform XML parser called MiniXML that can quickly parse the XML data into a document tree and provide an intuitive interface to access the data maintained by the document tree.
Design consideration
This project requires the best performance to build the document tree. I have the following considerations to achieve this goal:
Obviously, the above rules set is much smaller than the official XML definition (EBNF for XML). The MiniXML will parse the configuration files using this succinct rule set in order to minimize the CPU and memory usages.
General design
The following diagram (Figure 1) shows the MiniXML classes hierarchy:
The MiniXML implements two major tasks:
Parse the input XML data and build a document tree
The node of the document tree is the object of CElement
. Each object has a member of CElement* m_pFistChild
and a pair of m_pPrevSibling
and m_pNextSibling
members. The m_pFistChild
points to the first child node, while the sibling nodes consist of a double linked list maintained by the m_pPrevSibling
and m_pNextSibling
of each CElement
object. The self-built double linked list (instead of the STL container object) facilitates the iteration solution in the CElementIterator
.
The constructor of CXmlConf
initiates the process of creating a document tree. It creates the root object of CElement
and CScanner
objects, and calls the CElement
's Parse
function to build the document tree:
m_pScanner= new CScanner(pBuffer,pBuffer+buffersize);
m_pRoot=new CElement(NULL);
...
m_pRoot->Parse(m_pScanner);
...
The CElement::Parse
function directs the parsing tasks to the associated CBaseParser
objects according to the input token and the BNF rules defined in form 1.
bool CElement::Parse(CScanner* pScan)
{
CStagParser StagParser(this);
CEtagParser EtagParser(this);
if (!StagParser.Parse(pScan)) return false;
if (StagParser.IsEmptyElementTag()||
StagParser.IsPITag()||
StagParser.IsCommentTag())
{
m_StringValue=StagParser.GetNameObj();
if (m_pParent) m_pParent->AddChildElement(this);
m_bValid=true; return true;
}
CContent contentParser(this);
if (!contentParser.Parse(pScan)) return false;
if (!EtagParser.Parse(pScan)) return false;
if(StagParser.GetNameObj()==EtagParser.GetNameObj())
{
m_StringValue=StagParser.GetNameObj();
if (m_pParent) m_pParent->AddChildElement(this);
m_bValid=true;
}
return bValid;
}
Access the XML data from the document tree
The MiniXML access interface consists of the following three classes:
CXmlConf
, which acquires the XML data and initiates the parsing process to build a document tree.
CElement
, which is the core class to access all the XML data for an Element.
CElementIterator
, which is an iterator class to access the sibling nodes of a given CElement*
pointer.
For the demo purpose, I wrote a CElement* Clone (CElement* pObj)
function (in ElementClone.cpp) to show you how to use the public
member functions of the classes CElement
and CElementIterator
. The function returns a pointer of CElement
object that copies all the members and sub-nodes tree structure of the CElement
node pointed by pObj
. This function is not a practical way to do a real cloning job considering its low performance and memory usage. However, it is a helpful example to show you how to use the MiniXML's interface classes.
CElement* Clone(CElement*p)
{
if (!p||!p->IsValid()) return NULL;
vector<char> ElementName;
if (!p->GetElementName(ElementName)) return NULL;
CElement* retRoot=CElement::CreateNewElement(ElementName.begin());
int AttrCount=p->GetAttributeCount();
for (int i=0; i<AttrCount;i++)<ATTRCOUNT;I++){ vector<char>
{
vector<char> AttrName,AttrValue;
if (!p->GetAttributePairByIndex(i, AttrName,AttrValue))
{
retRoot->Delete();
return NULL;
}
retRoot->AddAttributePair(AttrName.begin(),AttrValue.begin());
}
vector <char> charData;
if (p->GetCharData(charData)) retRoot->SetCharData(charData);
CElementIterator iter(p->GetFirstChild());
while (iter.IsValid()){
CElement*child=Clone(iter.GetElementPtr());
if (child) retRoot->AddChildElement(child);
++iter;
}
return retRoot;
}
Using the code
The best practice of using MiniXML is to create a CXmlConf
object by acquiring a XML file or string and use the member functions of CElement
and CElementIterator
to walk through the established document tree. The ParseAndCloneTest
function defined in Test.cpp shows the usage:
void ParseAndCloneTest()
{
CXmlConf xmlConf("sampleXML.xml");
if (xmlConf) {
CElement*pRoot=xmlConf.Clone();
cout<<*pRoot;
pRoot->Delete();
}
else cout<<"ParseAndCloneTest Failed.";
}
Under certain cases, users may want to read or write a specific element of the document tree. The MiniXML provides a function CXmlConf::GetRootElement
to get the pointer of the first child CElement
that matches the element name sequence. For example, for the XML input:
="1.0"= "UTF-8"
="1.0"= "UTF-8"
<Element1 attr="haha" attr2="haha2" attr3="hahah3">
<SubElement1 Attr="Book" Attr2="Pen" Attr3="keyboard"/>
<SubElement1 Attr="Book2" Attr2="Pen" Attr3="keyboard">
<SubElement2>
<SubElement2Sub attr="Beijing" attr2="ShangHai"> </SubElement2Sub>
<SubElement2Sub attr="XiAn" attr2="NanJing"> </SubElement2Sub>
</SubElement2>
</SubElement1>
</Element1>
Users can call CXmlConf::GetRootElement("Element.SubElement1.SubElement2.SubElement2Sub")
to get the pointer of the first SubElement2Sub
element whose attr
attribute is "Beijing". The following example ElementModifyTest
gets the second SubElement2Sub
element pointer, does certain element modification, and finally outputs the document tree to "OutputXML.xml" file.
void ElementModifyTest()
{
CXmlConf xmlConf("sampleXML.xml");
if (xmlConf)
{
CElementIterator iter(xmlConf.GetRootElement("Element1."
"SubElement1.SubElement2.SubElement2Sub"));
if (iter.IsValid())
{
++iter;
if (iter.IsValid())
{
iter.GetElementPtr()->ModifyAttribute("attr","HuNan");
CElement*p=CElement::CreateNewElement("NewElement");
p->AddAttributePair("Attribute", "hahahaha");
p->AddAttributePair("Attribute2", "hahahaha2");
iter.GetElementPtr()->AddChildElement(p);
}
}
ofstream ofs("OutputXML.xml");
ofs<<xmlConf;
}
}
Points of interest
The class CStringValue
has been used in all the MiniXML classes to maintain string information such as element names, element char data, attribute names and attribute values. CStringValue
offers two different ways to keep the string information. During the parsing process, the CStringValue
will not copy the string acquired from the input XML data in its internal buffer. Instead, it will keep the starting and ending addresses of the string into its members m_pBegin
and m_pEnd
. This solution will avoid unnecessary string buffer creation and string copy during the parsing process. On the other hand, when users want to modify the document tree such as modify the attributes value, the CStringValue
will behave as a regular string class which uses its internal buffer to maintain the input string. CStringValue
class is defined in MiniParser.h.
Richard Lin is senior software engineer of in Silicon Valley.
Richard Lin was born in Beijing and came to US in the fall of 1995. He began his first software career in bay area of California in 1997. He has worked for many interesting projects including manufacturing testing systems, wireless AP firmware and applications, email anti-virus system and personal firewalls. He loves playing go (WeiQi in Chinese) and soccer in his spare time. He has a beautiful wife and a cute daughter and enjoys his life in San Jose of California.