Click here to Skip to main content
15,881,852 members
Articles / Web Development / HTML

SequelMax: C++ XML SAX Parser

Rate me:
Please Sign up or sign in to vote.
4.40/5 (15 votes)
12 Apr 2016CPOL5 min read 40.9K   710   36  
A new C++ SAX library to simplify parsing

Introduction

SAX (Simple API for XML) is an event-based sequential access parser API developed by the XML-DEV mailing list for XML documents. SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole, SAX parsers operate on each piece of the XML document sequentially.
--From Wikipedia, Simple API for XML

SAX can be used to parse simple XML. Anything that is more complex, requires the programmer to write his/her event dispatcher for each node which he/she is interested to extract data. Maintaining stateful information within these different event handlers is a chore which cannot be avoided. With standardization of the C++11 and C++ compilers embracing and implementing the new features since 2011, it is time to give SAX a little C++11 love. What do I mean by that? By incorporating C++11 Lambda into the new library, we can take the pain out of using SAX by simplifying the user code to register for event. In the new library, there is no need for programmer to write his/her own event dispatcher. Note: the library now supports C++98 through Boost function, programmer can switch between Boost or standard function in the config file.

Writing XML

Before I show the readers on how to use SequelMax to read XML, first we must have a XML, so we populate some data and save the XML. We will use XMLWriter for our task. XMLWriter enables us to save straight to XML without a DOM. This is the Employee structure we will use for the demo.

C++
struct Employee
{
    int EmployeeID;
    int SupervisorID;
    std::string Name;
    std::string Gender;
    double Salary;
    std::string Comment;
};

We fill up the vector with 3 Employee.

C++
void PopulateData(std::vector<Employee>& vec)
{
    Employee emp1;
    emp1.EmployeeID = 1286;
    emp1.SupervisorID = 666;
    emp1.Name = "Amanda Dion";
    emp1.Salary = 2200.0;
    emp1.Gender = "Female";
    emp1.Comment = "Hardworking employee!";

    Employee emp2;
    emp2.EmployeeID = 1287;
    emp2.SupervisorID = 666;
    emp2.Name = "John Smith";
    emp2.Salary = 3200.0;
    emp2.Gender = "Male";
    emp2.Comment = "Hardly working employee!";

    Employee emp3;
    emp3.EmployeeID = 1288;
    emp3.SupervisorID = 666;
    emp3.Name = "Sheldon Cohn";
    emp3.Salary = 5600.0;
    emp3.Gender = "Male";

    vec.clear();
    vec.push_back(emp1);
    vec.push_back(emp2);
    vec.push_back(emp3);
}

Notice only emp1 and emp2 has comments. We will save to XML using the code below.

C++
bool WriteDoc(const std::string& file, std::vector<Employee>& vec)
{
    using namespace SequelMax;
    XMLWriter w;
    if(w.Open(file, FT_UTF8, NEW, "    "))
    {
        w.WriteProcessingInstruction("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
        w.WriteStartElem("Employees");
    
        for(size_t i=0; i<vec.size(); ++i)
        {
            w.WriteStartElem("Employee");
                // writing attributes
                w.WriteAttr("EmployeeID", vec[i].EmployeeID);
                w.WriteAttr("SupervisorID", vec[i].SupervisorID);
                
                // writing elements
                w.WriteElement("Name", vec[i].Name);
                w.WriteElement("Salary", vec[i].Salary);
                w.WriteElement("Gender", vec[i].Gender);

                // writing comment if any
                if(vec[i].Comment.empty()==false)
                    w.WriteComment(vec[i].Comment);
            w.WriteEndElem();
        }
        w.WriteEndElem();
    }
    else
        return false;
        
    return true;
}

We specify to save as pretty XML by setting a indentation string which is 4 whitespace in XMLWriter::Open function. WriteDoc has 19 lines of code while, as we see later, ReadDoc has 16 lines. Each line of XML writing is explained and illustrated with an example output.

w.WriteStartElem("Employees"); writes the start element

XML
Example: <Employees>

w.WriteEndElem(); writes the end element if the element has children. EndElem knows what element name to write because all names are stored in a LIFO stack where start element will push the name to the stack and end element will pop it from the stack.

XML
Example: </Employees>

The same w.WriteEndElem(); will close the start element if it does not have any children.

XML
Example: <Employees/>

w.WriteAttr("EmployeeID", vec[i].EmployeeID); writes the attribute if the start element is not closed

XML
Example: EmployeeID="..."

w.WriteElement("Name", vec[i].Name); writes an element with text

XML
Example: <Name>George Solomon</Name>

Actually w.WriteElement("Name", vec[i].Name); is a shortcut to the verbose version below

C++
w.WriteStartElem("Name");
w.WriteElemText(vec[i].Name);
w.WriteEndElem();

XMLWriter::WriteAttr, XMLWriter::WriteElement and XMLWriter::WriteElemText are template functions. Programmers can overload SequelMax::ostream << operator to write your arbitrary data types for these 3 template functions. The XML is shown below.

XML
<?xml version="1.0" encoding="UTF-8"?>
<Employees>
    <Employee EmployeeID="1286" SupervisorID="666">
        <Name>Amanda Dion</Name>
        <Salary>2200</Salary>
        <Gender>Female</Gender>
        <!--Hardworking employee!-->
    </Employee>
    <Employee EmployeeID="1287" SupervisorID="666">
        <Name>John Smith</Name>
        <Salary>3200</Salary>
        <Gender>Male</Gender>
        <!--Hardly working employee!-->
    </Employee>
    <Employee EmployeeID="1288" SupervisorID="666">
        <Name>Sheldon Cohn</Name>
        <Salary>5600</Salary>
        <Gender>Male</Gender>
    </Employee>
</Employees>

Reading XML

In this section, we focus on how to use SequelMax to read XML. Let me show you the C++11 code 1st below before explaining.

C++
bool ReadDoc(const std::string& file, std::vector<Employee>& vec)
{
    using namespace SequelMax;
    Document doc;

    doc.RegisterStartElementFunctor("Employees|Employee", [&vec](Element& elem)->void {
        Employee emp;
        emp.EmployeeID = elem.GetAttrInt32("EmployeeID", 0);
        emp.SupervisorID = elem.GetAttrInt32("SupervisorID", 0);
        vec.push_back(emp);
    });
    doc.RegisterEndElementFunctor("Employees|Employee|Name", [&vec](const std::string& text)->void {
        vec.back().Name = text;
    });
    doc.RegisterEndElementFunctor("Employees|Employee|Gender",[&vec](const std::string& text)->void {
        vec.back().Gender = text;
    });
    doc.RegisterEndElementFunctor("Employees|Employee|Salary",[&vec](const std::string& text)->void {
        vec.back().Salary = boost::lexical_cast<int>(text);
    });
    doc.RegisterCommentFunctor("Employees|Employee", [&vec](const std::string& text)->void {
        vec.back().Comment = text;
    });

    return doc.Open(file);
}

Document class is used to read the XML and invoke the event handlers which we register. Same as XMLWriter class, the document class keeps a LIFO stack of element name as it parses the XML. RegisterStartElementFunctor register functor for start element event encountered by this order: Employees first and then Employee; At the same time, we capture the vec by reference in the lambda. The lambda has a Element parameter which contains the attributes. Typically for end element, we call RegisterEndElementFunctor when we want the text in the element or we want to know the end element is reached. RegisterCommentFunctor and RegisterCDataFunctor are similar that their only parameter is a string. The C++98 code to read XML using Boost function and bind are below.

C++
void ReadEmployee(SequelMax::Element& elem, std::vector<Employee>& vec)
{
    Employee emp;
    emp.EmployeeID = elem.GetAttrInt32("EmployeeID", 0);
    emp.SupervisorID = elem.GetAttrInt32("SupervisorID", 0);
    vec.push_back(emp);
}
void ReadName(const std::string& text, std::vector<Employee>& vec)
{
    vec.back().Name = text;
}
void ReadGender(const std::string& text, std::vector<Employee>& vec)
{
    vec.back().Gender = text;
}
void ReadSalary(const std::string& text, std::vector<Employee>& vec)
{
    vec.back().Salary = boost::lexical_cast<int>(text);
}
void ReadComment(const std::string& text, std::vector<Employee>& vec)
{
    vec.back().Comment = text;
}
bool ReadDoc(const std::string& file, std::vector<Employee>& vec)
{
    using namespace SequelMax;
    Document doc;

    doc.RegisterStartElementFunctor("Employees|Employee", boost::bind(ReadEmployee, _1, 
        boost::ref(vec)));
    doc.RegisterEndElementFunctor("Employees|Employee|Name", boost::bind(ReadName, _1, 
        boost::ref(vec)));
    doc.RegisterEndElementFunctor("Employees|Employee|Gender", boost::bind(ReadGender, _1, 
        boost::ref(vec)));
    doc.RegisterEndElementFunctor("Employees|Employee|Salary", boost::bind(ReadSalary, _1, 
        boost::ref(vec)));
    doc.RegisterCommentFunctor("Employees|Employee", boost::bind(ReadComment, _1, 
        boost::ref(vec)));

    return doc.Open(file);
}

We have to use boost::ref to capture our vector by reference, else a copy of the vector is made. Below is the display function.

C++
void DisplayDoc(const std::vector<Employee>& vec)
{
    for(size_t i=0; i<vec.size(); ++i)
    {
        std::cout << "Name: " << vec[i].Name << std::endl;
        std::cout << "EmployeeID: " << vec[i].EmployeeID << std::endl;
        std::cout << "SupervisorID: " << vec[i].SupervisorID << std::endl;
        std::cout << "Gender: " << vec[i].Gender << std::endl;
        std::cout << "Salary: " << vec[i].Salary << std::endl;
        if(vec[i].Comment.empty()==false)
            std::cout << "Comment: " << vec[i].Comment << std::endl;
        
        std::cout << std::endl;
    }
}

This is what is displayed. Notice the 3rd employee has no comments

Name: Amanda Dion
EmployeeID: 1286
SupervisorID: 666
Gender: Female
Salary: 2200
Comment: Hardworking employee!

Name: John Smith
EmployeeID: 1287
SupervisorID: 666
Gender: Male
Salary: 3200
Comment: Hardly working employee!

Name: Sheldon Cohn
EmployeeID: 1288
SupervisorID: 666
Gender: Male
Salary: 5600

You can register functor for Processing Instruction. The functor has 2 string parameters which are the key and value

C++
doc.RegisterProcessingInstructionFunctor([](const std::string& key, const std::string& val)->void {
    std::cout<< "ProcessingInstruction: "<< key << "=" <<val << std::endl;
});

The processing instruction key and value are shown below.

C++
Processing Instruction: encoding=UTF-8
Processing Instruction: version=1.0

When we register a event handler, the element string and event handler is stored in a STL map of string and std::function objects. Only those string key that can be found in the map, whose functor are invoked. Note: namespace are not supported, use the names as it appear in the XML, if the name is prefixed with namespace in the file, then include it in the key as well, well for example, "Company:Employees|Employee". The parser engine is modified from Portable Elmax DOM parser engine. However, SequelMax engine does not construct a DOM tree during parsing. Coming version provides option to specify some wildcards so as to parse some (limited) free-form XML. SequelMax is hosted together with Portable Elmax at Sourceforge. You can read the Portable Elmax article here.

Note: The TestSequelMax.cpp included in the project of the same name uses TSTR for strings which make the sample code verbose to read. To fix this, exclude that cpp and include the TestSequelMaxAscii.cpp in the project instead.

Conclusion

Elmax and SequelMax makes it easy to parse the XML file as long as the user knows what elements to expect in the XML file while on the other hand, DOM and SAX can parse any XML with unknown structure but most of time, we are reading our XML file format, not some unknown XML, so this is where Elmax and SequelMax comes in. Each technology have its own use in different scenarios; In circumstances of memory constricted system or big files, using SAX or SequelMax makes it possible to parse the file without consuming lots of memory like DOM does. Usually, a DOM memory requirement is 6 - 10 times of the file size. Choosing the right tool can simplify programmer's work. So choose your XML technology wisely. Thank you for reading!

History

  • 2015-06-14: 0.9.5 Beta. Removed Attribute class
  • 2014-05-16: Added VS2005/08 support through Boost function
  • 2013-12-08: Initial Release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Singapore Singapore
Shao Voon is from Singapore. His interest lies primarily in computer graphics, software optimization, concurrency, security, and Agile methodologies.

In recent years, he shifted focus to software safety research. His hobby is writing a free C++ DirectX photo slideshow application which can be viewed here.

Comments and Discussions

 
-- There are no messages in this forum --