Click here to Skip to main content
15,889,844 members
Articles / Programming Languages / XML

Some Ideas for Parsing Text Data Formats

Rate me:
Please Sign up or sign in to vote.
4.53/5 (5 votes)
8 Aug 2018MIT5 min read 14.3K   65   7   1
Recipes for parsing XML and JSON data files

Introduction

There are basically three ways of parsing data interchange formats such as JSON or XML. The first one is deserialization into a data object model. In this approach, the structure of JSON or XML document is represented by classes which can be created manually by a programmer or automatically generated with a tool. This method gives you the advantages of object-oriented programming and strong typing. The biggest disadvantage is that it takes much time to deserialize large files. I presented some thoughts on this topic in this article.

The second way of parsing data files depends on traversing the document tree either by manually iterating through nodes or by using query languages such as XPath. In this approach, you can select only required parts of the file (which improves performance) but you have to manually convert data to proper data types. The code is usually larger in size and less clean than in the first approach.

The third way is necessary in case of extremely large files (like this one). It depends on reading directly from a stream with a data reader because loading the whole tree into the memory would result in an OutOfMemoryException.

These three methods can be combined to achieve the best compromise between performance and code clarity. For example, you can read specific nodes with XmlReader and deserialize each node into an object without the need to deserialize the whole document at once.

In this article, I am going to present some ideas that improve the readability of code related to parsing XML files. These techniques are used in open source music notation library Manufaktura.Controls (https://www.codeproject.com/Articles/1252423/Music-Notation-in-NET) and some other abandoned music project. I have to mention that all code examples described in this article don’t support XML namespaces. I didn’t need namespace support when I was writing that code but it can be easily implemented if needed.

Read by a Data Reader, Parse Only Specific Nodes

In the following example, I read a large XML file with XmlReader and parse only specific tags:

C#
using (var fs = new FileStream(path, FileMode.Open, FileAccess.Read))
{
      using (var reader = XmlReader.Create(fs, new XmlReaderSettings { IgnoreWhitespace = true }))
      {
           while (reader.Read() && reader.Name != "record")
           {
               reader.MoveToContent();
           }
           while (true)
           {
               var record = reader.ReadOuterXml();
               if (string.IsNullOrWhiteSpace(record)) break;

               var recordElement = XElement.Parse(record);
               ParseRecord(recordElement, db);
           }
     }
}

The main advantage of this example is that XML file is read directly from file stream so the memory is conserved. The second advantage is that nodes that interest me are read entirely as XElements so they are easier to manipulate and convert into specific objects.

Parser Classes

It is a good idea to implement a strategy pattern for parsing specific type of nodes.

C#
public abstract class MusicXmlParsingStrategy
{
        private static readonly IEnumerable<MusicXmlParsingStrategy> _strategies;

        public abstract string ElementName { get; }

        static MusicXmlParsingStrategy()
        {
            var strategyTypes = typeof(MusicXmlParsingStrategy).
                GetTypeInfo().Assembly.DefinedTypes.Where(t => t.IsSubclassOf
                (typeof(MusicXmlParsingStrategy)) && !t.IsAbstract);
            List<MusicXmlParsingStrategy> strategies = new List<MusicXmlParsingStrategy>();

            foreach (var type in strategyTypes)
            {
                strategies.Add(Activator.CreateInstance(type.AsType()) as MusicXmlParsingStrategy);
            }
            _strategies = strategies.ToArray();
        }

        public abstract void ParseElement(MusicXmlParserState state, Staff staff, XElement element);

        public static MusicXmlParsingStrategy GetProperStrategy(XElement element)
        {
            return _strategies.FirstOrDefault(s => s.ElementName == element.Name);
        }
 }

In the above example, specific strategy is selected by matching node name with ElementName property. Then ParseElement method is called which reads the content of the node and adds specific elements such as notes, clefs, barlines, etc. to the staff. Strategies are instantiated in simple reflection-based mechanism assuming that every strategy has a default parameterless constructor. You can modify the code to  allow using IoC container of your choice.

This is a sample implementation of MusicXmlParsingStrategy:

C#
internal class ClefParsingStrategy : MusicXmlParsingStrategy
{
        public override string ElementName
        {
            get { return "clef"; }
        }

        public override void ParseElement(MusicXmlParserState state, Staff staff, XElement element)
        {

            ClefType typeOfClef = ClefType.GClef;
            int line = 1;

            element.IfElement("sign").HasValue(new Dictionary<string, ClefType> {
                {"G", ClefType.GClef},
                {"C", ClefType.CClef},
                {"F", ClefType.FClef},
                {"percussion", ClefType.Percussion}
            }).Then(v => typeOfClef = v);

            element.IfElement("line").HasValue<int>().Then(v => line = v).Otherwise(s =>
            {
                if (typeOfClef == ClefType.Percussion) line = 2;
            });
                  

            var clef = new Clef(typeOfClef, line);
            element.IfAttribute("number").HasValue<int>().Then(v => 
                            clef.Staff = staff.Part.Staves.ElementAt(v - 1));
            element.IfElement("clef-octave-change").HasValue<int>().Then(c => clef.OctaveChange = c);

            var correctStaff = clef.Staff ?? staff;
            correctStaff.Elements.Add(clef);
        }
}

The implementation of ParseElement method can look a bit strange because it uses experimental API that I will discuss now.

Fluent API

The extensions provided in Manufaktura.Core.Xml allow you to parse the contents of XElement in an intuitive way. The data is processed by queries whose assumptions are presented in the following diagram:

Image 1

This is a typical usage:

C#
var b = new Barline();
element.IfAttribute("location").HasValue("left")
    .Then(() => b.Location = HorizontalPlacement.Left)
    .Otherwise(r => b.Location = HorizontalPlacement.Right);

element.IfElement("bar-style").HasValue("light-heavy").Then(() => b.Style = BarlineStyle.LightHeavy);
element.IfElement("bar-style").HasValue("none").Then(() => b.Style = BarlineStyle.None);
element.IfElement("bar-style").HasValue("dashed").Then(() => b.Style = BarlineStyle.Dashed);

There are three main extension methods for XElement:

  • IfAttribute – creates a XAttributeHelper that enables you to query for attribute value
  • IfElement – creates a XElementHelper that enables you to query for child element value (there is also a IfDescendant method which performs a query on all descendants, not only first level children),
  • IfHasElement – returns IXHelperResult that contains information if element exists.

Generally speaking, these methods can return either IXHelper that enables you to query for value or IXHelperResult which acts as a container for returned value.

This mechanism may seem cloudy at first glance, so it's best to discuss it with examples.

Query for Attribute Value (string type)

C#
element.IfAttribute("location").HasValue("left")
                .Then(() => b.Location = HorizontalPlacement.Left)
                .Otherwise(r => b.Location = HorizontalPlacement.Right);

This code checks if element has attribute named “location”. If attribute doesn’t exist, the code does nothing. If attribute exists and has value of “left”, b.Location is set to HorizontalPlacement.Left. Otherwise, it is set to b.Location = HorizontalPlacement.Right.

Query for Child Element Value using Dictionary of Values

C#
ClefType typeOfClef = ClefType.GClef;
int line = 1;

element.IfElement("sign").HasValue(new Dictionary<string, ClefType> {
                {"G", ClefType.GClef},
                {"C", ClefType.CClef},
                {"F", ClefType.FClef},
                {"percussion", ClefType.Percussion}
            }).Then(v => typeOfClef = v);

Checks if there is a child element named “sign”. If element doesn’t exist, no action is performed. If it does exist, its value is mapped to Enum from a provided dictionary. If Enum is successfully matched, it is used in Then method to set typeOfClef variable to its value.

Dictionary of values can also be used when parsing Boolean values:

C#
element.IfAttribute("print-object").HasValue(new Dictionary<string, bool> {
                {"yes", true}, {"no", false}}).Then(m => builder.IsVisible = m);

Query for Child Element with Integer Value

C#
element.IfElement("line").HasValue<int>().Then(v => line = v).Otherwise(s =>
            {
                if (typeOfClef == ClefType.Percussion) line = 2;
            });

Checks if element has child element named “line”. If element doesn’t exist, no action is taken. If element has value which can be parsed to int, the local variable line is set to parsed integer value. Otherwise, local variable line is set to 2 but only if typeofClef == ClefType.Percussion.

Check if Child Element Exists

C#
notationsNode.IfElement("fermata").Exists().Then(() => builder.HasFermataSign = true);

If notationsNode has element “fermata”, then set builder.HasFermataSign to true. If element doesn’t exist, no action is taken.

Nesting Queries

Queries can be nested as in this example:

C#
notationsNode.IfHasElement("dynamics").Then(d =>
{
    var dir = new Direction();
    d.IfAttribute("default-y").HasValue<int>().Then(v =>
    {
        dir.DefaultYPosition = v;
        dir.Placement = DirectionPlacementType.Custom;
    });

    d.IfAttribute("placement").HasValue(new Dictionary<string, DirectionPlacementType>
    {
        {"above", DirectionPlacementType.Above},
        {"below", DirectionPlacementType.Below}
    }).Then(v =>
    {
        if (dir.Placement != DirectionPlacementType.Custom) dir.Placement = v;
    });

    foreach (XElement dynamicsType in d.Elements())
    {
        dir.Text = dynamicsType.Name.LocalName;
    }
    staff.Elements.Add(dir);
});

Extracting Values from Containers

By default, IXHelper allows us to use parsed values in Then and Otherwise methods, but we can also return the value with AndReturnResult and ThenReturnResult methods:

C#
var invMordentNode = ornamentsNode
    .IfElement("inverted-mordent")
    .Exists()
    .Then(e => builder.Mordent = new Mordent() { IsInverted = true })
    .AndReturnResult();

invMordentNode.IfAttribute("placement").HasValue
              (new Dictionary<string, VerticalPlacement> {
    {"above", VerticalPlacement.Above},
    {"below", VerticalPlacement.Below}
}).Then(v => builder.Mordent.Placement = v);

In the above example, Exists() method returns XHelperExistsResult which contains value of type XElement so AndReturnResult returns XML node. HasValue method returns XHelperHasValueResult which contains value of the node.

Conclusion

The last method of parsing XML documents is certainly not efficient in terms of performance because the same nodes can be parsed multiple times but it can give you the feeling of natural language. In my opinion, it can be very useful in unit testing, prototyping and simple business logic. It is also useful if your data model has different structure than the XML file. It can also gain a decent performance when combined with other methods.

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
Poland Poland
I graduated from Adam Mickiewicz University in Poznań where I completed a MA degree in computer science (MA thesis: Analysis of Sound of Viola da Gamba and Human Voice and an Attempt of Comparison of Their Timbres Using Various Techniques of Digital Signal Analysis) and a bachelor degree in musicology (BA thesis: Continuity and Transitions in European Music Theory Illustrated by the Example of 3rd part of Zarlino's Institutioni Harmoniche and Bernhard's Tractatus Compositionis Augmentatus). I also graduated from a solo singing class in Fryderyk Chopin Musical School in Poznań. I'm a self-taught composer and a member of informal international group Vox Saeculorum, gathering composers, which common goal is to revive the old (mainly baroque) styles and composing traditions in contemporary written music. I'm the annual participant of International Summer School of Early Music in Lidzbark Warmiński.

Comments and Discussions

 
QuestionGenial Pin
Member 139421098-Aug-18 0:57
professionalMember 139421098-Aug-18 0:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.