Using JsonTextReader to Stream Huge JSON

honey the codewitch

4.44/5 (5 votes)

Sep 10, 2019

CPOL

4 min read

23484

204

How to process large JSON data using a streaming reader

↵

Introduction

Note: This covers one aspect of my Json library. For more, please see my main Json article.

Loading JSON into objects is a great way to abstract it. However, it doesn't work well, if at all, to do it with large amounts of data. This is where the easy, JSON path selectable trees are not going to work so well.

What you need ideally, is to be able to move to the sections of the document you want, and then load those into the object tree and work on that small subset, and then continue. That way, you're never loading the entire document into memory at one time.

Enter JsonTextReader.

Background

Like all my parser libraries, I expose a streaming pull-parser interface to JSON that works quite a bit like System.Xml.XmlReader with a few extra features. It's suitable for use on streams that do not seek, and streams that are extremely large. It supports forward only navigation through the document.

The general idea is to call Read() in a loop, and then check the NodeType property, and work with the Value property or the RawValue property. The latter just returns the string data directly as it came from the stream, while the former "cooks" it by turning it into its corresponding .NET type.

For example, here's the code for printing out every node in the document as a list.

using (var reader = JsonTextReader.CreateFrom(@"..\..\data.json"))
{
    while(reader.Read())
    {
        Console.Write(reader.NodeType);
        // only keys and values have a Value
        if (JsonNodeType.Value == reader.NodeType 
            || JsonNodeType.Key==reader.NodeType)
            Console.Write(" " + reader.Value);
        Console.WriteLine();
    }
}

If you're familiar with XmlReader/XmTextReader, then this above should be familiar.

Using the Code

The above isn't very useful, but it illustrates the basic concept. We're about to make it more real world.

We'll be using the data http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb.

To load from the url, you'd do:

var url = "http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb";
using(var reader=JsonTextReader.CreateFromUrl(url))
{
    // stuff here
}

Aside from that, reading is exactly the same as from a file. You can also read from a string, using Create() but don't load huge data into a string. Disposing the reader is important if it was opened on a file or on a URL. On a string, it doesn't matter but it's good practice.

Remember that the reader is forward only. That complicates things. You cannot move to the parent of a node, and if whatever you're looking for isn't found, it's back to the drawing board because your reader just seeked to the end of where you were searching looking for what it couldn't find and you can't go back and try again - we're forward only. Because of this, in huge documents, you need to know what you're looking for - you can't simply do random access queries on it.

Now let's do some selecting.

var url = "http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb";
using(var reader=JsonTextReader.CreateFromUrl(url))
{
    if (reader.SkipToField("created_by"))
    {
        // we're currently on the *key*/field name
        // we have to move to the value if we want 
        // just that.
        // finally, we parse the subtree into a
        // tree of objects, and write that to the
        // console, which pretty prints it
        if (reader.Read())  
            Console.WriteLine(reader.ParseSubtree());
        else // below should never execute
            Console.WriteLine("Sanity check failed, key has no value");
    } 
    else
        Console.WriteLine("Not found");
}

We landed on an array. What if we want the second value?

using(var reader=JsonTextReader.CreateFromUrl
("http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb"))
{
    if (reader.SkipToField("created_by"))
    {
        // we're currently on the *key*/field name
        // we have to move to the value if we want 
        // just that.
        if (reader.Read())
        {
            // now, skip to the index we want
            // underneath where we are.
            // finally, we parse the subtree into a
            // tree of objects, and write that to the
            // console, which pretty prints it
            if (reader.SkipToIndex(1))
                Console.WriteLine(reader.ParseSubtree());
            else
                Console.WriteLine("Couldn't find the index.");
        }
        else // below should never execute
            Console.WriteLine("Sanity check failed, key has no value");
    } 
    else
        Console.WriteLine("Not found");
}

Okay, admittedly, even if we remove the comments, that's a lot of code for skipping to two places.

Fortunately, we can shorten it by skipping entire paths:

using(var reader=JsonTextReader.CreateFromUrl
("http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb"))
{
    // skip to "$.created_by[1]" <-- JSON path syntax
    if (reader.SkipTo("created_by",1))
        Console.WriteLine(reader.ParseSubtree()); 
    else
        Console.WriteLine("Not found");
}

Above we skipped to "created_by" followed by index 1 in one call. You can have as many combinations of field names and indices as you need, but be careful because if it can't find the selection, it will be hard to know what it couldn't find if it fails.

Finally, if we just wanted his name, we'd do:

var url="http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb";
using(var reader=JsonTextReader.CreateFromUrl(url))
{
    // skip to "$.created_by[1].name" <-- JSON path syntax
    if (reader.SkipTo("created_by", 1, "name"))
    {
        // we're currently on the *key*/field name
        // we have to move to the value if we want 
        // just that.
        if (reader.Read())
            Console.WriteLine(reader.ParseSubtree());
        else // below should never execute
            Console.WriteLine("Sanity check failed, key has no value");
    }
    else
        Console.WriteLine("Not found");
}

Perhaps occasionally, you don't know the field beforehand. Maybe it could be one of a number of possible fields. I don't really have anything other than a contrived scenario with this dataset we're using, but it looks something like this:

var url = "http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb";
using (var reader = JsonTextReader.CreateFromUrl(url))
{
    // accept either "seasons" or "production_companies" - whichever comes first
    if (reader.SkipToAnyOfFields("seasons", "production_companies"))
    {
        // we're currently on the *key*/field name
        // we have to move to the value if we want 
        // just that.
        if (reader.Read())
            Console.WriteLine(reader.ParseSubtree());
        else // below should never execute
            Console.WriteLine("Sanity check failed, key has no value");
    }
    else
        Console.WriteLine("Not found");
}

If you want to do multiple queries through a document, things get a bit more complicated. The reason is that with the pull-reader, after you've found the first result, you'll be somewhere in the inner branches of the tree, and you need to keep calling Read() to pull yourself back out again so you can parse the next section. JsonTextReader provides two helper methods to deal with some of that: SkipToEndObject() and SkipToEndArray() but you still need to know when to call it, which means knowing where you ended up in the first place.

var url = "http://api.themoviedb.org/3/tv/2129?api_key=c83a68923b7fe1d18733e8776bba59bb";
using (var reader = JsonTextReader.CreateFromUrl(url))
{
    // skip to "$.created_by[1].name" <-- JSON path syntax
    if (reader.SkipTo("created_by", 0, "name"))
    {
        // we're currently on the *key*/field name
        // we have to move to the value if we want 
        // just that.
        if (reader.Read())
            Console.WriteLine(reader.ParseSubtree());
        else // below should never execute
            Console.WriteLine("Sanity check failed, key has no value");
        // we need to move outward in the tree 
        // so we can read the next array element
        // so we skip the rest of this object
        reader.SkipToEndObject();
        if (reader.Read()) // read past the end of the object
        {
            // we're on the next object so get the name
            reader.SkipToField("name");
            // we're currently on the *key*/field name
            if (reader.Read())
                Console.WriteLine(reader.ParseSubtree());
            else // below should never execute
                Console.WriteLine("Sanity check failed, key has no value");
        }
        else // below should never execute, we didn't expect to reach the end
            Console.WriteLine("Sanity check failed, unexpected end of document");
                    
    }
    else
        Console.WriteLine("Not found");
}

As I said, things get a bit more complicated, because you have to use SkipToEndObject()/SkipToEndArray() to move back outward in the tree so you can run the next query. I find that you sometimes have to experiment with it to figure out where you are in the document at a given point, as it can be hard to keep track of. Remember you can ParseSubtree() at any point and then pretty print the result, which should give you a good read on where you are, although this won't work if your subtree is very large. You'll just have to step through and log it or use the watch window in the debugger to figure it out. Such is the nature of forward only streaming a nested document structure.

As you can see, it's much more complicated to stream JSON than it is simply to parse it, but with the JsonTextReader it's not unmanageable. I'd love to support an (extremely restricted) subset of JSON path in the future but currently that's just not feasible yet. Fortunately, SkipTo() gives you about 70% of that functionality.

History

10^th September, 2019 - Initial submission