Click here to Skip to main content
15,867,568 members
Articles / Internet of Things

Big JSON on the Arduino: Parse Huge Data on a Tiny Device

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
11 Dec 2020MIT11 min read 17.2K   265   11   3
Streaming your data and parsing on the go with tiny gadgets
Sometimes, you might need to process a lot of data a little at a time. We do that here with a streaming JSON reader implemented over my LexContext, ported to the Arduino software platform for 32 bit devices but targeted and tested on an ESP32.

Introduction

JSON is everywhere these days. IoT gadgets are slowly but surely becoming similarly ubiquitous. Naturally, the two are going to collide. There is an Arduino library for processing JSON out there but it doesn't stream very well because it's not a pull parser, which is what this little library gives you. This library is a port of part of the JsonTextReader in my Json library which allows you to read JSON data of virtually any size by examining just a little bit of it at a time.

Furthermore, this article is intended to give you a technique for simplifying lexing and parsing on the fly over a streaming source. To that end, I have also ported LexContext to this platform and backported the JsonReader offering to use it.

You might think it strange to port something from C# to C++, but these two classes were perfect candidates to port to the Arduino SDK because they were already simple, small and fast. They just needed a little bit of witchcraft to transform them to C++ and make them function on a tiny device.

With these handy widgets, you will now be able to scan and parse big JSON freely, or build your own efficient parsers over a streaming source.

Update: Small code cleanup, large addition to article exploring the inner workings

Update 2: Improved error handling, bug fixes. Note that the code in the article does not reflect these changes. The error handling significantly clutters the code, so I decided to leave it out here. User lastError() to get the error code and value() to get the error text.

Update 3: Bugfix with reporting incorrect error message during some out of memory conditions

Update 4: Bugfix with skipping (partial parsing) - removal of non-canonical skip since it wasn't needed. Changed name of Key to Field for consistency. Updated article code.

Conceptualizing this Mess

Pull parsing works by grabbing a little bit of a structured document at a time and doing enough bookkeeping to know what you're on top of and what your next move is. This usually involves building a finite state machine to do the heavy lifting. That's fancy language for a switch/case over an integer member variable that we update as we go:

C++
int _state = -1; // initial state

bool read() {
   switch(_state) {
      case -1:   // initial
         _state = 0; 
         // fall through
      case 0:    // thing A
         // do thing A
         _state = 1;
         return true;
      case 1:    // repeat thing B or end
         // do thing b or end
         if(textUnderCursorIsB()) {
            // do thing B
            _state = 1;
            return true;
         } 
         // done
         _state = -2;
         return false;
   }
}

It's spaghetti. It's not very friendly to humans but CPUs eat this stuff up. We're working pretty close to the metal on these little devices so complications like state machines are justifiable if they are efficient.

The upside though is we can now call it iteratively simply by doing this:

C++
while(parser.read()) { ... }

Each call to read() executes a step of the parse. The general idea is we set up the state machine so that read() will return true until there is no more data to be read.

That's parsing, but before we even get that far, we need a way to manage a streaming cursor. Previously, LexContext wrapped things like TextReader or an IEnumerable<char> source. Now however, we simply wrap the Arduino Stream class since it's derived by the File class and the WiFiClient class for example.

The LexContext provides facilities for doing basic operations over a streaming source like capturing the current character under the cursor, advancing along the input, reading and skipping whitespace, or reading or skipping until a certain character is encountered. When we create one, we instantiate it with a fixed length buffer to use for capture data. This buffer should be the length of the maximum amount of data anticipated to be worked with at once. For JSON data, this might be the length of a field name or scalar value.

A parser can then use this to manage its input while parsing.

Coding this Mess

To set this up, wire an SD card reader into your 32 bit device's primary SPI port and primary CS/SS. Copy the included data.json file to the root directory of an SD card formatted Fat32 and then you can run the demo.

LexContext

LexContext works such that current() always retrieves the current character under the cursor, while advance() moves to the next position and returns that character. capture() captures a character to the buffer and captureBuffer() gives us a string from that buffer. line(), column() and position() track our cursor's location. We start by declaring it with a capture buffer size:

C++
LexContext<1024> lc; // allocate 1kB for the capture

Here's an example of using it to read from the serial port until a non-digit is encountered:

C++
while (LexContext<1024>::EndOfInput != lc.advance() && isdigit((char)lc.current())) {
  Serial.print((char)lc.current());
}
Serial.println();

Above, we're initializing it with Serial as the input source. We then advance, and check the character to see if it's not an end of input marker and then if it's a digit. If it is, we print it and keep going. Keep in mind we advance before doing anything else. The cursor must be primed by advancing once first initialized. If your routine needs to ensure this happened but doesn't know already, you can call ensureStarted().

The API is essentially the same as the C# API with casing style changed to fit. Please see that article for more details. The main difference is initializing it with the amount of capture, and then calling begin() with the input source. Keep in mind that unlike the C# API, there is no corresponding close() mechanism here. The input source must be closed when finished from outside this class, due to the differences in underlying architecture.

JsonReader

Now that we've seen our cursor management, let's come back around to the JSON parsing. Using pull parsers is very efficient but takes some getting used to. We saw above that we read partial parses in a loop until there's no more data. Well inside that loop is where the interesting things happen.

First, we probably want to examine the nodeType() to see what kind of node we are on. This can be Initial, Value, Field, Array, EndArray, Object, EndObject, EndDocument or Error. These constants are accessed off the JsonReader class itself. If it's a Value node, we might want to examine the valueType() as well which can be String, Boolean, Number, or Null. If it's a String, most likely you'll call undecorate() to remove the quotes and translate the escapes to real characters. Note that after you do this, future calls to valueType() on this same node are unreliable. Finally, you can call value() to get the value as a char*, numericValue() to get it as a double, or booleanValue() to get it as a bool. Note that undecorate() makes calls to these unreliable as well. The reason undecorate() affects all these functions is that it modifies the string value in place to save space.

Let's take a look at the provided ino file:

C++
#include <SD.h>
#include <Json.h>

// for easier access
typedef JsonReader<2048> JsonReader2k;

// our reader. if we declare this locally it would require slightly over 2k on the stack
JsonReader2k jsonReader;

void dumpJsonFile() {
  // open data.json off the SD reader. You wired in an SD reader, right?
  File file = SD.open("/data.json", FILE_READ);
  if (!file) {
    Serial.println(F("/data.json not found or could not be opened"));
    while (true); // halt - no, no you did not. or you didn't insert the card
  }
  // initialize the reader with our file
  jsonReader.begin(file);
  // pull parsers return portions of the parse which you retrieve
  // by calling their parse/read method in a loop.
  while (jsonReader.read()) {
    // what kind of JSON element are we on?
    switch (jsonReader.nodeType()) {
      case JsonReader2k::Value: // we're on a scalar value
        Serial.print("Value ");
        switch (jsonReader.valueType()) { // what type of value?
          case JsonReader2k::String: // a string!
            Serial.print("String: ");
            jsonReader.undecorate(); // remove all the nonsense
            Serial.println(jsonReader.value()); // print it
            break;
          case JsonReader2k::Number: // a number!
            Serial.print("Number: ");
            Serial.println(jsonReader.numericValue()); // print it
            break;
          case JsonReader2k::Boolean: // a boolean!
            Serial.print("Boolean: ");
            Serial.println(jsonReader.booleanValue()); // print it!
            break;
          case JsonReader2k::Null: // a null!
            Serial.print("Null: ");
            Serial.println("null"); // print it!
            break;
        }
        break;
      case JsonReader2k::Field: // this is a field
        Serial.print("Field ");
        Serial.println(jsonReader.value());
        break;
      case JsonReader2k::Object: // an object start {
        Serial.println("Object");
        break;
      case JsonReader2k::EndObject: // an object end }
        Serial.println("End Object");
        break;
      case JsonReader2k::Array: // an array start [
        Serial.println("Array");
        break;
      case JsonReader2k::EndArray: // an array end ]
        Serial.println("End Array");
        break;
      case JsonReader2k::Error: // a bad thing
        // maybe we ran out of memory, or the document was poorly formed
        Serial.print("Error: (");
        Serial.print(jsonReader.lastError());
        Serial.print(") ");
        Serial.println(jsonReader.value());
        break;
    }
  }
  // don't forget this
  file.close();
}
void dumpId(bool recurse) {
  // open the file
  File file = SD.open("/data.json", FILE_READ);
  if (!file) {
    Serial.println(F("/data.json not found or could not be opened"));
    while (true); // halt
  }
  jsonReader.begin(file);
  // find the next field in the document named id. 
  // If recurse is specified, then subjects will be searched.
  // Otherwise, only objects on this level of the hierarchy are considered.
  // Once we find it, read what comes right after it to get the value
  if (jsonReader.skipToField("id", recurse) && jsonReader.read()) {
    Serial.println((int32_t)jsonReader.numericValue(), DEC); // get the value as a number
  }
  // always close the file
  file.close();
}
void dumpShowName() {
  File file = SD.open("/data.json", FILE_READ);
  if (!file) {
    Serial.println(F("/data.json not found or could not be opened"));
    while (true); // halt
  }
  jsonReader.begin(file);
  // look for the field named "name" on this level of the hierarchy.
  // then read what comes right after it.
  if (jsonReader.skipToField("name") && jsonReader.read()) {
    jsonReader.undecorate(); // deescape the string
    Serial.println(jsonReader.value());
  }
  // close the file
  file.close();
}
void setup() {
  Serial.begin(115200);
  // initialize SD on default SPI bus and CS/SS
  if (!SD.begin()) {
    Serial.println(F("SD card mount failed"));
    while (true); // halt
  }

  dumpJsonFile();
  //dumpShowName();
  //dumpId(true);
  //dumpId(false);  
}

void loop() {
  if (Serial.available()) {
    LexContext<1024> lc;
    lc.begin(Serial);
    while (LexContext<1024>::EndOfInput != lc.advance() && isdigit((char)lc.current())) {
      Serial.print((char)lc.current());
    }
    Serial.println();
  }
}

We've got a lot going on here. Of particular interest are dumpShowName(), dumpId() and dumpJsonFile().

In the latter, we are going through the motions of reading data out of the file. Note that we can get type information for our fields and get typed values out of them by calling the appropriate methods. All numbers resolve to a double currently. If you want an int, you will have to use atoi() on value().

The other routines show you how to move around the document. They don't demonstrate all of the features for this, but they do demonstrate using an important one, skipToField(). This method finds the next field in the document with the given name, optionally searching subelements. Usually once you find it, you'll want to read() once to get to the next element - the field's value. We do that in the ino above.

There's also skipSubtree() which skips over the entire subtree you're on, skipToEndArray() and skipToEndObject() which skip to the closing array or object marker on the same level, respectively.

One of the things I'd like to add is an in memory tree representation that integrates with this, or perhaps integrate with ArduinoJson. I may first create a JSONPath to CPP code generator that can generate code that uses this library to fulfill JSONPath queries. All in time.

Cool, But How Does It Work?

Like I said at the beginning, the whole mess is basically a state machine over a LexContext which manages a cursor over streaming input. The state machine starts in the Initial state. Each state corresponds directly to a value returned from nodeType().

C++
bool read() {
  int16_t qc;
  int16_t ch;
  switch (_state) {
    case JsonReader<S>::Error:
    case JsonReader<S>::EndDocument:
      return false;
    case JsonReader<S>::Initial:
      _lc.ensureStarted();
      _state = Value;
    // fall through
    case JsonReader<S>::Value:
value_case:
      _lc.clearCapture();
      switch (_lc.current()) {
        case LexContext<S>::EndOfInput:
          _state = EndDocument;
          return false;
        case ']':
          _lc.advance();
          _lc.trySkipWhiteSpace();
          _lc.clearCapture();
          _state = EndArray;
          return true;
        case '}':
          _lc.advance();
          _lc.trySkipWhiteSpace();
          _lc.clearCapture();
          _state = EndObject;
          return true;
        case ',':
          _lc.advance();
          _lc.trySkipWhiteSpace();
          if (!read()) { // read the next value
            _lastError = JSON_ERROR_UNTERMINATED_ARRAY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_ARRAY_MSG,S-1);
            _state = Error;
          }
          return true;
        case '[':
          _lc.advance();
          _lc.trySkipWhiteSpace();
          _state = Array;
          return true;
        case '{':
          _lc.advance();
          _lc.trySkipWhiteSpace();
          _state = Object;
          return true;
        case '-':
        case '.':
        case '0':
        case '1':
        case '2':
        case '3':
        case '4':
        case '5':
        case '6':
        case '7':
        case '8':
        case '9':
          qc = _lc.current();
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          while (LexContext<S>::EndOfInput != _lc.advance() &&
                 ('E' == _lc.current() ||
                  'e' == _lc.current() ||
                  '+' == _lc.current() ||
                  '.' == _lc.current() ||
                  isdigit((char)_lc.current()))) {
            if (!_lc.capture()) {
              _lastError = JSON_ERROR_OUT_OF_MEMORY;
              strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
              _state = Error;
              return true;
            }
          }
          _lc.trySkipWhiteSpace();
          return true;
        case '\"':
          _lc.capture();
          _lc.advance();
          if(!_lc.tryReadUntil('\"', '\\', true)) {
            if(LexContext<S>::EndOfInput==_lc.current()) {
              _lastError = JSON_ERROR_UNTERMINATED_STRING;
              strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_STRING_MSG,S-1);
              _state = Error;
              return true;

            } else {
              _lastError = JSON_ERROR_OUT_OF_MEMORY;
              strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
              _state = Error;
              return true;
            }
          }
          _lc.trySkipWhiteSpace();
          if (':' == _lc.current())
          {
            _lc.advance();
            _lc.trySkipWhiteSpace();
            if (LexContext<S>::EndOfInput == _lc.current()) {
              _lastError = JSON_ERROR_FIELD_NO_VALUE;
              strncpy_P(_lc.captureBuffer(),JSON_ERROR_FIELD_NO_VALUE_MSG,S-1);
              _state = Error;
              return true;
            }
            _state = Field;
          }
          return true;
        case 't':
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('r' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('u' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('e' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          _lc.advance();
          _lc.trySkipWhiteSpace();
          ch = _lc.current();
          if (',' != ch && ']' != ch && '}' != ch && LexContext<S>::EndOfInput != ch) {
            _lastError = JSON_ERROR_UNEXPECTED_VALUE;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
            _state = Error;
          }
          return true;
        case 'f':
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('a' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('l' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('s' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('e' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          _lc.advance();
          _lc.trySkipWhiteSpace();
          ch = _lc.current();
          if (',' != ch && ']' != ch && '}' != ch && LexContext<S>::EndOfInput != ch) {
            _lastError = JSON_ERROR_UNEXPECTED_VALUE;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
            _state = Error;
          }
          return true;
        case 'n':
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('u' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('l' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if ('l' != _lc.advance()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          if (!_lc.capture()) {
            _lastError = JSON_ERROR_OUT_OF_MEMORY;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
            _state = Error;
            return true;
          }
          _lc.advance();
          _lc.trySkipWhiteSpace();
          ch = _lc.current();
          if (',' != ch && ']' != ch && '}' != ch && LexContext<S>::EndOfInput != ch) {
            _lastError = JSON_ERROR_UNEXPECTED_VALUE;
            strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
            _state = Error;
          }
          return true;
        default:
          // Serial.printf("Line %d, Column %d, 
          // Position %d\r\n",_lc.line(),_lc.column(),(int32_t)_lc.position());
          _lastError = JSON_ERROR_UNEXPECTED_VALUE;
          strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
          _state = Error;
          return true;

      }
    default:
      _state = Value;
      goto value_case;
  }
}

Each part here examines the current state, and the character under the cursor to determine what to do next. This form of parsing is similar to LL(1). Actually, since JSON is even simpler to parse than that, it's not much more difficult than matching regular expressions. Since we don't have a separate lexer, our parser handles the lexing as well, except for the low level lexing like trySkipWhiteSpace() which it delegates to LexContext. The tedious bits are the parts that determine we're on a number and the parts that determine whether we encountered true, false, or null. Other than that, it's pretty straightforward.

As an optimization, this parser supports partial parsing while skipping over parts of the document. This does just enough parsing to determine if the document is well formed but otherwise normalizes nothing, speeding up the operation.

We have two routines for skipping over nested objects and arrays. They delegate to each other recursively when arrays are nested in objects and vice versa. Since they are nearly identical, we'll explore one:

C++
void skipObjectPart()
{
  int depth = 1;
  while (Error!=_state && LexContext<S>::EndOfInput != _lc.current())
  {
    switch (_lc.current())
    {
      case '[':
        if(LexContext<S>::EndOfInput==_lc.advance()) {
          _lastError = JSON_ERROR_UNTERMINATED_ARRAY;
          strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_ARRAY_MSG,S-1);              
          _state = Error;
          return;
        }
        skipArrayPart();
        break;
        
      case '{':
        ++depth;
        _lc.advance();
        if(LexContext<S>::EndOfInput==_lc.current())
          _lastError = JSON_ERROR_UNTERMINATED_OBJECT;
          strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_OBJECT_MSG,S-1);
          _state = Error;
        break;
      case '\"':
        skipString();
        break;
      case '}':
        --depth;
        _lc.advance();
        if (depth == 0)
        {
          _lc.trySkipWhiteSpace();
          return;
        }
        if(LexContext<S>::EndOfInput==_lc.current()) {
          _lastError = JSON_ERROR_UNTERMINATED_OBJECT;
          strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_OBJECT_MSG,S-1);              
          _state = Error;
        }
        break;
      default:
        _lc.advance();
        break;
    }
  }
}

These in turn, are used by skipSubtree():

C++
bool skipSubtree()
{
  switch (_state)
  {
    case JsonReader<S>::Error:
      return false;
    case JsonReader<S>::EndDocument: // eos
      return false;
    case JsonReader<S>::Initial: // initial
      if (read())
        return skipSubtree();
      return false;
    case JsonReader<S>::Value: // value
      return true;
    case JsonReader<S>::Field: // field
      if (!read())
        return false;
      return skipSubtree();
    case JsonReader<S>::Array:// begin array
      skipArrayPart();
      _lc.trySkipWhiteSpace();
      _state = EndArray; // end array
      return true;
    case JsonReader<S>::EndArray: // end array
      return true;
    case JsonReader<S>::Object:// begin object
      skipObjectPart();
      _lc.trySkipWhiteSpace();
      _state = EndObject; // end object
      return true;
    case JsonReader<S>::EndObject: // end object
      return true;
    default:
      _lastError = JSON_ERROR_UNKNOWN_STATE;
      strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNKNOWN_STATE_MSG,S-1);
      _state = Error;
      return true;
  }
}

Here, we're skipping the next subtree depending on where you are. If you're on the initial node, the subtree would be the entire document. If you're on a value, it's already skipped by the next read() call. If you're on a field, read to the next element - the field's value, and skip it. If we're on an array or object, we use the nested skip routine outlined just above.

To search, we provide methods like skipToIndex() and skipToField(). These allow you to move through the document by querying for field names or indices with arrays:

C++
bool skipToIndex(int index) {
  if (Initial==_state || Field == _state) // initial or field
    if (!read())
      return false;
  if (Array==_state) { // array start
    if (0 == index) {
      if (!read())
        return false;
    }
    else {
      for (int i = 0; i < index; ++i) {
        if (EndArray == _state) // end array
          return false;
        if (!read())
          return false;
        if (!skipSubtree())
          return false;
      }
      if ((EndObject==_state || EndArray==_state) && !read())
        return false;
    }
    return true;
  }
  return false;
}

Note that the above is for arrays. The following is for objects:

C++
bool skipToField(const char* field, bool searchDescendants = false) {
  if (searchDescendants) {
    while (read()) {
      if (Field == _state) { // field 
        undecorate();
        if (!strcmp(field , value()))
          return true;
      }
    }
    return false;
  }
  switch (_state)
  {
    case JsonReader<S>::Initial:
      if (read())
        return skipToField(field);
      return false;
    case JsonReader<S>::Object:
      while (read() && Field == _state) { // first read will move 
                                          // to the child field of the root
        undecorate();
        if (strcmp(field,value()))
          skipSubtree(); // if this field isn't the target so just skip over the rest of it
        else
          break;
      }
      return Field == _state;
    case JsonReader<S>::Field: // we're already on a field
      undecorate();
      if (!strcmp(field,value()))
        return true;
      else if (!skipSubtree())
        return false;

      while (read() && Field == _state) { // first read will move to the child field of the root
        undecorate();
        if (strcmp(field , value()))
          skipSubtree(); // if this field isn't the target just skip over the rest of it
        else
          break;
      }
      return Field == _state;
    default:
      return false;
  }
}

This is considerably more involved than skipToIndex() simply because there are so many corner cases to deal with. Also, unlike the previous method, this one needs to be able to search or skip over descendants. It's odd until you think about it, but it's actually easier to "recursively" search for a field because you don't have to skip subtrees to stay on the same level of the hierarchy.

As far as storing and retrieving element scalar values and field names, we use the LexContext's captureBuffer() for that. Doing so saves precious RAM versus copying it out of the buffer. One additional thing we do to save RAM is stipulate that any conversion of the data must be in place, where possible. This means that we have an undecorate() function which removes quotes and translates escapes from strings. It does so in place, knowing that the resulting string will always be shorter than the input string. This is because non-escapes are always shorter than escapes and because the quotes are always stripped:

C++
void undecorate() {
  char *src = _lc.captureBuffer();
  char *dst = src;
  char ch = *src;
  if ('\"' != ch)
    return;
  ++src;
  uint16_t uu;
  while ((ch = *src) && ch != '\"') {
    switch (ch) {
      case '\\':
        ch = *(++src);

        switch (ch) {
          case '\'':
          case '\"':
          case '\\':
          case '/':
            *(dst++) = ch;
            ++src;
            break;
          case 'r':
            *(dst++) = '\r';
            ++src;
            break;
          case 'n':
            *(dst++) = '\n';
            ++src;
            break;
          case 't':
            *(dst++) = '\t';
            ++src;
            break;
          case 'b':
            *(dst++) = '\b';
            ++src;
            break;
          case 'u':
            uu = 0;
            ch = *(++src);
            if (isHexChar(ch)) {
              uu = fromHexChar(ch);
              ch = *(++src);
              uu *= 16;
              if (isHexChar(ch)) {
                uu |= fromHexChar(ch);
                ch = *(++src);
                uu *= 16;
                if (isHexChar(ch)) {
                  uu |= fromHexChar(ch);
                  ch = *(++src);
                  uu *= 16;
                  if (isHexChar(ch)) {
                    uu |= fromHexChar(ch);
                    ch = *(++src);
                  }
                }
              }
            }
            if (0 < uu) {
              // no unicode
              if (256 > uu) {
                *(dst++) = (char)uu;
              } else
                *(dst++) = '?';
            }
        }
        break;
      default:
        *dst = ch;
        ++dst;
        ++src;
    }
  }
  *dst = 0;
}

That's not very nice. What it's doing is this - it's managing two cursors over the same buffer. The destination cursor *dst trails the source cursor *src by at least one because of the leading quote. Basically, we just copy characters from source to destination until we hit an escape in which case we translate it. When we find a final quote, we're done and we reterminate the string at the new end.

Another interesting function is valueType() which tells us what sort of JSON value we're looking at - note that it should not be called after undecorate():

C++
int8_t valueType() {
  char *sz = _lc.captureBuffer();
  char ch = *sz;
  if('\"'==ch)
    return String;
  if('t'==ch || 'f'==ch)
    return Boolean;
  if('n'==ch)
    return Null;
  return Number;
}

We take a number of liberties here. All we ever do is examine the first character and for a number we don't even do that, we just get to it by process of elimination. This is only reliable because we already checked these values while parsing. For example, if it starts with t, we know it's going to be true simply because nothing else is allowed to start with t unless it's surrounded by quotes. We already know it's not tree because the parser would have errored earlier. Now you can see the damage undecorate() does if it's called before this!

We've now covered the meat of the entire library, and where you go from here is up to you. I hope you enjoy this contribution and that your code is lean, pretty and bug resistant.

History

  • 9th December, 2020 - Initial submission
  • 9th December, 2020 - Update: added "How It Works" section
  • 10th December, 2020 - Update 2: added better error handling and bug fixes
  • 10th December, 2020 - Update 3: fixed bug with incorrect error message during some out of memory conditions
  • 11th December, 2020 - Update 4: fixed bug with skipping, changed Key to Field, removed non-canonical skipping and updated article code

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
United States United States
Just a shiny lil monster. Casts spells in C++. Mostly harmless.

Comments and Discussions

 
GeneralXML SAX-like dispatcher Pin
Chad3F28-Dec-20 13:08
Chad3F28-Dec-20 13:08 
QuestionThis is great article Pin
Bilico Miền Nam10-Dec-20 19:57
professionalBilico Miền Nam10-Dec-20 19:57 
AnswerRe: This is great article Pin
honey the codewitch11-Dec-20 1:23
mvahoney the codewitch11-Dec-20 1:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.