Multi-Machine Parsing #2: Literature Review

Peter Belcak

4.00/5 (2 votes)

Jan 10, 2021

CPOL

5 min read

5006

A short review of the literature on the subject of multi-machine parsing

Introduction

The previous article in this series has presented the case for using multiple parsing methods within one parser, listing a number of issues that otherwise arise (especially when using a parser generator). These issues were:

blindness to desirable essential ambiguity
creation of non-essential ambiguity
prohibitive parser size
sub-optimal parser performance

This part of the series gives a brief review of some of the literature on the subject of multi-machine parsing, looking at 3 combinations of well-known parser types that have historically been proposed.

While, initially, this article is likely not to be complete, it very much aspires to be so at some point in the future, and comments referencing work that has been missed are thus welcome.

It needs to be said that almost by definition, parallel parsing is in essence about using multiple parsing machines in one parser, or at least using one machine in several contexts (pseudo-)simultaneously. The motivations are, however, quite different, and we shall therefore not go into much detail. We resort just to noting that a process-configuration parser with inhomogenous agents could achieve behaviour similar to what we wish to do with multi-machine parsing, although no mention of such an attempt is known to me at the time of writing of this article.

The Review

LR-LR Parsing: The Korenjak's Method

Perhaps the first successful attempt to deal with the sheer size (see Issue #3) of LR(k) parsers was the Korenjak's Method, which splits the source grammar into multiple parts. To quote the author himself,

Quote:

This procedure involves partitioning the given grammar into a number of smaller parts. If an LR(k) processor can be constructed for each part (using Knuth's algorithm) and if certain conditions relating these individual processors are satisfied, then an LR(k) processor for the entire grammar can be constructed for them.

The resulting combination of LR(k) parsing tables is much smaller than the one large table for the entire grammar would be, effectively attacking the main limitation of the otherwise quite powerful algorithm.

The Korenjak's Method appears not to have caught on in the end. In the volume Parsing Techniques by Grune and Jacobs, which may well be regarded as a "reservation for endangered parsers" (a great read!), the authors conclude

Quote:

This [Korenjak's Method] helped, but not much, in view of the added complexity.

Nevertheless, Korenjak's 1969 paper remains to be an important milestone, as it appears to be the first documented attempt to modify a parser generation algorithm in a way that involves using multiple independent parsers.

LL-FA Parsing

PCCTS, Astir, and ANTLR up to the version using the LL(*) parsing strategy were all initially designed to produce "practical" LL/LR parsers. While many of the optimisations used by these parser generators can be broadly characterized under the label of "partitioned LL/LR parsing", their presentations (available here and here) were done quite differently.

In essence, an LL(k) parser can be combined with a finite automaton (finite state machine) to produce a parser that can handle some types of LL-regular grammars in linear time. The class of LL-regular grammars is much bigger than the class of LL(k) grammars, so such parsing is of course of natural interest (in relation to Issue #2).

The major drawback is that identifying the regular partition of an LL-regular grammar can be shown at least as hard as the Post Correspondence Problem, which is itself intractable. Hence, a combination of an LL parser with finite automata can only be constructed in situations where the regular partition is easily recognizable in the grammar.

LL(*) parsing solves this problem by backtracking when things don't go right, arguing that such backtracking is in practice rare, and only rarely leads to the worst-possible exponential runtime.

Astir's LL(finite) parsing makes this the parser designer's responsibility. Instead of implementing default but potentially sub-optimal behaviour, Astir's grammar specification language (as of v1.0) permits the separation of the source grammar into parts with the additional specification of which of them should be handled by finite automata, and which by LL parsers.

LL-LR Parsing

LL-LR parsing (originally named "LLLR") is formed around the realization that while LL(k) top-down parsers are generally small and simple to write or design, LL(k) grammars are nowhere near as powerful as their LR(k) counterparts. From the abstract of the paper that introduced the method:

Quote:

An LLLR parser uses an LL parser as its backbone and parses as much of its input string using LL parsing as possible. To resolve LL conflicts it triggers small embedded LR parsers. An embedded LR parser starts parsing the remaining input and once the LL conflict is resolved, the LR parser produces the left parse of the substring it has just parsed and passes the control back to the backbone LL parser.

Further:

An LLLR(k) parser is appropriate for grammars where the LL(k) conflicting nonterminals either appear relatively close to the bottom of the derivation trees or produce short substrings. In such cases an LLLR parser can perform a significantly better error recovery than an LR parser since the most part of the input string is parsed with the backbone LL parser.

LL-LR parsing is similar to LL-FA parsing in that it uses another parser type (i.e., different from the main, "backbone", LL(k) parser) to resolve conflicts. In this respect, a hypothetical parser generator producing LL-LR parsers would be similar to ANTLR in that it has a default, more complex behaviour that it opts for when a conflict is identified. But like Astir, such a parser generator would not enable the output parser to resort to backtracking.

The Takeaway

The idea of generating multiple independent parsers that communicate with each other for a single source grammar dates back to at least 1969.

The three most familiar types of parsing machines, namely the finite automata, LL(k) parsers, and LR(k) parsers have all been proposed to be included in combinations in their default form to produce a parser that is in some way more powerful or efficient than a single parser alone would be.

Of all the methods proposed, only the LL-FA combination has been utilized in parser generators (PCCTS, Astir, ANTLR), and that to extend the class of grammars for which parsers can be generated to a strict superset of LL(k) grammars. In the next article in this series, we will look at the particular problem of resolving an LL(k) conflict (as a manifestation of Issue #2) and show how the LL-FA parser combination can be used in this context.

History

10^th January, 2021: Initial version