Click here to Skip to main content
15,868,030 members
Articles / Programming Languages / C#
Tip/Trick

LookAheadEnumerator: Implement Backtracking in Your Parsers

Rate me:
Please Sign up or sign in to vote.
5.00/5 (13 votes)
23 Dec 2019MIT3 min read 15.9K   123   3   21
Easily implement efficient backtracking capabilities over any enumeration

Introduction

Complex parsers often need to support backtracking, which is a way to revisit items you've already encountered. The trick with this is twofold, doing it efficiently, and doing it transparently. The LookAheadEnumerator<T> class provides both.

Update: Bug fix, more robust, doc comments, and 90% slangified the code (removed switch/nameof and the like) so that Slang can cook it and I can use it in my generated parser code.

Update: LookAheadEnumerator is now used extensively in my Parsley parser generator which can parse C# (an example of parsing a large subset is at the link)

Background

It's often desirable in parser code to use some sort of "streaming" interface for its input like a TextReader class or an class implementing IEnumerator<T>. I prefer enumerators because of their ubiquity and simplicity. However, it can be difficult to backtrack on streaming sources without preloading it into memory before parsing. This is fine for small text, but not reams of say, bulk JSON.

Normally with an enumerator, all you can do is MoveNext() and sometimes Reset() if you're lucky. There is no way to seek back to a particular previous position, and even if there was, it probably wouldn't work of a true streaming source, like an HTTP response, or console input.

A backtracking parser on the other hand, needs to "bookmark" its current position, before trying several alternatives until it finds one that parses. That means revisiting the same sequence of text several times.

Backtracking parsers are inherently less efficient but far more flexible than non-backtracking parsers. I've made a fair effort to optimize this class to make it as efficient as possible for this purpose.

Conceptualizing This Mess

I've embedded an array backed queue into this class, which it uses to back the the lookahead buffer. The queue starts with 16 elements and grows as needed (almost doubling in size each time to avoid too many reallocations - heap in .NET is cheaper than CPU) depending on how much lookahead is needed. When a LookAheadEnumeratorEnumerator<T> (the lookahead cursor) is advanced, it often requires the primary class to read more data into the queue in order to satisfy it. When the main cursor is advanced, it will discard items in the queue (simply incrementing _queueHead which is really fast) . It's not a good idea to advance or reset the main cursor while using the lookahead cursor. The results in this case, are undefined, as I haven't implemented versioning in these enumerators.

Using This Mess

You use the code like a standard IEnumerator<T> with an additional property - LookAhead that allows you to foreach from your current position without advancing the cursor. There's also Peek() and TryPeek() which look ahead a specified number of positions, and DiscardLookAhead() which simply moves the cursor to the physical position and clears the buffer.

C#
var text = "fubarfoobaz";
var la = new LookAheadEnumerator<char>(text.GetEnumerator());
la.MoveNext(); // can't look ahead until we're already over the position 
               // we want to start look at.
foreach (var ch in la.LookAhead)
    Console.Write(ch);
Console.WriteLine();
while (la.MoveNext())
{
    foreach (var ch in la.LookAhead)
        Console.Write(ch);
    Console.WriteLine();
}

This would print the following to the console:

fubarfoobaz
ubarfoobaz
barfoobaz
arfoobaz
rfoobaz
foobaz
oobaz
obaz
baz
az
z

As you can see, we're incrementing the primary cursor by one in each iteration, and then we're enumerating over LookAhead from there. Enumerating over LookAhead does not affect the primary cursor*.

* The underlying physical read cursor is advanced, as it must be, but a facade is presented using a queue that buffers the already read input for you, presenting it as the next input.

Typically in a parser, you'll use it over tokens, like LookAheadEnumerator<Token> and then each time you need to do backtracking, you parse along LookAhead instead of the primary cursor. When you find a match, you'll have to discard all the tokens you matched, either by reparsing along the primary cursor or by counting and advancing if you know how many tokens you parsed. If you're only parsing one alternative to the main parse, you can simply use DiscardLookAhead() once you've matched the alternate.

That's about all. This is an extremely specialized class, but when you need it, you really need it.

History

  • 19th December, 2019: Initial submission
  • 23rd December, 2019: Update

License

This article, along with any associated source code and files, is licensed under The MIT License


Written By
United States United States
Just a shiny lil monster. Casts spells in C++. Mostly harmless.

Comments and Discussions

 
PraiseSuch fun and good memories of Lisp Pin
asiwel26-Dec-19 8:33
professionalasiwel26-Dec-19 8:33 
GeneralRe: Such fun and good memories of Lisp Pin
honey the codewitch26-Dec-19 8:35
mvahoney the codewitch26-Dec-19 8:35 
SuggestionGood Idea ! (may be including something else?) Pin
AndyHo26-Dec-19 5:35
professionalAndyHo26-Dec-19 5:35 
GeneralRe: Good Idea ! (may be including something else?) Pin
honey the codewitch26-Dec-19 5:39
mvahoney the codewitch26-Dec-19 5:39 
GeneralRe: Good Idea ! (may be including something else?) Pin
AndyHo26-Dec-19 13:25
professionalAndyHo26-Dec-19 13:25 
GeneralMy vote of 5 Pin
BillWoodruff21-Dec-19 18:58
professionalBillWoodruff21-Dec-19 18:58 
GeneralRe: My vote of 5 Pin
honey the codewitch21-Dec-19 19:23
mvahoney the codewitch21-Dec-19 19:23 
GeneralMy vote of 5 Pin
User 1106097921-Dec-19 3:23
User 1106097921-Dec-19 3:23 
GeneralRe: My vote of 5 Pin
honey the codewitch21-Dec-19 3:31
mvahoney the codewitch21-Dec-19 3:31 
QuestionI think I you're a special one Pin
Sacha Barber20-Dec-19 17:05
Sacha Barber20-Dec-19 17:05 
AnswerRe: I think I you're a special one Pin
honey the codewitch20-Dec-19 17:11
mvahoney the codewitch20-Dec-19 17:11 
GeneralRe: I think I you're a special one Pin
Sacha Barber21-Dec-19 0:29
Sacha Barber21-Dec-19 0:29 
GeneralRe: I think I you're a special one Pin
honey the codewitch21-Dec-19 1:46
mvahoney the codewitch21-Dec-19 1:46 
GeneralRe: I think I you're a special one Pin
Sacha Barber21-Dec-19 5:36
Sacha Barber21-Dec-19 5:36 
GeneralRe: I think I you're a special one Pin
honey the codewitch21-Dec-19 5:38
mvahoney the codewitch21-Dec-19 5:38 
GeneralRe: I think I you're a special one Pin
mldisibio21-Dec-19 7:30
mldisibio21-Dec-19 7:30 
GeneralRe: I think I you're a special one Pin
honey the codewitch21-Dec-19 19:22
mvahoney the codewitch21-Dec-19 19:22 
GeneralRe: I think I you're a special one Pin
Sacha Barber22-Dec-19 6:08
Sacha Barber22-Dec-19 6:08 
GeneralRe: I think I you're a special one Pin
honey the codewitch23-Dec-19 22:46
mvahoney the codewitch23-Dec-19 22:46 
QuestionCode? Pin
Marc Clifton19-Dec-19 14:49
mvaMarc Clifton19-Dec-19 14:49 
AnswerRe: Code? Pin
honey the codewitch19-Dec-19 15:24
mvahoney the codewitch19-Dec-19 15:24 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.