Ambiguity? Let NLT parse it for you

greenoldman

Rate me:

4.83/5 (4 votes)

11 Nov 2013MIT8 min read

20.5K

Lexer+parser suite for .Net world

Introduction

When you can handle string/text processing with just regular expressions that is fine (and fast), but when you get more complex task to do having more serious machinery — lexer and parser — comes handy. There are already some parsers for C# written, but I feel the place for old-school YACC-like parser is not occupied. Well, until now.

Background

The “Naive Language Tools” suite which I describe below is actually a side-effect of translating COOL framework to C#.

About a year and a half I enrolled in “Compilers” course by Prof. Alex Aiken (highly recommended) because I always was curious how compilers work and how to write one. The entire course was more fulfilling than I imagined and I wanted to add something from me (as repay, one could say) — at that time course staff provided C++ and Java frameworks, so why not make a framework in my favorite C#?

There was just one problem — at the time of writing C# framework I couldn’t find reliable, YACC/CUP-like parser for C# (similarity to those tools was a must to let the students switch from one framework to another).

Well, what the heck, I just finished the course, I wrote my own COOL compiler, I should be able to pull off writing a lexer and a parser as well. And indeed I was — that is how NLT was born. At first, with features required to get COOL compiler built, but in time I added more and more features.

The lexer got basic context scanning, the parser became a LALR(k) forking parser, and since I use NLT for my own purposes, it is more and more polished every day. OK, OK, more like every week.

Let’s see some action

The code of this article is included in NLT package as “05. Patterns and forking” example.

Please allow me to introduce some rather crazy calculator to show how NLT can handle the language ambiguity. Let’s build calculator with addition and subtraction as usual, but also with bit shifts (“a < b”, “a > b”) and bit indexing (“a<b>” — it gives b-th bit of “a”).

The priority and associativity of the operators — from lowest to highest priority:

summation and subtraction — left to right,
bit shifts — left to right,
bit indexing.

Lexer — chopping up the input

One can define all the actions directly in C# code, but also write them as NLT grammar and then let the NLT generator create lexer and parser. Here we use the latter approach for the sake of readability.

Let’s start from boring stuff:

lexer LexerFactory;
token SymbolEnum;
states StatesEnum
  *INIT
end

There is a definition of the class name of the lexer factory, the type (enum) name of tokens and states. Since our calculator is quite basic we will need just one state (asterisk says it will be the default one).

Chopping the input is done as follows:

scanning // start of lexer rules
  "+" -> PLUS;
  "-" -> MINUS;
  "<" -> LANGLE;
  ">" -> RANGLE;

We match the text using string pattern and we return a terminal (uppercase is used for making a distinction from non-terminal and special token — “Error”). Next we use regular expression pattern to grab the numbers:

// regex as pattern
/[0-9]+/ -> NUM, Convert.ToInt32($text);

“$text” is one of the pseudo-variables. By default the matched text is assigned to the value of the token, but better to convert text to number here than later in the parser.

We have to add a little clean-up:

  " "     { };    // ignore spaces
  /./  -> Error;  // "anything else" rule
  %EOF -> EOF;
end // end of lexer rules

Since one cannot match end of file (EOF) using string or regex pattern, thus the special entry “%EOF”. The EOF on the right is a predefined terminal.

Parser — finding out the structure

Take off

Again, we start with boring stuff:

parser ParserFactory;
types
  NUM int;
  expr Tuple<int,string>;
end

The “type” section allows us to define the types for all symbols — this way we won’t need to cast variables in parser actions. Here we have a definition of NUM terminal as “int”, and definition of the expression as a pair of values.

We would like to keep things simple here — the outcome of the calculation as in every calculator is desired of course — thus the “int” as the first element of the pair, but since the parsing will be all the fun here, we add a track of parse layout (flat parse tree) — single “string” as the second element will suffice.

With all things set up we can finally deal with parsing productions:

parsing // start of parser rules
comp -> e:expr { e };

In NLT the symbol used on left hand side (LHS) of the first production is recognized as a start symbol and it can appear only once. Here we basically say that entire computation is an expression.

expr -> e1:expr MINUS e2:expr
        { pair(e1.Item1 - e2.Item1, "("+e1.Item2+" - "+e2.Item2+")") }
      | e1:expr PLUS e2:expr 
        { pair(e1.Item1 + e2.Item1, "("+e1.Item2+" + "+e2.Item2+")") }
      | e1:expr LANGLE e2:expr
        { pair(e1.Item1 << e2.Item1, "("+e1.Item2+" < "+e2.Item2+")") }
      | e1:expr RANGLE e2:expr
        { pair(e1.Item1 >> e2.Item1, "("+e1.Item2+" > "+e2.Item2+")") }
      | e1:expr LANGLE e2:expr RANGLE 
        { pair((e1.Item1 & (1 << e2.Item1)) >> e2.Item1, "("+e1.Item2+"["+e2.Item2+"])") }
      | n:NUM 
        { pair(n, n) }
      ;
end // end of parser rules

Going from top to bottom we define an expression as a subtraction, sum, bit shifts or bit indexing of expressions. The last rule states that an expression is a single number as well.

Is this enough? When we run NLT generator (use “refresh” script to do it) over this grammar we will quickly discover we miss some pieces — the NLT report starts with:

Reduce/shift conflict on symbol MINUS. Affected items: 3.1, 3.0
3.0)  expr := expr MINUS expr .
3.1)  expr := expr . MINUS expr

The actual report is a bit richer in information.

It means that with the given rules NLT parser is unable to decide what “1-2-3” makes — should it be read as “(1-2)-3” or “1-(2-3)”? The same problems is with PLUS symbol.

To solve this we add precedence section:

precedence
  op left MINUS PLUS;
end

This is simple precedence rule for “operator” symbols — here MINUS and PLUS — which tells to bind operators from left to right (or in other words the production should be reduced as in (3.0)).

Let’s run generator again:

Reduce/shift conflict on symbol LANGLE. Affected items: 3.3, 3.5, 3.0
3.0)  expr := expr MINUS expr .
3.3)  expr := expr . LANGLE expr
3.5)  expr := expr . LANGLE expr RANGLE

That’s easy — we already said we want addition and subtraction to have the lowest priority. So all it takes is adding the second rule in precedences:

op left LANGLE RANGLE;

We use “left” because we want xANGLEs to have also left to right binding.
Life is beautiful, no conflicts this time so we can run our parser to try out some calculations:

$ 1-2-3 
The outcome: -4
The parse layout: ((1 - 2) - 3)

$ 1<2<3
The outcome: 32
The parse layout: ((1 < 2) < 3)

Let’s check bit indexing:

$ 1<2>
There were errors while parsing.

The bigger picture

NLT adds to the above error also a comment saying:

No action defined at node 10 for input "EOF" with stack "expr RANGLE".

The complain has its merit — the parser after seeing “1<2”, reduced it (we told it to do so) to an expression and later it didn’t know what to do with dangling right angle. It appears NLT warned us about this before, but we read too small portion of the conflicts report.

Here is the culprit:

Reduce/shift conflict on symbol LANGLE. Affected items: 7.3, 7.5, 7.0
7.0)  expr := expr LANGLE expr .
7.3)  expr := expr . LANGLE expr
7.5)  expr := expr . LANGLE expr RANGLE

This conflict translates to such scenario — we have “1 < 2 < …” and we have to decide if “1<2” can make a valid expression or should we wait for more input. Looking at (7.0) against (7.3) we would say “reduce” (left associativity), but then — looking at (7.0) vs (7.5) we would say shift (bit indexing has higher priority). There is no (reasonable) way we can tell for sure what it will be.

Dead end? No — why not try both ways and see what we will get.

Because now we delete the line

op left LANGLE RANGLE;

first we have to define LANGLE/RANGLE with higher priority than PLUS or MINUS:

precedence
  op left MINUS PLUS;
  rs shift LANGLE expr(PLUS MINUS) expr;
  rs shift RANGLE expr(PLUS MINUS) expr;
end

The order is priority, and since we want to solve problem exactly with xANGLE vs. PLUS/MINUS we have to give more precise definition. Those two added lines mean:

on reduce/shift conflict
perform shift
if the incoming symbol is LANGLE (RANGLE in the last line)
and the expression which is about to be reduced contains PLUS or MINUS operator
and it conflicts with expression which is about to be shifted.

OK, so we know how to precisely select conflicts to give them exact orders. The second half is similar:

rs try LANGLE expr(LANGLE RANGLE) expr;
rs try RANGLE expr(LANGLE RANGLE) expr;

only here we “try” (fork). Since we rely on input, for such cases:

$ 7<2
The outcome: 28
The parse layout: (7 < 2)

$ 7<2>
The outcome: 1
The parse layout: (7[2])

it works. But sometimes it doesn’t.

$ 7<2>1
There were errors while parsing.
Ambiguous parsing.

Ambiguity bites again

What happened? By adding “try” directive we didn’t really solved ambiguity but pushed solution from the grammar to the input. As the effect we got two, not one, parsing trees. Some languages have constraints which come with the input, but it is not our case — we have to add constraints manually.

Consider the last example — “7<2>1”. It can be parsed as “(7<2)>1” or (incorrectly) as “7<(2>1)”. Speaking in terms of constraints this means we don’t want to see a bit shift in the right branch.

Previously we had such rules for bit shifts:

e1:expr LANGLE e2:expr
e1:expr RANGLE e2:expr

after adding constraints:

e1:expr LANGLE e2:expr #LANGLE #RANGLE 
e1:expr RANGLE e2:expr #LANGLE #RANGLE

With hash character (#) we forbid given operator to appear directly at given place (here: right expression). Since the “operator” term is unknown to NLT (it is not a symbol per se — it can reuse the same name though) we have to mark (decorate) the expressions:

e1:expr ^LANGLE e2:expr #LANGLE #RANGLE 
e1:expr ^RANGLE e2:expr #LANGLE #RANGLE

Accent character (^) means “use this symbol as operator/decoration”.

In similar way we go through both expressions of bit indexing — “expr<expr>” — to finally get:

expr -> e1:expr MINUS e2:expr 
        { pair(e1.Item1 - e2.Item1, "("+e1.Item2+" - "+e2.Item2+")") }
      | e1:expr PLUS e2:expr 
        { pair(e1.Item1 + e2.Item1, "("+e1.Item2+" + "+e2.Item2+")") }
      | e1:expr ^LANGLE e2:expr #LANGLE #RANGLE 
        { pair(e1.Item1 << e2.Item1, "("+e1.Item2+" < "+e2.Item2+")") }
      | e1:expr ^RANGLE e2:expr #LANGLE #RANGLE 
        { pair(e1.Item1 >> e2.Item1, "("+e1.Item2+" > "+e2.Item2+")") }
      | e1:expr #LANGLE #RANGLE LANGLE e2:expr #LANGLE RANGLE 
        { pair((e1.Item1 & (1 << e2.Item1)) >> e2.Item1, "("+e1.Item2+"["+e2.Item2+"])") }
      | n:NUM 
        { pair(n, n) }
      ;

Run it and:

$ 1<2>3
The outcome: 0
The parse layout: ((1 < 2) > 3)

$ 1<7<2>
The outcome: 2
The parse layout: (1 < (7[2]))

Done!

Using NLT in your projects

Do you see anything you like? I cannot guarantee NLT is a cure for your problems with parsing, it is more an experimental project (I add features on-the-go as I need them), but give it a try — maybe it will serve your needs as well. I would be pleased to see that.

Why reinvent the wheel?

Did I google enough? I put it this way — even if I missed some really solid, YACC-like parser for C# out there (I would be surprised) I learned a lot. Not only I can use those tools, but I have good feel how they work starting from the lexer internals, so it was not wasted time. The only missing piece now is building my own compiler, but that is another story…

References

Coursera “Compilers” on-line course by Prof. Alex Aiken,
“Parsing Techniques” by Dick Grune, and Ceriel J.H. Jacobs (1990),
“The Theory of Parsing, Translation, and Compiling (vol.1)” by Alfred V. Aho, and Jeffrey D. Ullman (1972).

Resources

http://skila.pl/ — blog partially related to NLT development,
https://sourceforge.net/projects/naivelangtools/ — NLT repository.

License

This article, along with any associated source code and files, is licensed under The MIT License

Written By

greenoldman

Poland

I am interested in such topics:

* usability and accessibility of user interfaces,
* design of programming languages,
* natural language processing,
* machine translation,
* data mining.

Currently I am developing compiler and programming language Skila.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.