## Introduction

When you can handle string/text processing with
just regular expressions that is fine (and fast), but when you get
more complex task to do having more serious machinery — lexer and parser
— comes handy. There are already some parsers for C# written, but I
feel the place for old-school YACC-like parser is not occupied. Well,
until now.

## Background

The “Naive Language Tools” suite which I describe below is actually a
side-effect of translating COOL framework to C#.

About a year and a half I enrolled in “Compilers”
course by Prof. Alex Aiken (highly recommended) because I always was
curious how compilers work and how to write one. The entire course
was more fulfilling than I imagined and I wanted to add something
from me (as repay, one could say) — at that time course staff
provided C++ and Java frameworks, so why not make a framework in my
favorite C#?

There was just one problem — at the time of writing
C# framework I couldn’t find reliable, YACC/CUP-like parser for C#
(similarity to those tools was a must to let the students switch from
one framework to another).

Well, what the heck, I just finished the course, I
wrote my own COOL compiler, I should be able to pull off writing a
lexer and a parser as well. And indeed I was — that is how NLT was
born. At first, with features required to get COOL compiler built,
but in time I added more and more features.

The lexer got basic context scanning, the parser
became a LALR(k) forking parser, and since I use NLT for my own
purposes, it is more and more polished every day. OK, OK, more like
every week.

## Let’s see some action

*The code of this article is included in NLT package as “05. Patterns and forking” example.*

Please allow me to introduce some rather crazy calculator to show how NLT can handle the language ambiguity. Let’s build calculator with addition and subtraction as usual, but also with bit shifts (“`a`

< `b`

”, “`a`

> `b`

”) and bit indexing (“`a`

<`b`

>” — it gives `b`

-th bit of “`a`

”).

The priority and associativity of the operators — from lowest to highest priority:

- summation and subtraction — left to right,
- bit shifts — left to right,
- bit indexing.

## Lexer — chopping up the input

One can define all the actions directly in C# code, but also write them as NLT grammar and then let the NLT generator create lexer and parser. Here we use the latter approach for the sake of readability.

Let’s start from boring stuff:

lexer LexerFactory;
token SymbolEnum;
states StatesEnum
*INIT
end

There is a definition of the class name of the lexer factory, the type (enum) name of tokens and states. Since our calculator is quite basic we will need just one state (asterisk says it will be the default one).

Chopping the input is done as follows:

scanning // start of lexer rules
"+" -> PLUS;
"-" -> MINUS;
"<" -> LANGLE;
">" -> RANGLE;

We match the text using string pattern and we return a terminal (uppercase is used for making a distinction from non-terminal and special token — “`Error`

”). Next we use regular expression pattern to grab the numbers:

// regex as pattern
/[0-9]+/ -> NUM, Convert.ToInt32($text);

“`$text`

” is one of the pseudo-variables. By default the matched text is assigned to the value of the token, but better to convert text to number here than later in the parser.

We have to add a little clean-up:

" " { }; // ignore spaces
/./ -> Error; // "anything else" rule
%EOF -> EOF;
end // end of lexer rules

Since one cannot match end of file (EOF)

using string or regex pattern, thus the special entry “`%EOF`

”. The `EOF`

on the right is a predefined terminal.

## Parser — finding out the structure

### Take off

Again, we start with boring stuff:

parser ParserFactory;
types
NUM int;
expr Tuple<int,string>;
end

The “type” section allows us to define the types for all symbols — this way we won’t need to cast variables in parser actions. Here we have a definition of `NUM`

terminal as “int”, and definition of the expression as a pair of values.

We would like to keep things simple here — the outcome of the calculation as in every calculator is desired of course — thus the “int” as the first element of the pair, but since the parsing will be all the fun here, we add a track of parse layout (flat parse tree) — single “string” as the second element will suffice.

With all things set up we can finally deal with parsing productions:

parsing // start of parser rules
comp -> e:expr { e };

In NLT the symbol used on left hand side (LHS) of the first production is recognized as a start symbol and it can appear only once. Here we basically say that entire computation is an expression.

expr -> e1:expr MINUS e2:expr
{ pair(e1.Item1 - e2.Item1, "("+e1.Item2+" - "+e2.Item2+")") }
| e1:expr PLUS e2:expr
{ pair(e1.Item1 + e2.Item1, "("+e1.Item2+" + "+e2.Item2+")") }
| e1:expr LANGLE e2:expr
{ pair(e1.Item1 << e2.Item1, "("+e1.Item2+" < "+e2.Item2+")") }
| e1:expr RANGLE e2:expr
{ pair(e1.Item1 >> e2.Item1, "("+e1.Item2+" > "+e2.Item2+")") }
| e1:expr LANGLE e2:expr RANGLE
{ pair((e1.Item1 & (1 << e2.Item1)) >> e2.Item1, "("+e1.Item2+"["+e2.Item2+"])") }
| n:NUM
{ pair(n, n) }
;
end // end of parser rules

Going from top to bottom we define an expression as a subtraction, sum, bit shifts or bit indexing of expressions. The last rule states that an expression is a single number as well.

Is this enough? When we run NLT generator (use “refresh” script to do it) over this grammar we will quickly discover we miss some pieces — the NLT report starts with:

Reduce/shift conflict on symbol MINUS. Affected items: 3.1, 3.0
3.0) expr := expr MINUS expr .
3.1) expr := expr . MINUS expr

*The actual report is a bit richer in information.*

It means that with the given rules NLT parser is unable to decide what “1-2-3” makes — should it be read as “(1-2)-3” or “1-(2-3)”? The same problems is with `PLUS`

symbol.

To solve this we add precedence section:

precedence
op left MINUS PLUS;
end

This is simple precedence rule for “operator” symbols — here `MINUS`

and `PLUS`

— which tells to bind operators from left to right (or in other words the production should be reduced as in (3.0)).

Let’s run generator again:

Reduce/shift conflict on symbol LANGLE. Affected items: 3.3, 3.5, 3.0
3.0) expr := expr MINUS expr .
3.3) expr := expr . LANGLE expr
3.5) expr := expr . LANGLE expr RANGLE

That’s easy — we already said we want addition and subtraction to have the lowest priority. So all it takes is adding the second rule in precedences:

op left LANGLE RANGLE;

We use “left” because we want `xANGLEs`

to have also left to right binding.

Life is beautiful, no conflicts this time so we can run our parser to try out some calculations:

**$** 1-2-3
The outcome: -4
The parse layout: ((1 - 2) - 3)

**$** 1<2<3
The outcome: 32
The parse layout: ((1 < 2) < 3)

Let’s check bit indexing:

**$** 1<2>
There were errors while parsing.

### The bigger picture

NLT adds to the above error also a comment saying:

No action defined at node 10 for input "EOF" with stack "expr RANGLE".

The complain has its merit — the parser after seeing “1<2”, reduced it (we told it to do so) to an expression and later it didn’t know what to do with dangling right angle. It appears NLT warned us about this before, but we read too small portion of the conflicts report.

Here is the culprit:

Reduce/shift conflict on symbol LANGLE. Affected items: 7.3, 7.5, 7.0
7.0) expr := expr LANGLE expr .
7.3) expr := expr . LANGLE expr
7.5) expr := expr . LANGLE expr RANGLE

This conflict translates to such scenario — we have “1 < 2 < …” and we have to decide if “1<2” can make a valid expression or should we wait for more input. Looking at (7.0) against (7.3) we would say “reduce” (left associativity), but then — looking at (7.0) vs (7.5) we would say shift (bit indexing has higher priority). There is no (reasonable) way we can tell for sure what it will be.

Dead end? No — why not try both ways and see what we will get.

Because now we delete the line

op left LANGLE RANGLE;

first we have to define `LANGLE`

/`RANGLE`

with higher priority than `PLUS`

or `MINUS`

:

precedence
op left MINUS PLUS;
rs shift LANGLE expr(PLUS MINUS) expr;
rs shift RANGLE expr(PLUS MINUS) expr;
end

The order **is** priority, and since we want to solve problem exactly with `xANGLE`

vs. `PLUS`

/`MINUS`

we have to give more precise definition. Those two added lines mean:

- on reduce/shift conflict
- perform shift
- if the incoming symbol is
`LANGLE`

(`RANGLE`

in the last line) - and the expression which is about to be reduced contains
`PLUS`

or `MINUS`

operator - and it conflicts with expression which is about to be shifted.

OK, so we know how to precisely select conflicts to give them exact orders. The second half is similar:

rs try LANGLE expr(LANGLE RANGLE) expr;
rs try RANGLE expr(LANGLE RANGLE) expr;

only here we “try” (fork). Since we rely on input, for such cases:

**$** 7<2
The outcome: 28
The parse layout: (7 < 2)

**$** 7<2>
The outcome: 1
The parse layout: (7[2])

it works. But sometimes it doesn’t.

**$** 7<2>1
There were errors while parsing.
Ambiguous parsing.

### Ambiguity bites again

What happened? By adding “try” directive we didn’t really solved ambiguity but pushed solution from the grammar to the input. As the effect we got two, not one, parsing trees. Some languages have constraints which come with the input, but it is not our case — we have to add constraints manually.

Consider the last example — “7<2>1”. It can be parsed as “(7<2)>1” or (incorrectly) as “7<(2>1)”. Speaking in terms of constraints this means we don’t want to see a bit shift in the right branch.

Previously we had such rules for bit shifts:

e1:expr LANGLE e2:expr
e1:expr RANGLE e2:expr

after adding constraints:

e1:expr LANGLE e2:expr #LANGLE #RANGLE
e1:expr RANGLE e2:expr #LANGLE #RANGLE

With hash character (#) we forbid given operator to appear directly at given place (here: right expression). Since the “operator” term is unknown to NLT (it is not a symbol per se — it can reuse the same name though) we have to mark (decorate) the expressions:

e1:expr ^LANGLE e2:expr #LANGLE #RANGLE
e1:expr ^RANGLE e2:expr #LANGLE #RANGLE

Accent character (^) means “use this symbol as operator/decoration”.

In similar way we go through both expressions of bit indexing — “`expr`

<`expr`

>” — to finally get:

expr -> e1:expr MINUS e2:expr
** **{ pair(e1.Item1 - e2.Item1, "("+e1.Item2+" - "+e2.Item2+")") }
| e1:expr PLUS e2:expr
{ pair(e1.Item1 + e2.Item1, "("+e1.Item2+" + "+e2.Item2+")") }
| e1:expr ^LANGLE e2:expr #LANGLE #RANGLE
{ pair(e1.Item1 << e2.Item1, "("+e1.Item2+" < "+e2.Item2+")") }
| e1:expr ^RANGLE e2:expr #LANGLE #RANGLE
{ pair(e1.Item1 >> e2.Item1, "("+e1.Item2+" > "+e2.Item2+")") }
| e1:expr #LANGLE #RANGLE LANGLE e2:expr #LANGLE RANGLE
{ pair((e1.Item1 & (1 << e2.Item1)) >> e2.Item1, "("+e1.Item2+"["+e2.Item2+"])") }
| n:NUM
{ pair(n, n) }
;

Run it and:

**$** 1<2>3
The outcome: 0
The parse layout: ((1 < 2) > 3)

**$** 1<7<2>
The outcome: 2
The parse layout: (1 < (7[2]))

Done!

## Using NLT in your projects

Do you see anything
you like? I cannot guarantee NLT is a cure for your problems with
parsing, it is more an experimental project (I add features on-the-go as
I need them), but give it a try — maybe it will serve your needs as
well. I would be pleased to see that.

## Why reinvent the wheel?

Did I google enough? I put it this way — even
if I missed some really solid, YACC-like parser for C# out there (I
would be surprised) I learned a lot. Not only I can use those tools,
but I have good feel how they work starting from the lexer internals,
so it was not wasted time. The only missing piece now is building my
**own**
compiler, but that is another story…

## References

- Coursera “Compilers” on-line course by Prof. Alex Aiken,
- “Parsing Techniques” by Dick Grune, and Ceriel J.H. Jacobs (1990),
- “The Theory of Parsing, Translation, and Compiling (vol.1)” by Alfred V. Aho, and Jeffrey D. Ullman (1972).

## Resources