Generating a High-Speed Parser, Part 1: re2c

Matt Scarpino

5.00/5 (5 votes)

Nov 21, 2015

CPOL

10 min read

21048

601

This article explains how to generate a high-performance text scanner using re2c.

Download demo - 2.5 KB

Parsing text is a tedious, error-prone process, so rather than code a parser from scratch, I prefer to use tools that generate parsers. There are many options available, including ANTLR, bison, and yacc. But for my current project, I need a compact C/C++ parser that executes quickly. Therefore, I've chosen two tools: re2c to generate the lexer and lemon to generate the parser.

It's important to understand the difference between them. re2c's job is to generate a lexer, which analyzes incoming text and divides it into meaningful elements called tokens. lemon's job is to generate a parser, which accepts tokens from the lexer and constructs a model of the text's hierarchical structure.

The good news is that re2c and lemon are freely available and both generate compact, high-speed C code. The bad news is that they both have significant learning curves. In writing these articles, my goal is to lessen the pain associated with learning re2c and lemon.

I'm not an expert with either tool, and I won't pretend that these articles provide a thorough treatment. But I have successfully used re2c and lemon to generate quality lexers/parsers, and I hope these articles will help you do the same. The first article focuses on re2c and the second article discusses lemon.

1. Preliminary Requirements

To explain how re2c can be used to generate a lexer, this article adopts a hands-on approach. I don't expect you to know the theory of lexical analysis, but I do expect that you can program in C.

The example code presented in this article has been compiled with gcc. However, it should be usable with any C/C++ build system.

2. Lexer Overview

This article focuses on using re2c to generate a lexer capable of scanning C code. This lexer can't recognize all of C's syntax, but it can scan the following code:

/* This is a simple test 
for the re2c-generated lexer */
int main() {
  printf("Hello World");
  return 0;
}

The lexer reads each character and uses regular expression matching to return a sequence of tokens. For example, a left parenthesis corresponds to the LPAREN token and the return keyword corresponds to the RETURN token. The tokens.h file associates each token name with a number.

The lexer discussed in this article splits the text into tokens and then prints the type, line number, and position of each token. For the above code, this output is given as follows:

INT_TYPE found on Line 3 at Position 0
NAME ('main') found on Line 3 at Position 4
LPAREN found on Line 3 at Position 8
RPAREN found on Line 3 at Position 9
LBRACE found on Line 3 at Position 11
NAME ('printf') found on Line 4 at Position 2
LPAREN found on Line 4 at Position 8
STRING_LITERAL ('"Hello World"') found on Line 4 at Position 9
RPAREN found on Line 4 at Position 22
SEMICOLON found on Line 4 at Position 23
RETURN_KW found on Line 5 at Position 2
INT_LITERAL ('0') found on Line 5 at Position 9
SEMICOLON found on Line 5 at Position 10
RBRACE found on Line 6 at Position 0

The lexer doesn't create a token for the initial multi-line comment, but it adds an offset to the line numbers following the comment.

3. Writing the Token File

The name re2c stands for regular expressions to C and that's exactly what it accomplishes. It generates code that checks text against a series of regular expressions. When it finds a match, it executes C code associated with the expression. For example, the following text tells re2c to return the predefined INT_TYPE token whenever it encounters the int string in code:

"int"       { return(INT_TYPE); }

The associations between text and tokens is defined in a file called a token file (*.re). The hard part of using re2c is writing this file so that it can be converted into lexer code. For the lexer discussed in this article, the token file is simple_c.re. The process of writing this file consists of four steps:

Define a Scanner structure
Provide code to execute the scanner
Associate regular expressions with code
Assign values to re2c-specific macros

This section describes how each of these steps can be accomplished for a simple lexer. Where possible, I'll keep to the conventions set by the example code in the re2c distribution.

3.1 Define the scanner

For the sake of convenience, the example code stores re2c-specific data in a structure called Scanner. This contains the current line number in the text (line) and four pointers that reference memory locations in the text. The following code shows what this structure looks like:

typedef struct Scanner {
  char *top, *cur, *ptr, *pos;
  int line;
} Scanner;

Table 1 lists each pointer and the purpose it serves.

Table 1: Eight Pointers of the Scanner Structure
Pointer	Purpose
`top`	Points to the start of the current token
`cur`	Points to the character being examined
`ptr`	Used by re2c for back-tracking
`pos`	Points to the start of the line

In the example code, each pointer references the position of a char. This tells the lexer that each input symbol occupies one byte. Depending on the size of the character set, this data type could also be set to short, int, or an unsigned integer type.

3.2 Provide code to execute the scanner

The token file needs a function to start the scanning process. If the program's only purpose is to scan text, this would be the main function. Whatever the name is, this function has three responsibilities:

Create the Scanner structure
Provide the scanner with data
Create a scan loop to process the incoming text

The first two steps are easy. The following code reads input text from test.dat into a buffer. Then it sets three of the Scanner's pointers to point to the start of the buffer.

// Access the text file
fp = fopen("test.dat", "r");
fread(buff, 1, size, fp); 

// Create and initialize the Scanner
Scanner scanner;
scanner.top = buff;
scanner.cur = buff;
scanner.pos = buff;  
scanner.line = 1;

The third step is more involved. The scan loop cycles through the buffer's content, performing one iteration for each token it finds. The following code gives an idea of how this works:

int token = 0;
buff_end = (char*) (((char*)buff) + size);
while(token = scan(&scanner, buff_end)) {
  ... display output for the returned token ...
}

The real work of re2c is performed in the scan function, which returns an integer that uniquely identifies the token. In general, the scan function contains a little customization code and a lot of re2c configuration.

If each input symbol is a char, the scan function might look like this:

int scan(Scanner* s, char *buff_end) {

regular:
  if (s->cur >= buff_end) {
    return END_TOKEN;
  }
  s->top = s->cur;

/*!re2c
  ... analyze regular code ...
*/

comment:
/*!re2c
  ... analyze comment ...
*/
}

The scan function starts by checking if the current character (s->cur) is at the end of the text buffer (buff_end). If so, the function returns END_TOKEN. If not, the start of the token (s->top) is set equal to the current character (s->cur).

This code has two labels: regular and comment. They're needed because the goto statements in the lexer code repeat processing starting from one of the two labels. Like all proper programmers, I abhor the use of goto statements, but in lexical analysis, there's no getting around them.

Both comments in the code start with /*!re2c. This tells the re2c executable that the comment block contains instructions related to generating the lexer code. The following discussion explains how this works.

3.3 Associate regular expressions with code

In the scan function, instructions to re2c are provided inside comment blocks that start with /*!re2c. These instructions can take several forms, and the three most common statements are:

name = regexp; - associates an identifier with a regular expression (called a named definition)
regexp {code} - associates a regular expression with C code to be executed (called a rule)
re2c:name = value; - configures re2c's name parameter (called an inplace configuration)

Named definitions clarify the use of regular expressions. For example, instead of using [a-zA-Z] in a rule, a named definition such as 'letter = [a-zA-Z]' allows the name letter to be used in its place.

re2c recognizes a variety of regular expressions. Table 2 lists each of them with a description.

Table 2: Accepted Regular Expression Formats
Expression Format	Description
`"foo"`	Literal string (case-sensitive)
`'foo'`	Literal string (case-insensitive)
`[xyz]`	Character class (matches 'x', 'y', or 'z')
`[^class]`	Inversion of character class
`a-z`	Range (any character from 'a' to 'z')
`r \ s`	Match any character class `r` that isn't `s`
`r*`	Zero or more `r` expressions
`r+`	One or more `r` expressions
`r?`	Zero or one `r` expressions
`name`	The named definition
`(r)`	Overrides precedence for `r`
`r s`	Expression `r` followed by `s`
`r\|s`	Either `r` or `s`
`r/s`	`r` only if followed by `s`
`r{n}`	Matches `r` exactly `n` times
`r{n,}`	Matches `r` at least `n` times
`r{n, m}`	Matches `r` at least `n` times, but no more than `m` times
`.`	Any character except a newline

The first two expressions are particularly important. The following statement tells re2c to return a RETURN token when the lexer encounters the return string:

"return"        { return RETURN; }

If the double-quotes are replaced with single-quotes, the pattern matching will be case-insensitive. That is, the token will be returned when the lexer encounters return or RETURN.

To make the rules easier to read and write, simple_c.re uses the following named definitions:

dig = [0-9];
let = [a-zA-Z_];

These names are used in the pattern for matching identifiers. An identifier can contain numbers, but the first character must be a letter or underscore. Therefore, the rule for identifiers is given as follows:

let (let|dig)*  { return NAME; }

Whitespace consists of one or more spaces, \t, \v, or \f characters. This definition is given as:

whitespace = [ \t\v\f]+;

When the lexer encounters whitespace in regular code, it should ignore the characters and restart processing from the regular label. This is given in the following rule:

whitespace      { goto regular; }

In addition to rules and named definitions, the re2c-specific statements can include inplace configuration statements. These associate an re2c property with a value. For example, the re2c:yyfill:enable property identifies whether the lexer should refill the buffer. This isn't necessary in simple_c.re because the buffer contains all of the text. Therefore, the following configuration statement is used:

re2c:yyfill:enable = 0;

re2c has many other properties that can be set. The full list is provided by the re2c manual, located here.

3.4 Assign values to macros

The Scanner structure clarifies memory operations for developers, but the generated lexer code doesn't know anything about it. It interacts with the application using macros whose names start with YY. Table 3 lists three important macros:

Table 3: Important re2c Macros
Macro	Description	Usual Value
`YYCTYPE`	Data type of each input symbol	`char`, `short`, `int`, or an unsigned type
`YYCURSOR`	The current cursor location	A `Scanner` pointer (`s->cur`)
`YYMARKER`	Used by re2c to store the position for back-tracking	A `Scanner` pointer (`s->ptr`)

In simple_c.re, the following code assigns values to these macros:

#define   YYCTYPE     char
#define   YYCURSOR    s->cur
#define   YYMARKER    s->ptr

When re2c generates code for the lexer, it accesses the input text using macros like YYCTYPE and YYCURSOR. With these #define statements in place, the compiler will associate re2c's macros with the developer's Scanner structure.

4. Generating the Lexer

The token file consists of C code but it's not meant to be directly compiled. Before compiling, it should be processed with the re2c executable, which can be downloaded here. This command-line utility accepts a number of helpful flags, including the following:

-d - dumps information about the lexer's current position and state
-u - supports the full Unicode character set
-o - sets the name of the file to contain the lexer code
-D - produce a Graphviz dot view of the deterministic finite automata (DFA) results

By default, re2c prints its results to standard output. But the following command tells re2c to convert simple_c.re into a lexer source file called simple_c.c:

re2c -o simple_c.c simple_c.re

If the conversion is successful, the simple_c.c file can be compiled into a lexer with a regular build command such as the following:

gcc -O3 -o simple_c simple_c.c

As simple_c executes, it reads text from test.dat into a buffer and calls the scan function in a loop. Each iteration of scan returns a token, and in the case of simple_c, each iteration prints the type of token and the line and position where it was encountered.

5. Using the code

The code for this article generates a simple application that reads simple C code. It's provided in an archive called re2c_demo.zip, which contains four files:

simple_c.re - lexer definition file, re2c can use this to generate a lexer
tokens.h - associates each token with an integer value
test.dat - Contains C code to be scanned
Makefile - Contains instructions for generating the lexer and building the application

If you have GNU's make and gcc utilities, you can execute the make command. This reads the Makefile instructions and automatically builds the application.

If you want to build the application manually, you need to perform two steps:

Use re2c to generate the lexer code (re2c -o simple_c.c simple_c.re)
Compile the lexer code into an executable (gcc -o simple_c simple_c.c)

As the application runs, it reads characters from test.dat and prints information about each token it recognizes.

History

11/21/2015 - The article was completed and released for the first time