Formula Compiler using C++ BNF-like EDSL

Alexander Semjonov

4.94/5 (8 votes)

Nov 11, 2017

GPL3

8 min read

18673

143

The paper presents simplest byte-code formula compiler using C++ BNF-like embedded domain specific language (EDSL). It introduces BNFLite template library and implements byte-code for parallel computations.

Download formula compiler - 20.8 KB

Preface

Is there an area where a new programming language developed from scratch can be applied? For the first glance, there are no requirements which cannot be covered by the existing languages. However, such areas exist, e.g., systems where parallel calculation is required. It is remarkable that the computer society has been looking at a language to support grammar parsers itself! BNF (Backus Naur Form) notation and its derivatives (ABND, EBNF, etc.) are widely used for formal specifications of programming languages.

Although such specifications are really good for both requirements and design, they do not help for parser programming. Another approach is to generate code for parsers. The most famous tool is yacc (Yet Another Compiler Compiler) the precompile routine which can generate C parser code from syntactic definition in BNF-like form. This solution is considered as heavy and the software community has been looking for something easier.

For C++, the approach is obvious: a lite solution should be presented as embedded domain specific language (EDSL). It means BNF-like statements have to be mapped to both C++ keywords and overloaded operators. One known example is boost::spirit template library. This paper introduces the BNFLite C++ template library and demonstrates how to create the simplest byte-code formula compiler using this tool.

Introduction to BNFLite

The BNFLite implements embedded domain specific language approach for grammar specifications when all "production rules" are constructed as instances of C++ classes concatenated by means of overloaded arithmetic operators. For example, the integer number in BNF specs can be described as:

    <digit> ::= <0> | <1> | <2> | <3> | <4> | <5> | <6> | <7> | <8> | <9>
    <number> ::= <digit> | <digit> <number>

This representation can be written in C++ using BNFlite approach:

    #include "bnflite.h" // include source code lib
    using namespace bnf;
    Lexem Digit = Token("0") | "1"  | "2" | "4" | "5" | "6" | "7" | "8" | "9";
    Lexem Number;
    Number = Digit | Digit + Number;

Execution of this C++ code produces an internal date structure for further parsing procedures.

Requirements for Formula Compiler

Let's specify base requirements to the language of the formula compiler to be developed.

Byte-code compiler shall be founded on expression like language:
1. The following base types shall be supported: <INTEGER> | <FLOAT> | <STRING>
2. The compiler shall support unary negative operation and basic binary arithmetic operations
3. The compiler shall be able to generate code to call C++ functions defined in body program

This is enough for design of simple programming language, which can be described in Backus-Naur Form.

Formal BNF term is called "production rule". Each rule except "terminal" is a conjunction of a series of more concrete rules or terminals: production_rule ::= <rule_1>...<rule_n> | <rule_n_1>...<rule_m>. So, the designed expression-like language can be presented by the following terms:

<operand> ::= <INTEGER> | <FLOAT> | <STRING>
<operator> ::= "+" | "-" | "*" | "/"
<arguments> ::= <EMPTY> | <expression> | < arguments> "," <expression>
<function> ::= <IDENTIFIER> "(" < arguments > ")"
<expression> ::= <operand> | <function> | "-" 
<expression> | <expression> <operator> <expression> | "(" <expression> ")"

Such decryption is formally correct but it has two known issues. Firstly, binary operator precedence is not presented. Secondly, it contains redundant number of recursions which can be substituted by repetitions.

Augmented BNF specifications introduce constructions like <a>*<b><element> to support repetition where <a> and <b> imply at least <a> and at most <b> occurrences of the element. For example, 3*3<element> allows exactly three and 1*2<element> allows one or two. *<element> simplified construction allows any number (from 0 to infinity). Alternatively, 1*<element> encloses an optional element (at least one) and it can be optionally presented as [element].

Full specifications of the formula compiler language in ABNF form are not so complex:

ALPHA  =  'A'-'Z' | 'a'-'z'
DIGIT  =  '0'-'9'
digit1_9 =  '1'-'9'
string  = '\"' *( ALPHA  | DIGIT | '_' ) '\"'
e = 'e' | 'E'            
exp = e ['-' | '+'] 1*DIGIT
frac = '.' 1*DIGIT
int = '0' | ( digit1_9 *DIGIT )
number = [ '-' ] int [ frac ] [ exp ]
operand = int | number | string
operator = '+' | '-' | '*' | '/'
identifier =  (ALPHA | '_') *( ALPHA  | DIGIT | '_' )
arguments ::=  expression *( ',' expression )
function ::= identifier '(' [arguments]  ')'
elementary ::= operand | function | ('-' elementary) | ('(' expression ')')
primary = elementary *(('/' elementary) | ('*' elementary))
expression = primary *(('-' primary) | ('+' primary))

These specs are more computer friendly than previous ones.

Short Design Notes

The simple byte-code compiler is a minor extension of the expression calculator. For example, the expression 2 + 3 * 4 can be converted to tree.

└──+
   ├── 2
   └──*
      ├── 3
      └── 4

It can be written on C manner “add(2, mul(3, 4))” form. Let's write it reversely: ” (2,(3,4)mul)add”. This form is called as reverse polish notation (RPN) and byte-codes can be easily generated:

Int(2) Int(3), Int(4), Mul<I,I>, Addl<I,I>

Int(number) operation pushes the number to the stack. Mul/Add operation pops two parameters from the stack and pushes the result.

Practical benefit of the formula compiler comes only when functions are used. Let's consider “liner map” 2 + 3 *GetX(). The byte-code will be:

Int(2) Int(3), Call<0>, Mul<I,I>, Addl<I,I>

For example, this functionally can be cyclically applied to X database column to obtain Y column (Moreover, the provided demo does four calculations simultaneously, but it is rather the scope of another paper).

All byte-codes consist of two fields: type and union for integer, float and pointer to string.

struct byte_code
{
    OpCode type;
    union {
        int val_i;
        float val_f;
        const char* val_s;
    };
    byte_code(): type(opNop), val_i(0) {};
    /* … */

The union represents the operand for the simple “immediate addressing mode”. The type is the operation code based upon the following enumerator:

enum OpCode
{
    opFatal = -1, opNop = 0,
    opInt = 1,  opFloat = 2,  opStr = 3,  opMaskType = 0x03,
    opError = 1, opNeg = 2, opPos = 3, opCall = 4,
    opToInt = 5,  opToFloat = 6, opToStr = 7,
    opAdd = 2,  opSub = 3,  opMul = 4,  opDiv = 5,
};

The real operation code is a little more complex. It contains types for stack operands.

BNFLite Grammar classes (Token, Lexem and Rule)

The BNFLite description is similar to EBNF specifications above.

Number

Token digit1_9('1', '9');
Token DIGIT("0123456789");
Lexem i_digit = 1*DIGIT;
Lexem frac_ = "." + i_digit;
Lexem int_ = "0" | digit1_9  + *DIGIT;
Lexem exp_ = "Ee" + !Token("+-") + i_digit;
Lexem number_ = !Token("-") + int_ + !frac_ + !exp_;
Rule number = number_;

The Token class represents terminals, which are symbols. The Lexem constructs strings of symbols. Parsing is a process of analyzing tokens and lexemes of the input stream. The unary C operator `*` means the construction can be repeated from 0 to infinity. The binary C operator `*` means the construction like 1*DIGIT can be repeated from 1 to infinity. The unary C operator `!` means the construction is optional.

Practically, Rule is similar to Lexem except for callbacks and spaces sensitivity. The construction Rule number_ = !Token("-") + int_ + !frac_ + !exp_; allows incorrect spaces, e.g., between integer and fractional parts.

Strings and identifiers

    Token az_("_"); az_.Add('A', 'Z'); az_.Add('a', 'z');
    Token az01_(az_); az01_.Add('0', '9');
    Token all(1,255); all.Remove("\"");
    Lexem identifier = az_  + *(az01_);
    Lexem quotedstring = "\"" + *all + "\"";

The Token all represents all symbols except quote;

Major parsing rules

    Rule expression;
    Rule unary;
    Rule function = identifier + "(" + !(expression + *("," + expression)) +  ")";
    Rule elementary = AcceptFirst()
            | "(" + expression + ")"
            | function
            | number
            | quotedstring + printErr
            | unary;
    unary = Token("-") + elementary;
    Rule primary = elementary + *("*%/" + elementary);
    /* Rule */ expression = primary + *("+-" + primary);

Expression and unary rules are recursive. They need to be declared earlier. The AcceptFirst() statement changes parser behavior from the default mode to “accept best” for this particular Rule. After that, the parser chooses a first appropriate construction instead of the most appropriate one.

Parser Calls and Callbacks

Recursive descent parser implies tree composition. The user needs a means to track the tree traversal recursion. First of all, the interface object structure to interact with the parser should be specified.

typedef Interface< std::list<byte_code> > Gen;
/* ... */
const char* tail = 0;
Gen result;
int tst = Analyze(expression, expr.c_str(), &tail, result);

Parsing starts by the Analyze call with the following arguments:

expression – root Rule object of BNFlite gramma
expr.c_str() – string of the expression to be compiled to byte-code
tail – pointer to end of text in case of successful parsing or to place where parsing is stopped due to error
result – final object from users callbacks
tst – contains result of parsing implemented as bitwise flags, error if negative

The Callback function accepts vector of byte-code lists formed early. It returns the list of new formed byte-codes. For example, Gen DoNumber(std::vector<Gen>& res) accepts the one element vector representing a number. It has to return the result Gen object with filled user's data (either Int<> or Float<>).

In common case, the callback Gen DoBinary(std::vector<Gen>& res) accepts the left byte-code vector, the sign of operation(+**/), and the right byte-code vector. The callback just joins left and right byte-code vectors and generates the tail byte-code according to the sign.

Comparison with Boost::Spirit Implementation

This is a grammar part implemented by means of boost::spirit template library:

namespace q = boost::spirit::qi;

typedef boost::variant<
    int, float, std::string,
    boost::recursive_wrapper<struct binary_node>,
    boost::recursive_wrapper<struct unary_node>,
    boost::recursive_wrapper<struct call_node>
> branch_t;

template <typename Iterator>
struct SParser: q::grammar<Iterator, branch_t, q::space_type>
{
    q::rule<Iterator, branch_t, q::space_type> expression, primary, elementary, unary;
    q::rule<Iterator, std::string()> quotedstring;
    q::rule<Iterator, std::string()> identifier;
    q::rule<Iterator, std::vector<branch_t>, q::space_type> arglist;
    q::rule<Iterator, call_node, q::space_type> function;

    SParser() : SParser::base_type(expression)
    {
    using boost::phoenix::construct;
        
    expression = primary[q::_val = q::_1]
        >> *('+'  > primary[q::_val = construct<binary_node>(q::_val, q::_1, '+')]
            | '-' > primary[q::_val = construct<binary_node>(q::_val, q::_1, '-')] );
    primary =   elementary[q::_val = q::_1]
        >> *('*'  > elementary[q::_val = construct<binary_node>(q::_val, q::_1, '*')]
            | '/' > elementary[q::_val = construct<binary_node>(q::_val, q::_1, '/')] );
    unary = '-' > elementary[q::_val = construct<unary_node>(q::_1, '-')]; 

    elementary = q::real_parser<float, q::strict_real_policies<float>>()[q::_val = q::_1]
        | q::int_[q::_val = q::_1]
        | quotedstring[q::_val = q::_1]
        | ('(' > expression > ')')[q::_val = q::_1]
        | function[q::_val = q::_1]
        | unary[q::_val = q::_1];

    quotedstring = '"' > *(q::char_ - '"') > '"';
    identifier = (q::alpha | q::char_('_')) >> *(q::alnum | q::char_('_'));

    arglist = '(' > (expression % ',')[q::_val = q::_1] > ')';
    function = (identifier > arglist)[q::_val = construct<call_node>(q::_1, q::_2)];

    on_error(expression, std::cout << boost::phoenix::val("Error "));
    }
};

Please, refer to `boost::spirit` documentation for details. The example is provided for demonstration similarity of `boost::spirit` grammar to BNFlite one.

Boost::spirit is a more mature tool and does not use callbacks. For each “rule” it calls constructors of classes which should be member of boost::variant. Therefore, boost::spirit cannot be used for one-pass compilers. However, boost::variant can be effective if something more complex is required (e.g. byte-code optimizers).

Problems of boost::spirit lie in another area. Any inclusion of boost::spirit considerably increases project compiler time. Another significant drawback of boost::spirit is related to run-time debugging of the grammar - it is not possible at all!

Debugging of BNFLite Grammar

Writing grammar by EDSL is unusual and the user does not have full understanding about the parser. If the Analyze call returns an error for the correct text, then the user always should take into consideration the possibility of grammar bugs.

Return code

Return code from Analyze call can contain flags related to the grammar. For example, eBadRule, eBadLexem flags mean the tree of rules is not properly built.

Names and breakpoints

The user can assign a name to the Rule. It can help to track recursive descent parser using Rule::_parse function. Debugger stack (history of function calls) can inform which Rule was applied and when. The user just needs to watch the this.name variable. It is not as difficult as it seems at first glance.

Grammar subsets

Analyze function can be applied as unit test to any Rule representing subset of grammar.

Tracing

Function with prototype “bool foo(const char* lexem, size_t len)” can be used in BNFLite expressions for both reasons: to obtain temporary results and to inform about predicted errors.

This function will print the parsed number:

static bool DebugNumber(const char* lexem, size_t len)
{    printf("The number is: %.*s;\n", len, lexem);    return true; }
    /* … */
Rule number = number_ + DebugNumber;

The function should return true if result is correct.

Let's assume the numbers with leading ‘+’ are not allowed.

static bool ErrorPlusNumber(const char* lexem, size_t len)
{printf("The number %.*s with plus is not allowed\n", len, lexem); return false;}
    /* … */
Lexem number_p = !Token("+") + int_ + !frac_ + !exp_ + ErrorPlusNumber;
Rule number = number_ | number_p;

The function should return false to inform the parser the result is not fit. C++11 constructions like below are also possible:

Rule number = number_ | [](const char* lexem, size_t len)
{ return !printf("The number %.*s with plus is not allowed\n", len, lexem); }

Building and Run

The attached simplest formula compiler package has the following C++ files:

main.cpp - starter of byte-code compiler and interpreter
parser.cpp - BNFLite parser with grammar section and callbacks
code_gen.cpp - byte-code generator
code_lib.cpp - several examples of built-in functions (e.g POW(2,3) - power: 2*2*2)
code_run.cpp - byte-code interpreter (used SSE2 for parallel calculation of 4 formulas)

The package has several prebuilt projects but generally it can be built as:

>$ g++ -O2 -march=pentium4 -std=c++14 -I.. code_gen.cpp  parser.cpp  code_lib.cpp  main.cpp code_run.cpp

Function GetX() returns four values: 0, 1, 2, 3. It can be used in expression to be compiled and calculated:

> $ a.exe "2 + 3 *GetX()"  
5 byte-codes in: 2+3*GetX()
Byte-code: Int(2),Int(3),opCall<I>,opMul<I,I>,opAdd<I,I>;
result = 2, 5, 8, 11;

Conclusion

The presented formula compiler is the realization of several ideas in order to demonstrate its feasibility. First of all, it is parallel computation according to compiled formula. Secondly, it introduces BNFLite header library to spread existing concept of applicability of BNF forms. Now a fast parser implementation can be easily created for specific customer requirements where BNF-like language is used.

References

[1] BNFLite with some examples, https://github.com/r35382/bnflite