This article describes a MATLAB-to-C code converter written using the GNU Bison parser generator and the GNU Flex lexical analyzer generator. The article explains the basic features of these utilities and shows how to generate the output code from the parser results. This converter generates code that is compiled and linked with runtime support library. An introduction into the usage of the library is also provided.
The source code parser and the target code generator
The MATLAB code parser is generated (almost :-) automatically from the language syntax description by means of the GNU Bison and GNU Flex utilities. These utilities process the syntax definition file (TmcParParser.Y) and lexer definition file (TmcParLexer.L) and generate C/C++ code of the parser state-machine.
The resulting parser state-machine is then called as a function:
tmc_parse();
Input file for Bison: the description of the source language syntax
The Bison utility is used as following:
win_bison -o TmcParParser.tab.cpp -d -v TmcParParser.Y
Bison processes the syntax definition file TmcParParser.Y and creates C++ files TmcParParser.tab.cpp and TmcParParser.tab.hpp that contain the parser code. Let's understand the syntax definition file structure.
In order to create a language parser, the source language syntax should be defined by some notation technique. One of useful notation techniques for the programming languages is the Backus–Naur form (BNF). The BNF language transcription is a set of derivation rules for the language symbols. The syntax definition file (TmcParParser.Y) contains this language transcription. Consider an example file fragment: file:
%{
/// File: TmcParParser.Y
#include "tmcpar_tree.h"
%}
%name-prefix="tmcpar_"
%union
{
char *str;
double num;
LSP_IDENT ident;
struct tree_const_val *lsp_const_val;
class T_ident *lsp_ident_val;
class T_const *lsp_constant_val;
class T_expr_bin *lsp_binary_expression_val;
class T_expr_gen *lsp_expression_val;
}
%token <ident> IDENT
%token <num> NUMBER
%token <num> NUMBER_IM
%token <str> STRING
%token PLUS
%token MUL
%left PLUS
%left MUL
%token END_OF_INPUT
%type <lsp_ident_val> ident
%type <lsp_constant_val> const
%type <lsp_const_val> variant_val
%type <lsp_expression_val> expression
%type <lsp_binary_expression_val> bin_expr
%start module
%%
module
: file_hdr list_function END_OF_INPUT
{
$$=$2;
tmcpar_parsing_module = $$;
YYACCEPT;
}
;
expression : identifier
{
$$=(T_expr_gen*)($1);
}
| const
{
$$=(T_expr_gen*)($1);
}
| bin_expr
{
$$=(T_expr_gen*)($1);
}
;
identifier : IDENT
{
$$ = create_identifier($1,(int)tmcpar_lineno,(int)tmcpar_colno);
}
;
const : variant_val
{
$$ = create_constant((enum CONST_VAL_TYPES)0,$1,(int)tmcpar_lineno,(int)tmcpar_colno);
}
;
variant_val : NUMBER
{
$$ = make_number($1,0,const_val_real);
}
| NUMBER_IM
{
$$ = make_number(0,$1,const_val_complex);
}
| STRING
{
$$ = make_string($1);
}
;
bin_expr : expression PLUS expression
{
$$ = create_binary_expression("+",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
}
| expression MUL expression
{
$$ = create_binary_expression("*",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
}
;
%%
// C++ code here
...
This file consists of 4 sections, separated by %{
, %}
and %%
separators. The 1st section (separated by %{, %}) contains C/C++ code, it may include the files that defines the classes and variables used. The 2nd section second contains the type definitions. The 3rd section contains the syntax rules and semantical actions that should be done on each rule matching. The parser state machine is running iteratively. At each state a new matching rule is applyed otherwise the parser calls internally the lexical analyzer function tmcpar_lex
that gets from the input stream next lexema. Each rule has input arguments ($1,$2 etc.) and the output ($$). An input argument can be the result of a children rule or a lexema returned by the lexical analizer. These arguments are internally represented in the state-machine by a union with fields defined by %union
command that may be data type returned by lexema or by another rule. The type of the arguments that are rules are defined by the %type
command. The type of the arguments that are lexema data are defined by the %token
command. The lexical analyzer together with the value of lexema also returns the token code value.
For example, the definition
%union
{
double num;
...
}
%token <num> NUMBER
means that when the lexer returns a number, its returned token value is
NUMBER
and the data type is
double
.
The definition
%union
{
class T_expr_gen *lsp_expression_val;
...
}
%type <lsp_expression_val> expression
means that when the parser rule returns an
expression
, its data type is
class T_expr_gen *
.
The rule
expression : identifier
{
$$=(T_expr_gen*)($1);
}
| const
{
$$=(T_expr_gen*)($1);
}
| bin_expr
{
$$=(T_expr_gen*)($1);
}
;
means that
expression
may be
ident
,
const
or
bin_expr
.
The rule
bin_expr : expression PLUS expression
{
$$ = create_binary_expression("+",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
}
| expression MUL expression
{
$$ = create_binary_expression("*",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
}
;
means that
bin_expr
is a sequence of
symbols expression
,
PLUS
and
expresion
or
expression
,
MUL
and
expresion
.
A rule definition also contains the semantic actions, i.e. the C++ code that will be performed on the rule match. E.g.
bin_expr : expression PLUS expression
{
$$ = create_binary_expression("+",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
}
...
means that when rule
expression PLUS expression
is matched the node of type
class T_expr_bin *
will be created by the function
create_binary_expression()
and returned by the rule as symbol
bin_expr
.
These tree nodes returned on each rule matching are connected into a parse tree. The parsing procedure is finished when the main symbol module
defined by %start module
command is returned. In the presented example the nodes are connected each with another into a tree or as a list.
Input file for Flex: lexical analyzer logics
The syntax parser described in the previous section accepts lexemas as its input. The syntax description is actually a context-free language description but the parser is based on the lexical analyzer that should resolve the dependency on the context. The lexical analyzer in this example is generated by means of the GNU Flex utility. The Flex utility is used as following:
win_flex -olex.tmcpar_.cpp TmcParLexer.L
This utility produces the lexical analyzer from the lexer definition file TmcParLexer.L. The output is C++ file is lex.tmcpar_.cpp. This file contains the implementation of lexer function
tmcpar_lex()
that is called internally by the parser that was produced by Bison. The lexer definition file uses the token definitions that are provided by the parser definition file. Since the lexer should return the token definitions, Bison should be run first.
Here is a fragment of the lexer file:
/// File: TmcParLexer.L
%option prefix = "tmcpar_"
%option yylineno
%option never-interactive
%{
#include "tmcpar_tree.h"
#include "TmcParParser.tab.hpp"
%}
%Start ExpectQuoteAsCTRANSPOSE
IDENTIFIER [a-zA-Z][_a-zA-Z0-9]*
%%
{IDENTIFIER} {
get_identifier();
return ReturnLspToken( IDENT);
}
%%
char* get_identifier()
{
//yyleng is length
//yytext is src
strcpy(tmcpar_lval.ident[0],yytext);
return tmcpar_lval.ident[0];
}
Like in the Bison definition file, the Flex definition file consists of sections separated by
%{
,
%}
and
%%
. The first section contains options and definitions for
regular expressions. The second section contains the actions, i.e. the code that is executed when a regular expression is matched. An action may be conditionally matched as, for example, in the following code:
<INITIAL>'[^'\r\f\n]*' {
register int i, size;
char* modified;
const int length = yyleng-2;
for (size = 0, i = 1;i <= length; size++, i++)
if (*(yytext+i) == '\'')
i++;
modified = alloc_string(size+1);
*(modified+size) = '\0';
for (size = 0, i = 1;i <= length; size++, i++)
{
*(modified+size) = *(yytext+i);
if (*(yytext+i) == '\'')
i++;
}
tmcpar_lval.str = modified;
yyinsert_comma_in_input(STRING);
return ReturnLspToken( STRING);
}
recognizes the literal string and returns the STRING lexema, it is executed only in state INITIAL. The following code
<ExpectQuoteAsCTRANSPOSE>' {
tmcpar_lval.int_value = CTRANSPOSE;
return ReturnLspToken( CTRANSPOSE ); }
recognizes the transpose action (') and returns the CTRANSPOSE lexema, it is executed only in the state ExpectQuoteAsCTRANSPOSE. The initial state is defined by the option
%Start ExpectQuoteAsCTRANSPOSE
The switch to INITIAL state is performed by macro
BEGIN(INITIAL)
The lexical analyzer tries to apply all the rules in the list according to its order. If it succeds the code of the rule is executed that may return the lexema code for to the syntax analyzer. The lexema value is returned in the union
tmcpar_lval
.
Output code generator
When the parsing is finished a parsing tree is created. Each node of the tree is a C++ class that represents a syntax construction. For example, a constant is represented by class T_const
:
class T_const : public T_expr_gen
{
...
public:
struct tag_tmsMatrix * const val ;
T_const(CONST_VAL_TYPES type,struct tree_const_val* v,int l,int c);
...
virtual void generate_rtl_node(CInstrList *ilist);};
The class constructor is called from the parser and stores the constant value in the member
T_const::val
. After that during the code generation the virtual method
T_const::generate_rtl_node
creates the intermediate instructions and add them to list
ilist
from which the target C-code will be generated. For example, the following C-code will be generated for the constant "1":
tmcScalar(reg[11],1.000000000000000e+000);
This is a call to run-time library function
tmcScalar
that initializes the temporary "register" variable with the node execution result, the constant "1.0". The intermediate instruction class
CInstr
contains the infromation about the instruction type, output "register" number and the constant value:
class CInstr
{
public:
enum instr_types
{
instr_create_matrix_empty,
instr_complex,
instr_scalar,
instr_create_cell_empty,
instr_create_string_empty,
instr_create_string,
instr_create_magic_colon,
...
};
instr_types m_inst_type;
CIReg outreg; double m_Real;
double m_Imag;
...
}
The method T_const::generate_rtl_node
allocates a new "register" number for the instruction output CInstr::outreg
, then creates a single intermediate instruction with the constant values and then adds the instruction to list CInstrList
. The process of the intermediate code creation is recursive and it is important that the output of a child expression is stored in CInstr::outreg
and is used in the parent expression.
When the full list of intermediate code CInstrList
for a function
is finished, the target code is simply printed by the CInstrList::print_rtl
method. This is done by the CInstr::gen_icode
method the prints the C-code or by CInstr::gen_asmcode
method the prints the equivalent assembler code, e.g:
...
case instr_scalar: sprintf(buff,"%s,%.15e",get_reg_name(outreg).c_str(),m_Real);
retval=std::string("tmcScalar(").append(buff).append(");");
break;
...
More complex example is the code of for
statement. It is implemented by class T_ctrl_cmd_for
:
class T_ctrl_cmd_for : public T_cmd_gen
{
private:
T_expr_gen * const lhs;
T_expr_gen * const expr;
L_stmnt_gen * const list;
...
CILabel m_end_ilabel; CILabel m_exit_ilabel;...
public:
T_ctrl_cmd_for (T_expr_gen *le, T_expr_gen *re,
L_stmnt_gen *lst,
int l = -1, int c = -1)
: T_cmd_gen (l, c), lhs (le), expr (re), list (lst)
{
...
}
...
public:
virtual void generate_rtl_node(CInstrList *ilist);}
As an example, consider the code
for k=0:N-1
..
end
Here the variable k
is parsed into T_ctrl_cmd_for::lhs
, the expression 0:N-1
in expr
and the body code in list
.
The method T_ctrl_cmd_for::generate_rtl_node
registers "k" variable stored in T_ctrl_cmd_for::lhs
in the symbol table, calls the expr->generate_rtl
to generate recursively the expression "0:N-1" code and calls list->generate_rtl_list
to generate the body code instructions. This method produces a number of intermediate instructions ( instr_for_init
, instr_for_next
, instr_label
,instr_goto
) that constain "goto". The label numbers for these instructions are allocated and stored in T_ctrl_cmd_for::m_end_ilabel
and T_ctrl_cmd_for::m_exit_ilabel
.
Finally, the whole m-file code is created by generate_rtl_list
function:
void generate_rtl_list()
{
CInstrList InstList; class T_func_hdr *tfd;
extern L_stmnt_gen *tmcpar_parsing_module; int NumLocalFuncs=0;
if (tmcpar_parsing_module)
{ for (L_stmnt_gen::iterator p = tmcpar_parsing_module->begin();
p != tmcpar_parsing_module->end();p++)
{
tfd = (T_func_hdr *)(*p);
Compiler.indFunc = NumLocalFuncs;
Compiler.indFunc_RTL = NumLocalFuncs;
InstList.store_function_ind(NumLocalFuncs);
tfd->generate_rtl_node(&InstList); InstList.print_rtl(NumLocalFuncs); InstList.reset_new_local_function();
NumLocalFuncs++;
}
}
}
Strings hashing
In order to optimize the calculations with literal strings and field names string hash table is used. Structure field names are accessed by hash-codes. The code structure field assignment
S.x = 100;
is converted to
tmcScalar(reg[3],1.000000000000000e+002);
tmcGetRefByFieldHcode(pRefHelper,S,0x00780000);
tmcAssign(pRefHelper,reg[3]);
The field name string "x" is accessed by hash-code 0x00780000
. The hash-table is pre-calculated during the conversion and is loaded at the initialization of the run-time library. Each string known at the conversion time has its unique hash code and is accessed by this code. This hash table contains the original strings and their hash-code, e.g:
const struct CInitHashData InitHashData[]={
#include "TestO.hash_init.dat"
};
Symbols table
The symbol table is the fundamental element of every compiler and code generator. The symbol table contains the names of variables and functions with their prototypes. It is implemeted by class symbol_table
. In this example the symbol table was not optimized for access speed; it is based on C++ list
template.
The source files database is implemented by class CTmcFileList TmcFileList
. It manages the list of actually used source files found in the search path. The symbol table is initialized after the first pass of the converter when it is possible already to distinguish between variables and functions and to get the function prototypes.
Run-time library
The converter is built on the same principles as a compiler, but unlike the compilers that generate low-level assmbler or C-code (e.g MATISSE http://specs.fe.up.pt/tools/matisse/), it generate calls to functions of supplied run-time library. The reason for it is the absence of any type information about the MATLAB variables. Each such variable and the intermediate variables (reg[]
) are implemented by single type tmsMatrix
.
The run-time library implements elementary operations on matrixes and other build-in functions needed for the code execution. The linear-algebra operations such as matrix division are implemented using LAPACK package. The run-time library should be initialized in order to create global variables and to load the strings hesh-table. The needed initialization code is generated by the converter. Finally the library should be uninitialized.
How to run the converter
The converter accepts a list of search directories for the external functions as parameters, a table of build-in functions with the information about number of input and output parameters and the name of root function that is converted. The parser is called in two passes.
To perform the first pass TMCCO converter is called e.g. as
TMCCO -P -L -w TestO -r ./Stubs/TestO.m -h ..\..\include/ -i ./MatSRC/ -o ./OutL/
It starts to parse the main function in file TestO.m. If an unassigned symbol is found, it is assumed to be an external function and its source file is searched in directories passed by the -i switch. These source files are processed recursively until all the dependent files are parsed. As the result, a list of actually used m-files (TestO.filelist.txt), a table with all the functions prototypes (TestO.sym.dat) and the include file with the functions prototypes (stdtmc.h) are produced. The file TestO.sym.dat also includes the prototypes of all build-in functions.
To perform the second pass TMCCO converter is called as in the following example:
TMCCO -c -C -g2 -w TestO -r ./Stubs/TestO.m -i ./MatSRC/ -s ./MatSRC/ -s ./Stubs/ -o ./OutC/
The processing is performed the same way as in the 1st pass, but the parser already may distinguish between the functions and variables and to generate the output code. The parser state-machine creates a tree presentation of the code. Then the code creation procedure walks through the tree and prints the output code. In additions to the target C files, the run-time initialization files are created (TestO.globals.c,TestO.globals.h,TestO._init_hash_data.c,TestO.hash_init.dat).
In this example three types of output code are presented: C-code (the most useful), LISP-code (for the code analysis) and ASM-code (for demo purposes, may be processed by NASM Assembler and converted directly to Win32 executable!).
The latest version of the converter, examples, documentation and additional tools may be found at TMC Compiler downloads site ( https://sourceforge.net/projects/tmc-m-files-to-c-compiler).
Points of Interest
If the task of writing syntax analyzer is very tricky, the implementing of run-time library that performs the basic operations is even more so. The main challenge was to avoid any memory leak, to optimize the execution time and to support different platforms and compilers. The memory usage was debugged using MSVC debugging mechanisms such like _malloc_dbg
function, _CrtSetDbgFlag
macros and precise counting of allocated memory.
When the C code was generated, it was a challenge to debug it since the MATLAB variables were stored in dynamically allocated objects of complex structure. A simple debugger was developed that displays all the variables created in given context. For this purpose some additional "debug" code is generated (using -d switch in the parser call) and the process memory is read by ReadProcessMemory
function. In GNU GCC environment the simplest method was to use GDB debugger and to directly call run-time function tmcdisp
that prints the contents of a given variable.
It was interesting to produce an executable directly from the m-code. It turned out to be very simple, thanks to the article "Tiny PE. Creating the smallest possible PE executable". To generate calls to the run-time library in C and in assembler was aprroximately of the same complexity. Thus the assembler code was generated by the converter, a PE executable header was attached to it and the result file assembled by NASM assembler. This exercise was done only for x86 format; for x64 the assembler code generator should be adapted to the new calling conversion and those who are interested in it may try it by himself :)
The following book may be recommended for deeper understanding of the GNU Bison/Flex:
John R. Levine (2009). flex & bison. O'Reily Media Inc.
Acknowledgements
I should give a special credit to Vladimir Borisenko from the Moscow State University for his outstanding course on compilers.
Thanks to Michael Medved for his hard work on the review of this article.
History
First version.