Writing MATLAB to C converter: Parser and Code Generator created with Bison/Flex

Shmuel Safonov

5.00/5 (10 votes)

Nov 21, 2017

GPL3

11 min read

13660

306

Using Bison/Flex for creation of the code convertor from subset of MATLAB language to C code. The converter is used for building native applications and libraries from MATLAB code.

Download source - 2.5 MB

Introduction

This article describes a MATLAB-to-C code converter written using the GNU Bison parser generator and the GNU Flex lexical analyzer generator. The article explains the basic features of these utilities and shows how to generate the output code from the parser results. This converter generates code that is compiled and linked with runtime support library. An introduction into the usage of the library is also provided.

The source code parser and the target code generator

The MATLAB code parser is generated (almost :-) automatically from the language syntax description by means of the GNU Bison and GNU Flex utilities. These utilities process the syntax definition file (TmcParParser.Y) and lexer definition file (TmcParLexer.L) and generate C/C++ code of the parser state-machine.

The resulting parser state-machine is then called as a function:

 tmc_parse();

Input file for Bison: the description of the source language syntax

The Bison utility is used as following:

win_bison -o TmcParParser.tab.cpp -d -v TmcParParser.Y

Bison processes the syntax definition file TmcParParser.Y and creates C++ files TmcParParser.tab.cpp and TmcParParser.tab.hpp that contain the parser code. Let's understand the syntax definition file structure.

In order to create a language parser, the source language syntax should be defined by some notation technique. One of useful notation techniques for the programming languages is the Backus–Naur form (BNF). The BNF language transcription is a set of derivation rules for the language symbols. The syntax definition file (TmcParParser.Y) contains this language transcription. Consider an example file fragment: file:

%{
/// File: TmcParParser.Y
#include "tmcpar_tree.h"
%}
%name-prefix="tmcpar_"

%union
{
	char   *str;
	double num;
	LSP_IDENT   ident;
	
	struct tree_const_val *lsp_const_val;
	class T_ident *lsp_ident_val;
	class T_const   *lsp_constant_val;
	class T_expr_bin *lsp_binary_expression_val;
	class T_expr_gen *lsp_expression_val;
}

%token	<ident> IDENT
%token  <num>   NUMBER
%token  <num>   NUMBER_IM
%token  <str>   STRING

%token   PLUS            
%token   MUL             

%left   PLUS
%left   MUL
           
%token  END_OF_INPUT

%type <lsp_ident_val> ident
%type <lsp_constant_val> const
%type <lsp_const_val> variant_val
%type <lsp_expression_val> expression
%type <lsp_binary_expression_val> bin_expr

%start module

%%

module
		: file_hdr list_function	END_OF_INPUT
		{
			$$=$2;
			tmcpar_parsing_module = $$;
			YYACCEPT;
		}
		;
expression		:  identifier
					{
					$$=(T_expr_gen*)($1);
					}
				|  const
					{
					$$=(T_expr_gen*)($1);
					}
                | bin_expr
					{
					$$=(T_expr_gen*)($1);
					}

				;
identifier              : IDENT
					{
					 $$ = create_identifier($1,(int)tmcpar_lineno,(int)tmcpar_colno);
					 }
					 ;
const              : variant_val
					{
					 $$ = create_constant((enum CONST_VAL_TYPES)0,$1,(int)tmcpar_lineno,(int)tmcpar_colno);
					}
					;
variant_val        : NUMBER
					{
					 $$ = make_number($1,0,const_val_real);
					}
                   | NUMBER_IM
                   {
                     $$ = make_number(0,$1,const_val_complex);
                   }

                   | STRING
                   {
					 $$ = make_string($1);
                   }
                   ;

bin_expr : 	expression  PLUS expression
		  {
			$$ = create_binary_expression("+",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
		  }
		 |	expression  MUL expression
		  {
			$$ = create_binary_expression("*",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
		  }
		  ;

%%
// C++ code here
...

This file consists of 4 sections, separated by %{, %} and %% separators. The 1st section (separated by %{, %}) contains C/C++ code, it may include the files that defines the classes and variables used. The 2nd section second contains the type definitions. The 3rd section contains the syntax rules and semantical actions that should be done on each rule matching. The parser state machine is running iteratively. At each state a new matching rule is applyed otherwise the parser calls internally the lexical analyzer function tmcpar_lex that gets from the input stream next lexema. Each rule has input arguments ($1,$2 etc.) and the output ($$). An input argument can be the result of a children rule or a lexema returned by the lexical analizer. These arguments are internally represented in the state-machine by a union with fields defined by %union command that may be data type returned by lexema or by another rule. The type of the arguments that are rules are defined by the %type command. The type of the arguments that are lexema data are defined by the %token command. The lexical analyzer together with the value of lexema also returns the token code value.

For example, the definition

 %union
 {
	double num;
	...
 }
 %token  <num>   NUMBER

means that when the lexer returns a number, its returned token value is NUMBER and the data type is double.

The definition

 %union
 {
	class T_expr_gen *lsp_expression_val;
	...
 }
 %type <lsp_expression_val> expression

means that when the parser rule returns an expression, its data type is class T_expr_gen *.

The rule

expression		:  identifier
					{
					$$=(T_expr_gen*)($1);
					}
				|  const
					{
					$$=(T_expr_gen*)($1);
					}
                | bin_expr
					{
					$$=(T_expr_gen*)($1);
					}
				;

means that expression may be ident, const or bin_expr.

The rule

bin_expr : 	expression  PLUS expression
		{
			$$ = create_binary_expression("+",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
		}
		|	expression  MUL expression
		{
			$$ = create_binary_expression("*",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
		}
		;

means that bin_expr is a sequence of symbols expression, PLUS and expresion or expression, MUL and expresion.

A rule definition also contains the semantic actions, i.e. the C++ code that will be performed on the rule match. E.g.

bin_expr : 	expression  PLUS expression
	{
		$$ = create_binary_expression("+",$1,$3,(int)tmcpar_lineno,(int)tmcpar_colno);
	}
...

means that when rule expression PLUS expression is matched the node of type class T_expr_bin * will be created by the function create_binary_expression() and returned by the rule as symbol bin_expr.

These tree nodes returned on each rule matching are connected into a parse tree. The parsing procedure is finished when the main symbol module defined by %start module command is returned. In the presented example the nodes are connected each with another into a tree or as a list.

Input file for Flex: lexical analyzer logics

The syntax parser described in the previous section accepts lexemas as its input. The syntax description is actually a context-free language description but the parser is based on the lexical analyzer that should resolve the dependency on the context. The lexical analyzer in this example is generated by means of the GNU Flex utility. The Flex utility is used as following:

win_flex -olex.tmcpar_.cpp TmcParLexer.L

This utility produces the lexical analyzer from the lexer definition file TmcParLexer.L. The output is C++ file is lex.tmcpar_.cpp. This file contains the implementation of lexer function tmcpar_lex() that is called internally by the parser that was produced by Bison. The lexer definition file uses the token definitions that are provided by the parser definition file. Since the lexer should return the token definitions, Bison should be run first.

Here is a fragment of the lexer file:

/// File: TmcParLexer.L

%option prefix = "tmcpar_"
%option yylineno
%option never-interactive

%{
#include "tmcpar_tree.h"
#include "TmcParParser.tab.hpp"
%}

%Start   ExpectQuoteAsCTRANSPOSE
IDENTIFIER [a-zA-Z][_a-zA-Z0-9]*

%%

{IDENTIFIER} {
 get_identifier();
  return ReturnLspToken( IDENT);
             }

%%

char* get_identifier()
{
  //yyleng is length
  //yytext is src
  strcpy(tmcpar_lval.ident[0],yytext);
  return tmcpar_lval.ident[0];
}

Like in the Bison definition file, the Flex definition file consists of sections separated by %{ , %} and %%. The first section contains options and definitions for regular expressions. The second section contains the actions, i.e. the code that is executed when a regular expression is matched. An action may be conditionally matched as, for example, in the following code:

<INITIAL>'[^'\r\f\n]*' {
		register int i, size;
		char* modified;
		const int length = yyleng-2;
		for (size = 0, i = 1;i <= length; size++, i++)
		if (*(yytext+i) == '\'')
					i++;
		modified = alloc_string(size+1);
		*(modified+size) = '\0';
		for (size = 0, i = 1;i <= length; size++, i++)
		{
		*(modified+size) = *(yytext+i);
		if (*(yytext+i) == '\'')
					i++;
		}
		tmcpar_lval.str = modified;
		yyinsert_comma_in_input(STRING);
		return ReturnLspToken( STRING);
}

recognizes the literal string and returns the STRING lexema, it is executed only in state INITIAL. The following code

<ExpectQuoteAsCTRANSPOSE>'   {
		tmcpar_lval.int_value = CTRANSPOSE;
		return ReturnLspToken( CTRANSPOSE );  }

recognizes the transpose action (') and returns the CTRANSPOSE lexema, it is executed only in the state ExpectQuoteAsCTRANSPOSE. The initial state is defined by the option

 %Start   ExpectQuoteAsCTRANSPOSE

The switch to INITIAL state is performed by macro

 BEGIN(INITIAL)

The lexical analyzer tries to apply all the rules in the list according to its order. If it succeds the code of the rule is executed that may return the lexema code for to the syntax analyzer. The lexema value is returned in the union tmcpar_lval.

Output code generator

When the parsing is finished a parsing tree is created. Each node of the tree is a C++ class that represents a syntax construction. For example, a constant is represented by class T_const:

class T_const : public T_expr_gen
{
...
public:
	// The actual value
    struct tag_tmsMatrix  * const val ; 
	T_const(CONST_VAL_TYPES type,struct tree_const_val* v,int l,int c);
...
	virtual void generate_rtl_node(CInstrList *ilist);// constant
};

The class constructor is called from the parser and stores the constant value in the member T_const::val. After that during the code generation the virtual method T_const::generate_rtl_node creates the intermediate instructions and add them to list ilist from which the target C-code will be generated. For example, the following C-code will be generated for the constant "1":

tmcScalar(reg[11],1.000000000000000e+000);

This is a call to run-time library function tmcScalar that initializes the temporary "register" variable with the node execution result, the constant "1.0". The intermediate instruction class CInstr contains the infromation about the instruction type, output "register" number and the constant value:

class CInstr 
{
public:
	//! instruction operation types
	enum instr_types
	{
	instr_create_matrix_empty,
	instr_complex,
	instr_scalar,
	instr_create_cell_empty,
	instr_create_string_empty,
	instr_create_string,
	instr_create_magic_colon,
	...
	};
	instr_types m_inst_type;
	CIReg outreg; // output register
	double m_Real;
	double m_Imag;
	...
}

The method T_const::generate_rtl_node allocates a new "register" number for the instruction output CInstr::outreg, then creates a single intermediate instruction with the constant values and then adds the instruction to list CInstrList. The process of the intermediate code creation is recursive and it is important that the output of a child expression is stored in CInstr::outreg and is used in the parent expression.

When the full list of intermediate code CInstrList for a function is finished, the target code is simply printed by the CInstrList::print_rtl method. This is done by the CInstr::gen_icode method the prints the C-code or by CInstr::gen_asmcode method the prints the equivalent assembler code, e.g:

...
 		case instr_scalar: // make a scalar matrix
			sprintf(buff,"%s,%.15e",get_reg_name(outreg).c_str(),m_Real);
			retval=std::string("tmcScalar(").append(buff).append(");");
			break;
			...

More complex example is the code of for statement. It is implemented by class T_ctrl_cmd_for:

class T_ctrl_cmd_for : public T_cmd_gen
{
private:
  //! Expression to modify.
  T_expr_gen * const lhs;
  //! Expression to evaluate.
  T_expr_gen * const expr;
  //! List of commands to execute.
  L_stmnt_gen * const list;
...
  // while command's end and exit labels
  CILabel m_end_ilabel;//!  end label (for CONTINUE)
  CILabel m_exit_ilabel;//! exit label (for BREAK or finish)
...
public:  
T_ctrl_cmd_for (T_expr_gen *le, T_expr_gen *re,
			   L_stmnt_gen *lst,
			   int l = -1, int c = -1)
    : T_cmd_gen (l, c), lhs (le), expr (re), list (lst)
	{
	...
	}
  ...
public:
	virtual void generate_rtl_node(CInstrList *ilist);//for 
}

As an example, consider the code

	for k=0:N-1
	..
	end

Here the variable k is parsed into T_ctrl_cmd_for::lhs, the expression 0:N-1 in expr and the body code in list.

The method T_ctrl_cmd_for::generate_rtl_node registers "k" variable stored in T_ctrl_cmd_for::lhs in the symbol table, calls the expr->generate_rtl to generate recursively the expression "0:N-1" code and calls list->generate_rtl_list to generate the body code instructions. This method produces a number of intermediate instructions ( instr_for_init, instr_for_next, instr_label,instr_goto) that constain "goto". The label numbers for these instructions are allocated and stored in T_ctrl_cmd_for::m_end_ilabel and T_ctrl_cmd_for::m_exit_ilabel.

Finally, the whole m-file code is created by generate_rtl_list function:

void generate_rtl_list()
// generate list of instructions from the list pointed by tmcpar_parsing_module
{
	CInstrList InstList; // intermediate instructions list 
	class T_func_hdr *tfd;
	extern L_stmnt_gen *tmcpar_parsing_module;// global pointer to parser tree, initialized by tmc_parse()
	int NumLocalFuncs=0;
	if (tmcpar_parsing_module)
	{   /// Go through list of functions
		for (L_stmnt_gen::iterator p = tmcpar_parsing_module->begin();
		p != tmcpar_parsing_module->end();p++)
		{
			tfd = (T_func_hdr *)(*p);
			Compiler.indFunc = NumLocalFuncs; 
			Compiler.indFunc_RTL = NumLocalFuncs;
			InstList.store_function_ind(NumLocalFuncs);
			tfd->generate_rtl_node(&InstList);/// Generate intermediate instructions
			InstList.print_rtl(NumLocalFuncs);/// Convert instructions to the target code 
			InstList.reset_new_local_function();
			NumLocalFuncs++;
		}
	}
}

Strings hashing

In order to optimize the calculations with literal strings and field names string hash table is used. Structure field names are accessed by hash-codes. The code structure field assignment

		S.x = 100;

is converted to

		tmcScalar(reg[3],1.000000000000000e+002);
		tmcGetRefByFieldHcode(pRefHelper,S,0x00780000);/* x */
		tmcAssign(pRefHelper,reg[3]);

The field name string "x" is accessed by hash-code 0x00780000. The hash-table is pre-calculated during the conversion and is loaded at the initialization of the run-time library. Each string known at the conversion time has its unique hash code and is accessed by this code. This hash table contains the original strings and their hash-code, e.g:

const struct CInitHashData InitHashData[]={
#include "TestO.hash_init.dat"
/** File TestO.hash_init.dat:
"?error",0x21290000,
"x",0x00780000,
*/
};

Symbols table

The symbol table is the fundamental element of every compiler and code generator. The symbol table contains the names of variables and functions with their prototypes. It is implemeted by class symbol_table. In this example the symbol table was not optimized for access speed; it is based on C++ list template.

The source files database is implemented by class CTmcFileList TmcFileList. It manages the list of actually used source files found in the search path. The symbol table is initialized after the first pass of the converter when it is possible already to distinguish between variables and functions and to get the function prototypes.

Run-time library

The converter is built on the same principles as a compiler, but unlike the compilers that generate low-level assmbler or C-code (e.g MATISSE http://specs.fe.up.pt/tools/matisse/), it generate calls to functions of supplied run-time library. The reason for it is the absence of any type information about the MATLAB variables. Each such variable and the intermediate variables (reg[]) are implemented by single type tmsMatrix.

The run-time library implements elementary operations on matrixes and other build-in functions needed for the code execution. The linear-algebra operations such as matrix division are implemented using LAPACK package. The run-time library should be initialized in order to create global variables and to load the strings hesh-table. The needed initialization code is generated by the converter. Finally the library should be uninitialized.

How to run the converter

The converter accepts a list of search directories for the external functions as parameters, a table of build-in functions with the information about number of input and output parameters and the name of root function that is converted. The parser is called in two passes.

To perform the first pass TMCCO converter is called e.g. as

TMCCO -P -L -w TestO -r ./Stubs/TestO.m -h ..\..\include/ -i ./MatSRC/ -o ./OutL/

It starts to parse the main function in file TestO.m. If an unassigned symbol is found, it is assumed to be an external function and its source file is searched in directories passed by the -i switch. These source files are processed recursively until all the dependent files are parsed. As the result, a list of actually used m-files (TestO.filelist.txt), a table with all the functions prototypes (TestO.sym.dat) and the include file with the functions prototypes (stdtmc.h) are produced. The file TestO.sym.dat also includes the prototypes of all build-in functions.

To perform the second pass TMCCO converter is called as in the following example:

TMCCO -c -C -g2 -w TestO -r ./Stubs/TestO.m -i ./MatSRC/ -s ./MatSRC/ -s ./Stubs/ -o ./OutC/

The processing is performed the same way as in the 1st pass, but the parser already may distinguish between the functions and variables and to generate the output code. The parser state-machine creates a tree presentation of the code. Then the code creation procedure walks through the tree and prints the output code. In additions to the target C files, the run-time initialization files are created (TestO.globals.c,TestO.globals.h,TestO._init_hash_data.c,TestO.hash_init.dat).

In this example three types of output code are presented: C-code (the most useful), LISP-code (for the code analysis) and ASM-code (for demo purposes, may be processed by NASM Assembler and converted directly to Win32 executable!).

The latest version of the converter, examples, documentation and additional tools may be found at TMC Compiler downloads site ( https://sourceforge.net/projects/tmc-m-files-to-c-compiler).

Points of Interest

If the task of writing syntax analyzer is very tricky, the implementing of run-time library that performs the basic operations is even more so. The main challenge was to avoid any memory leak, to optimize the execution time and to support different platforms and compilers. The memory usage was debugged using MSVC debugging mechanisms such like _malloc_dbg function, _CrtSetDbgFlag macros and precise counting of allocated memory.

When the C code was generated, it was a challenge to debug it since the MATLAB variables were stored in dynamically allocated objects of complex structure. A simple debugger was developed that displays all the variables created in given context. For this purpose some additional "debug" code is generated (using -d switch in the parser call) and the process memory is read by ReadProcessMemory function. In GNU GCC environment the simplest method was to use GDB debugger and to directly call run-time function tmcdisp that prints the contents of a given variable.

It was interesting to produce an executable directly from the m-code. It turned out to be very simple, thanks to the article "Tiny PE. Creating the smallest possible PE executable". To generate calls to the run-time library in C and in assembler was aprroximately of the same complexity. Thus the assembler code was generated by the converter, a PE executable header was attached to it and the result file assembled by NASM assembler. This exercise was done only for x86 format; for x64 the assembler code generator should be adapted to the new calling conversion and those who are interested in it may try it by himself :)

The following book may be recommended for deeper understanding of the GNU Bison/Flex:

John R. Levine (2009). flex & bison. O'Reily Media Inc.

Acknowledgements

I should give a special credit to Vladimir Borisenko from the Moscow State University for his outstanding course on compilers.

Thanks to Michael Medved for his hard work on the review of this article.

History

First version.