A scanner and scanner generator






4.95/5 (9 votes)
Apr 10, 2001
2 min read

100366

2710
Supports both common approaches to scanners in one object.
Introduction
A scanner breaks a stream of characters into a sequence of tokens. This is comparable with a human reader who groups characters into words, numbers and punctuation thereby reaching a higher abstraction level. The text:
0-201-05866-9, cancelled, "Parallel Program Design"
e.g. could be translated into the tokens:
T_ISBN T_COMMA T_CANCELLED T_COMMA T_TITLE
where T_ISBN
, T_COMMA
and so on are integer constants. There are two approaches for implementing general scanners.
- The scanner is an object of a class.
The searched tokens are specified via calls to member functions.
- The scanner is automatically generated from regular expressions.
This is a two-phase approach. First, you specify the scanner, and then you run the generator, which outputs "C" source code.
The first approach is better suited for a project with frequent changes. The second approach gives you superior performance but has the disadvantage that the generated "C"- code is nearly unreadable for human beings and therefore shouldn't be edited.
The scanner and scanner generator presented in this article combines both approaches and provides one interface for both implementation strategies.
Interface for the scanner class
The scannerREXI_Scan
is based on the regular expression facility already presented in the article 'Fast regular expressions'. To use it, you
- specify a regular expression for each token to recognize.
- set the source string.
- call
Scan
repeatedly until it returnsREXI_Scan::eEos
.
class REXI_Scan : public REXI_Base { public: REXI_Scan(char cLineBreak= '\n'); //related function 'GetNofLines' /*initialize scanner with symbol definitions 1.STEP */ REXI_DefErr AddSymbolDef (string strRegExp,int nIdAnswer); REXI_DefErr AddHelperRegDef (string strName,string strRegExp); REXI_DefErr SetToSkipRegExp (string strRegExp= "[ \r\n\v\t]*"); /* set source 2.STEP */ inline void SetSource (const char* pszSource); /* Read next token, then return symbolId ('nIdAnswer' from 'AddSymbolDef') 3.STEP */ int Scan (); /* retrieve,set information after a call to 'Scan' */ inline string GetTokenString ()const; inline void SkipChars (int nOfChars=1); inline int GetLastSymbol ()const; inline int GetNofLines ()const; };
Example Usage
enum ESymbol{T_ERR,T_AVAILABLE,T_CANCELLED, T_COMMA,T_LINEBREAK,T_ISBN,T_TITLE}; struct Info{ Info():m_eKey(T_ERR){} string m_sISBN; ESymbol m_eKey; string m_sTitle; }; int main(int argc,char* argv[]) { const int ncOk= REXI_DefErr::eNoErr; const char szTestSrc[]= "3-8272-5737-9,AVAILABLE, \"XML praxis und referenz\"\r\n" "0-201-05866-9,cancelled, \"Parallel Program Design\"\r\n"; REXI_Scan scanner; REXI_DefErr err; /* STEP 1: initialize scanner with symbol definitions */ err= scanner.AddSymbolDef ("(AVAILABLE)\\i",T_AVAILABLE); assert(err.eErrCode==ncOk); err= scanner.AddSymbolDef ("(CANCELLED)\\i",T_CANCELLED); assert(err.eErrCode==ncOk); err= scanner.AddSymbolDef (",",T_COMMA); assert(err.eErrCode==ncOk); err= scanner.AddSymbolDef ("\\n",T_LINEBREAK); assert(err.eErrCode==ncOk); err= scanner.AddHelperRegDef("$Int_","[0-9]+\\-"); assert(err.eErrCode==ncOk); err= scanner.AddSymbolDef ("$Int_ $Int_ $Int_ [0-9]+", T_ISBN); assert(err.eErrCode==ncOk); err= scanner.AddSymbolDef (" \"( [^\"\\n] | \\\"] )* \"", T_TITLE); assert(err.eErrCode==ncOk); err= scanner.SetToSkipRegExp("[ \\t\\v\\r]*"); assert(err.eErrCode==ncOk); /* STEP 2 : set source */ scanner.SetSource(szTestSrc); int nNofLines=0; int nRes; Info info; vector<Info> vecInfos; /* STEP 3: read until eos */ while( (nRes=scanner.Scan())!=REXI_Scan::eEos ){ switch(nRes){ case T_AVAILABLE: case T_CANCELLED: info.m_eKey= (enum ESymbol)nRes; break; case T_TITLE: info.m_sTitle= scanner.GetTokenString(); break; case T_ISBN: info.m_sISBN= scanner.GetTokenString(); break; case T_LINEBREAK: vecInfos.push_back(info); info= Info(); break; case REXI_Scan::eIllegal: cout << "Illegal:" << scanner.GetTokenString() << endl; while( (nRes=scanner.Scan())!=REXI_Scan::eEos && nRes!= T_LINEBREAK); info= Info(); break; } } cout << "Number of correct read records: " << vecInfos.size() << endl; char c; cin >> c; return 0; }
Interface for the scanner generator
The scanner generator is a very simple GUI program. It allows you to specify and run a test scanner and finally generates the source code for the specified scanner. The generated code uses a REXI_Scan
derived scanner and provides two different code parts. Controlled by the conditional directive #ifdef REXI_STATIC_SCANNER
, either an efficient hard coded scanner or a scanner working like the one described above is activated.
The specification for the scanner to be generated uses regular expressions and supports 4 different ways to specify a token, which are shown below.
if #T_Quote= '[^']' $Int= [0-9]+ ##T_FLOAT= $Int (\. $Int)?
It is important, that you separate the token definitions by tabs. Now, let's see what the 4 definitions above mean.
1. Token if The scanner searches for exactly the word 'if'
and automagically creates a constant T_if for the token
2. Token #T_Quote= '[^']' The leading # means:
The next identifier up to the = is the name of the token constant,
then the token definition follows.
3. Helper $Int= [0-9]+ Defines a helper definition,
which can be used later.
4. Token ##T_FLOAT= $Int (\. $Int)?
The leading ## means: Same as # but do postprocessing
after recognizing this token.
Fragment of a generated scanner
int Simple::Scan() { #ifdef REXI_STATIC_SCANNER int nRes= FastScan(); #else int nRes= REXI_Scan::Scan(); #endif switch(nRes){ case eIllegal:{ m_sIllegal= GetTokenString(); return nRes; } case T_PRICE:{ // add your postprocessing code here return nRes; } default: return nRes; } }
Intended Use
Scanning a comma separated file, implementing a pretty printer for C++-source code or building a scanner for an interpreter are potential application areas. There are also quite a lot of freely available scanner generators (lex, bison) out there, but as far as I know, no one generates scanners with such a neat interface as this one.