Slang Part 1: Parsing a C# Subset into the CodeDOM

honey the codewitch

4.43/5 (8 votes)

Dec 4, 2019

MIT

6 min read

9953

164

An initial look at a tool to dramatically simplify language agnostic code generation using the CodeDOM

Download source - 501.4 KB

Introduction

This is the first part in a series in which we're going to be building a tool to generate language independent source code from a subset of C# source code. We're not quite there yet by the end, but we've got to start somewhere. We're going to start with the front-end of the tool - the parsing.

Background

We'll be using Microsoft's CodeDOM to represent the parse tree we get from parsing the document. The CodeDOM is a boon for code generation tools, providing for formatted output of generated code in any .NET language for which a CodeDOM provider exists. With the "stock" distribution of .NET, there is C# and VB but there are nuget packages for other languages.

Unfortunately, using it is a headache we really don't need. The object model is verbose, poorly documented, and just clunky. For example, it doesn't even use generic collections, and the "typed" collections it offers are haphazard. You need code like:

var expr = new CodeFieldReferenceExpression(new CodeThisReferenceExpression(),"_state");

just to render this code (in C#):

this._state

That is just it's own little nightmare.

We'd much rather simply use:

var expr=SlangParser.ParseExpression("this._state");

To get the same thing, no? This article is a huge step in that direction.

Using this Mess

This bit is easy. The demo program takes its own program file and converts it to VB. What's harder is learning which subset of C# is supported. I haven't written a grammar for that yet. The rule of thumb is, if there isn't a CodeDOM object for it, it can't be represented using Slang. This means a lot of operators like ++ and += are off the table as is using type aliasing, nested namespaces, readonly modifiers on fields, etc. Still, even with all these limitations, this is so much improvement over using the CodeDOM directly. Also comments are currently stripped from the output, and line pragmas cannot be parsed as this is a preliminary release. There may be bugs as well. YMMV

using System;
using System.Collections.Generic;
using Slang;
namespace SlangDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine(CodeDomUtility.ToString
             (SlangParser.ReadCompileUnitFrom("..\\..\\Program.cs"),"vb"));
        }
    }
}

Outputs:

'------------------------------------------------------------------------------
' <auto-generated>
'     This code was generated by a tool.
'     Runtime Version:4.0.30319.42000
'
'     Changes to this file may cause incorrect behavior and will be lost if
'     the code is regenerated.
' </auto-generated>
'------------------------------------------------------------------------------

Option Strict Off
Option Explicit On

Imports Slang
Imports System
Imports System.Collections.Generic

Namespace SlangDemo
    Friend Class Program
        Public Shared Sub Main()
            Console.WriteLine(CodeDomUtility.ToString_
             (SlangParser.ReadCompileUnitFrom("..\..\Program.cs"), "vb"))
        End Sub
    End Class
End Namespace

All it's doing is using CodeDomUtility to render the CodeDOM objects it got back from ReadCompileUnitFrom()

In addition to the aforementioned function on SlangParser, we also have ParseXXXX() and ReadXXXXFromUrl() methods to parse the various constructs from various sources. Usually, we'll be parsing whole compile units.

Conceptualizing this Mess

What we're doing is using a backtracking recursive descent parser. Like most recursive descent parsers, this one is written by hand rather than generated using a tool. However, we're currently using a tokenizer/lexer to break up our raw text into lexemes and that is generated by a tool. The tool is called Rolex, and I posted an article on what it is and how to code it here. In the future, the lexer itself may be hand written to overcome some limitations of the current implementation, but for now it is servicable. If you want to include the rolex binaries and set up the custom build step for the tokenizer, the link above contains the source code to build those binaries, and SlangTokenizer.rl contains instructions for setting up the build step.

The parser uses these lexemes, represented as tokens to decide what to parse next:

static CodeExpression _ParseTerm(_PC pc)
{
    var lhs = _ParseFactor(pc);
    while (true)
    {
        var op = default(CodeBinaryOperatorType);
        _SkipComments(pc);
        switch (pc.SymbolId)
        {
            case ST.add: // +
                op = CodeBinaryOperatorType.Add;
                break;
            case ST.sub: // -
                op = CodeBinaryOperatorType.Subtract;
                break;
            default:
                return lhs;
        }
        pc.Advance();
        var rhs = _ParseFactor(pc);
        lhs = new CodeBinaryOperatorExpression(lhs, op, rhs);
    }
}

This is fairly standard operator precedence parsing. Some of the parsing however, is not so straightforward.

Consider a cast:

(long)1

We may find a ( in the input but we can't know whether that's a parenthesized subexpression or a cast until we parse further. If we parse looking for a cast and we're wrong, then we're in trouble, and the same thing goes if we parse looking for an expression and we're wrong. Consequently, we backtrack:

// possibly a cast, or possibly a subexpression
// we can't know for sure so this gets complicated
// basically we need to backtrack.
CodeExpression expr = null;
Exception ex=null;
var pc2 = pc.GetLookAhead();
pc2.EnsureStarted();
try
{
    expr = _ParseCast(pc2);
}
catch(Exception eex) { ex = eex; }
if(null!=expr)
{
    // now advance our actual pc
    // TODO: see if we can't add a dump feature
    // to the lookahead so we don't have to 
    // parse again. Minor, but sloppy.
    return _ParseCast(pc);

} else
{
    try
    {
        if (!pc.Advance())
            throw new ArgumentException("Unterminated cast or subexpression", "input");
        expr=_ParseExpression(pc);
        _SkipComments(pc);
        if(ST.rparen!=pc.SymbolId)
            throw new ArgumentException("Invalid cast or subexpression", "input");
        pc.Advance();
        return expr;
    }
    catch
    {
        if (null == ex)
            throw;
        throw ex;
    }
}

You can see this _PC (pc) object (our parse context, which we'll get to) is used to create something called look-ahead. We then parse along that look-ahead, like normal, before discarding it. What's happening is it's running an attempted parse. The lookahead cursor (our parse context pc2) does not advance its source parse context (pc) position so we can parse as much as we like along a look-ahead without having to worry about advancing the "real" cursor. This way, if we fail at a parse, we can simply go back to where we were and try something else, until we find what works.

Out little _PC class, which manages a running cursor along an IEnumerator<Token> is how we get our current token, and how advance the input. It uses LookAheadEnumerator<T> to enable the lookahead, which uses a Queue<T> underneath as a lookahead buffer, so when we lookahead, we really are advancing the actual cursor, but we expose a facade using a buffer via the Queue<T> to mask that. The parse context also has to prepend it's current token to the lookahead, so we use a ConcatEnumerator<Token> to accomplish that. We could have used LINQ, but I had this handy. Otherwise, _PC is pretty straightforward.

The main thing to note about the CodeDOM tree is despite the sample code generating VB code, it is not correct internally. The CodeDOM requires us to use CodePropertyReference objects to reference properties and CodeFieldReference to reference fields. It also makes assumptions about what is a variable, (sometimes) what is a type. So all of our member references happen to be reported as field references. Worse yet, our method invocations are actually considered delegate invocations, so Console.WriteLine(...) is interpreted as a delegate invoke of the delegate field WriteLine! Now with VB and C#, this doesn't matter, but with other languages, it very well might.

The reason our CodeDOM tree is like this is because we do not have type information during the parse. We cannot query a type to find out what is a property, or what is a method, or what is field or an event without it and we can't do that yet because all of our types haven't even been parsed yet. Therefore, we tag the CodeDOM tree's UserData entries with slang:unresolved to mark them as needing more information.

Update:

I have fixed some of the parsing bugs, and added a T4 style preprocessor to Slang, so now, with something like this from Test.tt in the demo project:

using System;
public class Test {
    public static void HelloWorld() {
<#for(var i =0;i<3;++i) {
#>        Console.WriteLine("Hello World! #<#=i+1#>");
<#}#>
    }
}

and using the following bit of code:

var sw = new StringWriter();
using (var w = new StreamReader(@"..\..\Test.tt"))
    SlangPreprocessor.Preprocess(w, sw);
Console.WriteLine(CodeDomUtility.ToString(SlangParser.ParseCompileUnit(sw.ToString()),"vb"));
return;

You can output this:

Option Strict Off
Option Explicit On

Imports System

Public Class Test
    Public Shared Sub HelloWorld()
        Console.WriteLine("Hello World! #1")
        Console.WriteLine("Hello World! #2")
        Console.WriteLine("Hello World! #3")
    End Sub
End Class

So now you can build your codedom trees using T4 text templating syntax. It has no dependencies on Microsoft's T4 stuff, but it also doesn't support attributes or custom assembly references yet, anything fancy.

The main class to make this work is SlangPreprocessor and the meat of that is one method, a couple of parsing support functions notwithstanding:

public static void Preprocess(TextReader input,TextWriter output,string lang="cs")
{
    // TODO: Add error handling, even though output codegen errors shouldn't occur with this
    var method = new CodeMemberMethod();
    method.Attributes = MemberAttributes.Public |MemberAttributes.Static;
    method.Name = "Preprocess";
    method.Parameters.Add(new CodeParameterDeclarationExpression(typeof(TextWriter), "w"));
    int cur;
    var more = true;
    while(more)
    {
        var text = _ReadUntilStartContext(input);
        if(0<text.Length)
        {
            method.Statements.Add(new CodeMethodInvokeExpression(
                new CodeArgumentReferenceExpression("w"),
                "Write",
                new CodePrimitiveExpression(text)));
        }
        cur = input.Read();
        switch(cur)
        {
            case -1:
                more = false;
                break;
            case '=':
                method.Statements.Add(new CodeMethodInvokeExpression(
                    new CodeArgumentReferenceExpression("w"),
                    "Write",
                    new CodeSnippetExpression(_ReadUntilEndContext(-1, input))));
                break;
            default:
                method.Statements.Add(new CodeSnippetStatement(_ReadUntilEndContext(cur, input)));
                break;
        }
    }
    method.Statements.Add(new CodeMethodInvokeExpression(new CodeArgumentReferenceExpression("w"), "Flush"));
    var cls = new CodeTypeDeclaration("Preprocessor");
    cls.TypeAttributes = TypeAttributes.Public;
    cls.IsClass = true;
    cls.Members.Add(method);
    var ns = new CodeNamespace();
    ns.Types.Add(cls);
    var cu = new CodeCompileUnit();
    cu.Namespaces.Add(ns);
    var prov = CodeDomProvider.CreateProvider(lang);
    var opts = new CompilerParameters();
    var outp= prov.CompileAssemblyFromDom( opts,cu);
    var m = outp.CompiledAssembly.GetType("Preprocessor").GetMember("Preprocess")[0] as MethodInfo;
    m.Invoke(null, new object[] { output });
}

That's 80% of the T4 processing right there. The rest is just a couple of remedial parsing functions. It uses the old ASP/ASP.NET trick of turning context switches (delimited by those <# #> tags) into Write() calls, and then it uses the codedom to compile and then load the compiled assembly, before using reflection to run the single method exposed from that assembly. It really is quite simple despite being kinda fancy. Eventually I will shore this up, and maybe even add includes and other neat things.

Points of Interest

C#'s grammar is one of the most deceptively simple grammars I've ever seen. I thought C was a tiger, but C#'s ambiguity was a challenge to parse. It looks so easy but it's really not. It requires GLR parsing or hand rolled parsers to parse it.

History

4^th December, 2019 - Initial submission
5^th December, 2019 - Update