Slang Part 2: Scope and Type Resolution in the CodeDOM

honey the codewitch

4.20/5 (5 votes)

Dec 7, 2019

MIT

7 min read

4125

127

Getting an accurate CodeDOM back from a C# source subset

Download source code - 525.7 KB

Introduction

In our last article, we parsed a subset of C# rather badly into the CodeDOM due to lack of type information during the parse. I told you we'd follow up and correct those unresolved nodes and here we are. For those of you just tuning in, we're turning a subset of C# source-code into an abstract-syntax-tree represented by the CodeDOM which can then be rendered in any target language for which an adequate CodeDOM provider exists. Since VB and C# providers ship with .NET Slang allows you to parse a C# subset and render C# or VB code from it, "out of the box." There are additional providers for languages like F# and TypeScript, but I haven't tried them yet.

The reason again, for this endeavor is to produce a library that makes it really easy to add language agnostic code generation capabilities to your projects. This is useful if your project itself is a tool to generate code for other developers.

A note about dependencies: This project relies on Microsoft's CodeDOM NuGet package for .NET Standard and Core. If this source code is transferred to a .NET Framework class library, that dependency will not be required. I did it this way for maximum portability though.

Also, I can't stress enough that this tool is not production ready. It may work for your projects and it may not as it is, but it's provided as an interest article of this ongoing effort of mine. I didn't want to make everyone wait for a month or more while I sorted out all of the stability issues.

Where We Are Now

As of the last article, we were dealing with a somewhat buggy proof of concept parser and awkward CodeDOM tree we could get back from it. The parser has been shored up a little bit but still has a long way to go. The parser itself may change to something ANTLR or even Elk driven if I have to down the road, because the error reporting is dodgy and I just haven't had enough time to test it on a variety of source files yet. Watch this space.

I also added a preprocessor so now you can use built in T4 preprocessing on your Slang source files. This allows for the dynamism in our code generation:

using System;
public class Test {
    public static void Main() {
    var j = int.MinValue;
    Console.WriteLine(10 * 2f);
<#for(var i =0;i<3;++i) {
#>        Console.WriteLine("Hello World! #<#=i+1#>");
        
<#}#>
    }
}

This generates the following Slang code:

using System;
public class Test {
        public static void Main() {
        var j = int.MinValue;
        Console.WriteLine(10 * 2f);
                Console.WriteLine("Hello World! #1");

                Console.WriteLine("Hello World! #2");

                Console.WriteLine("Hello World! #3");
        }
}

Which you can see is just (poorly formatted, but wait for it!) C# code. It is a C# subset, not full C#. However, this in turn can generate this, in "real" C#:

using System;

public class Test {
    public static void Main() {
        int j = int.MinValue;
        System.Console.WriteLine((10 * 2F));
        System.Console.WriteLine("Hello World! #1");
        System.Console.WriteLine("Hello World! #2");
        System.Console.WriteLine("Hello World! #3");
    }
}

Or this in VB.NET:

Option Strict Off
Option Explicit On

Imports System

Public Class Test
    Public Shared Sub Main()
        Dim j As Integer = Integer.MinValue
        System.Console.WriteLine((10 * 2!))
        System.Console.WriteLine("Hello World! #1")
        System.Console.WriteLine("Hello World! #2")
        System.Console.WriteLine("Hello World! #3")
    End Sub
End Class

Note that the Option Strict Off setting is put there by Microsoft, not our code. We don't control the rendering of the output language. The 3^rd party CodeDOM providers take care of that. Here, Microsoft's VBCodeProvider is responsible for this.

Note how our type references got fully qualified and our var declaration got turned into an explicitly typed variable declaration. This is to ensure unambiguous proper output regardless of the target language.

Using this Mess

We can get the above using the following code, assuming our input template is in our project directory and named test.tt:

// Holds the output of our preprocessing:
var sw = new StringWriter();

// First preprocess our template - runs the T4 processing
// output is Slang source
using (var r = new StreamReader("..\\..\\test.tt"))
    SlangPreprocessor.Preprocess(r, sw);

// Now we parse our Slang source into our initial CodeDOM
// parse tree
var code = SlangParser.ParseCompileUnit(sw.ToString());

// We need one of these lil guys to resolve our codedom
// types and members and external types and members
var res = new CodeDomResolver();

// Give it the code we just parsed
res.CompileUnits.Add(code);

// Now we can tell Slang to fix up our tree with the type info
SlangRegenerator.Patch(res.CompileUnits);

// Finally, our tree is good. We can render it
Console.WriteLine(CodeDomUtility.ToString(code, "vb"));

We've got several steps here. The first one is to preprocess. We need to run our T4 over the input to get our Slang. Then we take our Slang and parse it into a CodeCompileUnit. We also instantiate a CodeDomResolver which takes our code and adds tags to it so it can resolve scopes and types, which we'll get into.

Before we can use the CodeDOM objects we got back, we need to patch them because what we got back from the parse was incomplete. There simply isn't information enough in the source code alone to parse C#. You need type information. So our parsing was just the "initial pass" to get our basic structure. Now that we have done that, we go back through and apply type information, "correcting" our parse. For example, our parser sees Console.WriteLine("Hello"); as a delegate invocation of the field WriteLine off of the variable Console. That's not right at all! However, the parser simply doesn't have enough information at that point to know better. Patch() handles this using CodeDomVisitor which I wrote about here.

Finally, after the patch, our CodeDOM tree has been folded, mutilated and spindled into something servicable, so we just pass it to CodeDomUtility's ToString() method with our desired language. You can find more on CodeDomUtility here.

You can use CodeDomResolver by itself if you like. Just hand it your CodeCompileUnit instances, and call Refresh() and then you're ready to roll. You can use it as a standalone system to get type and scope information for your code constructs, which we'll get into now.

Conceptualizing this Mess

CodeDomResolver is somewhat involved magic. It's basically doing something similar to what the middle tier of a compiler does - it resolves type and scope information in our code. First, it uses UserData on the CodeDOM objects to weak-reference each element to its parent. This is handled by Refresh() This is important so we can walk back up the tree from wherever we are, getting our scope variables, arguments, members and types, which is exactly what _FillScope() does:

CodeDomResolverScope _FillScope(CodeDomResolverScope result)
{
    object p;
    if(null==result.Expression)
    {
        if (null != result.TypeRef)
        {
            p = result.TypeRef;
            while (null != (p = _GetRef(p, _parentKey)))
            {
                var expr = p as CodeExpression;
                if (null != expr)
                {
                    result.Expression = expr;
                    break;
                }
            }
        }
    }
    if(null==result.Statement)
    {
        if(null!=result.Expression)
        {
            p = result.Expression;
            while(null!=(p=_GetRef(p,_parentKey)))
            {
                var stmt = p as CodeStatement;
                if (null != stmt)
                {
                    result.Statement = stmt;
                    break;
                }
            }
        } else if(null!=result.TypeRef)
        {
            p = result.TypeRef;
            while (null != (p = _GetRef(p, _parentKey)))
            {
                var stmt = p as CodeStatement;
                if (null != stmt)
                {
                    result.Statement = stmt;
                    break;
                }
            }
        }
    }
    if(null!=result.Statement)
    {
        _PopulateStatementScopeInfo(result);
    }
    if(null==result.Member)
    {
        p = null;
        if (null != result.Statement)
        {
            p = result.Statement;
        }
        else if (null != result.Expression)
            p = result.Expression;
        if(null!=p)
        { 
            while (null != (p = _GetRef(p, _parentKey)))
            {
                var mbr = p as CodeTypeMember;
                if (null != mbr)
                {
                    result.Member = mbr;
                    break;
                }
            }
        }
    }
...
}

This is a rather long method, but all it's doing is walking from wherever it is, up to the parent that _GetRef() retrieves until it finds something of interest. Then it stops and collects the data to populate the appropriate scope information before continuing.

This allows for some serious spellcraft where we can invoke GetScope(code) from anywhere and get all of the variables, arguments, members and types available to us. We rely on this to get our type information so we can turn those Console "variables" we ran into toward the beginning of the article into what they actually are - type references!

We also have things like GetTypeOfExpression() which can retrieve a CodeTypeReference for (almost) any expression we pass to it, and FillMembersOfType() which gets the members of a CodeTypeReference object or a Type object. It retrieves both the declared and reflected members, including base type members. That's not so easy!

We use this information in a big ugly method called _Patch() in SlangRegenerator which runs the CodeDomVisitor's Vist() method in a loop, looking for anything that still has "slang:unresolved" on it. For now, there's also a hard limit of 1000 iterations imposed while I finish the code so it never hangs indefinitely even if not everything is resolved. I'm still in the process of verifying that I'm covering all the elements in this preliminary deliverable.

In there is basically an anonymous method which a bunch of type checks for the various different code constructs. This is where C#8's type switch "pattern matching" would help but I didn't use it here. Every time we find something, we attempt to use the information we have to correct it. This can get ugly, particularly in the case of field and type references we find, but we manage. Here for example, is the segment of code in _Patch() that fills in our var declarations with an explicit type:

var vd = co as CodeVariableDeclarationStatement;
if (null != vd)
{
    if (null == vd.Type || "System.Void" == vd.Type.BaseType)
    {
        var scope = res.GetScope(vd);
        var e = res.GetTypeOfExpression(vd.InitExpression, scope);
        if (null == e || e.BaseType == "System.Void")
            more = true;
        else
        {
            vd.Type = e;
            vd.UserData.Remove("slang:unresolved");
        }
    }
}

This code badly needs factoring but I want to get it 100% to running state before I start in case I have to redesign.

Where We Are Going

I expect I'll replace the tokenizer and parser with another technique. The current incarnation simply has no decent error handling and I'm not working from a grammar, but rather what I can cobble together from Microsoft's C# specs. It's dodgy, to say the least. I'd like to work from a grammar, so whatever I use has largely to do with what grammar I settle on. Even continuing with the backtracking recursive descent parser would be fine, with a suitable grammar. However, we may move to something like GLR parsing if something like ANTLR or a hand rolled parser doesn't cut it. I'd rather avoid that, as it may force me to write part of the project in C++.

The CodeDomResolver needs refactoring, testing and shoring up as does SlangRegenerator but the foundation is stable now. I know how it needs to work, and it's basically doing that, which gives me solid footing to move forward.

As I alluded to before, this code is still a playground for me to design and figure all this out, so it hasn't been factored very well, but the concept is there. I just wanted you to have something you could get your hot little hands on before Christmas.

History

7^thDecember, 2019 - Initial submission