Parse Anything: Advanced Parsley, Part 2

honey the codewitch

3.33/5 (6 votes)

Jan 6, 2020

MIT

11 min read

8290

217

Using Parsley to parse a C# subset into the CodeDOM

Introduction

In my previous article, I covered using Parsley's more advanced features to parse a complicated grammar - in this case Slang, a C# subset I use in my code generation projects. This article is simply the logical follow up - taking that parsed tree and turning it into something useful.

Background

Just as a matter of course, this article will raise the question of whether it's better to hand roll or generate your parser. You'll find the work involved in turning a parse tree into something useful is very much like parsing in the first place, so where does it save effort?

This is a good question, and one that bears answering. The short answer is, it doesn't. A parser generator gives you consistency in error handling and parsing to reduce bugs and improve the robustness of your code. It doesn't necessarily save you effort. The parse tree you get has already been (largely) checked for syntax and structure so you already know what it's going to look like, meaning you don't need another layer of error handling while converting the parse tree into something more fun. Still, you have to write that code to convert the tree anyway. That's what we'll be doing here. As for me personally, I tend to prefer the hand written approach, but Slang was too big for me to handle all the error handling and to be sure I got the grammar right. I think after a point, a generated parser can really help.

Conceptualizing this Mess

In the last article, we were left with a somewhat servicable Slang grammar in need of testing and a hint of what we would be doing next. We need to take that mess and turn it into a CodeDOM graph.

Typically, we use the => { ... } syntax directed action features of Parsley to accomplish that but I didn't do that here, largely because of the /fast option which the above doesn't support (resolution always happens for actions so it will slow down the parser generation process), but also because I didn't want the end method to be named Evaluate(), since that name doesn't make sense in this case.

Instead, I simply created a partial class aliasing SlangParser and added the methods myself. I also added a few helper extensions mostly to save my fingers. Unfortunately, the end code isn't all that intelligible because it's just indexing into parse trees. I'd print out the parse tree, count into it to find my indices, and then code from there. I don't have constants for the indices because that would quickly get overwhelming anyway. There are just too many. There's no magic way to make this easier unfortunately, although if you code your grammar to where your parse tree always has the same "shape" - that is, avoiding optional expressions and such that will change the length of the children can make it easier. I've done that in some cases. Particularly in my virtuals, I take care to make the parse tree shape the same regardless of how its filled, since I have full control to do so there.

Just to give you an overview of what the code looks like, here's code for creating a CodeVariableDeclarationStatement from the parse tree.

static CodeStatement _BuildVariableDeclarationStatement(ParseNode node)
{
    CodeTypeReference vt = null;
    if (StatementParser.varType != node.C(0).SymbolId)
        vt = _BuildType(node.C(0));
    var vd = new CodeVariableDeclarationStatement().SetLoc(node);
    if (null != vt)
        vd.Type = vt;
    vd.Name = node.C(1).Value;
    if (3 < node.CL())
    {
        vd.InitExpression = _BuildExpression(node.C(3));
    }
    if (null != vd.InitExpression)
        vd.Mark(node, null == vd.Type);
    return vd;
}

C() and CL() are just aliases for node.Children, and node.Children.Length, respectively. Mark() and SetLoc() mark the UserData of a CodeDOM object with various bits of information we'll use later. Mark() marks an object for visitation, and possible resolution by SlangPatcher (part of Code DOM Go Kit) while SetLoc() simply copies the node's location information to the object which we can use later for line pragmas and error reporting. The whole thing is a bit like hand parsing code would be in the first place, except it's against syntax checked data structures we already have the basic shape of, due to our grammar.

Coding this Mess

Building this Mess

Remember that Parsley is part of my Build Pack, and you'll have to build a couple of times in Release mode to get it to run. This is in order to build the binaries used in the pre-build steps. If you don't do this first, your project will not build. You'll get errors the first time you build. Those are file locking errors due to a necessary circular dependency in the projects. Simply build again and the errors should go away.

In the process of coding this, I have updated the grammar and codebase since the last article with bugfixes as I encountered them. I'm going to give you an overview of the steps I took to write this code so that you will have an idea of the process involved in doing this for your own grammars. It's not my favorite chore, but it's the final leg of the parsing journey.

As I said before, I look at the parse tree and count nodes to find out what's what. You can dump a parse tree of any file Slang can parse with:

try
{
    stm = File.OpenRead("mycode.cs");
    var tokenizer = new SlangTokenizer(stm);
    var pn = SlangParser.Parse(tokenizer);
    Console.WriteLine(pn.ToString("t"));
}
finally
{
    if (null != stm)
        stm.Close();
}

I've bolded the primary bit of interest here, which is ToString("t") on ParseNode. Passing "t" to ToString() will give you the node as a tree. This feature is available to Parsley generated parsers as of this release.

Doing it from a string of text is even easier:

var text = "using System;" +
    "class Program {" +
        "static void Main() {" +
            "Console.WriteLine(\"Hello World!\");" +
        "}" +
    "}";

var tokenizer = new SlangTokenizer(text);
var pn = SlangParser.Parse(tokenizer);
Console.WriteLine(pn.ToString("t"));

This will spit out a big long parse tree here:

+- CompileUnit
   +- UsingDirective
   |  +- usingKeyword using
   |  +- NamespaceName
   |  |  +- identifier System
   |  +- semi ;
   +- CustomAttributeGroups
   +- Class
      +- Comments
      +- CustomAttributeGroups
      +- TypeAttributes
      +- Partial
      +- classKeyword class
      +- identifier Program
      +- TypeParams
      +- BaseTypes
      +- WhereClauses
      +- Members
         +- Member
            +- Method
               +- Comments
               +- CustomAttributeGroups
               +- MemberAttributes
               |  +- staticKeyword static
               +- voidType void
               +- PrivateImplementationType
               +- identifier Main
               +- lparen (
               +- MethodParamList
               +- rparen )
               +- StatementBlock
                  +- lbrace {
                  +- Statement
                  |  +- Comments
                  |  +- ExpressionStatement
 ...

Note that I've omitted the bit below the ellipses as it's quite "wide" and will wrap.

One of the things to note here is we have a bunch of empty nodes like Partial and PrivateImplementationType. These work such that they are not considered unless they have child nodes, but they themselves are always present in the tree. The reason I designed the grammar this way was to make it much easier to find your indexes in the tree, since the "height" of the tree won't change based on the data. In other words, this is so the count of nodes from the top of Class to Members is always the same. It's a good idea to design your grammars this way. This is how you do it, instead of this:

MyProduction= [ MyOptionalElement ] foo;
MyOptionalElement= bar;
foo="foo";
bar="bar";

which would make MyOptionalElement not appear in the parse tree under MyProduction unless it's specified, you'd do:

MyProduction= MyOptionalElement foo;
MyOptionalElement= bar|; // or MyOptionalElement= [ bar ];
foo="foo";
bar="bar";

Note how we've moved the optional nature of MyOptionalElement to the production itself, making it so it can be empty. This is how you do it. The same applies to list nodes. Simply replace { bar }+ with { bar } under the production itself, similar to the above, and then make the production itself non-optional within its parent, just like the above. This might sometimes give rise to conflicts in which case it might not work for you, and you'll simply have to put up with a variable length parse tree or use automatic backtracking to override the conflict where possible, although this approach is not recommended in this case, as it's a perf hit for little benefit, and trial and error backtracking will always be less reliable than not backtracking.

Now that the grammar's tree is factored to be as easy to traverse as possible, we can get on with it. In general, the code you'll have to write increases linearly in complexity and size more nodes your subtree has, so for basic elements of the grammar, like the variable declaration code I showed well above it's pretty simple, whereas for things like a type declaration, things get complicated quickly regardless of how you've factored your grammar, although reusing elements of your grammar in as many places as possible means reusing the code you have to write in as many places is possible, which saves you work. Reduce, reuse recycle! Let's look at one of our more complicated procedures - the type declaration I mentioned above:

var result = new CodeTypeDeclaration().Mark(node);
var ca = _BuildCustomAttributeGroups(node.C(1));
_AddCustomAttributes(ca, "", result.CustomAttributes);
ca.Remove("");
// make sure there's no attributes that are targeted to something other than the default:
if (0 < ca.Count)
    throw new SyntaxException
          ("Invalid attribute target", node.Line, node.Column, node.Position);
result.Attributes = 0;
if (TypeDeclParser.MemberAttributes == node.C(2).SymbolId)
{
    result.Attributes = _BuildMemberAttributes(node.C(2));
    result.TypeAttributes = _BuildMemberTypeAttributes(node.C(2));
}
else
    result.TypeAttributes = _BuildTypeAttributes(node.C(2));
// only partial if Partial has child nodes
result.IsPartial = 0<node.C(3).CL();
var name = node.C(5).Value;
if (verbatimIdentifier == node.C(5).SymbolId)
    name = name.Substring(1);
result.Name = name;
switch (node.SymbolId)
{
    case TypeDeclParser.Enum:
        result.IsEnum = true;
        break;
    case TypeDeclParser.Class:
        result.IsClass = true;
        break;
    case TypeDeclParser.Struct:
        result.IsStruct = true;
        break;
    case TypeDeclParser.Interface:
        result.IsInterface = true;
        break;
}
result.TypeParameters.AddRange(_BuildTypeParams(node.C(6)));
result.BaseTypes.AddRange(_BuildBaseTypes(node.C(7)));
var wn = node.C(8);
for(var i = 0;i<wn.Children.Length-1;i++)
{
    ++i;
    CodeTypeParameter p = null;
    var tpn = wn.C(i).C(0).Value;
    if (verbatimIdentifier == wn.C(i).C(0).SymbolId)
        tpn = tpn.Substring(1);
    for (int jc= result.TypeParameters.Count,j = 0;j<jc;++j)
    {
        var tp = result.TypeParameters[j];
        if(0==string.Compare(tp.Name,tpn,StringComparison.InvariantCulture))
        {
            p = tp;
            break;
        }
    }
    if (null == p)
        throw new SyntaxException("Where clause constraint on unspecified type parameter", 
                                   wn.Line, wn.Column, wn.Position);
    var wcpn = wn.C(i).C(2);
    for(var j = 0;j<wcpn.Children.Length;j++)
    {
        var wcc = wcpn.C(j);
        if(TypeDeclParser.newKeyword==wcc.C(0).SymbolId)
            p.HasConstructorConstraint = true;
        else
            p.Constraints.Add(_BuildType(wcc.C(0)));
                    
        ++j;
    }
}
var mn = node.C(9);
for(var i = 0;i<mn.Children.Length;i++)
    result.Members.Add(_BuildMember(mn.C(i)));
return result;

Decoding the "where" clauses is what occupies much of this procedure. The rest is just basic indexing into the parse tree and delegation to the various other _BuildXXXX() methods, which is what perhaps 80% or more of what the code in the _BuildXXXX() methods do. The "where" clauses are complicated grammar-wise, and further complicated by the way the CodeDOM maps them, which requires us to look up the type parameter for each constraint using an unindexed search. Fortunately, there's almost never more than a dozen type params for all but the most complicated generics, the exception being the Tuple<> generics, so this isn't much of a performance hit, and it only applies when "where" is in play in any case.

Note on grammar design with respect to errors:

You can see this is tedious, but notice how we don't do error handling in this routine, which would make it much more tedious still. You can do error handling in these routines, but I don't necessarily recommend it, as the parse tree has already been checked and any errors would have been reported upstream during the parse, which is kind of the point of generating your parser in the first place. If you add error handling here, you'd be nearly doubling the code you needed to write, perhaps ironically introducing bugs as a result, and aside from that you'd be duplicating work already performed during the parse itself. At best, you might want to do error handling in the public methods that expose these routines, before you pass the parse node down, but you only need to verify the root node, and only if it was passed in from an external source instead of you parsing it directly in the routine. Any error handling I do in these routines is just to validate the few areas where we couldn't do validation via the grammar itself. Try to build your grammars to do as much validation as you can. It reduces bugs and the code you'll have to write during this step. Basically, you want to make your grammar as specific and precise as it will allow for. Sometimes, such as I did in a few areas in the Slang grammar, you may have to resort to virtual productions to take over the parse and introduce complicated error handling like we do in _ParseMember(). You don't strictly have to do it there, but if you don't, you'll have to do your validation later, during this parse tree examination step. That's not great, as good error handling means reporting as soon as possible. You'll want to go through the effort to report errors during the parse phase as much as possible. That's all about how you design your grammar, so good design is key here. Like any code, it can take a few iterations to get it right. Remember that one of the key benefits of generated parsers - if not the primary benefit - is error handling, so I can't overstate the importance of careful design here. Take the time and refine it up front.

Even after all of this code, we still don't have a completely valid CodeDOM tree. The reason is that the CodeDOM requires us to distinguish between things like references to fields versus references to properties, while both look exactly the same in C#. Furthermore, all method calls are resolved initially here as delegate invoke expressions, so myObj.ToString() would be regarded as invoking the delegate referred to by the ToString "field!" Like above, method invocations and delegate invocations look exactly the same in C#. There are many other areas that require resolution such as var variable declarations and the like. That being said, most code will actually render properly in C# (except for var variables) from a CodeDOM tree like this, incorrect as it is, precisely because these unresolved field references and delegate invocations look just like the "proper" code when rendered in C#. This is not applicable to other languages. In those cases, we must patch the CodeDOM we get back.

In general, we'll patch the tree before we operate on it, but that's for a future article. Since this is a specialized phase that's not applicable to most parsers, it won't be covered here. Instead, we've focused simply on converting a parse tree into an abstract syntax tree, which is what the CodeDOM is. We've created our code tree from the parse tree, and marked every possible node in the resulting AST tree with the source line, column and position information of where it came from in the source document. This is almost necessary for decent error handling, and is the achilles heel of the current incarnation of Slang - part of what this new code here seeks to rectify.

Note on designing your final data structures:

Probably most frequently with your own parsing code, you won't be putting it into 3^rd party data structures like the CodeDOM, but rather your own data structures. When you do so, remember to hold information about the source document where the tree came from - like line and column information. You'll regret it if you don't design for it up front. Fortunately, the CodeDOM objects have a user defined dictionary where we can plug arbitrary information but for your own structures, you'll want to design them so that they have member fields for this data.

The latest codebase is slightly ahead of our article, as I've already included some of the patching code. We'll cover that code in a follow up article, but it needs to be optimized and a proper demo built around it first, and I'd rather put together an article focusing on it specifically.

History

6^th January, 2020 - Initial submission