Introduction
This article is about creating code models, or “CodeDOMs”,
that model the semantics of programming languages. This is Part 1 of a series on codeDOMs which
will include lots of source code, but this first article is a necessary
background discussion of what is being attempted and why it’s needed.
Most Computer Languages Today are Lacking Something Important
Software developers must not only create applications which
meet requirements and are as robust and bug-free as possible, but they must
also do their best to build software which is as easy as possible to maintain
and extend – not only by themselves, but by others. This often involves the use of
object-oriented techniques to create an easy-to-understand object model which
can also be easily extended. Such
techniques are not always used in these days of HTML and scripting languages,
but it’s probably safe to say that the majority of senior developers would
agree that object-oriented analysis, design, and programming techniques are a
“best practice” for large, complex applications.
Therefore, it’s somewhat ironic that the tools used to build
such applications rarely expose an object model and usually are not extensible. This problem starts with the most important
tools of all: the language compilers. Developers
will often find the language they are using to be somewhat lacking in
capability or features for their specific needs, but will have no recourse but
to work within the confines of the language, usually waiting many years for new
features to arrive. Experienced
developers working on complex systems will often end up effectively implementing
a “Domain-Specific Language” (DSL) as a natural way of simplifying core logic
in the system. This “language” might consist
only of a set of helpful methods or types (the developer might not even realize
that they’ve effectively created a DSL), or it might be a true scripting or
compiled language (whether homegrown or 3rd party). However, they will not be able to extend the primary
language to accommodate the DSL, limiting the power of this technique (consider
the addition of “inline SQL” with the LINQ feature of C# as one example of what
could be accomplished by allowing for such extensions).
More importantly, the closed nature of language compilers
puts a huge burden on 3rd party developers of related tools, such as
code analysis tools, code difference comparers, etc. Vendors must implement their own parsing and
reference resolving, which requires huge amounts of time and is exceedingly
complex. This duplication of effort
between the compiler, editor, and other tools results in poor quality due to inconsistencies,
bugs, and performance and memory usage issues.
It also creates a large barrier to market for such tools, limiting
choices for developers. No matter how
good a developer is, good tools will make them better. A good static analysis tool will find issues
in anyone’s code. But, the lack of tool
quality can drive developers away from using tools that would otherwise improve
their code.
In summary, the qualities that software developers strive to
provide in software that they create do not effectively exist in the core
language tools that they use to write their code: they do not have easily
extensible object models (at least, not publicly accessible ones). And, this causes serious difficulty for all
higher level tools used in the development process. In other words, “The cobbler’s children have
no shoes”. The majority of developers
probably don’t even realize this, or understand just how much of a limitation
this is to them in their daily work – it’s just the way it’s always been. Languages are closed.
Why are Computer Languages Generally “Closed”?
Computer languages are almost always defined in terms of
text, and they are almost always closed to extension by users. I would argue that this closed nature is due more
to tradition than good reason. The
tradition when creating a computer language is to create a “grammar” that
defines its text representation. This
grammar is then fed into a code generator that creates fearsomely convoluted
code that “parses” the language and builds an Abstract Syntax Tree (AST) which
can then be analyzed for correctness and used to generate executable code (or
pseudo instructions which are converted to real opcodes later). Any change to the language requires a change
to the grammar, re-generation of the parsing code, and re-building of the entire
compiler – everything about this architecture is a barrier to changes by anyone
who isn’t a member of the compiler team.
And, even if you’re a member of that team (perhaps you’ve created your
own DSL), you will probably dread having to debug a problem and step into that ugly
generated parsing logic.
Computer languages are text-based. That’s how they started, and it’s never
changed (ignoring a few rare exceptions that have never really caught on). Developers are taught in school that this is
just how it’s done. Most of them have
probably never really thought very hard about why it’s done that way, much less
about possible alternatives and the world of benefits that they might open up. Most of them are also probably scared away
from ever creating their own languages after that one compiler theory class
that they had (grammars, compiler-compilers, lexers, ASTs, semantic analysis,
EBNF, LL(k), LALR – ack!). It’s the traditional
text nature of computer languages combined with the traditional methods of
implementing them that results in their closed nature. Once upon a time, perhaps this all made
plenty of sense, but at least since the advent of graphical IDEs, it has become
somewhat archaic in my opinion.
It’s time to think outside the (text) box. It’s time to swallow the red pill.
Object Models for Code: CodeDOMs
What is a “computer language”, exactly? Why not define computer languages in terms of
objects rather than text? After all,
text-based languages are generally converted to objects by modern compilers in
order to be processed. Text is actually
a terrible format for a computer to digest – it’s used for the benefit of
humans, not machines. Using text is an
ancient tradition with computer languages, because it allows developers to
easily write code using any text editor.
However, for decades now most developers use IDEs with what are
essentially graphical editors (with colors, fonts, pop-up menus, tooltips,
intellisense, collapsible sections, etc).
They get the feel that they’re working with text, and their code is
stored in text files, but the IDE is very far from being a simple text editor –
it’s using a hidden object model to represent the code internally.
So, why not make the huge leap of designing languages directly as an object model that
represents the semantics of the language and forget about text and all of the
limitations that come with it? Well, maybe
we can’t completely forget about text, but can we at least make the text
representation second-class to the object model instead of the other way
around? This isn’t actually a completely
new idea – Smalltalk provided an object model for code, and various visual
programming environments have effectively done it. The fact that such attempts have not
succeeded wildly doesn’t mean that the general idea isn’t a good one – just
that various drawbacks existed which prevented widespread adoption. There’s no question that the potential
benefits are huge and numerous, but a successful design will require almost
zero drawbacks. Drag-and-drop
programming may have its uses, but preventing developers from typing away
madly, writing temporary pseudo-code, or anything else that they do today, would
make most of them very unhappy. When it
comes to editing code, the user must have at least the option of a very similar
experience as to what they have with text languages. Also, although storing programs as objects in
a database would seem to make a great deal of sense (after all, that’s what is
generally done with any other complex data), it also makes sense to retain the
option of storing as text for backwards compatibility.
The logical conclusion of this line of thinking is: We
should start designing computer languages primarily as an object model that
represents the semantics of the language, but also with an alternative text
representation and with easy conversion between the two. This might not sound all that different from
what basically exists today, but the focus on making an object model the
first-class implementation will have a huge impact: all tools will be
consistent and much easier to create instead of everybody constantly
replicating effort by creating their own incompatible object models. Also, the language will be completely open
and extensible like any good object-oriented design. The number and quality of language tools will
both increase dramatically. Sometimes, a
relatively minor change in viewpoint can make a tremendous difference in
outcome.
Let’s call such object models for code “CodeDOMs”. The “DOM” stands for “Document Object
Model”. It’s not perfect, but it’s
concise and there is a history of prior use for this term (although perhaps not
with exactly the same definition, which will be addressed later). Also, what language should be used to create
the object model? In most cases, the same language being modeled –
that might sound a bit strange at first, but it actually makes a lot of sense
(it’s known as “bootstrapping”). The
codeDOM could also be provided in other languages if for some reason they might
be used to manipulate the primary language.
How Do You Examine and Edit Code Without Using Text?
The codeDOM objects would be displayed very much like text
is displayed in an IDE, meaning using what is really a graphical display, but
one that still uses plenty of text using colored fonts. After all, it will still make sense for code
objects such as types and methods to have names, statements to use keywords, and
of course to have comments throughout the code.
It should always be possible to get a virtually identical display as you
would with text, but it will also be possible to get a more graphical display
if desired (or use a different “skin”), since it will be objects that are
rendered instead of plain text. For
example, background colors and enclosing lines might be used to represent
objects that are children of others, comments that are associated with specific
code objects, etc. Sub-expressions that
“wrap” onto more than one line might be vertically centered within the parent
expression, meaning that all text might not line up exactly into specific rows
(or columns) on the screen. Most IDEs
already use a proprietary object model internally for display that is mapped to
the text – the idea here is to provide a public object model for the language
instead. It’s also a suggestion that
IDEs move towards a more graphical display, dropping the idea of the
represented code appearing almost exactly as the lines and columns of a text
file.
Actually, once you start to think of code being displayed
truly graphically instead of as text, many new things become possible. Any statement could have its body optionally
collapsed. Many syntax characters lose
their importance, and things such as braces, semi-colons, or even statement and
method parenthesis might be optionally hidden.
Comments could be optionally displayed in a proportional font or
hidden. Documentation comments could be
displayed in a WYSIWYG format instead of as XML. Real mathematical symbols could be used to
display some operators in place of the ASCII characters that traditionally fill
in for them. You could even customize
existing UI controls for code objects, or create your own – such as a mapping
table that looks like a spreadsheet dropped into your program that maps one
column of values directly onto another (implemented with a hidden ‘switch’
statement, or even with custom code generation). You might choose to see a tree-like
representation of code objects, such as to make the evaluation order of a
complex expression more obvious. The
possibilities are basically unlimited – and best of all – each user could
customize their own view of the code as they desire (no more concerns or
arguments about formatting). You could
even provide the option to map keywords and library names to the (human)
language of the programmer, instead of forcing everybody to deal with English.
As far as editing, the GUI would probably have more
graphical editing options than a standard IDE, such as the use of drop-down
selections, drag-and-drop, etc. The use
of a more graphical display could make it quicker and easier to select proper
code fragments than when using a text-based display. However, doing editing “right” would mean
allowing the user to just type away normally, parsing the code on the fly into
code objects, while also easily allowing for code fragments or pseudo-code that
isn’t quite valid yet. This is an area
where previous attempts at this sort of thing have often fallen short, but
there is theoretically no reason that such text-like editing couldn’t still be
supported for a tree of objects.
The average user of a language with a standardized codeDOM
wouldn’t necessarily need to learn or use the codeDOM. They could learn the language much as they do
with text-based languages today. They
would benefit from the codeDOM through the increased number and quality of language
tools that it would bring about, but they wouldn’t need to use it directly themselves. Most likely, though, the day would come that
they’d find themselves using the codeDOM to create a tool, extend the language,
add code analysis rules, or generate code.
A codeDOM also provides “reflection” and “expression tree” support (used
in modern, managed languages such as C#).
How Much Does a CodeDOM Really Buy Us?
I’ve already talked about quite a few benefits, but here’s a
recap plus some additional ideas:
- Better consistency between
tools, less memory usage, and better performance.
- A much bigger selection of
tools, with much better overall quality.
- A much higher level of
customization for all tools, starting with the language itself.
- Much better support for
DSLs, and tight integration with the primary language.
- A more graphical display
and manipulation of code in addition to text-like editing.
- Highly customizable
display of code by each individual user – making it easier to read and
understand code, and finally putting an end to formatting style
disagreements.
- Better and more easily
customized code analysis with better performance.
- Much better version
control based upon actual code object changes instead of text.
- The ability to store code
in a database instead of text files, increasing performance by eliminating
the need for parsing, and providing better management of large codebases.
- Search, analyze, and
refactor code using powerful SQL queries.
I certainly hope that most readers are starting to buy into
the whole codeDOM idea by now, and that many of you are more than a little
excited about the possibilities.
Honestly, I’ve been thinking about them myself since the 90’s, and
frankly I’m quite disappointed in the entire industry that it’s taking so long
to implement something with such obvious huge benefits! As software developers, we not only need this sort of thing, we needed it decades ago! On that note, I’ll risk being a bit grandiose
and present a “manifesto” of sorts (without actually using that overused word).
The Software Developer’s Ultimatum
As software developers, we are often asked to perform
monumental tasks in ridiculously short periods of time, making few mistakes
along the way, and resulting in something that is easy to understand and extend
by those who come along after us. Sadly,
we often fall far short of meeting these expectations.
We accept part of the blame, acknowledging that we are only
human. However, let it be known that the
tools which are provided for us to accomplish our work are woefully inadequate
for the job. To create better software,
we need better tools. Support for
object-oriented techniques, managed code, and agile methodologies have been big
advances, but they’re not enough. Our core tools are lacking in many respects,
and until we are provided with better ones, the quality of our work will suffer
accordingly.
Specifically, a modern computer language platform should
include:
1)
A publicly accessible and extensible set of
classes that model the semantics of the language, implemented in the language
itself, and in addition to the text representation with easy conversion between
the two. We need to escape the restrictions
of text, and the need for language tools to constantly re-parse it.
2)
A publicly accessible and extensible graphical
editor that allows for direct manipulation of code objects while also providing
most commonly used text-oriented editing features. We need better code editing and refactoring
tools, and we need to be able to customize them ourselves.
3)
Integrated and easily customizable support for
code analysis that can be improved over time to be far better than what is
currently available. We need better code
analysis without performance issues, and we need to be able to customize it
ourselves.
4)
Integrated and automatic version control that
determines differences at the code object level instead of at the text level,
and provides excellent branching and merging capabilities. We need to know exactly what has changed, not
a rough guess, and we need simple & error-free merging.
5)
Optional storage of code objects directly in a
database instead of text files, allowing for faster, more powerful, and
parallel searching and analysis operations on very large codebases. We need the proper tools for analyzing
mountains of code, not scanning thousands of text files.
When provided with such a development platform, our productivity
and the quality of our code will increase significantly. We will finally be free to concentrate mainly
on the problems we are trying to solve instead of working through a fog of
limitations with our tools.
As of Fall 2012 – despite the creation of many new managed
and scripting languages in the last 20 years – there really aren’t any
widely-used computer languages available to us that provide any of these
features, much less all of them.
Design Goals for a CodeDOM
I’m going to lay out some primary design goals for a codeDOM
as I’ve been defining it. My idea of a codeDOM
is a set of classes that can be used to create a tree of objects that represent
the semantics (meaning) of code in a particular language (a new language or an
existing one). These “code objects” will
represent the code in a form that is most easily manipulated by other code,
making it as easy as possible for a programmer to write code that analyzes and
modifies the code, thus greatly facilitating the creation of language
tools. I think the most important design
goals for such a codeDOM are:
- Clearly
named classes that model the actual semantics of the language. All statements, operators, comments,
etc. should have their own classes, they should be in a logical hierarchy
with a single common base class, and preferably should be implemented in
the language being modeled (although implementations for other languages
could also exist).
- Child objects should have
a reference to their parent. It
should be possible to associate comments with a particular code object in
addition to having standalone comments.
- Conservative use of
memory. Large codebases will have
millions of objects, so no extra fields or objects for non-essential
syntax, tokens, or formatting that really apply only for text.
- Easy modifications of the
code object tree. Rename an object
simply by changing its Name property.
Assigning an object to a new parent should automatically update its
parent reference.
- Support for formatting
code objects as text (for export, debugging, etc), but implemented in an
unobtrusive way that defaults to standard formatting if not explicitly
specified.
- Support for parsing existing
text code into code objects in addition to creating them manually.
- Support for resolving of
symbolic references into direct references to other code objects.
It’s impossible to overstate the importance of design goal A. The names of the classes, their hierarchy,
and their members basically define the language in programmatic form, and they
should match the text representation as closely as possible.
Doesn’t Something Like This Already Exist?
As I’ve already mentioned, this isn’t exactly a revolutionary
idea – at least, not the basic idea of an object model that represents
code. But, the specific viewpoint that
we should start designing languages as an open object model first with a text
representation more of an afterthought is perhaps somewhat radical. And, it seems to me that few of the design
goals laid out above have really been met by any existing tool to date. However, in .NET, there are code modeling
classes in the System.CodeDOM
and System.Linq.Expressions
namespaces, and then there’s the Roslyn project. I’ll address these further in my next
article.
Enough Talk, Already – How About Some Code?
I’ve been writing code for over 30 years, and I’m tired of
waiting for some big company to finally create a development environment that
gets to the promising, gleaming future where code is stored as objects instead
of text. Forget about flying cars – I
want codeDOMs!
The good news is that I’ve actually been working on exactly
that for a long time now, and I’m prepared to hand over a lot of my sources to
the public domain in the hopes of increasing interest in codeDOMs, and also
just to share useful code. In this
series of articles, I will share a codeDOM for C#, displaying a codeDOM with
WPF, an object-oriented parsing technique, codeDOM classes for reading/writing
VS Solution and Project files, examples of using Reflection and Mono Cecil to
load metadata from assemblies, how to resolve symbolic references in a codeDOM,
calculating metrics from and searching a codeDOM, an analysis of what tools
such as Roslyn give us, and more.
In my next article, I will jump into creating a codeDOM
based upon C#, with source included. I’ll
try to keep it simple, clean, and well organized, but we’ll have about 45,000
lines of code spread across about 300 types for a start, and several times that
much by the end of this series. Click here for Part 2.