Over the last week, I essentially rebuilt the Director engine’s Lingo compiler from scratch. In doing so, I’ve restructured it to eliminate many longstanding issues and greatly improved its long-term maintainability.
The Lingo compiler, in short, parses Lingo source code and turns it into bytecode, which is then interpreted by the Lingo runtime. Previously, the code generator and the parser were inextricably linked, both defined within a Bison grammar file. This made things really complicated for two major reasons:
1) Bison is designed to parse context-free languages – languages whose syntax can be described by a context-free grammar. Lingo is context-free in terms of syntax, so it can be parsed by a context-free grammar. However, Lingo in terms of semantics, Lingo is context-sensitive. In other words, the same thing can have different meanings based on the context it’s used in. One of the most obvious examples of this is variables. When there’s a global loop
statement somewhere in the script, the variable loop
refers to a global variable. When there’s a property loop
statement somewhere in the script, the variable loop
refers to a property variable. In most other cases, it refers to a local variable, but it can get much weirder. When used in a go
command, the variable loop
doesn’t refer to a variable at all, but to the symbol #loop
. To deal with context-sensitive meaning within the context-free parser, we were basically keeping track of context with actions, and this context could then affect the lexer. This made the parser unnecessarily complicated.
2) Due to the complexities of doing code generation within the grammar, we had plenty of midrule actions, often alternating between symbols and actions several times in one rule. This, exacerbated by the context hacks, made the code hard to follow and quite brittle. Any tiny change brought the possibility of major breakage. Instead of properly defining a grammar that included all of Lingo’s often inconsistent syntax, we sometimes resorted to preprocessing Lingo source code into something the existing grammar could handle.
To fix this, I rebuilt the grammar from the ground up, leaving out anything from the old grammar that dealt with the language’s context-dependent semantics. What was left was a still rather ugly but straightforward, context-free grammar. And now, instead of outputting bytecode, the parser outputs an abstract syntax tree, which is then passed as input to a separate code generator.
An abstract syntax tree, if you’re not familiar with it, is basically just source code in the form of a tree data structure, making it easier to manipulate than raw text. For example, the Lingo source code:
on printStuff arg1
put "foo"
put 123
put arg1
end
can be represented as a tree something like:
HandlerNode {
name: "printStuff",
args: "arg1"
stmts: [
CmdNode {
name: "put",
args: [
StringNode {
val: "foo"
}
]
},
CmdNode {
name: "put",
args: [
IntNode {
val: 123
}
]
},
CmdNode {
name: "put",
args: [
VarNode {
name: "arg1"
}
]
}
]
}
Now, to compile this function to bytecode, we just need to traverse this tree, generating instructions for each node:
c_stringpush "foo"
c_callcmd "put", 1
c_intpush 123
c_callcmd "put", 1
c_localpush "arg1"
c_callcmd "put", 1
In addition, having an AST makes it extremely easy to gather semantic information like variable types and adjust code generation as necessary.
With this refactoring, I was able to clean up the compiler quite a lot. Since parsing is separated from code generation, the grammar is now a lot simpler, and I’ve been able to eliminate almost all of the preprocessing and context hacks we used. Now, it’ll be trivial to make the compiler changes I need to finish up of chunk expressions. Perhaps more importantly, it’ll now be much, much easier to add support for later Lingo versions to the compiler, enabling more exciting stuff to come.
Leave a Reply