In this chapter,
we’ve been building the abstract syntax trees of
Simple
programs manually—writing longhand Ruby expressions likeAssign.new(:x, Add.new(Variable.new(:x), Number.new(1)))
—rather
than beginning with raw
Simple
source code like'x = x + 1'
and using a parser to automatically turn it into a
syntax tree.
Implementing a
Simple
parser
entirely from scratch would involve a lot of detail and take us on a long
diversion from our discussion of formal semantics. Hacking on toy
programming languages is fun, though, and thanks to the existence of
parsing tools and libraries it’s not especially difficult to construct a
parser by relying on other people’s work, so here’s a brief outline of how
to do it.
One of the best parsing tools available for Ruby is
Treetop
, a
domain-specific language for describing syntax in a way that
allows a parser to be automatically generated. A Treetop description of a
language’s syntax is written as
a
parsing expression grammar
(PEG), a
collection of simple, regular-expression-like rules that are easy to write
and to understand. Best of all, these rules can be annotated with method
definitions so that the Ruby objects generated by the parsing process can
be given their own behavior. This ability to define both a syntactic
structure and a collection of Ruby code that operates on that structure
makes Treetop ideal for sketching out the syntax of a language and giving
it an executable semantics.
To give us a taste of how this works, here’s a cut-down version of
the Treetop grammar for
Simple
,
containing only the rules needed to parse the string'while (x < 5) { x = x * 3 }'
:
grammar
Simple
rule
statement
while
/
assign
end
rule
while
'while ('
condition
:
expression
') { '
body
:
statement
' }'
{
def
to_ast
While
.
new
(
condition
.
to_ast
,
body
.
to_ast
)
end
}
end
rule
assign
name
:
[a-z]
+
' = '
expression
{
def
to_ast
Assign
.
new
(
name
.
text_value
.
to_sym
,
expression
.
to_ast
)
end
}
end
rule
expression
less_than
end
rule
less_than
left
:
multiply
' < '
right
:
less_than
{
def
to_ast
LessThan
.
new
(
left
.
to_ast
,
right
.
to_ast
)
end
}
/
multiply
end
rule
multiply
left
:
term
' * '
right
:
multiply
{
def
to_ast
Multiply
.
new
(
left
.
to_ast
,
right
.
to_ast
)
end
}
/
term
end
rule
term
number
/
variable
end
rule
number
[0-9]
+
{
def
to_ast
Number
.
new
(
text_value
.
to_i
)
end
}
end
rule
variable
[a-z]
+
{
def
to_ast
Variable
.
new
(
text_value
.
to_sym
)
end
}
end
end
This language looks a little like Ruby, but the similarity is only
superficial; grammars are written in the special Treetop language. Therule
keyword introduces a new rule for
parsing a particular kind of syntax, and the expressions inside each rule
describe the structure of the strings it will recognize. Rules can
recursively call other rules—thewhile
rule calls theexpression
andstatement
rules, for instance—and parsing begins
at the first rule, which isstatement
in this grammar.
The order in which the expression-syntax rules call each other
reflects the precedence of
Simple
’s
operators. Theexpression
rule callsless_than
, which then immediately callsmultiply
to give it a chance to match
the*
operator somewhere in the string
beforeless_than
gets a chance to match
the lower-precedence<
operator.
This makes sure that'1 * 2 < 3'
is
parsed as «(1 * 2) < 3
» and not
«1 * (2 < 3)
».
To keep things simple, this grammar makes no attempt to constrain
what kinds of expression can appear inside other expressions, which
means the parser will accept some programs that are obviously
wrong.
For example, we have two rules for binary expressions—less_than
andmultiply
—but the only reason for having
separate rules is to enforce operator precedence, so each rule only
requires that a higher precedence rule matches its left operand and a
same-or-higher-precedence one matches its right. This creates the
situation where a string like'1 < 2 <
will be parsed successfully, even though the semantics of
3'
Simple
won’t be able to give the
resulting expression a meaning.
Some of these problems can be resolved by tweaking the grammar,
but there will always be other incorrect cases that the parser can’t
spot. We’ll separate the two concerns by keeping the parser as liberal
as possible and using a different technique to detect invalid programs
in
Chapter 9
.
Most of the rules in the grammar are annotated with Ruby code inside curly brackets. In
each case, this code defines a method called#to_ast
, which
will be available on the corresponding syntax objects built by Treetop when it parses a
Simple
program.
If we save this grammar into a file called
simple.treetop
, we can
load it with Treetop to generate aSimpleParser
class. This
parser allows us to turn a string of
Simple
source code into
a representation built out of Treetop’sSyntaxNode
objects:
>>
require
'treetop'
=> true
>>
Treetop
.
load
(
'simple'
)
=> SimpleParser
>>
parse_tree
=
SimpleParser
.
new
.
parse
(
'while (x < 5) { x = x * 3 }'
)
=> SyntaxNode+While1+While0 offset=0, "…5) { x = x * 3 }" (to_ast,condition,body):
SyntaxNode offset=0, "while ("
SyntaxNode+LessThan1+LessThan0 offset=7, "x < 5" (to_ast,left,right):
SyntaxNode+Variable0 offset=7, "x" (to_ast):
SyntaxNode offset=7, "x"
SyntaxNode offset=8, " < "
SyntaxNode+Number0 offset=11, "5" (to_ast):
SyntaxNode offset=11, "5"
SyntaxNode offset=12, ") { "
SyntaxNode+Assign1+Assign0 offset=16, "x = x * 3" (to_ast,name,expression):
SyntaxNode offset=16, "x":
SyntaxNode offset=16, "x"
SyntaxNode offset=17, " = "
SyntaxNode+Multiply1+Multiply0 offset=20, "x * 3" (to_ast,left,right):
SyntaxNode+Variable0 offset=20, "x" (to_ast):
SyntaxNode offset=20, "x"
SyntaxNode offset=21, " * "
SyntaxNode+Number0 offset=24, "3" (to_ast):
SyntaxNode offset=24, "3"
SyntaxNode offset=25, " }"
ThisSyntaxNode
structure is a
concrete syntax tree
: it’s designed specifically
for manipulation by the Treetop parser and contains a lot of extraneous
information about how its nodes are related to the raw source code that
produced them. Here’s what the
Treetop
documentation
has to say about it:
Please don’t try to walk down the syntax tree yourself, and please
don’t use the tree as your own convenient data structure. It contains
many more nodes than your application needs, often even more than one
for every character of input.Instead, add methods to the root rule that return the information
you require in a sensible form. Each rule can call its sub-rules, and
this method of walking the syntax tree is much preferable to attempting
to walk it from the outside.
And that’s what we’ve done. Rather than manipulate this messy tree
directly, we’ve used annotations in the grammar to define a#to_ast
method on each of its nodes. If we call
that method on the root node, it’ll build an abstract syntax tree made
from
Simple
syntax objects:
>>
statement
=
parse_tree
.
to_ast
=> «while (x < 5) { x = x * 3 }»
So we’ve automatically converted source code to an abstract syntax
tree, and now we can use that tree to explore the meaning of the program
in the usual ways:
>>
statement
.
evaluate
({
x
:
Number
.
new
(
1
)
})
=> {:x=>«9»}
>>
statement
.
to_ruby
=> "-> e { while (-> e { (-> e { e[:x] }).call(e) < (-> e { 5 }).call(e) }).call(e);
e = (-> e { e.merge({ :x => (-> e { (-> e { e[:x] }).call(e) * (-> e { 3 }).call(e)
}).call(e) }) }).call(e); end; e }"
Another drawback of this parser, and of Treetop in general, is that it generates a
right-associative
concrete syntax tree. This means that the string'1 * 2 * 3 * 4'
is parsed as if it had been written'1 * (2 * (3 * 4))'
:
>>
expression
=
SimpleParser
.
new
.
parse
(
'1 * 2 * 3 * 4'
,
root
:
:expression
)
.
to_ast
=> «1 * 2 * 3 * 4»
>>
expression
.
left
=> «1»
>>
expression
.
right
=> «2 * 3 * 4»
But multiplication is conventionally
left-associative
: when we
write'1 * 2 * 3 * 4'
we actually mean'((1 * 2) * 3) * 4'
, with the numbers grouped together starting
at the lefthand end of the expression, not the right. That doesn’t matter much for
multiplication—both ways produce the same result when evaluated—but for operations like
subtraction and division, it creates a problem, because «((1 - 2) -
» does
3) - 4
not
evaluate to the same value as «1 - (2 - (3 - 4))
».
To fix this, we’d have to make the rules and#to_ast
implementations more complicated. See
Parsing
for a
Treetop grammar that builds a left-associative AST.