CS 430: Lecture 3 - Radish Interpreter
Dear students:
Given how lab went last week, I think we need more practice with lexing, parsing, and grammars. So today we're going to write out our own little calculator language that can be used to add numbers together and shift them left or right—but there's a twist. The numbers can be in different bases.
Radish
We'll call the language Radish since it derives from the Latin word radix. The word itself means root. Here are some sample utterances in Radish:
56
56_16
1001001_2 > 10
1001_2 + 1 > 2
32 << 1
(100 >> 2) + 3
56 56_16 1001001_2 > 10 1001_2 + 1 > 2 32 << 1 (100 >> 2) + 3
The language supports three operators: <<
, >>
, and +
. It supports integer literals. By default these are in base 10, but an explicit alternative base can be specified with a _
suffix. The final result can be displayed in a particular base using the >
output designator.
Lexing
We have these token types in Radish:
- integer literals with optional base
-
>>
-
<<
-
+
-
>
-
(
-
)
Only one of these is very interesting. We'll draw a state machine for it, and then write a lexer.
Grammar
The grammar is a useful tool that will help us think about our language. It determines the precedence and associativity of our operators. It does this by pushing higher-precedence operations farther down the parser. To make that happen we'll need the operators to be broken up into levels. The lowest precedence level will be represented by the non-terminal expr0
.
A “program” in Radish will consist of a single statement that has one of two forms:
statement -> expr0 > integer_literal
| expr0
statement -> expr0 > integer_literal | expr0
The left- and right-shift operators are the lowest-precedence operators, so they are produced by expr0
. We have to be careful with these productions, because it's easy to make a flawed grammar. We have the following constrains:
-
The grammar should support recursive formations like
512 >> 1 >> 1
. That means the productions should include a self-reference somewhere. -
If we have a production like
expr0 >> expr0
, we allow recursion, but we also introduce ambiguity. We could parse512 >> 1 >> 1
in two different ways. -
We want to limit the recursion to one operand. Because we want the leftmost occurrence of
>>
to have higher precedence than any>>
to its right, we make the left operand recursive. That leads to the productionexpr0 >> expr1
.
All told, we form these productions:
expr0 -> expr0 << expr1
| expr0 >> expr1
| expr1
expr0 -> expr0 << expr1 | expr0 >> expr1 | expr1
The next precedence level follows a similar structure:
expr1 -> expr1 + operand
| operand
expr1 -> expr1 + operand | operand
We've only got these two precedence levels. Between the operators we have the operands:
operand -> INTEGER
| ( expr0 )
operand -> INTEGER | ( expr0 )
Writing the parser is a lot easier if we have first sketched a grammar.
Abstract Syntax Tree
But before we write a parser, we'll write these classes to represent our program structures:
-
ExpressionInteger
-
ExpressionAdd
-
ExpressionLeftShift
-
ExpressionRightShift
-
Statement
Every program structure will have an evaluate
method that we can use to execute the program. Using these structures, we could create a program without even writing Radish code:
e = ExpressionAdd.new(ExpressionInteger.new(10), ExpressionInteger.new(5))
e = ExpressionAdd.new(ExpressionInteger.new(10), ExpressionInteger.new(5))
But we don't want to assemble these structures manually. Assembling them is the job of the parser.
Parsing
The parser takes the tokens from the lexer and assembles an abstract syntax tree. A useful strategy is to turn the non-terminals of the grammar into the parser's methods:
class Parser
def initialize
end
def statement
end
def expr0
end
def expr1
end
def operand
end
end
class Parser def initialize end def statement end def expr0 end def expr1 end def operand end end
We'll implement these methods from the bottom up. The productions will guide us. Whenever we have multiple productions, we'll need some sort of conditional statement that arbitrates between the choices by peeking at the next tokens.
The trickiest productions to implement will be the ones that include self-reference, like this one:
expr1 -> expr1 + operand
| operand
expr1 -> expr1 + operand | operand
This would be a bad way to implement the expr1
method:
def expr1
left = expr1
if has(:plus)
consume
right = operand
end
...
end
def expr1 left = expr1 if has(:plus) consume right = operand end ... end
The first line leads to infinite recursion. A better way is to turn the recursion into iteration. Ultimately, the recursion has to bottom out at an operand, so we call operand
first and then collect up any +
operations into a new structure:
def expr1
left = operand
while has(:plus)
consume
right = operand
left = ExpressionAdd.new(left, right)
end
left
end
def expr1 left = operand while has(:plus) consume right = operand left = ExpressionAdd.new(left, right) end left end
Something similar happens with the shift operators, but with an extra conditional statement to decide which AST node to instantiate.
TODO
Here's your list of things to do before we meet next:
See you next time.
Sincerely,
P.S. It's time for a haiku!