CS 430: Lecture 3 - Radish Interpreter

Dear students:

Given how lab went last week, I think we need more practice with lexing, parsing, and grammars. So today we're going to write out our own little calculator language that can be used to add numbers together and shift them left or right—but there's a twist. The numbers can be in different bases.

Radish

We'll call the language Radish since it derives from the Latin word radix. The word itself means root. Here are some sample utterances in Radish:

56
56_16
1001001_2 > 10
1001_2 + 1 > 2
32 << 1
(100 >> 2) + 3

56
56_16
1001001_2 > 10
1001_2 + 1 > 2
32 << 1
(100 >> 2) + 3

The language supports three operators: <<, >>, and +. It supports integer literals. By default these are in base 10, but an explicit alternative base can be specified with a _ suffix. The final result can be displayed in a particular base using the > output designator.

Lexing

We have these token types in Radish:

integer literals with optional base
>>
<<
+
>
(
)

Only one of these is very interesting. We'll draw a state machine for it, and then write a lexer.

Grammar

The grammar is a useful tool that will help us think about our language. It determines the precedence and associativity of our operators. It does this by pushing higher-precedence operations farther down the parser. To make that happen we'll need the operators to be broken up into levels. The lowest precedence level will be represented by the non-terminal expr0.

A “program” in Radish will consist of a single statement that has one of two forms:

statement -> expr0 > integer_literal
           | expr0

statement -> expr0 > integer_literal
           | expr0

The left- and right-shift operators are the lowest-precedence operators, so they are produced by expr0. We have to be careful with these productions, because it's easy to make a flawed grammar. We have the following constrains:

The grammar should support recursive formations like 512 >> 1 >> 1. That means the productions should include a self-reference somewhere.
If we have a production like expr0 >> expr0, we allow recursion, but we also introduce ambiguity. We could parse 512 >> 1 >> 1 in two different ways.
We want to limit the recursion to one operand. Because we want the leftmost occurrence of >> to have higher precedence than any >> to its right, we make the left operand recursive. That leads to the production expr0 >> expr1.

All told, we form these productions:

expr0 -> expr0 << expr1
       | expr0 >> expr1
       | expr1

expr0 -> expr0 << expr1
       | expr0 >> expr1
       | expr1

The next precedence level follows a similar structure:

expr1 -> expr1 + operand
       | operand

expr1 -> expr1 + operand
       | operand

We've only got these two precedence levels. Between the operators we have the operands:

operand -> INTEGER
         | ( expr0 )

operand -> INTEGER
         | ( expr0 )

Writing the parser is a lot easier if we have first sketched a grammar.

Abstract Syntax Tree

But before we write a parser, we'll write these classes to represent our program structures:

ExpressionInteger
ExpressionAdd
ExpressionLeftShift
ExpressionRightShift
Statement

Every program structure will have an evaluate method that we can use to execute the program. Using these structures, we could create a program without even writing Radish code:

e = ExpressionAdd.new(ExpressionInteger.new(10), ExpressionInteger.new(5))

e = ExpressionAdd.new(ExpressionInteger.new(10), ExpressionInteger.new(5))

But we don't want to assemble these structures manually. Assembling them is the job of the parser.

Parsing

The parser takes the tokens from the lexer and assembles an abstract syntax tree. A useful strategy is to turn the non-terminals of the grammar into the parser's methods:

class Parser
  def initialize
  end

  def statement
  end

  def expr0
  end

  def expr1
  end

  def operand
  end
end

class Parser
  def initialize
  end

  def statement
  end

  def expr0
  end

  def expr1
  end

  def operand
  end
end

We'll implement these methods from the bottom up. The productions will guide us. Whenever we have multiple productions, we'll need some sort of conditional statement that arbitrates between the choices by peeking at the next tokens.

The trickiest productions to implement will be the ones that include self-reference, like this one:

expr1 -> expr1 + operand
       | operand

expr1 -> expr1 + operand
       | operand

This would be a bad way to implement the expr1 method:

def expr1
  left = expr1
  if has(:plus)
    consume
    right = operand
  end
  ...
end

def expr1
  left = expr1
  if has(:plus)
    consume
    right = operand
  end
  ...
end

The first line leads to infinite recursion. A better way is to turn the recursion into iteration. Ultimately, the recursion has to bottom out at an operand, so we call operand first and then collect up any + operations into a new structure:

def expr1
  left = operand
  while has(:plus)
    consume
    right = operand
    left = ExpressionAdd.new(left, right)
  end
  left
end

def expr1
  left = operand
  while has(:plus)
    consume
    right = operand
    left = ExpressionAdd.new(left, right)
  end
  left
end

Something similar happens with the shift operators, but with an extra conditional statement to decide which AST node to instantiate.

TODO

Here's your list of things to do before we meet next:

I'm recruiting one volunteer from each section to help organize a program challenge for the first midterm. If you are interested, talk to me. You will get full credit on the exam for your participation.

Make sure you have access to your private repository on GitHub. Send me a DM on Discord if you don't see it.

Make progress on project 1. You are expected to spend 3–5 hours per week on your project, and you must submit a progress report to a private channel that I will make for you on Discord. Your channel will include a few of your classmates, and your job is to cheer each other on and keep each other accountable.

See you next time.

Sincerely,

P.S. It's time for a haiku!

I made a language We didn't need a new one Nor the ones we had