CS 330: Lecture 6 – Lexing
Dear students,
Okay, with regexes and the notion of state machines tucked away in our brains, we’re now on a quest to design our own programming language. Usually you’re supposed to start this kind of learning by writing a calculator. I couldn’t bring us to do that. Instead we’re going to design a language for drawing art based on primitive shapes.
One shouldn’t design a language until one has already programmed in it. We’ll look at a few examples and try to agree upon a syntax. Then we’ll start to tackle interpreting programs written in this language. We break interpreting up into two steps:
- Lexing, in which we decompose the source code into its “atoms.” The input is the source code. The output is a flat sequence of tokens.
- Parsing, in which we construct a model of our program. The input is the stream of tokens. The output is a meaningful representation of the program. If we store this representation of our program to disk, we have a compiler. If we execute it immediately, we have an interpreter.
But today, we don’t deal with meaning. We just want to rip our program into a thousand tiny pieces. Meaningless pieces. But not individual characters. That’s too fine-grained. We want the tokens. What are tokens? Literals, identifiers, keywords, operators, and separators.
Consider this chunk of code:
if i % 2 == 0
print 'He loves me.'
else
print 'I love him.'
end
The code looks like this from a lexer’s point of view:
if i % 2 == 0\n\tprint 'He loves me.'\nelse\n\tprint 'I love him.'\nend
The lexer’s job is to run through and produce a collection like this:
IF if IDENTIFIER i OPERATOR % INTEGER 2 OPERATOR == IDENTIFIER print STRING 'He loves me.' ELSE else IDENTIFIER print STRING 'I love him.' END end
Here’s your TODO for next time:
- Watch Stanford professor Alex Aiken’s The Economy of Programming Languages (20 minutes).
- On a quarter sheet, make a prediction about you think the language landscape will look like in ten years. Use reason instead of arbitrary bias, and incorporate the ideas that Aiken discusses.
![](http://www.twodee.org/images/signature_first.png)
P.S. Here’s the code we wrote together:
dt.rb
#!/usr/bin/env ruby
def jjeach list
i = 0
while i < list.size
yield list[i]
i += 1
end
end
list = ['andy', 'crutches', 'why']
jjeach(list) do |item|
puts item
end
# s = 'foo54-65capitalizm'
# s.gsub!(/(\d+)-(\d+)/) do |first, second|
# puts "first: #{first}"
# puts "second: #{second}"
# end
numbers
255 2556 100 + 89 * -32 1 2 3 4 5 6 7 20020 dog blue42
binbin.rb
#!/usr/bin/env ruby
src = File.read('numbers')
src.gsub!(/(?<!-)\b(\d|\d\d|[01]\d\d|2[0-4]\d|25[0-5])\b/) do
$1.to_i.to_s(2) + 'b'
end
puts src