CS 330: Lecture 6 – Lexing
Dear students,
Okay, with regexes and the notion of state machines tucked away in our brains, we’re now on a quest to design our own programming language. Usually you’re supposed to start this kind of learning by writing a calculator. I couldn’t bring us to do that. Instead we’re going to design a language for drawing art based on primitive shapes.
One shouldn’t design a language until one has already programmed in it. We’ll look at a few examples and try to agree upon a syntax. Then we’ll start to tackle interpreting programs written in this language. We break interpreting up into two steps:
- Lexing, in which we decompose the source code into its “atoms.” The input is the source code. The output is a flat sequence of tokens.
- Parsing, in which we construct a model of our program. The input is the stream of tokens. The output is a meaningful representation of the program. If we store this representation of our program to disk, we have a compiler. If we execute it immediately, we have an interpreter.
But today, we don’t deal with meaning. We just want to rip our program into a thousand tiny pieces. Meaningless pieces. But not individual characters. That’s too fine-grained. We want the tokens. What are tokens? Literals, identifiers, keywords, operators, and separators.
Consider this chunk of code:
if i % 2 == 0
print 'He loves me.'
else
print 'I love him.'
end
The code looks like this from a lexer’s point of view:
if i % 2 == 0\n\tprint 'He loves me.'\nelse\n\tprint 'I love him.'\nend
The lexer’s job is to run through and produce a collection like this:
IF if IDENTIFIER i OPERATOR % INTEGER 2 OPERATOR == IDENTIFIER print STRING 'He loves me.' ELSE else IDENTIFIER print STRING 'I love him.' END end
Here’s your TODO for next time:
- Watch Stanford professor Alex Aiken’s The Economy of Programming Languages (20 minutes).
- On a quarter sheet, make a prediction about you think the language landscape will look like in ten years. Use reason instead of arbitrary bias, and incorporate the ideas that Aiken discusses.
P.S. Here’s the code we wrote together:
dt.rb
#!/usr/bin/env ruby
def jjeach list
i = 0
while i < list.size
yield list[i]
i += 1
end
end
list = ['andy', 'crutches', 'why']
jjeach(list) do |item|
puts item
end
# s = 'foo54-65capitalizm'
# s.gsub!(/(\d+)-(\d+)/) do |first, second|
# puts "first: #{first}"
# puts "second: #{second}"
# end
numbers
255 2556 100 + 89 * -32 1 2 3 4 5 6 7 20020 dog blue42
binbin.rb
#!/usr/bin/env ruby
src = File.read('numbers')
src.gsub!(/(?<!-)\b(\d|\d\d|[01]\d\d|2[0-4]\d|25[0-5])\b/) do
$1.to_i.to_s(2) + 'b'
end
puts src