CS 330: Lecture 7 – Lexing, Really
Dear students,
Okay, with regexes and the notion of state machines tucked away in our brains, we’re now on a quest to design our own programming language. Usually you’re supposed to start this kind of learning by writing a calculator. I couldn’t bring us to do that. Instead we’re going to design a language for drawing art based on primitive shapes.
One shouldn’t design a language until one has already programmed in it. We’ll look at a few examples and try to agree upon a syntax. Then we’ll start to tackle interpreting programs written in this language.
The lexer’s job is to rip our program into a thousand tiny pieces. Meaningless pieces. But not individual characters. That’s too fine-grained. We want the tokens. What are tokens? Literals, identifiers, keywords, operators, and separators.
Consider this chunk of code:
if i % 2 == 0
print 'He loves me.'
else
print 'I love him.'
end
The code looks like this from a lexer’s point of view:
if i % 2 == 0\n\tprint 'He loves me.'\nelse\n\tprint 'I love him.'\nend
The lexer’s job is to run through and produce a collection like this:
IF if IDENTIFIER i OPERATOR % INTEGER 2 OPERATOR == IDENTIFIER print STRING 'He loves me.' ELSE else IDENTIFIER print STRING 'I love him.' END end
We’ll draw a DFA for capturing tokens in our language. It may or may not look something like this:
You’re probably thinking, “Yes, neat, a picture. But what about the code?” Well, it turns out that we can model our code around this DFA, and it’ll feel like we’re doing it right.
We’ll build our interpreter in Javascript, because it is a language that we cannot escape. We start with a function lex
, which turns a source string into an array of tokens:
function lex(source) {
...
}
Clients of lex
will use it like so:
var source = document.getElementById('editor');
var tokens = lex(source);
tokens.forEach(function(token) {
console.log(token);
});
We’ll also create a Token
class to represent the pieces of our program:
function Token(type, src) {
this.type = type;
this.src = src;
}
That first parameter represents what kind of a token we have. A string literal, a plus sign, a left parenthesis, and what have you. What should be the type of type
? An enum, probably. Sadly, Javascript doesn’t support enums. We’ll just create a set of strings that we vow to never change:
var INTEGER = 'INTEGER';
var IDENTIFIER = 'IDENTIFIER';
var EOF = 'EOF';
var LEFT_PARENTHESIS = 'LEFT_PARENTHESIS';
var RIGHT_PARENTHESIS = 'RIGHT_PARENTHESIS';
var PLUS = 'PLUS';
var ASTERISK = 'ASTERISK';
var EQUALS = 'EQUALS';
var MINUS = 'MINUS';
var FORWARD_SLASH = 'FORWARD_SLASH';
var RECTANGLE = 'RECTANGLE';
var CIRCLE = 'CIRCLE';
var EOF = 'EOF';
Okay, now we get to the heart of our algorithm. It will model the machinery of our DFA. This will be a little different than our DFAs from the other day, however, because our goal here is to gobble up many tokens—not just one. So, once it lands in an accepting state (and is unable to reach another by processing more characters), it will emit the token that it has seen so far. Then it’ll start back over in the starting state with the remaining characters in our source code.
In pseudocode we compose our algorithm as follows:
i = 0 while not all characters are consumed switch on character i case ( consume emit LEFT_PARENTHESIS token case ) consume emit RIGHT_PARENTHESIS token case + consume emit PLUS token case letter call identifier case number call integer case - call minus ... emit EOF token
To manage complexity, any time we have a “wild” state, one with multiple transitions in or out, we’ll create a helper function that corresponds to that state.
Let’s tackle the implementation in this order:
- Write a
has
method that asserts the next character matches a regular expression - Write a
consume
function that advances one character and adds it the token that’s currently being matched - Write an
emitToken
function that appends a token to our list - Handle single character tokens
- Handle integers
- Handle identifiers
- Handle whitespace
- Handle negative numbers and subtraction
- Handle comments
Here’s what we’ll use for has
:
function has(regex) {
return command.charAt(iToRead).match(regex);
}
We’ll need a bookmark into our source code pointing to the first unconsumed character. We can declare it as a variable right inside lexer
:
var iToRead = 0;
The consume
function will throw the character we’re looking at the token we’ve assembled so far and push us on to the next:
function consume() {
tokenSoFar += command.charAt(iToRead);
++iToRead;
}
We’ll need to track the token being built up so far:
var tokenSoFar = '';
Once we reach an accepting state, we’ll need to package up the symbols into a token:
var tokens = [];
function emitToken(type) {
tokens.push(new Token(type, tokenSoFar));
tokenSoFar = ''; // reset the DFA
}
With these helpers in place, we can start to write lex
to handle the single-character tokens:
while (iToRead < source.length) {
if (has(/\+/)) {
consume();
emitToken(PLUS);
} else if (has(/\*/)) {
consume();
emitToken(MULTIPLY);
} else if (has(/\)/)) {
consume();
emitToken(LEFT_PARENTHESIS);
} else if (has(/\(/)) {
consume();
emitToken(RIGHT_PARENTHESIS);
} else if (has(/=/)) {
consume();
emitToken(EQUALS);
} else {
throw 'You fell into the tomb of the unknown token: ' + source.charAt(iToRead);
}
}
return tokens;
We can test this with a little HTML:
<!DOCTYPE>
<html>
<head>
<title>My Little Language</title>
<script src="lexer.js"></script>
</head>
<body>
<textarea id="editor" rows="10" cols="50" style="font-size: 24pt;"></textarea><br>
</body>
</html>
And a listener defined in Javascript:
var editor = document.getElementById('editor');
editor.onchange = function() {
var tokens = lex(editor.value);
tokens.forEach(token -> console.log(token));
}
Does it work? I hope so!
Now let’s augment our if-ladder to handle integer literals:
else if (has(/\d/)) {
integer();
}
And we’ll add a function for this more complex state:
function integer() {
while (has(/\d/)) {
consume();
}
emitToken(INTEGER);
};
After writing this, we should probably have has
do a bounds-check:
function has(regex) {
return iToRead >= 0 && iToRead < source.length && source.charAt(iToRead).match(regex);
}
Let’s handle identifiers similarly:
else if (has(/[A-Za-z]/)) {
identifier();
}
However, what constitutes a legal identifier in most programming languages? Can we name a variable for
? Or if
? No. These are keywords. Instead of tweaking our DFA to have special states for keywords, it’s common practice to special case them after you’ve lexed out an identifier:
function identifier() {
while (has(/\w/)) {
consume();
}
if (tokenSoFar == 'rectangle') {
emitToken(RECTANGLE);
} else if (tokenSoFar == 'circle') {
emitToken(CIRCLE);
} else {
emitToken(IDENTIFIER);
}
};
We should be able to lex integer literals and identifiers now. Let’s try a program that has both!
56 foo
It fails on the whitespace. Our lexer must be ready for that too! But hey, whitespace isn’t significant in many languages, so there’s really no reason to emit a token for it. We just consume it but restart the DFA:
else if (has(/[ \t]/) {
consume();
tokenSoFar = '';
}
How about -
? This operator is the bane of lexing. Sometimes it’s a unary operator, negating its operand. Other times its a binary operator. What if it appears in front of an integer literal?
-17
Should we include it in the literal? Probably not. Consider this situation:
18 -17
The programmer almost certainly meant to subtract 17 from 18 instead of list two integer literals next to each other. Because of situations like this, we will never include -
as part of an integer literal. We’ll emit it as its own token:
else if (has(/-/)) {
emitToken(MINUS);
}
The parser will impose binary or unary operator semantics on it.
How about division? Well, if we see two slashes, maybe we’re looking at a comment? We’ll need to solve this similarly. First, our condition in lex
:
else if (has(/\//)) {
forwardSlash();
}
The helper function will peek ahead to distinguish between the mathematical operator and a comment. Comments, like whitespace, we can discard, as they will have no impact on the interpretation of our program.
function forwardSlash() {
consume();
if (has(/\//)) {
consume();
while (has(/[^\n]/)) {
consume();
}
tokenSoFar = '';
} else {
emitToken(FORWARD_SLASH);
}
}
One last thing. Once we’re out of characters, let’s emit a fake token to mark the end of the tokens:
while (iToRead < source.length) {
...
}
emitToken(EOF);
return tokens;
That rounds out our lexer, the first stage of getting our computer to understand us. We have broken the program up into byte-size pieces. Next time we’ll look at parsing, which will figure out what we’re actually trying to say with these utterances. If we have time today, we’ll start to consider the grammar of our language.
Here’s your TODO:
- Consider attending the next Chippewa Valley Developer’s Group meeting on Tuesday evening. The talk is on building an emulator for gaming hardware from the 1970s. They meet at the Lazy Monk Brewery. Food is free, but RSVP. (Drinks are not free.) For Wednesday, write down a few sentences about the talk and the community on a quarter sheet for an extra credit participation point.
See you next time!
P.S. It’s time for a haiku!
To solve a problem
Break it up and sort the steps
Easy to easy
P.P.S. Here’s the code we wrote together:
lexer.js
var IDENTIFIER = 'IDENTIFIER'; var PLUS = 'PLUS'; var LEFT_PARENTHESIS = 'LEFT_PARENTHESIS'; var RIGHT_PARENTHESIS = 'RIGHT_PARENTHESIS'; var DIVIDE = 'DIVIDE'; var MINUS = 'MINUS'; var ASTERISK = 'ASTERISK'; var COMMA = 'COMMA'; var MOD = 'MOD'; var INTEGER = 'INTEGER'; function lex(source) { var tokens = []; var tokenSoFar = ''; var i = 0; function has(regex) { return source.charAt(i).match(regex); } function devour() { tokenSoFar += source.charAt(i); i += 1; } function emit(type) { tokens.push({type: type, source: tokenSoFar}); tokenSoFar = ''; } while (i < source.length) { if (has(/\+/)) { devour(); emit(PLUS); } else if (has(/%/)) { devour(); emit(MOD); } else if (has(/-/)) { devour(); emit(MINUS); } else if (has(/\(/)) { devour(); emit(LEFT_PARENTHESIS); } else if (has(/\)/)) { devour(); emit(RIGHT_PARENTHESIS); } else if (has(/\*/)) { devour(); emit(ASTERISK); } else if (has(/\//)) { devour(); if (has(/\//)) { while (has(/[^\n]/)) { devour(); } tokenSoFar = ''; } else { emit(DIVIDE); } } else if (has(/\d/)) { while (has(/\d/)) { devour(); } emit(INTEGER); } else if (has(/[a-zA-Z]/)) { while (has(/[a-zA-Z0-9]/)) { devour(); } emit(IDENTIFIER); } else if (has(/\s/)) { devour(); tokenSoFar = ''; } else { throw 'ack! [' + tokenSoFar + ']'; } } return tokens; }
index.html
<!DOCTYPE html> <html> <head> <title>...</title> <script src="lexer.js"></script> </head> <body> <textarea id="ide" rows="8"></textarea> <input type="button" id="run" value="Run"> <script> var ide = document.getElementById('ide'); var run = document.getElementById('run'); run.onclick = function() { var source = ide.value; var tokens = lex(source); console.log(tokens); }; </script> </body> </html>