teaching machines

CS 330: Lecture 7 – Lexing, Really

February 12, 2018 by . Filed under cs330, lectures, spring 2018.

Dear students,

Okay, with regexes and the notion of state machines tucked away in our brains, we’re now on a quest to design our own programming language. Usually you’re supposed to start this kind of learning by writing a calculator. I couldn’t bring us to do that. Instead we’re going to design a language for drawing art based on primitive shapes.

One shouldn’t design a language until one has already programmed in it. We’ll look at a few examples and try to agree upon a syntax. Then we’ll start to tackle interpreting programs written in this language.

The lexer’s job is to rip our program into a thousand tiny pieces. Meaningless pieces. But not individual characters. That’s too fine-grained. We want the tokens. What are tokens? Literals, identifiers, keywords, operators, and separators.

Consider this chunk of code:

if i % 2 == 0
  print 'He loves me.'
else
  print 'I love him.'
end

The code looks like this from a lexer’s point of view:

if i % 2 == 0\n\tprint 'He loves me.'\nelse\n\tprint 'I love him.'\nend

The lexer’s job is to run through and produce a collection like this:

IF if
IDENTIFIER i
OPERATOR %
INTEGER 2
OPERATOR ==
IDENTIFIER print
STRING 'He loves me.'
ELSE else
IDENTIFIER print
STRING 'I love him.'
END end

We’ll draw a DFA for capturing tokens in our language. It may or may not look something like this:

You’re probably thinking, “Yes, neat, a picture. But what about the code?” Well, it turns out that we can model our code around this DFA, and it’ll feel like we’re doing it right.

We’ll build our interpreter in Javascript, because it is a language that we cannot escape. We start with a function lex, which turns a source string into an array of tokens:

function lex(source) {
  ...
}

Clients of lex will use it like so:

var source = document.getElementById('editor');
var tokens = lex(source);
tokens.forEach(function(token) {
  console.log(token);
});

We’ll also create a Token class to represent the pieces of our program:

function Token(type, src) {
  this.type = type;
  this.src = src; 
}

That first parameter represents what kind of a token we have. A string literal, a plus sign, a left parenthesis, and what have you. What should be the type of type? An enum, probably. Sadly, Javascript doesn’t support enums. We’ll just create a set of strings that we vow to never change:

var INTEGER = 'INTEGER';
var IDENTIFIER = 'IDENTIFIER';
var EOF = 'EOF';
var LEFT_PARENTHESIS = 'LEFT_PARENTHESIS';
var RIGHT_PARENTHESIS = 'RIGHT_PARENTHESIS';
var PLUS = 'PLUS';
var ASTERISK = 'ASTERISK';
var EQUALS = 'EQUALS';
var MINUS = 'MINUS';
var FORWARD_SLASH = 'FORWARD_SLASH';
var RECTANGLE = 'RECTANGLE';
var CIRCLE = 'CIRCLE';
var EOF = 'EOF';

Okay, now we get to the heart of our algorithm. It will model the machinery of our DFA. This will be a little different than our DFAs from the other day, however, because our goal here is to gobble up many tokens—not just one. So, once it lands in an accepting state (and is unable to reach another by processing more characters), it will emit the token that it has seen so far. Then it’ll start back over in the starting state with the remaining characters in our source code.

In pseudocode we compose our algorithm as follows:

i = 0
while not all characters are consumed
  switch on character i
    case (
      consume
      emit LEFT_PARENTHESIS token
    case )
      consume
      emit RIGHT_PARENTHESIS token
    case +
      consume
      emit PLUS token
    case letter
      call identifier
    case number
      call integer
    case -
      call minus
    ...
emit EOF token

To manage complexity, any time we have a “wild” state, one with multiple transitions in or out, we’ll create a helper function that corresponds to that state.

Let’s tackle the implementation in this order:

Here’s what we’ll use for has:

function has(regex) {
  return command.charAt(iToRead).match(regex);
}

We’ll need a bookmark into our source code pointing to the first unconsumed character. We can declare it as a variable right inside lexer:

var iToRead = 0;

The consume function will throw the character we’re looking at the token we’ve assembled so far and push us on to the next:

function consume() {
  tokenSoFar += command.charAt(iToRead);
  ++iToRead;
}

We’ll need to track the token being built up so far:

var tokenSoFar = '';

Once we reach an accepting state, we’ll need to package up the symbols into a token:

var tokens = [];

function emitToken(type) {
  tokens.push(new Token(type, tokenSoFar));
  tokenSoFar = ''; // reset the DFA
}

With these helpers in place, we can start to write lex to handle the single-character tokens:

while (iToRead < source.length) {
  if (has(/\+/)) {
    consume();
    emitToken(PLUS);
  } else if (has(/\*/)) {
    consume();
    emitToken(MULTIPLY);
  } else if (has(/\)/)) {
    consume();
    emitToken(LEFT_PARENTHESIS);
  } else if (has(/\(/)) {
    consume();
    emitToken(RIGHT_PARENTHESIS);
  } else if (has(/=/)) {
    consume();
    emitToken(EQUALS);
  } else {
    throw 'You fell into the tomb of the unknown token: ' + source.charAt(iToRead);
  }
}

return tokens;

We can test this with a little HTML:

<!DOCTYPE>
<html>
  <head>
    <title>My Little Language</title>
    <script src="lexer.js"></script>
  </head>
  <body>
    <textarea id="editor" rows="10" cols="50" style="font-size: 24pt;"></textarea><br>
  </body>
</html>

And a listener defined in Javascript:

var editor = document.getElementById('editor');
editor.onchange = function() {
  var tokens = lex(editor.value);
  tokens.forEach(token -> console.log(token));
}

Does it work? I hope so!

Now let’s augment our if-ladder to handle integer literals:

else if (has(/\d/)) {
  integer();
}

And we’ll add a function for this more complex state:

function integer() {
  while (has(/\d/)) {
    consume();
  }
  emitToken(INTEGER);
};

After writing this, we should probably have has do a bounds-check:

function has(regex) {
  return iToRead >= 0 && iToRead < source.length && source.charAt(iToRead).match(regex);
}

Let’s handle identifiers similarly:

else if (has(/[A-Za-z]/)) {
  identifier();
}

However, what constitutes a legal identifier in most programming languages? Can we name a variable for? Or if? No. These are keywords. Instead of tweaking our DFA to have special states for keywords, it’s common practice to special case them after you’ve lexed out an identifier:

function identifier() {
  while (has(/\w/)) {
    consume();
  }

  if (tokenSoFar == 'rectangle') {
    emitToken(RECTANGLE);
  } else if (tokenSoFar == 'circle') {
    emitToken(CIRCLE);
  } else {
    emitToken(IDENTIFIER);
  }
};

We should be able to lex integer literals and identifiers now. Let’s try a program that has both!

56 foo

It fails on the whitespace. Our lexer must be ready for that too! But hey, whitespace isn’t significant in many languages, so there’s really no reason to emit a token for it. We just consume it but restart the DFA:

else if (has(/[ \t]/) {
  consume();
  tokenSoFar = '';
}

How about -? This operator is the bane of lexing. Sometimes it’s a unary operator, negating its operand. Other times its a binary operator. What if it appears in front of an integer literal?

-17

Should we include it in the literal? Probably not. Consider this situation:

18 -17

The programmer almost certainly meant to subtract 17 from 18 instead of list two integer literals next to each other. Because of situations like this, we will never include - as part of an integer literal. We’ll emit it as its own token:

else if (has(/-/)) {
  emitToken(MINUS);
}

The parser will impose binary or unary operator semantics on it.

How about division? Well, if we see two slashes, maybe we’re looking at a comment? We’ll need to solve this similarly. First, our condition in lex:

else if (has(/\//)) {
  forwardSlash();
}

The helper function will peek ahead to distinguish between the mathematical operator and a comment. Comments, like whitespace, we can discard, as they will have no impact on the interpretation of our program.

function forwardSlash() {
  consume();
  if (has(/\//)) {
    consume();
    while (has(/[^\n]/)) {
      consume();
    }
    tokenSoFar = '';
  } else {
    emitToken(FORWARD_SLASH);
  }
}

One last thing. Once we’re out of characters, let’s emit a fake token to mark the end of the tokens:

while (iToRead < source.length) {
  ...
}
emitToken(EOF);

return tokens;

That rounds out our lexer, the first stage of getting our computer to understand us. We have broken the program up into byte-size pieces. Next time we’ll look at parsing, which will figure out what we’re actually trying to say with these utterances. If we have time today, we’ll start to consider the grammar of our language.

Here’s your TODO:

See you next time!

Sincerely,

P.S. It’s time for a haiku!

To solve a problem
Break it up and sort the steps
Easy to easy

P.P.S. Here’s the code we wrote together:

lexer.js

var IDENTIFIER = 'IDENTIFIER';
var PLUS = 'PLUS';
var LEFT_PARENTHESIS = 'LEFT_PARENTHESIS';
var RIGHT_PARENTHESIS = 'RIGHT_PARENTHESIS';
var DIVIDE = 'DIVIDE';
var MINUS = 'MINUS';
var ASTERISK = 'ASTERISK';
var COMMA = 'COMMA';
var MOD = 'MOD';
var INTEGER = 'INTEGER';

function lex(source) {
  var tokens = [];
  var tokenSoFar = '';
  var i = 0;

  function has(regex) {
    return source.charAt(i).match(regex);
  }

  function devour() {
    tokenSoFar += source.charAt(i);
    i += 1;
  }

  function emit(type) {
    tokens.push({type: type, source: tokenSoFar});
    tokenSoFar = '';
  }

  while (i < source.length) {
    if (has(/\+/)) {
      devour();
      emit(PLUS);
    } else if (has(/%/)) {
      devour();
      emit(MOD);
    } else if (has(/-/)) {
      devour();
      emit(MINUS);
    } else if (has(/\(/)) {
      devour();
      emit(LEFT_PARENTHESIS);
    } else if (has(/\)/)) {
      devour();
      emit(RIGHT_PARENTHESIS);
    } else if (has(/\*/)) {
      devour();
      emit(ASTERISK);
    } else if (has(/\//)) {
      devour();
      if (has(/\//)) {
        while (has(/[^\n]/)) {
          devour();
        }
        tokenSoFar = '';
      } else {
        emit(DIVIDE);
      }
    } else if (has(/\d/)) {
      while (has(/\d/)) {
        devour();
      }
      emit(INTEGER);
    } else if (has(/[a-zA-Z]/)) {
      while (has(/[a-zA-Z0-9]/)) {
        devour();
      }
      emit(IDENTIFIER);
    } else if (has(/\s/)) {
      devour();
      tokenSoFar = '';
    } else {
      throw 'ack! [' + tokenSoFar + ']';
    }
  }

  return tokens;
}

index.html

<!DOCTYPE html>
<html>
<head>
  <title>...</title>
  <script src="lexer.js"></script>
</head>
<body>
  <textarea id="ide" rows="8"></textarea>
  <input type="button" id="run" value="Run">
  
  <script>
var ide = document.getElementById('ide');
var run = document.getElementById('run');

run.onclick = function() {
  var source = ide.value;
  var tokens = lex(source);
  console.log(tokens);
};

  </script>
</body>
</html>