teaching machines

CS 330 Lecture 33 – State Machines

April 26, 2017 by . Filed under cs330, lectures, spring 2017.

Dear students,

Okay, pruning out the recursive calls that you’ve already done before is one way to tame recursion. Let’s look at another. We mentioned early that the piling up of stack frames is what gets recursion in trouble. Consider this definition of sum' in Haskell:

sum' [] = 0
sum' (first:rest) = first + sum' rest

When we say sum' [1..5], this sort of thing stacks up in RAM:

invocation computation
sum' [] 0
sum' [5..5] 5 + sum' []
sum' [4..5] 4 + sum' [5..5]
sum' [3..5] 3 + sum' [4..5]
sum' [2..5] 2 + sum' [3..5]
sum' [1..5] 1 + sum' [2..5]

No single instance of the function call can finish until we reach the base case. The general cases sit idly consuming memory. Well, what if we could make it such that the general cases had nothing left to do? What if we sent a general cases’ work along to the recursive call? We can do this by adding another parameter—called an accumulator because it accumulates up the results:

sum' accumulator [] = accumulator
sum' accumulator (first:rest) = sum' (first + accumulator) rest

The recursive call is said to be in the tail position of the function. It’s the very last thing left to do in the function’s body. With this arrangement, our call stack becomes this:

invocation computation
sum' 15 [] 15
sum' 10 [5..5] sum' (5 + 10) []
sum' 6 [4..5] sum' (4 + 6) [5..5]
sum' 3 [3..5] sum' (3 + 3) [4..5]
sum' 1 [2..5] sum' (2 + 1) [3..5]
sum' 0 [1..5] sum' (1 + 0) [2..5]

Notice that there’s no work left for the general cases to complete after the recursive call finishes. We might as well just reuse the current call’s stack frame instead of pushing on a new one. This reuse would make recursion cost no more than conventional iteration.

This practice of a compiler recognizing that a recursive call is the only remaining work for a function to do, and having the recursive call overtake the caller’s stack frame, is called tail call optimization. Not all compilers support it, but Clang, GCC, and GHC do under certain conditions. If you recursion is getting out of hand, try using an accumulator to get the recursive call into the tail position.

Now we make an abrupt change. The next few lectures we will discuss building tools that interpret source code. We’ll discuss lexing a source file into its “atoms” and parsing those tokens according to the grammar of the language to form an executable representation of our program. But I felt that before we get into that stuff, we should just spend a day discussing state machines, which underly both lexing and parsing.

Instead of writing code today, we’re going to consider a model of computation called a deterministic finite automaton (DFA). A DFA is used to match text that falls into the “language” of the machine. For example, we might build a machine that accepts only strings of all capital letters, only lists of numbers separated by commas, or only four-digit PINs.

We tell if a string is in a DFA’s language by feeding it the string. It bites off a character at a time. Each character takes the machine into a new state. If the string winds up in a final or accepting state when all the characters have been consumed, that string is said to be in the DFA’s language.

We’ll create a few of these DFAs in small groups. But first, some examples:

  1. Build a state machine that matches strings over the alphabet [c] that have even length. c looks a little like a flower petal. This machine accepts he-loves-me flowers. It rejects he-loves-me-not flowers.
  2. Build a state machine that matches double-quoted or single-quoted string literals.
  3. Build a state machine that recognizes floating point literals.

Now, let’s break up into groups. Together, build a state machine that recognizes only…

  1. Strings with an even number of 0s and an even number of 1s.
  2. Strings that are a multiline Java comment.
  3. Strings like dog, doog, dooog, and so on.
  4. Strings that contain neither ba nor ab.
  5. Hexadecimal digits, optionally prefixed by 0x.
  6. Strings in which any 0 preceded by at least two 1s.
  7. Strings that contain at least three commas.
  8. Strings that don’t contain foo.
  9. Strings in which any 0 appears before any 1.
  10. Strings over the alphabet [12345] whose digits are monotonically increasing—with each digit greater than or equal to its predecessor. For example: 1335.

Now let’s try another one. Let’s build a machine that recognizes the strings of balanced parentheses. Like (()) and ()()().

We’ll find that we can’t do this with a DFA. A state machine by itself lacks something very vital to a lot of computations: memory. In a state machine, we only know our current state, but we have no recollection of how we got there. HTTP is similar. When a server serves out a web page, it has no idea what other pages you’ve been before this one. Only by adding memory (cookies, session tracking) can the server follow your flow through a site.

Languages that can be recognized by a DFA are called regular. Generally, any language that has recursive elements, nesting, or other complex historical constraints is not regular. Regular expressions are a textual notation of DFAs, and that is why they are called regular.

Are our programming languages regular? No. We will still build state machines to process our source code, but we will be adding a lot of memory and lookahead to make machines that recognize our programs.

Here’s your TODO list:

See you next time, when we discuss state machines and regular languages!