teaching machines

CS 330: Lecture 1 – Asserting Patterns with Regex

January 29, 2018 by . Filed under cs330, lectures, spring 2018.

Dear students,

Welcome to CS 330: Programming Languages! What’s this class about? Well, imagine you are a biology student taking a class on mammals and every lecture, lab, and homework is on cows. You wouldn’t really be a biologist at the end of this class. You’d be a cowist.

A cow gives you one picture of how an organism can be. But to be a biologist, you need lots of pictures. Similarly, Java gives you one picture of how a programmer talks to a machine, but to be a computer scientist, you need lots of pictures. Each picture is a window into some deeper concept that lies at the center of all them and is not fully manifested in any single one of them.

Accordingly, this class is about all the tools—the languages—we use to teach machines. We don’t really address the interesting applications of these tools. We focus on the tools themselves. That sounds kind of disembodied. But when you consider that these tools form the bridge between a programmer and a machine, this class feels much more human. We want to design and use tools that take away the pain and time-wasting from developing software. In many computer science classes, we focus on making an end-product for an end-user. In this class, we focus on making a mid-product for you, a developer of software. You are just as important as your users, and it’s okay to have a class about you.

Some of the features of our tools that we’ll discuss include:

The languages through which we view these features include C, C++, Haskell, Java, and Ruby.

We’ll also write our own language. Take that, biologists. When’s the last time you made your own mammal?

That’s what the class is about. What am I about?

Now, what are you about? On a quarter sheet of paper, tell me the following:

  1. What’s your name?
  2. What do you wish you had done over break but didn’t?
  3. What would you write a book on?
  4. What have you made that you are most proud of?

Okay, for the rest of our time today, let’s write our own programming language. Okay, let’s just look at a small piece of it. A major part of writing a language is being able to recognize what the programmer is saying. We will use regular expressions to do this. Regex are descriptions of the patterns you expect to find in a string of text. In this next week, we’ll use them to accomplish the following tasks:

Today we focus on validating user input. We want to assert that user input conforms to an expected format. You already know how to match text exactly. How do we do that?

In Java, we use equals. In C++, we use ==. In PHP, we use ===. In C, we use strcmp.

In practice, we often don’t know exactly what we’re looking for, but we know the overall shape. Enter regular expressions, which allow us to loosely describe the shape of a string. Let’s explore them in the context of felt needs. What are some things whose format you might want to assert? Hopefully you can think of a few. I’ve got some just in case you fail us:

We’re going to explore regular expressions in Ruby, because they feel cleanest to me in this language (and in Perl, which popularized them). Javascript does a pretty good job. But most other languages use string literals to express them, which requires too much escaping.

Here’s a Ruby script that gets some input from the user and inspects the result:

#!/usr/bin/env ruby

print '> '
line = gets
print line.inspect

Here’s the output when we enter foo:

> foo
"foo\n"

What do you notice? The newline is still here. Let’s dispose of it with chomp:

#!/usr/bin/env ruby

print '> '
line = gets.chomp
print line.inspect

To assert that a line matches a regular expression, we often write code like this:

#!/usr/bin/env ruby

print '> '
line = gets.chomp
if line =~ /some regex/
  print 'You did it!'
else
  print 'Bad user!'
end

We will spend the rest of our time today learning the parts and pieces of regular expressions as we try to match text for the problems you and I enumerate. My own understanding of regular expressions improved greatly when I learned how to organize all the parts and pieces. I used a table like this:

what to match
atoms
how many to match
quantifiers
where to match
anchors

These are the most common atoms that appear in regular expressions:

symbol what to match
abc literal text abc
. any single character
\w any single alphanumeric character or underscore
\d any single digit
\s any single whitespace
[abc] any single character that is a, b, or c
[^abc] any single character that is not a, b, or c
[A-Z] any uppercase letter
[a-z] any lowercase letter
[A-Za-z] any letter
a|b a or b
\W any single non-alphanumeric character
\D any single non-digit character
\S any single non-whitespace character

We can quantify how many times an atom repeats with these quantifiers:

symbol how many to match
? 0 or 1
* 0 or more, as many as possible
+ 1 or more, as many as possible
*? 0 or more, as few as possible
+? 1 or more, as few as possible
{m} exactly m instances
{m,} at least m instances
{m,n} between m and n instances
{,n} no more than n instances

We prescribe where matches should happen within a string with these anchors:

symbol where to match
^ at start of string or line
$ at end of string or line
\b at word boundary
(?=abc) before abc
(?<=abc) after abc
(?!abc) not before abc
(?<!abc) not after abc

Here’s your TODO list for next time:

Sincerely,

P.S. It’s time for a haiku!

My boss just fired me
I checked in code with regex
He feared the backslash

P.P.S. Here’s the code we wrote together:

birthdate.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

# if line =~ /\d\d?[-\/]\d\d?[-\/]\d\d/
if line =~ /^(\d\d?-\d\d?-\d\d|\d\d?\/\d\d?\/\d\d)$/
  puts 'Go you!'
else
  puts 'Bad user...'
end

email.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

# if line =~ /\d\d?[-\/]\d\d?[-\/]\d\d/
if line =~ /^ *\w+@uwec.edu *$/
  puts 'Go you!'
else
  puts 'Bad user...'
end