» CS 330: Lecture 1 – Asserting Patterns with Regex

CS 330: Lecture 1 – Asserting Patterns with Regex

January 29, 2018 by Chris Johnson. Filed under cs330, lectures, spring 2018.

Dear students,

Welcome to CS 330: Programming Languages! What’s this class about? Well, imagine you are a biology student taking a class on mammals and every lecture, lab, and homework is on cows. You wouldn’t really be a biologist at the end of this class. You’d be a cowist.

A cow gives you one picture of how an organism can be. But to be a biologist, you need lots of pictures. Similarly, Java gives you one picture of how a programmer talks to a machine, but to be a computer scientist, you need lots of pictures. Each picture is a window into some deeper concept that lies at the center of all them and is not fully manifested in any single one of them.

Accordingly, this class is about all the tools—the languages—we use to teach machines. We don’t really address the interesting applications of these tools. We focus on the tools themselves. That sounds kind of disembodied. But when you consider that these tools form the bridge between a programmer and a machine, this class feels much more human. We want to design and use tools that take away the pain and time-wasting from developing software. In many computer science classes, we focus on making an end-product for an end-user. In this class, we focus on making a mid-product for you, a developer of software. You are just as important as your users, and it’s okay to have a class about you.

Some of the features of our tools that we’ll discuss include:

type systems
static vs. dynamic operations (compile-time vs. run-time)
code reuse (polymorphism)
procedural abstraction (higher-order functions, closures)
hardware abstraction (memory management, statelessness)

The languages through which we view these features include C, C++, Haskell, Java, and Ruby.

We’ll also write our own language. Take that, biologists. When’s the last time you made your own mammal?

That’s what the class is about. What am I about?

5’9″.
36 years old. I’m at that age where mid-life crises start to happen. I’ll try not to have any during lecture.
My family, who needs me more than I want them to, but I suppose my feelings will be reversed some day. One thing you should know: I organize my day so that I can spend time with my kids. I get to campus very, very early, and I leave at 3 PM everyday to pick them up from school. Then we play Breath of the Wild, which will take us until they graduate from high school to finish.
This class. I love this one more than all the others.
A full life. I spend too much time reading and trying to be a better computer scientist. I signed up for too many projects with students. I mistakenly agreed to visit two schools each week and teach elementary and middle schoolers how to code with 3D modeling. In a moment of weakness, I asked to audit Japanese 102 in hopes that I can someday visit Japan. I expend far too much time writing bad haiku to close out lectures. I feel like there is too much good in the world and that I’m drowning in my own curiosity. Stupid internet. Stupid library. Stupid Humble Bundle.

Now, what are you about? On a quarter sheet of paper, tell me the following:

What’s your name?
What do you wish you had done over break but didn’t?
What would you write a book on?
What have you made that you are most proud of?

Okay, for the rest of our time today, let’s write our own programming language. Okay, let’s just look at a small piece of it. A major part of writing a language is being able to recognize what the programmer is saying. We will use regular expressions to do this. Regex are descriptions of the patterns you expect to find in a string of text. In this next week, we’ll use them to accomplish the following tasks:

validate user input (=~)
locate matches in a large body of text (scan)
programmatically transform text (gsub)

Today we focus on validating user input. We want to assert that user input conforms to an expected format. You already know how to match text exactly. How do we do that?

In Java, we use equals. In C++, we use ==. In PHP, we use ===. In C, we use strcmp.

In practice, we often don’t know exactly what we’re looking for, but we know the overall shape. Enter regular expressions, which allow us to loosely describe the shape of a string. Let’s explore them in the context of felt needs. What are some things whose format you might want to assert? Hopefully you can think of a few. I’ve got some just in case you fail us:

a birthdate
a UWEC email address
a comma-separated list of five non-empty words
a phone number
English and British spellings of “color”
a string without spaces
an integer literal
rectangular dimensions (e.g., “5×6”, “2 x 4”, “2 by 2”)
strings that can’t be interpreted as hexadecimal literals (no “deadbeef”)
words with a silent E and a long vowel
a properly-quoted string literal, with internal quotes escaped
strings that have a word starting with a capital letter

We’re going to explore regular expressions in Ruby, because they feel cleanest to me in this language (and in Perl, which popularized them). Javascript does a pretty good job. But most other languages use string literals to express them, which requires too much escaping.

Here’s a Ruby script that gets some input from the user and inspects the result:

#!/usr/bin/env ruby

print '> '
line = gets
print line.inspect

Here’s the output when we enter foo:

> foo
"foo\n"

What do you notice? The newline is still here. Let’s dispose of it with chomp:

#!/usr/bin/env ruby

print '> '
line = gets.chomp
print line.inspect

To assert that a line matches a regular expression, we often write code like this:

#!/usr/bin/env ruby

print '> '
line = gets.chomp
if line =~ /some regex/
  print 'You did it!'
else
  print 'Bad user!'
end

We will spend the rest of our time today learning the parts and pieces of regular expressions as we try to match text for the problems you and I enumerate. My own understanding of regular expressions improved greatly when I learned how to organize all the parts and pieces. I used a table like this:

what to match atoms	how many to match quantifiers	where to match anchors
…	…	…

These are the most common atoms that appear in regular expressions:

symbol	what to match
`abc`	literal text `abc`
`.`	any single character
`\w`	any single alphanumeric character or underscore
`\d`	any single digit
`\s`	any single whitespace
`[abc]`	any single character that is `a`, `b`, or `c`
`[^abc]`	any single character that is not `a`, `b`, or `c`
`[A-Z]`	any uppercase letter
`[a-z]`	any lowercase letter
`[A-Za-z]`	any letter
`a\|b`	`a` or `b`
`\W`	any single non-alphanumeric character
`\D`	any single non-digit character
`\S`	any single non-whitespace character

We can quantify how many times an atom repeats with these quantifiers:

symbol	how many to match
`?`	0 or 1
`*`	0 or more, as many as possible
`+`	1 or more, as many as possible
`*?`	0 or more, as few as possible
`+?`	1 or more, as few as possible
`{m}`	exactly m instances
`{m,}`	at least m instances
`{m,n}`	between m and n instances
`{,n}`	no more than n instances

We prescribe where matches should happen within a string with these anchors:

symbol	where to match
`^`	at start of string or line
`$`	at end of string or line
`\b`	at word boundary
`(?=abc)`	before `abc`
`(?<=abc)`	after `abc`
`(?!abc)`	not before `abc`
`(?<!abc)`	not after `abc`

Here’s your TODO list for next time:

Read the syllabus.
Complete the three parts of homework 0 as soon as possible. Follow the steps in part 2 to fork the template repository before Wednesday for a participation point.
Complete the regex tutorial at RegexOne. For a participation point, compose two problems of your own, listing the strings to match and the strings to not match. Turn these in on a quarter sheet of paper at the beginning of next lecture. Late quarter sheets are not accepted. You undermine their purpose when you do them five minutes before class starts.

Sincerely,

P.S. It’s time for a haiku!

My boss just fired me
I checked in code with regex
He feared the backslash

P.P.S. Here’s the code we wrote together:

birthdate.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

# if line =~ /\d\d?[-\/]\d\d?[-\/]\d\d/
if line =~ /^(\d\d?-\d\d?-\d\d|\d\d?\/\d\d?\/\d\d)$/
  puts 'Go you!'
else
  puts 'Bad user...'
end

email.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

# if line =~ /\d\d?[-\/]\d\d?[-\/]\d\d/
if line =~ /^ *\w+@uwec.edu *$/
  puts 'Go you!'
else
  puts 'Bad user...'
end