teaching machines

CS 330: Lecture 2 – Regex Capturing and Find-All

January 31, 2018 by . Filed under cs330, lectures, spring 2018.

Dear students,

Last time we started looking a regular expressions. There’s something I didn’t share because I was afraid your minds would explode. It’s this:

Regex is a language for recognizing a language.

We use regex to create little recognition machines. We feed a string into such a machine, and it either belches approvingly or projectile-vomits it across the room.

As with all languages, they take a lot of time to learn, and not much happens in your brain long-term if you don’t speak it. So, let’s tackle some problems. With a neighbor, pick two of the following problems and write just the regex. Pick ones that aren’t near each other so we get a good distribution. If you finish early, do more.

  1. a comma-separated list of five non-empty words
  2. a phone number
  3. English and British spellings of “color”
  4. a string without spaces
  5. an integer literal
  6. rectangular dimensions (e.g., “5×6”, “2 x 4”, “2 by 2”)
  7. strings that can’t be interpreted as hexadecimal literals (no “deadbeef”)
  8. words with a silent E and a long vowel
  9. a properly-quoted string literal, with internal quotes escaped
  10. strings that have a word starting with a capital letter

Regex are not just boolean machines, however. Sometimes we want to extract a portion of the matching text. If we parenthesize the pattern that matches the text we care about, the matching text will be saved in a variable whose identifier is the parenthesized group’s 1-based serial number. For example:

expression =~ /(\d+)([-+*\/])(\d+)/
a = $1
operator = $2
b = $3

Parentheses are a little overloaded in regex. They are used for association (to help | and the quantifiers do the right thing) and capturing. If you want parentheses that associate but do not capture, use ?:. For example, we might search of paths inside backtics with this pattern:

/`((?:\/\w+))`/

If the ?: was gone, we’d get a serial variable for each component of the path. With the ?:, the outer parentheses capture the whole path in $1.

Let’s also start looking at how to find all matches of a regex within a large corpus of text. We often get large corpuses from files, so let’s examine how to read a file in Ruby. Here goes:

#!/usr/bin/env ruby

text = File.read(ARGV[0])

We can locate all matches within text using String.scan:

#!/usr/bin/env ruby

text = File.read(ARGV[0])
words = text.scan(/\w+/)

The result words is an array of all the matches in the text. We often process arrays with loops, but loops have fallen out of favor in modern languages. Instead, we’ll use a method called each and pass it the code to process each word:

#!/usr/bin/env ruby

text = File.read(ARGV[0])
words = text.scan(/\w+/)
words.each do |word|
  puts word
end

If we have capturing in the regular expression, we use this form:

#!/usr/bin/env ruby

text = File.read(ARGV[0])
words = text.scan(/(...)(...)(...)/)
words.each do |capture1, capture2, capture3|
  puts capture1
  puts capture2
  puts capture3
end

Let’s solve a few problems using scan:

Our tables from last time are still relevant, so I conjure them back up. There are three kinds of entities in regex:

what to match
atoms
how many to match
quantifiers
where to match
anchors

These are the most common atoms that appear in regular expressions:

symbol what to match
abc literal text abc
. any single character
\w any single alphanumeric character or underscore
\d any single digit
\s any single whitespace
[abc] any single character that is a, b, or c
[^abc] any single character that is not a, b, or c
[A-Z] any uppercase letter
[a-z] any lowercase letter
[A-Za-z] any letter
a|b a or b
\W any single non-alphanumeric character
\D any single non-digit character
\S any single non-whitespace character

We can quantify how many times an atom repeats with these quantifiers:

symbol how many to match
? 0 or 1
* 0 or more, as many as possible
+ 1 or more, as many as possible
*? 0 or more, as few as possible
+? 1 or more, as few as possible
{m} exactly m instances
{m,} at least m instances
{m,n} between m and n instances
{,n} no more than n instances

We prescribe where matches should happen within a string with these anchors:

symbol where to match
^ at start of string or line
$ at end of string or line
\b at word boundary
(?=abc) before abc
(?<=abc) after abc
(?!abc) not before abc
(?<!abc) not after abc

Here’s your TODO list for next time:

Sincerely,

P.S. It’s time for a haiku!

Any marble slab
Inside there is a David
If you can find it

P.P.S. Here’s the code we wrote together:

nohex.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

if line =~ /[^0-9A-Fa-f]/
  puts 'Go you!'
else
  puts 'Bad user...'
end

if line !~ /^[0-9A-Fa-f]+$/
  puts "is hexy"
else
  puts "isn't hexy"
end

silente.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

if line =~ /[aeiou][^aeiou]e$/
  puts 'Go you!'
else
  puts 'Bad user...'
end

dims.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

if line =~ /\d+\s*(x|\*|by)\s*\d+/
  puts 'Go you!'
else
  puts 'Bad user...'
end

sliteral.rb

#!/usr/bin/env ruby

print '> '
line = gets.chomp

if line =~ /^"([^"]|\\")*"$/
  puts 'Go you!'
else
  puts 'Bad user...'
end

scan.rb

#!/usr/bin/env ruby

text = File.read(ARGV[0])

# matches = text.scan(/.*ology$/)
# puts matches

text.scan(/.*ology$/) do |match|
  puts match 
end