CS 330: Lecture 2 – Regex Capturing and Find-All
Dear students,
Last time we started looking a regular expressions. There’s something I didn’t share because I was afraid your minds would explode. It’s this:
Regex is a language for recognizing a language.
We use regex to create little recognition machines. We feed a string into such a machine, and it either belches approvingly or projectile-vomits it across the room.
As with all languages, they take a lot of time to learn, and not much happens in your brain long-term if you don’t speak it. So, let’s tackle some problems. With a neighbor, pick two of the following problems and write just the regex. Pick ones that aren’t near each other so we get a good distribution. If you finish early, do more.
- a comma-separated list of five non-empty words
- a phone number
- English and British spellings of “color”
- a string without spaces
- an integer literal
- rectangular dimensions (e.g., “5×6”, “2 x 4”, “2 by 2”)
- strings that can’t be interpreted as hexadecimal literals (no “deadbeef”)
- words with a silent E and a long vowel
- a properly-quoted string literal, with internal quotes escaped
- strings that have a word starting with a capital letter
Regex are not just boolean machines, however. Sometimes we want to extract a portion of the matching text. If we parenthesize the pattern that matches the text we care about, the matching text will be saved in a variable whose identifier is the parenthesized group’s 1-based serial number. For example:
expression =~ /(\d+)([-+*\/])(\d+)/
a = $1
operator = $2
b = $3
Parentheses are a little overloaded in regex. They are used for association (to help |
and the quantifiers do the right thing) and capturing. If you want parentheses that associate but do not capture, use ?:
. For example, we might search of paths inside backtics with this pattern:
/`((?:\/\w+))`/
If the ?:
was gone, we’d get a serial variable for each component of the path. With the ?:
, the outer parentheses capture the whole path in $1.
Let’s also start looking at how to find all matches of a regex within a large corpus of text. We often get large corpuses from files, so let’s examine how to read a file in Ruby. Here goes:
#!/usr/bin/env ruby
text = File.read(ARGV[0])
We can locate all matches within text
using String.scan
:
#!/usr/bin/env ruby
text = File.read(ARGV[0])
words = text.scan(/\w+/)
The result words
is an array of all the matches in the text. We often process arrays with loops, but loops have fallen out of favor in modern languages. Instead, we’ll use a method called each
and pass it the code to process each word:
#!/usr/bin/env ruby
text = File.read(ARGV[0])
words = text.scan(/\w+/)
words.each do |word|
puts word
end
If we have capturing in the regular expression, we use this form:
#!/usr/bin/env ruby
text = File.read(ARGV[0])
words = text.scan(/(...)(...)(...)/)
words.each do |capture1, capture2, capture3|
puts capture1
puts capture2
puts capture3
end
Let’s solve a few problems using scan
:
- Extract the URLs from all
img
elements. - Find all the -ologies and -ographies in the
/usr/share/bin/dict
. - List all the HTML tags in a file.
Our tables from last time are still relevant, so I conjure them back up. There are three kinds of entities in regex:
what to match atoms |
how many to match quantifiers |
where to match anchors |
---|---|---|
… | … | … |
These are the most common atoms that appear in regular expressions:
symbol | what to match |
---|---|
abc
|
literal text abc
|
.
|
any single character |
\w
|
any single alphanumeric character or underscore |
\d
|
any single digit |
\s
|
any single whitespace |
[abc]
|
any single character that is a , b , or c
|
[^abc]
|
any single character that is not a , b , or c
|
[A-Z]
|
any uppercase letter |
[a-z]
|
any lowercase letter |
[A-Za-z]
|
any letter |
a|b
|
a or b
|
\W
|
any single non-alphanumeric character |
\D
|
any single non-digit character |
\S
|
any single non-whitespace character |
We can quantify how many times an atom repeats with these quantifiers:
symbol | how many to match |
---|---|
?
|
0 or 1 |
*
|
0 or more, as many as possible |
+
|
1 or more, as many as possible |
*?
|
0 or more, as few as possible |
+?
|
1 or more, as few as possible |
{m}
|
exactly m instances |
{m,}
|
at least m instances |
{m,n}
|
between m and n instances |
{,n}
|
no more than n instances |
We prescribe where matches should happen within a string with these anchors:
symbol | where to match |
---|---|
^
|
at start of string or line |
$
|
at end of string or line |
\b
|
at word boundary |
(?=abc)
|
before abc
|
(?<=abc)
|
after abc
|
(?!abc)
|
not before abc
|
(?<!abc)
|
not after abc
|
Here’s your TODO list for next time:
- Solve this puzzle on LinkedIn. You may wish to practice at regexcrossword.com. Write your solution on a quarter sheet to be turned in at the beginning of next class.
- For an extra credit participation point, create your own non-trivial 5×5 regex crossword on regexcrossword.com and share a link on Piazza.
P.S. It’s time for a haiku!
Any marble slab
Inside there is a David
If you can find it
P.P.S. Here’s the code we wrote together:
nohex.rb
#!/usr/bin/env ruby
print '> '
line = gets.chomp
if line =~ /[^0-9A-Fa-f]/
puts 'Go you!'
else
puts 'Bad user...'
end
if line !~ /^[0-9A-Fa-f]+$/
puts "is hexy"
else
puts "isn't hexy"
end
silente.rb
#!/usr/bin/env ruby
print '> '
line = gets.chomp
if line =~ /[aeiou][^aeiou]e$/
puts 'Go you!'
else
puts 'Bad user...'
end
dims.rb
#!/usr/bin/env ruby
print '> '
line = gets.chomp
if line =~ /\d+\s*(x|\*|by)\s*\d+/
puts 'Go you!'
else
puts 'Bad user...'
end
sliteral.rb
#!/usr/bin/env ruby
print '> '
line = gets.chomp
if line =~ /^"([^"]|\\")*"$/
puts 'Go you!'
else
puts 'Bad user...'
end
scan.rb
#!/usr/bin/env ruby
text = File.read(ARGV[0])
# matches = text.scan(/.*ology$/)
# puts matches
text.scan(/.*ology$/) do |match|
puts match
end