teaching machines

CS 330 Lecture 8 – Substitution Blocks, Number Ranges, Lookarounds

February 8, 2017 by . Filed under cs330, lectures, spring 2017.

Dear students,

At the end of class, we’re going to play some Regex Bingo. While you’re waiting, find a partner make a 4×4 grid of randomly generated strings. Include uppercase and lowercase letters, numbers, whitespace, and punctuation. Keep the strings short. There’s no free space.

Our discussion of gsub was cut short last time. Let’s start today with a few more examples:

  1. Fix missing quotation marks around attributes in HTML.
  2. Toggle case of SQL keywords.
  3. Evaluate embedded mathematical expressions in a report.
  4. Replace all numbers in [0, 255] with a special byte literal syntax: 127b.

Occasionally in our patterns, we find ourselves wanting to match some text X adjacent to some other text Y. We don’t really want to do anything with text Y, but we need it in our pattern to serve as an anchor. We end up capturing text Y only to reinsert it, unmodified. For example, in this snippet from last time, we insert a space between a lowercase letter followed by an uppercase letter:

id.gsub!(/([a-z])([A-Z])/, '\1 \2')

We didn’t really do anything with the letters but put them back in. It’d be great if we could match just the interstitial space between the two letters. We can with lookaround assertions:

id.gsub!(/(?<=[a-z])(?=[A-Z])/, ' ')

Lookaround assertions allow us to mark elements as anchor points, dictating where a match occurs, but not actually including the anchoring text as part of the match. I use them a lot in my text editor to position my cursor after some anchoring text. For example, to get my cursor just inside divs with class foo, I’d run the following search in Vim: /\(<div class="foo">\)\@<=.

 0: \b[A-Z]
 1: \s\w\s
 2: ^\d
 3: ^\w{3}$
 4: \.\d
 5: \d\w+\d
 6: [$?!#]{2}
 7: ^$
 8: ^(\d).*\1$
 9: \s\d
10: \(.*\)
11: ^[^abc]
12: \$\w+
13: ^\d+$
14: [02468][A-Z]
15: (^\.|\.$)
16: ^.{4}$
17: (.)\1\1
18: [a-m].?[n-z]
19: [A-Z][a-z]
20: l[ao]
21: .\D{2,4}.
22: ^[A-Z][a-z]*$
23: \d[^A-Za-z0-9]

With this we close out our formal discussion of regular expressions. I include them at this point in the semester because they are a practical tool that I think will give you power over your text, and there’s no point in delaying your acceptance of this superpower. They will come back later in the semester when we write our own language.

Here’s your TODO list for next time:

See you then!

Sincerely,

lines.rb

#!/usr/bin/env ruby

lines = File.readlines('post.gray')
lines.each_with_index do |line, i|
  puts "#{i}. #{line}"
end

multiline = <<EOF
Kathy
has
two
dogs
EOF

multiline.lines.each_with_index do |line, i|
  puts "#{i}. #{line}"
end

bad.html

<!DOCTYPE html>
<html>
<head>
  <title>...</title>
</head>
<body>
  <img src=http://www.google.jpg width=56 height="79">
  <div id=madeup></div>
  <a href=foo></a>
</body>
</html>

fix.rb

#!/usr/bin/env ruby

html = File.read(ARGV[0])

html.gsub!(/(=)([^"]\S*?)( |\/?>)/, '\1"\2"\3')

puts html

humanid.rb

#!/usr/bin/env ruby

id = ARGV[0].dup

id.gsub!(/(?<=[a-z])(?=[A-Z])/, ' ')
id[0] = id[0].upcase

puts id

report

The tesseract has a dichoral angle of {{{ 2 * 5 * 3 * 3 }}} degrees, which is {{{ 90 * 3.14159 / 180 }}} radians. My favorite power of two is {{{ 2 ** 16 }}}.

evalreport.rb

#!/usr/bin/env ruby

src = File.read('report')

src.gsub!(/\{\{\{(.*?)\}\}\}/) do
  eval($1)
end

puts src

# puts eval "2 * 8"

foo.sql

SELECT * FROM bulbapedia INNER JOIN mylist WHERE atk > 50;

sqler.rb

#!/usr/bin/env ruby

src = File.read('foo.sql')

is_upper = false

src.gsub!(/from|select|where|join|inner|left|update|values/i) do
  if is_upper
    $&.upcase
  else
    $&.downcase
  end 
end

puts src

byte.rb

#!/usr/bin/env ruby

src.gsub!(/\b(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\b/) do
  "#{$1}b"
end