CS 330 Lecture 8 – Substitution Blocks, Number Ranges, Lookarounds
Dear students,
At the end of class, we’re going to play some Regex Bingo. While you’re waiting, find a partner make a 4×4 grid of randomly generated strings. Include uppercase and lowercase letters, numbers, whitespace, and punctuation. Keep the strings short. There’s no free space.
Our discussion of gsub
was cut short last time. Let’s start today with a few more examples:
- Fix missing quotation marks around attributes in HTML.
- Toggle case of SQL keywords.
- Evaluate embedded mathematical expressions in a report.
- Replace all numbers in [0, 255] with a special byte literal syntax:
127b
.
Occasionally in our patterns, we find ourselves wanting to match some text X adjacent to some other text Y. We don’t really want to do anything with text Y, but we need it in our pattern to serve as an anchor. We end up capturing text Y only to reinsert it, unmodified. For example, in this snippet from last time, we insert a space between a lowercase letter followed by an uppercase letter:
id.gsub!(/([a-z])([A-Z])/, '\1 \2')
We didn’t really do anything with the letters but put them back in. It’d be great if we could match just the interstitial space between the two letters. We can with lookaround assertions:
id.gsub!(/(?<=[a-z])(?=[A-Z])/, ' ')
Lookaround assertions allow us to mark elements as anchor points, dictating where a match occurs, but not actually including the anchoring text as part of the match. I use them a lot in my text editor to position my cursor after some anchoring text. For example, to get my cursor just inside div
s with class foo
, I’d run the following search in Vim: /\(<div class="foo">\)\@<=
.
0: \b[A-Z] 1: \s\w\s 2: ^\d 3: ^\w{3}$ 4: \.\d 5: \d\w+\d 6: [$?!#]{2} 7: ^$ 8: ^(\d).*\1$ 9: \s\d 10: \(.*\) 11: ^[^abc] 12: \$\w+ 13: ^\d+$ 14: [02468][A-Z] 15: (^\.|\.$) 16: ^.{4}$ 17: (.)\1\1 18: [a-m].?[n-z] 19: [A-Z][a-z] 20: l[ao] 21: .\D{2,4}. 22: ^[A-Z][a-z]*$ 23: \d[^A-Za-z0-9]
With this we close out our formal discussion of regular expressions. I include them at this point in the semester because they are a practical tool that I think will give you power over your text, and there’s no point in delaying your acceptance of this superpower. They will come back later in the semester when we write our own language.
Here’s your TODO list for next time:
- Read The Descent to C by Simon Tatham, developer of PuTTY. On a quarter sheet, write down 2-3 questions or observations inspired by your reading.
See you then!
lines.rb
#!/usr/bin/env ruby lines = File.readlines('post.gray') lines.each_with_index do |line, i| puts "#{i}. #{line}" end multiline = <<EOF Kathy has two dogs EOF multiline.lines.each_with_index do |line, i| puts "#{i}. #{line}" end
bad.html
<!DOCTYPE html> <html> <head> <title>...</title> </head> <body> <img src=http://www.google.jpg width=56 height="79"> <div id=madeup></div> <a href=foo></a> </body> </html>
fix.rb
#!/usr/bin/env ruby html = File.read(ARGV[0]) html.gsub!(/(=)([^"]\S*?)( |\/?>)/, '\1"\2"\3') puts html
humanid.rb
#!/usr/bin/env ruby id = ARGV[0].dup id.gsub!(/(?<=[a-z])(?=[A-Z])/, ' ') id[0] = id[0].upcase puts id
report
The tesseract has a dichoral angle of {{{ 2 * 5 * 3 * 3 }}} degrees, which is {{{ 90 * 3.14159 / 180 }}} radians. My favorite power of two is {{{ 2 ** 16 }}}.
evalreport.rb
#!/usr/bin/env ruby src = File.read('report') src.gsub!(/\{\{\{(.*?)\}\}\}/) do eval($1) end puts src # puts eval "2 * 8"
foo.sql
SELECT * FROM bulbapedia INNER JOIN mylist WHERE atk > 50;
sqler.rb
#!/usr/bin/env ruby src = File.read('foo.sql') is_upper = false src.gsub!(/from|select|where|join|inner|left|update|values/i) do if is_upper $&.upcase else $&.downcase end end puts src
byte.rb
#!/usr/bin/env ruby src.gsub!(/\b(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\b/) do "#{$1}b" end