teaching machines

CS 330: Lecture 5 – Lookaround Assertions and Numeric Ranges

Dear students,

To start things off, we’re going to play some Regex Bingo. Find a partner and make a 4×4 grid of randomly generated strings. Include uppercase and lowercase letters, numbers, whitespace, and punctuation. Keep the strings short. There’s no free space.

I will call out the following regex, one at a time. Cross out all strings that match.

 0: \b[A-Z]
 1: \s\w\s
 2: ^\d
 3: ^\w{3}$
 4: \.\d
 5: \d\w+\d
 6: [$?!#]{2}
 7: ^$
 8: ^(\d).*\1$
 9: \s\d
10: \(.*\)
11: ^[^abc]
12: \$\w+
13: ^\d+$
14: [02468][A-Z]
15: (^\.|\.$)
16: ^.{4}$
17: (.)\1\1
18: [a-m].?[n-z]
19: [A-Z][a-z]
20: l[ao]
21: .\D{2,4}.
22: ^[A-Z][a-z]*$
23: \d[^A-Za-z0-9]

Now let’s look at a few more problems:

  • Find all the statements in a Javascript source file that don’t end in a semi-colon. (As a proxy, we’ll say all lines that have code on them that don’t end in a curly brace, comma, or semi-colon.)
  • Evaluate embedded mathematical expressions in a report.
  • Fix missing quotation marks around single-token attributes in HTML.
  • Replace all numbers in [0, 255] with their binary representation with b suffix. For example 6 turns into 110b.

Occasionally in our patterns, we find ourselves wanting to match some text X adjacent to some other text Y. We don’t really want to do anything with text Y, but we need it in our pattern to serve as an anchor. We end up capturing text Y only to reinsert it, unmodified. For example, in the missing quotation marks problem, we capture the attribute but reinsert it as is:

id.gsub!(/(\w+=)([^" ]+)/, '\1"\2"')

We didn’t really do anything with the attribute. It’d be great if we could just replace the content that we do care about. We can—with lookaround assertions:

id.gsub!(/(?<=\w+=)([^" ]+)/, '"\2"')

Lookaround assertions allow us to mark elements as anchor points, dictating where a match occurs, but not actually including the anchoring text as part of the match. I use them a lot in my text editor to position my cursor after some anchoring text. For example, to get my cursor just inside divs with class foo, I’d run the following search in Vim: /\(<div class="foo">\)\@<=.

Additionally, sometimes we want to search for numbers that fall into a restricted numeric range. Regex may not be the best tool for this, but it can be made to work. Start by drawing out a table of the range you wish to capture. For the problem above, we want to find byte literals in [0, 255], so here’s our table:

0 1 2 3 4 5 6 7 8 9
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99
100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
130 131 132 133 134 135 136 137 138 139
140 141 142 143 144 145 146 147 148 149
150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169
170 171 172 173 174 175 176 177 178 179
180 181 182 183 184 185 186 187 188 189
190 191 192 193 194 195 196 197 198 199
200 201 202 203 204 205 206 207 208 209
210 211 212 213 214 215 216 217 218 219
220 221 222 223 224 225 226 227 228 229
230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249
250 251 252 253 254 255

Now we decompose this table into a series of rectangular groups, with members of each group having something in common. For example, this group has only single digits:

0 1 2 3 4 5 6 7 8 9

This group has only two digits:

00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
80 81 82 83 84 85 86 87 88 89
90 91 92 93 94 95 96 97 98 99

This group has three digits with a leading 1:

100 101 102 103 104 105 106 107 108 109
110 111 112 113 114 115 116 117 118 119
130 131 132 133 134 135 136 137 138 139
140 141 142 143 144 145 146 147 148 149
150 151 152 153 154 155 156 157 158 159
160 161 162 163 164 165 166 167 168 169
170 171 172 173 174 175 176 177 178 179
180 181 182 183 184 185 186 187 188 189
190 191 192 193 194 195 196 197 198 199

This group has three digits with a leading 2 and a tens-digits in [0, 4]:

200 201 202 203 204 205 206 207 208 209
210 211 212 213 214 215 216 217 218 219
220 221 222 223 224 225 226 227 228 229
230 231 232 233 234 235 236 237 238 239
240 241 242 243 244 245 246 247 248 249

And this group is what’s leftover. These numbers have a leading 25 and a ones-digit in [0, 5]:

250 251 252 253 254 255

These rectangles can be individually matched with these expressions:

  • \d
  • \d\d
  • 1\d\d
  • 2[0-4]\d
  • 25[0-5]

To match the whole range, we union the rectangles together with |.

Here’s your TODO list for next time:

  • Examine regexes in a language, editor, or tool that you use regularly. On a quarter sheet, describe and write a short program that accomplishes some regex pattern matching task in that language. Aim for some task that involves capturing.
Sincerely,

P.S. It’s time for a haiku!

Where regex will fail
Matching primes, parsing web pages
Making friends, wooing

P.P.S. Here’s the code we wrote together:

dayne.js

function f() {
  console.log("Goodbye, Brendan!")
}

var a = 7;

var div = document.getElementById('junk')

var names = []

function g() {
  f = {names: []}
  var o = new Object()
}

missing_semi.rb

#!/usr/bin/env ruby

File.readlines('dayne.js').each_with_index do |line, i|
  line.chomp!
  if line =~ /[^;}{]$/
    puts "Line #{i}: #{line}"
  end
end

report

On day 1, {{{2 ** (1 - 1)}}} person was freshly infected.
On day 2, {{{2 ** (2 - 1)}}} person was freshly infected.
On day 3, {{{2 ** (3 - 1)}}} person was freshly infected.
On day 4, {{{2 ** (4 - 1)}}} person was freshly infected.
On day 5, {{{2 ** (5 - 1)}}} person was freshly infected.
On day 6, {{{2 ** (6 - 1)}}} person was freshly infected.
On day 7, {{{2 ** (7 - 1)}}} person was freshly infected.

evalreport.rb

#!/usr/bin/env ruby

src = File.read('report')

src.gsub!(/\{\{\{(.*)\}\}\}/) do
  eval($1)
end

puts src

fixquotes.rb

#!/usr/bin/env ruby

tag = '<img src=foo.png width=700 alt="Foobag">'

tag.gsub!(/(?<==)([^"].*?)(?=\s|>)/, '"\1"')

puts tag

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *