CS 330: Lecture 5 – Lookaround Assertions and Numeric Ranges
Dear students,
To start things off, we’re going to play some Regex Bingo. Find a partner and make a 4×4 grid of randomly generated strings. Include uppercase and lowercase letters, numbers, whitespace, and punctuation. Keep the strings short. There’s no free space.
I will call out the following regex, one at a time. Cross out all strings that match.
0: \b[A-Z] 1: \s\w\s 2: ^\d 3: ^\w{3}$ 4: \.\d 5: \d\w+\d 6: [$?!#]{2} 7: ^$ 8: ^(\d).*\1$ 9: \s\d 10: \(.*\) 11: ^[^abc] 12: \$\w+ 13: ^\d+$ 14: [02468][A-Z] 15: (^\.|\.$) 16: ^.{4}$ 17: (.)\1\1 18: [a-m].?[n-z] 19: [A-Z][a-z] 20: l[ao] 21: .\D{2,4}. 22: ^[A-Z][a-z]*$ 23: \d[^A-Za-z0-9]
Now let’s look at a few more problems:
- Find all the statements in a Javascript source file that don’t end in a semi-colon. (As a proxy, we’ll say all lines that have code on them that don’t end in a curly brace, comma, or semi-colon.)
- Evaluate embedded mathematical expressions in a report.
- Fix missing quotation marks around single-token attributes in HTML.
- Replace all numbers in [0, 255] with their binary representation with
b
suffix. For example6
turns into110b
.
Occasionally in our patterns, we find ourselves wanting to match some text X adjacent to some other text Y. We don’t really want to do anything with text Y, but we need it in our pattern to serve as an anchor. We end up capturing text Y only to reinsert it, unmodified. For example, in the missing quotation marks problem, we capture the attribute but reinsert it as is:
id.gsub!(/(\w+=)([^" ]+)/, '\1"\2"')
We didn’t really do anything with the attribute. It’d be great if we could just replace the content that we do care about. We can—with lookaround assertions:
id.gsub!(/(?<=\w+=)([^" ]+)/, '"\2"')
Lookaround assertions allow us to mark elements as anchor points, dictating where a match occurs, but not actually including the anchoring text as part of the match. I use them a lot in my text editor to position my cursor after some anchoring text. For example, to get my cursor just inside div
s with class foo
, I’d run the following search in Vim: /\(<div class="foo">\)\@<=
.
Additionally, sometimes we want to search for numbers that fall into a restricted numeric range. Regex may not be the best tool for this, but it can be made to work. Start by drawing out a table of the range you wish to capture. For the problem above, we want to find byte literals in [0, 255], so here’s our table:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 |
50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 |
60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 |
70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 |
80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 |
90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 |
100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 |
110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 |
140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 |
150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 |
160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 |
170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 |
180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 |
190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 |
200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 |
210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 |
220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 |
230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 |
240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 |
250 | 251 | 252 | 253 | 254 | 255 |
Now we decompose this table into a series of rectangular groups, with members of each group having something in common. For example, this group has only single digits:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
This group has only two digits:
00 | 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 |
10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 |
50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 |
60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 |
70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 |
80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 |
90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 |
This group has three digits with a leading 1:
100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 |
110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 |
140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 |
150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 |
160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 |
170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 |
180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 |
190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 |
This group has three digits with a leading 2 and a tens-digits in [0, 4]:
200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 |
210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 |
220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 |
230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 |
240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 |
And this group is what’s leftover. These numbers have a leading 25 and a ones-digit in [0, 5]:
250 | 251 | 252 | 253 | 254 | 255 |
These rectangles can be individually matched with these expressions:
\d
\d\d
1\d\d
2[0-4]\d
25[0-5]
To match the whole range, we union the rectangles together with |
.
Here’s your TODO list for next time:
- Examine regexes in a language, editor, or tool that you use regularly. On a quarter sheet, describe and write a short program that accomplishes some regex pattern matching task in that language. Aim for some task that involves capturing.
P.S. It’s time for a haiku!
Where regex will fail
Matching primes, parsing web pages
Making friends, wooing
P.P.S. Here’s the code we wrote together:
dayne.js
function f() { console.log("Goodbye, Brendan!") } var a = 7; var div = document.getElementById('junk') var names = [] function g() { f = {names: []} var o = new Object() }
missing_semi.rb
#!/usr/bin/env ruby
File.readlines('dayne.js').each_with_index do |line, i|
line.chomp!
if line =~ /[^;}{]$/
puts "Line #{i}: #{line}"
end
end
report
On day 1, {{{2 ** (1 - 1)}}} person was freshly infected. On day 2, {{{2 ** (2 - 1)}}} person was freshly infected. On day 3, {{{2 ** (3 - 1)}}} person was freshly infected. On day 4, {{{2 ** (4 - 1)}}} person was freshly infected. On day 5, {{{2 ** (5 - 1)}}} person was freshly infected. On day 6, {{{2 ** (6 - 1)}}} person was freshly infected. On day 7, {{{2 ** (7 - 1)}}} person was freshly infected.
evalreport.rb
#!/usr/bin/env ruby
src = File.read('report')
src.gsub!(/\{\{\{(.*)\}\}\}/) do
eval($1)
end
puts src
fixquotes.rb
#!/usr/bin/env ruby
tag = '<img src=foo.png width=700 alt="Foobag">'
tag.gsub!(/(?<==)([^"].*?)(?=\s|>)/, '"\1"')
puts tag