» CS 330: Lecture 5 – Lookaround Assertions and Numeric Ranges

CS 330: Lecture 5 – Lookaround Assertions and Numeric Ranges

February 7, 2018 by Chris Johnson. Filed under cs330, lectures, spring 2018.

Dear students,

To start things off, we’re going to play some Regex Bingo. Find a partner and make a 4×4 grid of randomly generated strings. Include uppercase and lowercase letters, numbers, whitespace, and punctuation. Keep the strings short. There’s no free space.

I will call out the following regex, one at a time. Cross out all strings that match.

 0: \b[A-Z]...
 1: \s\w\s...
 2: ^\d...
 3: ^\w{3}$...
 4: \.\d...
 5: \d\w+\d...
 6: [$?!#]{2}...
 7: ^$...
 8: ^(\d).*\1$...
 9: \s\d...
10: \(.*\)...
11: ^[^abc]...
12: \$\w+...
13: ^\d+$...
14: [02468][A-Z]...
15: (^\.|\.$)...
16: ^.{4}$...
17: (.)\1\1...
18: [a-m].?[n-z]...
19: [A-Z][a-z]...
20: l[ao]...
21: .\D{2,4}....
22: ^[A-Z][a-z]*$...
23: \d[^A-Za-z0-9]...

Now let’s look at a few more problems:

Find all the statements in a Javascript source file that don’t end in a semi-colon. (As a proxy, we’ll say all lines that have code on them that don’t end in a curly brace, comma, or semi-colon.)
Evaluate embedded mathematical expressions in a report.
Fix missing quotation marks around single-token attributes in HTML.
Replace all numbers in [0, 255] with their binary representation with b suffix. For example 6 turns into 110b.

Occasionally in our patterns, we find ourselves wanting to match some text X adjacent to some other text Y. We don’t really want to do anything with text Y, but we need it in our pattern to serve as an anchor. We end up capturing text Y only to reinsert it, unmodified. For example, in the missing quotation marks problem, we capture the attribute but reinsert it as is:

id.gsub!(/(\w+=)([^" ]+)/, '\1"\2"')

We didn’t really do anything with the attribute. It’d be great if we could just replace the content that we do care about. We can—with lookaround assertions:

id.gsub!(/(?<=\w+=)([^" ]+)/, '"\2"')

Lookaround assertions allow us to mark elements as anchor points, dictating where a match occurs, but not actually including the anchoring text as part of the match. I use them a lot in my text editor to position my cursor after some anchoring text. For example, to get my cursor just inside divs with class foo, I’d run the following search in Vim: /\(<div class="foo">\)\@<=.

Additionally, sometimes we want to search for numbers that fall into a restricted numeric range. Regex may not be the best tool for this, but it can be made to work. Start by drawing out a table of the range you wish to capture. For the problem above, we want to find byte literals in [0, 255], so here’s our table:

0	1	2	3	4	5	6	7	8	9
00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99
100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199
200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249
250	251	252	253	254	255

Now we decompose this table into a series of rectangular groups, with members of each group having something in common. For example, this group has only single digits:

This group has only two digits:

00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99

This group has three digits with a leading 1:

100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199

This group has three digits with a leading 2 and a tens-digits in [0, 4]:

200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249

And this group is what’s leftover. These numbers have a leading 25 and a ones-digit in [0, 5]:

250

251

252

253

254

255

These rectangles can be individually matched with these expressions:

\d
\d\d
1\d\d
2[0-4]\d
25[0-5]

To match the whole range, we union the rectangles together with |.

Here’s your TODO list for next time:

Examine regexes in a language, editor, or tool that you use regularly. On a quarter sheet, describe and write a short program that accomplishes some regex pattern matching task in that language. Aim for some task that involves capturing.

Sincerely,

P.S. It’s time for a haiku!

Where regex will fail
Matching primes, parsing web pages
Making friends, wooing

P.P.S. Here’s the code we wrote together:

dayne.js

function f() {
  console.log("Goodbye, Brendan!")
}

var a = 7;

var div = document.getElementById('junk')

var names = []

function g() {
  f = {names: []}
  var o = new Object()
}

missing_semi.rb

#!/usr/bin/env ruby

File.readlines('dayne.js').each_with_index do |line, i|
  line.chomp!
  if line =~ /[^;}{]$/
    puts "Line #{i}: #{line}"
  end
end

report

On day 1, {{{2 ** (1 - 1)}}} person was freshly infected.
On day 2, {{{2 ** (2 - 1)}}} person was freshly infected.
On day 3, {{{2 ** (3 - 1)}}} person was freshly infected.
On day 4, {{{2 ** (4 - 1)}}} person was freshly infected.
On day 5, {{{2 ** (5 - 1)}}} person was freshly infected.
On day 6, {{{2 ** (6 - 1)}}} person was freshly infected.
On day 7, {{{2 ** (7 - 1)}}} person was freshly infected.

evalreport.rb

#!/usr/bin/env ruby

src = File.read('report')

src.gsub!(/\{\{\{(.*)\}\}\}/) do
  eval($1)
end

puts src

fixquotes.rb

#!/usr/bin/env ruby

tag = '<img src=foo.png width=700 alt="Foobag">'

tag.gsub!(/(?<==)([^"].*?)(?=\s|>)/, '"\1"')

puts tag

0	1	2	3	4	5	6	7	8	9
00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99
100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199
200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249
250	251	252	253	254	255

00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99

100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199

200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249

0	1	2	3	4	5	6	7	8	9
00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99
100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199
200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249
250	251	252	253	254	255

00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99

100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199

200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249

0	1	2	3	4	5	6	7	8	9
00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99
100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199
200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249
250	251	252	253	254	255

00	01	02	03	04	05	06	07	08	09
10	11	12	13	14	15	16	17	18	19
20	21	22	23	24	25	26	27	28	29
30	31	32	33	34	35	36	37	38	39
40	41	42	43	44	45	46	47	48	49
50	51	52	53	54	55	56	57	58	59
60	61	62	63	64	65	66	67	68	69
70	71	72	73	74	75	76	77	78	79
80	81	82	83	84	85	86	87	88	89
90	91	92	93	94	95	96	97	98	99

100	101	102	103	104	105	106	107	108	109
110	111	112	113	114	115	116	117	118	119
130	131	132	133	134	135	136	137	138	139
140	141	142	143	144	145	146	147	148	149
150	151	152	153	154	155	156	157	158	159
160	161	162	163	164	165	166	167	168	169
170	171	172	173	174	175	176	177	178	179
180	181	182	183	184	185	186	187	188	189
190	191	192	193	194	195	196	197	198	199

200	201	202	203	204	205	206	207	208	209
210	211	212	213	214	215	216	217	218	219
220	221	222	223	224	225	226	227	228	229
230	231	232	233	234	235	236	237	238	239
240	241	242	243	244	245	246	247	248	249