CS 330 Lecture 7 – Find and Replace
Dear students,
We will focus on the final two of the three common operations we for which we will use regular expressions:
- Asserting that text matches a pattern.
- Finding all matches of a pattern in a document.
- Replacing all matches of a pattern with some other text.
Finding all matches of a regular expression is done with String.scan
. We may want the results as an array to be processed later:
matches = text.scan(/pattern/)
If there are any capturing groups in the pattern, the results will be an array of arrays. Element 0 will be [$1, $2, ...]
for the first match. And so on.
Or, we can process the array immediately by passing scan
a block:
# For a pattern without capturing groups text.scan(/pattern/) do process match end # For a pattern with capturing groups text.scan(/pattern/) do |group1, group2, ...| process match end
The block that we give to scan
is expected to be a doer, a void function. It is not expected to return anything.
Suppose we want to replace the text that we match. For that we can use String.gsub
(for a global substitution) or String.sub
(for a single substitution). gsub
and sub
return new strings, while gsub!
and sub!
modify the invoking strings. The substitution can be expressed several ways:
text.gsub!(/pattern/, 'replacement text') text.gsub!(/pattern/, 'replacement \1 with captures \2') text.gsub!(/pattern/, "replacement \\1 with captures \\2 and \n double quotes") text.gsub!(/pattern/) do compute the replacement text, using $1, $2, ... end
The block that we give to gsub
is expected to be a returner, giving back the string that we want swapped in. It can contain arbitrary Ruby code that processes the matching text. This form of gsub
is the most powerful because it is the most sensitive, which is how true power works.
Let’s write regex to do the following:
- List and number all the image URLs from
img
elements. - List all the lines in a file that match a regex.
- Identify all the fields of study listed in a dictionary—the -ology, -nomy, and -nomics words.
- Locate all the string literals in a source file.
- Humanize identifiers, turning
isUnderSiege
toIs Under Siege
. - Fix missing quotation marks around attributes in HTML.
- Evaluate embedded mathematical expressions in a report.
imgripper.rb
#!/usr/bin/env ruby html = IO.read('onion.html') html.scan(/<img.*?src="(.*?)"/).each_with_index do |groups, i| url = groups[0] puts "#{i}. #{url}" # system("wget #{url}") end
studies.rb
#!/usr/bin/env ruby dictionary = IO.read('/usr/share/dict/words') # dictionary.scan(/.*(.)\1\1.*/) do # puts $& # end # exit 0 dictionary.scan(/.+(ology|nomy|nomics|graphy)$/) do # puts $` puts $& # puts $' end
foo.src
this is some code here's a string literal: "hey, foobag!" and another on the same "line" here's one with a backslash: "I think \"presidents\" should wear bodycams." another: "asdf34543 32423 dfggd!!!"
literals.rb
#!/usr/bin/env ruby src = IO.read('foo.src') src.scan(/"(\\"|[^"])*"/) do puts $& end
humanid.rb
#!/usr/bin/env ruby id = ARGV[0].dup id.gsub!(/([a-z])([A-Z])/, '\1 \2') id[0] = id[0].upcase puts id