CS 330: Lecture 4 – Find and Replace
Dear students,
Last week we started examining regex, a language for recognizing languages. We examined their syntax and theoretical background. I want to spend two more days discussing them. Today we look at several applications of them inspired by real-life needs that I’ve encountered:
- Extract the URLs from all
img
elements. If you alter this to search for anchor taghref
s instead and throw in some recursion, you’d have yourself your very own webcrawler. - List all the HTML tags in a file.
- Find all the statements that don’t end in a semi-colon. (As a proxy, we’ll say all lines that have code on them that don’t end in a curly brace, comma, or semi-colon.)
- Find all the public method names in a Java source file.
In solving these challenges, we’re going to see a bit more of the Ruby language. We’ll see a few different methods for processing files and arrays.
Loading a file as an array of lines can be done using File.readlines
:
lines = File.readlines('file.txt')
lines.each do |line|
# process line
end
Or you can load the file into one string and call String.lines
to break it up:
all = File.read('file.txt')
all.lines.each do |line|
# process line
end
If you need line numbers, you can call each_with_index
and add a parameter to your block:
lines.each_with_index do |line, i|
# process line at index i
end
Suppose instead of just find text, we want to replace the text that we match. For that we can use String.gsub
(for a global substitution) or String.sub
(for a single substitution). gsub
and sub
return new strings, while gsub!
and sub!
modify the invoking strings. The substitution can be expressed several ways:
text.gsub!(/pattern/, 'replacement text')
text.gsub!(/pattern/, 'replacement \1 with captures \2')
text.gsub!(/pattern/, "replacement \\1 with captures \\2 and \n double quotes")
text.gsub!(/pattern/) do
compute the replacement text, using $1, $2, ...
end
Let’s examine gsub
by solving these challenges:
- Humanize identifiers, turning
isUnderSiege
toIs Under Siege
. - Fix missing quotation marks around single-token attributes in HTML.
- Evaluate embedded mathematical expressions in a report.
Here’s your TODO list for next time:
- Start working on the Regexercise homework. It is due before February 19. This means that I will grade very early on February 19.
P.S. It’s time for a haiku!
President Y’s plangsub
X’s policies
With this:'not \1'
P.P.S. Here’s the code we wrote together:
imgripper.rb
#!/usr/bin/env ruby
html = File.read(ARGV[0])
html.scan(/<img\s+.*src\s*=\s*"([^"]*)"/) do
url = $1
if url =~ %r{^//}
url = "https:#{url}"
end
system("curl -O #{url}")
end
methods.rb
#!/usr/bin/env ruby
path = '/Users/johnch/checkouts/speccheck/src/org/twodee/speccheck/SpecChecker.java'
java = File.read(path)
# java.scan(/public\s+.*?(\w+)\(/) do
# puts $1
# end
java.scan(/public\s+.*?(\w+)\(/).each_with_index do |name, i|
puts "#{i}. #{name}"
end
unity.rb
#!/usr/bin/env ruby
id = 'isUnderSiege'
id = 'isFalse'
id = 'isOneWord'
# id.gsub(pattern, replacement)
first = id[0].upcase
rest = id[1..-1]
newID = first + rest.gsub(/([A-Z])/, ' \1')
puts newID