» CS 330: Lecture 4 – Find and Replace

CS 330: Lecture 4 – Find and Replace

February 5, 2018 by Chris Johnson. Filed under cs330, lectures, spring 2018.

Dear students,

Last week we started examining regex, a language for recognizing languages. We examined their syntax and theoretical background. I want to spend two more days discussing them. Today we look at several applications of them inspired by real-life needs that I’ve encountered:

Extract the URLs from all img elements. If you alter this to search for anchor tag hrefs instead and throw in some recursion, you’d have yourself your very own webcrawler.
List all the HTML tags in a file.
Find all the statements that don’t end in a semi-colon. (As a proxy, we’ll say all lines that have code on them that don’t end in a curly brace, comma, or semi-colon.)
Find all the public method names in a Java source file.

In solving these challenges, we’re going to see a bit more of the Ruby language. We’ll see a few different methods for processing files and arrays.

Loading a file as an array of lines can be done using File.readlines:

lines = File.readlines('file.txt')

lines.each do |line|
  # process line
end

Or you can load the file into one string and call String.lines to break it up:

all = File.read('file.txt')

all.lines.each do |line|
  # process line
end

If you need line numbers, you can call each_with_index and add a parameter to your block:

lines.each_with_index do |line, i|
  # process line at index i
end

Suppose instead of just find text, we want to replace the text that we match. For that we can use String.gsub (for a global substitution) or String.sub (for a single substitution). gsub and sub return new strings, while gsub! and sub! modify the invoking strings. The substitution can be expressed several ways:

text.gsub!(/pattern/, 'replacement text')
text.gsub!(/pattern/, 'replacement \1 with captures \2')
text.gsub!(/pattern/, "replacement \\1 with captures \\2 and \n double quotes")
text.gsub!(/pattern/) do
  compute the replacement text, using $1, $2, ...
end

Let’s examine gsub by solving these challenges:

Humanize identifiers, turning isUnderSiege to Is Under Siege.
Fix missing quotation marks around single-token attributes in HTML.
Evaluate embedded mathematical expressions in a report.

Here’s your TODO list for next time:

Start working on the Regexercise homework. It is due before February 19. This means that I will grade very early on February 19.

Sincerely,

P.S. It’s time for a haiku!

President Y’s plan
gsub X’s policies
With this: 'not \1'

P.P.S. Here’s the code we wrote together:

imgripper.rb

#!/usr/bin/env ruby

html = File.read(ARGV[0])

html.scan(/<img\s+.*src\s*=\s*"([^"]*)"/) do 
  url = $1 
  if url =~ %r{^//}
    url = "https:#{url}"
  end
  system("curl -O #{url}")
end

methods.rb

#!/usr/bin/env ruby

path = '/Users/johnch/checkouts/speccheck/src/org/twodee/speccheck/SpecChecker.java'
java = File.read(path)

# java.scan(/public\s+.*?(\w+)\(/) do
  # puts $1
# end

java.scan(/public\s+.*?(\w+)\(/).each_with_index do |name, i|
  puts "#{i}. #{name}"
end

unity.rb

#!/usr/bin/env ruby

id = 'isUnderSiege'
id = 'isFalse'
id = 'isOneWord'

# id.gsub(pattern, replacement)

first = id[0].upcase
rest = id[1..-1]

newID = first + rest.gsub(/([A-Z])/, ' \1')
puts newID