CS 330 Lecture 35 – Roogle, a Poor Man’s Search Engine
Agenda
- what ?s
- program this
- a simple search engine
- scrappy Ruby
- top-level functions
- imperative and object-oriented with some functional
- globals vs. locals
- Hash, Array, and Set
Program This
- Write a pseudocode/Ruby function get_words that accepts some HTML as a parameter. Return an array of all non-tag words in the HTML body. Lowercase all such words.
- Write a pseudocode/Ruby function filter_links that accepts as parameters some HTML and prefix URL. Prefix is the part of the URL that doesn’t include the page, e.g., http://prefix/index.php. Return a list of links embedded in the HTML. Bonus features: lowercase all URLs, prepend the prefix onto any URL that doesn’t start with http, and strip off any http:// or https://.
TODO
- Make some progress on http://tryruby.org. Quarter sheet.
Code
Indexing Pseudocode
to crawl url
for each word in HTML
add url to index[word]
for each link in HTML
crawl link
Searching Pseudocode
to search for word
urls = index[word]
print urls
roogle.rb
#!/usr/bin/env ruby
require 'net/http'
require 'set'
def get_words(html)
html =~ /<body.*?>(.*)<\/body>/m
body = $1
body.gsub!(/<.*?>/, ' ')
body.gsub!(/&.*?;/, ' ')
body.gsub!(/'s/, 's')
body.gsub!(/\W/, ' ')
words = body.scan(/\w+/).map do |word|
word.downcase
end
words
end
$word_to_urls = Hash.new
def crawl(url)
url =~ /^(.*?)(\/.*)$/
host, page = $1, $2
Net::HTTP.start(host, 80) do |http|
html = http.get(page).body
# Index all them thar words.
words = get_words(html)
words.each do |word|
if not $word_to_urls.include? word
$word_to_urls[word] = Set.new
end
$word_to_urls[word].add(url)
end
end
end
crawl('www.cs.uwec.edu/index.html')
$word_to_urls.each do |word, urls|
puts "#{word} -> #{urls.to_a.join(' ')}"
end
Haiku
Tortoise and Hare raced
Afterward they fell in love
Soon Ruby was born
Afterward they fell in love
Soon Ruby was born