teaching machines

CS 330 Lecture 35 – Roogle, a Poor Man’s Search Engine

May 1, 2013 by . Filed under cs330, lectures, spring 2013.

Agenda

Program This

  1. Write a pseudocode/Ruby function get_words that accepts some HTML as a parameter. Return an array of all non-tag words in the HTML body. Lowercase all such words.
  2. Write a pseudocode/Ruby function filter_links that accepts as parameters some HTML and prefix URL. Prefix is the part of the URL that doesn’t include the page, e.g., http://prefix/index.php. Return a list of links embedded in the HTML. Bonus features: lowercase all URLs, prepend the prefix onto any URL that doesn’t start with http, and strip off any http:// or https://.

TODO

Code

Indexing Pseudocode

to crawl url
  for each word in HTML
    add url to index[word]
  for each link in HTML
    crawl link

Searching Pseudocode

to search for word
  urls = index[word]
  print urls

roogle.rb

#!/usr/bin/env ruby

require 'net/http'
require 'set'

def get_words(html)
  html =~ /<body.*?>(.*)<\/body>/m
  body = $1
  body.gsub!(/<.*?>/, ' ')
  body.gsub!(/&.*?;/, ' ')
  body.gsub!(/'s/, 's')
  body.gsub!(/\W/, ' ')
  words = body.scan(/\w+/).map do |word|
    word.downcase
  end

  words
end

$word_to_urls = Hash.new

def crawl(url)
  url =~ /^(.*?)(\/.*)$/
  host, page = $1, $2

  Net::HTTP.start(host, 80) do |http|
    html = http.get(page).body

    # Index all them thar words.
    words = get_words(html)
    words.each do |word|
      if not $word_to_urls.include? word
        $word_to_urls[word] = Set.new
      end
      $word_to_urls[word].add(url)
    end
  end
end

crawl('www.cs.uwec.edu/index.html')

$word_to_urls.each do |word, urls|
  puts "#{word} -> #{urls.to_a.join(' ')}"
end

Haiku

Tortoise and Hare raced
Afterward they fell in love
Soon Ruby was born