» CS 330 Lecture 35 – Roogle, a Poor Man’s Search Engine

CS 330 Lecture 35 – Roogle, a Poor Man’s Search Engine

May 1, 2013 by Chris Johnson. Filed under cs330, lectures, spring 2013.

Agenda

what ?s
program this
a simple search engine
scrappy Ruby
top-level functions
imperative and object-oriented with some functional
globals vs. locals
Hash, Array, and Set

Program This

show

TODO

Make some progress on http://tryruby.org. Quarter sheet.

Code

Indexing Pseudocode

to crawl url
  for each word in HTML
    add url to index[word]
  for each link in HTML
    crawl link

Searching Pseudocode

to search for word
  urls = index[word]
  print urls

roogle.rb

#!/usr/bin/env ruby

require 'net/http'
require 'set'

def get_words(html)
  html =~ /<body.*?>(.*)<\/body>/m
  body = $1
  body.gsub!(/<.*?>/, ' ')
  body.gsub!(/&.*?;/, ' ')
  body.gsub!(/'s/, 's')
  body.gsub!(/\W/, ' ')
  words = body.scan(/\w+/).map do |word|
    word.downcase
  end

  words
end

$word_to_urls = Hash.new

def crawl(url)
  url =~ /^(.*?)(\/.*)$/
  host, page = $1, $2

  Net::HTTP.start(host, 80) do |http|
    html = http.get(page).body

    # Index all them thar words.
    words = get_words(html)
    words.each do |word|
      if not $word_to_urls.include? word
        $word_to_urls[word] = Set.new
      end
      $word_to_urls[word].add(url)
    end
  end
end

crawl('www.cs.uwec.edu/index.html')

$word_to_urls.each do |word, urls|
  puts "#{word} -> #{urls.to_a.join(' ')}"
end

Haiku

show