teaching machines

CS 240: Lecture 33 – Hashing, Part 2

November 10, 2021 by . Filed under cs3, fall-2021, lectures.

Dear students:

Last time we introduced hashing as a vehicle for mapping arbitrary objects to locations in an array. One of the most popular applications of hashing is to implement a dictionary, which makes it easy to look up a value based on some key. Dictionaries are used to implement real dictionaries, email and phone directories, the domain name service, indices, and so on.

We could implement a naive dictionary in the following manner:

class Dictionary<K, V>
  constructor
    entries = new array[somePrimeNumber]

  put(K key, V value)
    i = (key.hash & 0x7FFFFFFF) % entries.size
    entries[i] = [key, value]

  get(K key)
    i = (key.hash & 0x7FFFFFFF) % entries.size
    return entries[i]

The key is the element whose hash code is used to compute the elements index. We can also use hashing to implement a hash set. In that case, we wouldn’t have a key-value pair. We’d have just the item itself, and we’d use its own hash code to place it in the array. For the remainder of discussion of hashing, we’ll assume that it’s being used in a dictionary.

Load Factor

Some of the behaviors of a hashed collection examine the collection’s load factor, which has this definition:

$$\text{load factor} = \frac{\text{element count}}{\text{array size}}$$

A load factor of 0.5 means that 50% of the cells are occupied. A new entry is more likely to collide than land in an empty cell. For this reason, we want to avoid large load factors. We are not trying to be efficient with space but rather lookup time. What can we do to ensure that the load factor stays small? If the load factor goes above 0.5, we create a bigger array and move all the elements over, much liked we did with an array list.

Collisions

In a hash table, we can’t totally eliminate collisions; we can only hope to minimize them. So, what happens when we do have a collision? We have a few options:

Separate Chaining

In the worst case, what will the cost of insertion and search be with separate chaining? The collection will degenerate to a linked list, so these operations will run in linear time. Achieving good performance in a collection is only possible if the hash function evenly distributes the items.

Because of these collisions, looking up a value is not just a matter of finding the right cell and giving back the item we find there. We must inspect each item in the chain and find the one whose key matches. That means the table must store both the keys and values.

The OpenJDK Hashtable uses separate chaining.

Closed Hashing

With closed hashing, the entries don’t leave the hash table. If there’s a collision, we follow a plan for finding a new spot for the entry. The plan is dictated by a probe function. The probe function considers the key and the number of collisions experienced so far during the hunt for an empty cell. It yields the number of cells to advance forward from the original index. To move forward one cell on every step of the hunt, we’d have this probe function:

$$p(\text{key}, \text{ncollisions}) = \text{ncollisions}$$

Such linear probing easily leads to pileups of colliding entries. The colliding item just gloms on to the mass of previous colliding items. Some of these collisions can be eliminated by spreading out the steps of the hunt for an empty cell. With quadratic probing, we grow the step size by squaring the number of collisions:

$$p(\text{key}, \text{ncollisions}) = \text{ncollisions}^2$$

On the first collision, we move forward 1 cell. On the second, 4 cells. On the third, 9 cells. And so on.

Quadratic probing reduces the number of incidental collisions that happen when entire neighborhoods of cells become a mass of colliding entries, but it doesn’t fix the long hunts that happen when entries map to the same index. To keep the hunts short, we need to vary the step size. We can make the probe function consider the key itself to disperse colliding elements:

$$p(\text{key}, \text{ncollisions}) = (1 + \text{key} \mathbin{\%} 7) \cdot \text{ncollisions}$$

Closed hashing has a few limitations. First, when the load factor reaches 1, there’s no more room. The table must be expanded to accommodate more entries. Separate chaining doesn’t have this same restriction. Second, when you remove an item, you can’t just clear the cell. Otherwise if you later go to look up an item that had previously collided with the removed item, you will assume that the item doesn’t exist because the cell is empty. Instead, you must mark that the cell was occupied somehow and that you should keep probing to resolve the collision. The marker is affectionately called a tombstone because it marks the grave of a removed item.

Exercises

For the remainder of our time, we will work through some exercises on hashing.

TODO

You have some work to do before the next class:

See you next time.

Sincerely,

P.S. It’s time for a haiku!

I got you something
It’s a fast dictionary
It’s less than half full