teaching machines

CS 240: Lecture 32 – Hashing, Part 1

November 8, 2021 by . Filed under cs3, fall-2021, lectures.

Dear students:

This course is really a marathon of one-upping. We introduce a data structure only to knock it down with the next one. And then we knock that one down when we introduce a third. Let’s do it again.

Consider the problem of searching for something in a collection. With a plain array, we find an item using sequential search, which runs in linear time. Linear time is okay, but not great. We get better performance by sorting the array and using binary search, which runs in logarithmic time. That’s really good, except adding items to a sorted array is a linear time operation. A self-balancing binary search tree one-ups the sorted array, as it supports both searching and insertion in logarithmic time.

Can we do better? Yes, but we must abandon the idea of sorting the collection. Instead, we will use a plain old array whose contents are not in any discernable order. But this array is paired with a magical function that derives a key’s index from the key’s state. That function is called a hash function. It gets its name from its job: it stirs together the state of an element into a soup of bits called a hash code. In many programming languages, that soupy hash code is a 32-bit integer.

If we have a hash function, we could insert an element into an array with code like this:

void put(T item) {
  int hash = item.hashCode();
  elements[hash] = item;
}

However, there are several problems with this. First, how many different hash codes might be generated? About 4 billion. Only 2 billion of those are positive and therefore legal array indices. The array would have to be very large to support all those possible indices. We can get by with smaller arrays if we let hash codes that exceed the bounds wrap back to the beginning of the array. That’s a job for the mod operator.

Second, what if two different keys produce the same hash code and therefore map to the same cell of the array? There are several ways we might fix this, and we’ll talk about them in a moment.

If we can get this hashing business to work, here’s the headline for tomorrow’s newspaper: “SEARCH AND INSERT IN CONSTANT TIME!” Because that’s how fast this code is, assuming that the hash code itself can be computed in constant time. A collection whose performance doesn’t degrade as its size increases is a breakthrough.

Hashing

The hash function’s job is to turn an arbitrary object into an integer. It must achieve these goals:

Before we can use the hash code as an index, it must be modified to follow this additional rule:

But it’s not the job of the hash function to perform this compression. Our hashing collection will do that. It will have to ensure that the hash code is positive and that it’s a valid index. The OpenJDK implementation of Java transforms it like so:

int index = (hash & 0x7FFFFFFF) % table.length;
table[index] = item;

The bit masking shaves off the 1 that may make the number negative.

There are many schemes we could use to map an object to an integer. In OpenJDK, the Integer class computes a hash code like this:

public class Integer {
  // ...

  public int hashCode() {
    return Integer.hashCode(value);
  }

  public static int hashCode(int value) {
    return value;
  }
}

The hash code is just the integer itself. The other, smaller integer types probably work similarly. That leads me to wonder how Boolean does its job. Here’s the OpenJDK implementation:

public class Boolean {
  // ...

  public int hashCode() {
    return Boolean.hashCode(value);
  }

  public static int hashCode(boolean value) {
    return value ? 1231 : 1237;
  }
}

That’s not what I was expecting.

Suppose our objects are strings. We could implement our hash code function by just finding the ordinal value of the first letter:

class String {
  // ...

  public int hashCode() {
    return characters[0];
  }
}

The string “empathy” would map to “e”. The string “authenticity” would map to “a”. The string “rhythmic” would map to “r”. How do you feel about this scheme? Is it a good one? Not really. There are many possible strings, but far fewer hash codes. If all of the entries began with lowercase letters, they would all collide at the same 26 cells of the array.

The String class implements hash code with more sophistication than just the first letter:

public class String {
  // ...

  public int hashCode() {
    int h = hash;

    if (h == 0 && value.length > 0) {
      char[] val = value;

      for (int i = 0; i < value.length; i++) {
        h = 31 * h + val[i];
      }

      hash = h;
    }

    return h;
  }
}

This code requires some interpretation. The assignments involving hash reveal that the hash code is cached. Instead of computing the hash code each time the function is called, it is stored in an instance variable. This is a wise strategy in Java because strings are immutable. The characters won’t change and thus invalidate the hash code.

The hash code isn’t computed until it’s first needed. It starts off as 0. This technique is called lazy initialization.

The summation is easier to understand if we write it in a mathematical notation:

$$\textrm{hash} = s[0] \cdot 31^{n – 1} + s[1] \cdot 31^{n – 2} + s[2] \cdot 31^{n-3} + \ldots + s[n – 1] \cdot 31^{0}$$

This hashing function is engineered to prevent strings with similar prefixes or with the same characters in a different order from having hash codes that cluster together.

The hash code for ArrayList is defined similarly:

public abstract class AbstractList {
  // ...

  public int hashCode() {
    int hashCode = 1;
    for (E e : this)
      hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
    return hashCode;
  }
}

When you create your own objects, you are encouraged to implement your own hashCode function. It’s not a bad idea to mimic the the strategy of computing a weighted sum of the instance variables’ hash codes:

public class SomeClass {
  // ...

  public int hashCode() {
    int hashCode = 7;
    hashCode = 31 * hashCode + instanceVariableA.hashCode();
    hashCode = 31 * hashCode + instanceVariableB.hashCode();
    hashCode = 31 * hashCode + instanceVariableC.hashCode();
    // ...
    return hashCode;
  }
}

The Object class does provide a default implementation of hashCode, but it doesn’t uphold an important property: all objects that are considered equal should produce the same hash code. This is an important property because if you want to look something up later, you want the key that you provide to match the key that’s associated with the element. The default hashCode builds the hash out of some unique identifier that is different for every instance of an object, even if it has the same state.

Collision

We’ve seen how the hash gets turned into an index in the array. What size should this array have? There’s no an easy answer, but there are certain sizes that you should avoid. Suppose you have an array of size 10 and someone inserts items with hash codes 5, 10, 15, 20, 25, 30, 35, … These will map to just two cells in the array.

This pattern of collisions will happen whenever the data and the array size have common factors. An array of size 12 will suffer when the data comes in multiples of 2, 3, 4, or 6. An array of size 48 will suffer when the data comes in multiples of 2, 3, 4, 6, 8, 12, 16, and 24. Without knowing the distribution of keys ahead of time, the only surefire way of prevent this clustering is to make the table have a size that is a prime number. It won’t share any factors with the sequence of data except for the length itself.

In our earlier example, what number yields the first collision if we pick an array size of 11? We’ll solve this as a peer instruction exercise.

We can’t totally eliminate collisions; we can only hope to minimize them. So, what happens when we do have a collision? We have a few options:

With open hashing, where do we place the new item in the linked list? There’s no right answer. Putting it at the end would preserve the insertion order, if that matters.

In the worst case, what will the cost of insertion and search be? The collection will degenerate to linked list, so these operations will run in linear time. Achieving good performance in a collection is only possible if the hash function evenly distributes the items.

Dictionary

The examples above treat each item in the array as a standalone object. We can insert new items and see if the collection contains an item. The hash code would be derived from the item itself. If we were to code this behavior up, we’d have a HashSet class. It’d be a very fast set implementation.

More commonly, we use hashing to implement a dictionary. Each item is a key-value pair. The key is used to look up the value, and it is the key’s hash code that we use to compute the pair’s index. We might implement this dictionary, which is often called a hash table, like this:

class Hashtable<K, V> {
  private Entry<K, V> entries;

  // ...

  public void put(K key, V value) {
    int hash = key.hashCode();
    int index = (hash & 0x7FFFFFFF) % entries.length;
    entries[index] = new Entry<>(key, value);
  }

  public boolean contains(K key) {
    int hash = key.hashCode();
    int index = (hash & 0x7FFFFFFF) % entries.length;
    Entry<> entry = entries[index];
    return entry != null && entry.key.equals(key);
  } 
}

There’s no open hashing in this rough draft. Note that we need to keep the key in the collection because later we may need it to resolve collisions.

TODO

You have some work to do before the next class:

See you next time.

Sincerely,

P.S. It’s time for a haiku!

You want the window
But the guy says, “That’s my seat”
Ask him his hash code