teaching machines

CS 240: Lecture 16 – Binary Search

September 29, 2021 by . Filed under cs3, fall-2021, lectures.

Dear students:

Even before computers, we humans collected massive amounts of data. We made dictionaries and phonebooks, for example. Finding an entry in our records can be difficult. One option is the sequential search, which looks like this:

to search(target)
  for each record
    if record matches target
      return record
  return null 

In the worst case, sequential search is $\Theta(n)$. That’s not terrible, but we can do better.

Binary Search

We can find a target faster if we sort the records first. If we pick a record at random from the list and compare it to the target, we know how to proceed. One of the three things will have happened:

We throw away the portion of the list in which we know the target doesn’t appear. That throwing away big chunks of the list is what makes binary search faster than sequential search.

How shall we pick a random element? To minimize the number of steps in the worst case, we pick the element in the exact middle of the list. That means we will throw away half of the list on every check. We won’t actually change the structure of the list, as that would be expensive. Instead, we’ll maintain a pair of indices that form a window within the list in which the target might appear.

This contains method examines an array to see if a certain number is within it:

public static boolean contains(ints[] xs, int target, int lo, int hi) {
  if (lo > hi) return false;
 
  int mid = (lo + hi) / 2;
  if (target < mid) {
    return contains(xs, target, 0, mid - 1);
  } else if (target > mid) {
    return contains(xs, target, mid + 1, hi);
  } else {
    return true;
  }
}

There are two base cases in this method. If the target does not appear in the list, then the window will eventually become invalid, in which case we return false. Or we find the target, in which case we return true. Otherwise, we defer to a recursive call to search one half of the list.

Analysis

Given our earlier discussion of algorithm analysis, you probably have a visceral feel for the complexity of binary search. We want to develop that visceral feel into something that we can defend mathematically.

Suppose we have a list of length 1. How many calls will we make before finding the target? Just 1 if the target is in the list.

Suppose we have a list of length 2. How many calls will we make before finding the target? In the worst case, 2.

Suppose we have a list of length 3. How many calls will we make before finding the target? In the worst case, 2.

Suppose we have a list of length 4. How many calls will we make before finding the target? In the worst case, 3.

Suppose we have a list of length 5. How many calls will we make before finding the target? In the worst case, 3.

Suppose we have a list of length 8. How many calls will we make before finding the target? In the worst case, 4.

Suppose we have a list of length 16. How many calls will we make before finding the target? In the worst case, 5. After just one check, we have reduced the problem down to a list of 8, which we already know has a cost of 4 calls.

Building a table will help us see the relationship between the size of the list and the number of calls made to contains:

size number of calls
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
n $\lfloor\log_2 n\rfloor + 1$

The base-2 logarithm tells us how many times a number can be divided in half until we reach 1. Since splitting the list in half is exactly what the binary search does, its complexity is $\Theta(\log_2 n)$.

Recurrence

Building a table and generalizing is one approach we can take to determining the complexity. That’s relationship between the input size and the cost is not always easy to determine, however. Another approach is to develop a recurrence relation. For binary search, we have these cases:

$$\begin{align}T(0) &= 1 \\T(1) &= 1 \\T(n) &= 1 + T(\tfrac{n}{2}) \\\end{align}$$

If someone calls our contains method on an array of size 8, we can use this recurrence relation to determine its cost:

$$\begin{align}T(8) &= 1 + T(\tfrac{8}{2}) \\ &= 1 + T(4) \\ &= 1 + 1 + T(2) \\ &= 1 + 1 + 1 + T(1) \\ &= 1 + 1 + 1 + 1 \\ &= 4 \\\end{align}$$

This is the same result as we had in our table, so that’s a relief.

The recurrence relation is correct but impractical. That self-reference is annoying. We want to find a closed-form solution that contains no self-reference.

$$\begin{align}T(n) &= 1 + T(\tfrac{n}{2}) \\ &= 1 + 1 + T(\tfrac{\tfrac{n}{2}}{2}) \\ &= 1 + 1 + T(\tfrac{n}{4}) \\ &= 1 + 1 + 1 + T(\tfrac{n}{8}) \\ &= 1 + 1 + 1 + 1 + T(\tfrac{n}{16}) \\\end{align}$$

We could keep expanding forever. After $i$ expansions, we have this expansion:

$$\begin{align}T(n) &= \underbrace{1 + \ldots + 1}_i + T(\tfrac{n}{2^i}) \\ &= i + T(\tfrac{n}{2^i})\end{align}$$

But we know the expansion will stop when we hit an initial condition. That will happen when the size of the input is reduced to 1. We can solve for the number of steps it takes to get there:

$$\begin{align}\frac{n}{2^i} &= 1 \\n &= 2^i \\\log_2 n &= \log_2 2^i \\\log_2 n &= i \cdot \log_2 2 \\\log_2 n &= i \cdot 1 \\\log_2 n &= i \\\end{align}$$

We drop this value of $i$ back into our generalized expansion and then simplify a bit:

$$\begin{align}T(n) &= \log_2 n + T(\tfrac{n}{2^{\log_2 n}}) \\ &= \log_2 n + T(\tfrac{n}{n}) \\ &= \log_2 n + T(1) \\ &= \log_2 n + 1 \\\end{align}$$

Hey, look at that. There’s no self-reference anymore. We have solved the recurrence. We can emphatically say that the worst case cost of binary search is $\log_2 n + 1$. And the argument rests on a mathematical foundation.

TODO

You have some work to do before the next class:

See you next time.

Sincerely,

P.S. It’s time for a haiku!

The prince sorted them
The slipper fit the fifth foot
“What magic!” they thought