CS 240: Lecture 3 - Searching

Dear students:

Last lecture we encountered some algorithms whose execution took longer than we'd like. Today, we formalize this notion of evaluating an algorithm's cost, use it to judge a couple of search algorithms, and then drop the costs into a fixed set of buckets called complexity classes.

Measuring Algorithms

We have an arsenal of tools at our disposal and how each tool has its strengths and weaknesses. To be able to choose which tool to use, we need a way of judging them. Here are a few possible ways in which we can measure algorithms:

With both the list experiment and the biggest difference experiment from last lecture, we used execution time to measure the time efficiency. Execution time turns out not to the best measurement for several reasons:

A better measure is the number of steps. A step is loosely defined as a single operation. An algorithm's number of operations is expressed as a function of n:

$$T(n) = \mathrm{?}$$

Consider this algorithm:

total = 0;
for (int i = 0; i < numbers.length; ++i) {
  total += numbers[i];
}

How many steps are taken? The += happens n times. So does the array indexing. So does ++i. So does the comparison. We could define \(T\) for this algorithm in this way:

$$T(n) = 4n$$

But don't forget the assignment that happens before the loop:

$$T(n) = 4n + 1$$

Even then, this isn't the code that ultimately gets executed. Rather, the code gets compiled down to a bunch of machine instructions. The array index might expand into five machine instructions. The comparison might expand into three. So, \(T\) might more realistically be defined this way:

$$T(n) = 26n + 1$$

While this may be more accurate, the definition of \(T\) now depends on the machine again, which will make comparing algorithms on different machines difficult. For this reason, we define \(T\) more loosely in terms of some constant coefficients:

$$T(n) = cn + d$$

How should we define \(T\) for this algorithm?

total = 0;
for (int i = 0; i < numbers.length; ++i) {
  for (int j = 0; j < numbers.length; ++j) {
    total += numbers[i];
  }
}

The outer loop executes n times. On each iteration, the inner loop executes n times. That gives us this loose definition of \(T\):

$$T(n) = cn^2 + d$$

All algorithms whose number of steps can be expressed this way are said to have quadratic time. Would you rather have a linear time algorithm or a quadratic time algorithm?

How should we define \(T\) for this algorithm?

public static <T extends Comparable<T>> T bigger(T a, T b) {
  if (a.compareTo(b) >= 0) {
    return a;
  } else {
    return b;
  }
}

This method has no loop nor a collection, so there's no \(n\) elements. Its cost will therefore not vary but always be some constant:

$$T(n) = c$$

Now let's examine the cost function of two classic search algorithms.

Searching

Even before computers, we humans collected massive amounts of data. We made dictionaries and phonebooks, for example. Finding an entry in our records can be difficult. One option is the linear search, which looks like this:

to search(target)
  for each record
    if record matches target
      return record
  return null 

In the worst case \(T(n) = cn\).

We can find a target faster if we sort the records first. If we pick a record at random from the list and compare it to the target, we know how to proceed. One of the three things will have happened:

We throw away the portion of the list in which we know the target doesn't appear. Throwing away big chunks of the list is what makes binary search faster than linear search.

How shall we pick a random element? To minimize the number of steps in the worst case, we pick the element in the exact middle of the list. That means we will eliminate half of the list on every check. We don't physically halve the list, as that would be expensive. Instead, we'll maintain a pair of indices that mark a window within the list in which the target might appear.

This contains method uses a binary search to see if a certain number is within an array:

public static boolean contains(ints[] xs, int target) {
  int lo = 0;
  int hi = xs.length - 1; 
  while (lo < hi) {
    int mid = (lo + hi) / 2;
    if (target < mid) {
      hi = mid - 1;
    } else if (target > mid) {
      lo = mid + 1; 
    } else {
      return true;
    }
  }
  return false;
}

What happens if you pass an unsorted array to this method? Try it.

The cost function of the binary search is proportional to the number of iterations of the loop. Each iteration halves the list. How many times can a number be halved? The flip of this problem is how many times must 2 be doubled until you reach \(n\):

$$2^? = n$$

The answer is the base-2 logarithm. The binary search then has this cost function:

$$T(n) = c\log_2 n$$

Compare these two functions in Desmos. Binary search is a clear winner. It scales to very large collections. However, remember that the list must be sorted before you can achieve these gains.

Big-O and Big-Theta

As we encounter algorithms and data structures, we will judge them by their cost functions. To ease their comparison, we will place them on the Ladder of Complexity. This ladder categorizes the cost functions into a small number of families. These families are formed by considering only the most significant term and dropping constants. Here are a few of those families:

We'll draw this ladder with high costs at the top and low costs at the bottom. Each family is called a complexity class.

To say that an algorithm or data structure is within some family on this ladder, we use a notation called Big-Theta. These cost functions are Big-Theta of these families:

$$\begin{align} 3n^2 &\in \Theta(n^2) \\ 2n + 10 &\in \Theta(n) \\ 5 &\in \Theta(1) \\ \end{align}$$

To say that an algorithm or data structure is within or below some family on this ladder, we use a notation called Big-O. These cost functions are Big-O of these families:

$$\begin{align} 3n^2 &\in O(n^2) \\ 3n^2 &\in O(n^3) \\ 3n^2 &\in O(2^n) \\ 2n + 10 &\in O(n) \\ 2n + 10 &\in O(n^2) \\ 2n + 10 &\in O(n^3) \\ 5 &\in O(1) \\ 5 &\in O(\log_2 n) \\ 5 &\in O(n!) \\ \end{align}$$

Big-O is a weaker description about the cost function's placement on the ladder. It only gives an upper bound. Big-Theta puts the cost function in a very specific family. We use both notations because sometimes we want to be specific and sometimes we don't. Big-O appears on a kid's menu for 10-and-unders or a ride that only permits riders under 75 pounds. Big-Theta appears on socks for shoe sizes 6–12 and on jars of salsa. A spicy salsa doesn't mean it's either spicy, medium, or mild.

TODO

You've got a few things to do before we meet again:

The first quiz will be posted Thursday afternoon and will be due before class on Monday. You can consult notes, the book, and the code we've written in class, but you must do your own work.
Get all your repository hiccups worked out before Friday's lab.

See you next time!

Sincerely,

P.S. It's time for a haiku!

The prince sorted them The slipper fit the fifth foot “What magic!” they thought