teaching machines

CS 240: Lecture 1 – Introduction

August 25, 2021 by . Filed under cs3, fall-2021, lectures.

Welcome to CS 240: Algorithms and Data Structures. I’m Chris, and I’ve never felt less prepared for a semester than I do right now. Part of it’s because the summer was busy. I put on a few summer camps. Part of it’s because I’m still new to JMU. This is my second year here, and there’s a lot to learn. But mostly it’s because I feel frozen. When the world is burning, it’s hard to plan ahead. But let’s not let that stop us from plowing ahead into computer science!

I had a good friend at another school who felt that every course needed a story. A course’s story explained why the faculty thought it belonged in a computer science major, how all the content tied together, and what the students would do with the knowledge afterward. What would you say is the one-sentence story of CS 149? CS 159? Here are my suggestions:

What’s the story of this course? Let’s complete an exercise before we answer that question.

ArrayList

Consider this code, which adds half a million numbers to a list and then removes them all:

LinkedList<Integer> numbers = new LinkedList<>();

for (int i = 0; i < 500000; ++i) {
  numbers.add(0, i);
}

while (!numbers.isEmpty()) {
  numbers.remove(0);
}

The code uses LinkedList, which you may not be familiar with. In fact, suppose you found this code on the first day of a new job. You decide to replace LinkedList with ArrayList, because you are more familiar with it. How do you think the ArrayList version will compare to LinkedList?

This semester I’m using a tool called Socrative to ask you questions during lecture. You’ll need a computer or mobile device to answer. Visit socrative.com and answer the question. For most of these questions, we’ll do two rounds. The first round will be you thinking and answering on your own. Then you’ll group up with your nearest neighbors and discuss your answers. You will try to persuade your neighbors why your answer is the right one. Then you’ll answer the question a second time. This second time your participation will be graded.

To time our two versions, we’ll use this Stopwatch utility, which measures the current thread’s CPU usage:

import java.lang.management.ManagementFactory;
import java.lang.management.ThreadMXBean;

public class Stopwatch {
  private ThreadMXBean bean;
  private long startNanos;

  public Stopwatch() {
    bean = ManagementFactory.getThreadMXBean();
  }

  public void start() {
    startNanos = bean.getCurrentThreadCpuTime();
  }

  public double stop() {
    long endNanos = bean.getCurrentThreadCpuTime();
    return (endNanos - startNanos) / 1e9;
  }
}

For informal testing, one could use System.currentTimeMillis instead of ThreadMXBean. I don’t use that method because it would also include the time spent in other threads.

On my MacBook Pro, the LinkedList version runs in ~0.11 seconds. The ArrayList version runs in ~23 seconds. This is a drastic difference. Users don’t have much patience for a task that takes 23 seconds to complete. Why is there such a big difference?

Contiguous Memory

There are two basic approaches to organizing data in a computer’s memory (RAM). One is to allocate a contiguous chunk of bytes. Your program reaches into this chunk using offsets from its start. You probably know this contiguous chunk as an array. Arrays are known for being fast because of a practice called caching. When you read in one element of an array, it moves from RAM into the much faster cache memory located directly on the processor. As you read and write into that element, the CPU touches the cache instead of RAM.

What makes an array fast is that not only does the requested element get pulled into cache, but so do its neighboring bytes. If you are looping through the array, the CPU will often find the element it needs to already be in cache memory. In this case, however, the array backing the ArrayList is not fast. That’s because we’re shifting elements around. When we insert a number, we insert it at the beginning of the list, which causes all the subsequent elements to change their memory cells. That’s a basic problem with arrays: structural changes are expensive.

A related problem is that arrays are fixed in size. You generally can’t just extend the end of the array past its current memory cell, because those bytes might be reserved for other data. Instead, you must know how much memory you need for the array at the time you make the array. Or you use an abstraction like ArrayList, which guesses a size for you, and if you fill it up, it automatically creates a bigger array and copies everything over.

Linked Memory

The alternative to contiguous memory is non-contiguous memory. A non-contiguous list stores in RAM two things per element: the datum and the address of the next element. The next element need not be adjacent to its predecessor; it can live anywhere in memory. A collection that is pieced together like this is linked.

There are some advantages to a linked data structure. Imagine inserting a new element at the beginning of the list. You allocate room for the new element, and then point its next link to the node that used to be the head of the list. And that’s it. No shuffling of elements is necessary. To remove an element, you just wire up its predecessor to its successor.

In this situation, a linked list is much faster than an array. That leads to a possible story for this course: the straightforward solution fails under the stress of large data.

That leads us to another question. Suppose you have a linked list named list. Which of the following calls will be the slowest, if any?

list.get(0)
list.get(size() / 2)
list.get(size() - 1)
list.get(size())

In a straightforward implementation of a linked list, the last valid element will be the slowest to access because it requires a traversal of the links from the very beginning. However, a more sophisticated implementation will also keep a pointer to the end of the list and each node will have a link to its predecessor. Such a list is called a doubly-linked list.

Data Structures

LinkedList and ArrayList are two kinds of lists. One could dream up other kinds, like a list that stores elements in contiguous chunks of 5 before linking to the next chunk. All of these different lists are implementations of an abstract data type (ADT). An ADT is a description of data that describes the possible values of the type and the operations that the type supports. A list is an indexed sequence of data that supports these operations and maybe others:

One of the most important things you will take away from this course is a catalog of ADTs that you will consider when coming up with an algorithm. There are several other ADTs that are in this catalog and that we will discuss in this course:

There are many others.

Each of these abstractions allow many possible concrete implementations, or data structures. For example, an ADT doesn’t dictate whether arrays or linked structures are used. The data structures themselves make those decisions. This leads to the second important takeaway from this course: how we implement an ADT has a profound impact on its suitability for performing certain tasks. We can’t just float around at the abstract level and still write good software.

Story

Let’s frame the story of this course in the positive: computer scientists have an arsenal of tools at their disposal, and they must select the right one for the job. Allow me to be honest with you. This arsenal is large, and we will be walking through it faster than your brain can probably organize it. But that’s okay. You will keep learning the arsenal after the class is over, and hopefully that learning will be easier because of this course.

TODO

See you next time.

Sincerely,

P.S. It’s time for a haiku!

There are many lists
Some loop fast, others grow fast
Each a specialist