CS 240: Lecture 1 – Introduction
Welcome to CS 240: Algorithms and Data Structures. I’m Chris, and I’ve never felt less prepared for a semester than I do right now. Part of it’s because the summer was busy. I put on a few summer camps. Part of it’s because I’m still new to JMU. This is my second year here, and there’s a lot to learn. But mostly it’s because I feel frozen. When the world is burning, it’s hard to plan ahead. But let’s not let that stop us from plowing ahead into computer science!
I had a good friend at another school who felt that every course needed a story. A course’s story explained why the faculty thought it belonged in a computer science major, how all the content tied together, and what the students would do with the knowledge afterward. What would you say is the one-sentence story of CS 149? CS 159? Here are my suggestions:
- CS 149 is a course in which humans learn how to teach machines.
- CS 159 is a course that reveals the immensity of software and the smallness of developers.
What’s the story of this course? Let’s complete an exercise before we answer that question.
ArrayList
Consider this code, which adds half a million numbers to a list and then removes them all:
LinkedList<Integer> numbers = new LinkedList<>();
for (int i = 0; i < 500000; ++i) {
numbers.add(0, i);
}
while (!numbers.isEmpty()) {
numbers.remove(0);
}
The code uses LinkedList
, which you may not be familiar with. In fact, suppose you found this code on the first day of a new job. You decide to replace LinkedList
with ArrayList
, because you are more familiar with it. How do you think the ArrayList
version will compare to LinkedList
?
This semester I’m using a tool called Socrative to ask you questions during lecture. You’ll need a computer or mobile device to answer. Visit socrative.com and answer the question. For most of these questions, we’ll do two rounds. The first round will be you thinking and answering on your own. Then you’ll group up with your nearest neighbors and discuss your answers. You will try to persuade your neighbors why your answer is the right one. Then you’ll answer the question a second time. This second time your participation will be graded.
To time our two versions, we’ll use this Stopwatch
utility, which measures the current thread’s CPU usage:
import java.lang.management.ManagementFactory;
import java.lang.management.ThreadMXBean;
public class Stopwatch {
private ThreadMXBean bean;
private long startNanos;
public Stopwatch() {
bean = ManagementFactory.getThreadMXBean();
}
public void start() {
startNanos = bean.getCurrentThreadCpuTime();
}
public double stop() {
long endNanos = bean.getCurrentThreadCpuTime();
return (endNanos - startNanos) / 1e9;
}
}
For informal testing, one could use System.currentTimeMillis
instead of ThreadMXBean
. I don’t use that method because it would also include the time spent in other threads.
On my MacBook Pro, the LinkedList
version runs in ~0.11 seconds. The ArrayList
version runs in ~23 seconds. This is a drastic difference. Users don’t have much patience for a task that takes 23 seconds to complete. Why is there such a big difference?
Contiguous Memory
There are two basic approaches to organizing data in a computer’s memory (RAM). One is to allocate a contiguous chunk of bytes. Your program reaches into this chunk using offsets from its start. You probably know this contiguous chunk as an array. Arrays are known for being fast because of a practice called caching. When you read in one element of an array, it moves from RAM into the much faster cache memory located directly on the processor. As you read and write into that element, the CPU touches the cache instead of RAM.
What makes an array fast is that not only does the requested element get pulled into cache, but so do its neighboring bytes. If you are looping through the array, the CPU will often find the element it needs to already be in cache memory. In this case, however, the array backing the ArrayList
is not fast. That’s because we’re shifting elements around. When we insert a number, we insert it at the beginning of the list, which causes all the subsequent elements to change their memory cells. That’s a basic problem with arrays: structural changes are expensive.
A related problem is that arrays are fixed in size. You generally can’t just extend the end of the array past its current memory cell, because those bytes might be reserved for other data. Instead, you must know how much memory you need for the array at the time you make the array. Or you use an abstraction like ArrayList
, which guesses a size for you, and if you fill it up, it automatically creates a bigger array and copies everything over.
Linked Memory
The alternative to contiguous memory is non-contiguous memory. A non-contiguous list stores in RAM two things per element: the datum and the address of the next element. The next element need not be adjacent to its predecessor; it can live anywhere in memory. A collection that is pieced together like this is linked.
There are some advantages to a linked data structure. Imagine inserting a new element at the beginning of the list. You allocate room for the new element, and then point its next link to the node that used to be the head of the list. And that’s it. No shuffling of elements is necessary. To remove an element, you just wire up its predecessor to its successor.
In this situation, a linked list is much faster than an array. That leads to a possible story for this course: the straightforward solution fails under the stress of large data.
That leads us to another question. Suppose you have a linked list named list
. Which of the following calls will be the slowest, if any?
list.get(0)
list.get(size() / 2)
list.get(size() - 1)
list.get(size())
In a straightforward implementation of a linked list, the last valid element will be the slowest to access because it requires a traversal of the links from the very beginning. However, a more sophisticated implementation will also keep a pointer to the end of the list and each node will have a link to its predecessor. Such a list is called a doubly-linked list.
Data Structures
LinkedList
and ArrayList
are two kinds of lists. One could dream up other kinds, like a list that stores elements in contiguous chunks of 5 before linking to the next chunk. All of these different lists are implementations of an abstract data type (ADT). An ADT is a description of data that describes the possible values of the type and the operations that the type supports. A list is an indexed sequence of data that supports these operations and maybe others:
- size
- get an element at some index
- remove an element at some index
- insert an element
- check whether an element is in the list
One of the most important things you will take away from this course is a catalog of ADTs that you will consider when coming up with an algorithm. There are several other ADTs that are in this catalog and that we will discuss in this course:
- map to manage a collection of key-value pairs
- set to track membership of elements
- stack to manage a changing collection from newest to oldest
- queue to manage a changing collection from oldest to newest
- tree to manage a branching structure
There are many others.
Each of these abstractions allow many possible concrete implementations, or data structures. For example, an ADT doesn’t dictate whether arrays or linked structures are used. The data structures themselves make those decisions. This leads to the second important takeaway from this course: how we implement an ADT has a profound impact on its suitability for performing certain tasks. We can’t just float around at the abstract level and still write good software.
Story
Let’s frame the story of this course in the positive: computer scientists have an arsenal of tools at their disposal, and they must select the right one for the job. Allow me to be honest with you. This arsenal is large, and we will be walking through it faster than your brain can probably organize it. But that’s okay. You will keep learning the arsenal after the class is over, and hopefully that learning will be easier because of this course.
TODO
- Install Java 16.
- Install IntelliJ IDEA. Grab the free Community edition, not the commercial Ultimate edition.
- Create a new project and confirm that you can run code. Also test a JUnit unit test.
- Bring a configured laptop on Friday if you have one.
- Join the course Slack.
- Complete the course survey on Canvas.
- Complete the first two OpenDSA readings.
- Read the generics tutorial.
- Complete the generics quiz.
See you next time.
P.S. It’s time for a haiku!
There are many lists
Some loop fast, others grow fast
Each a specialist