CS 352 Lecture 37 – Hazard MitigationDear students,
Last time we introduced the idea of pipelining, which allowed multiple instructions to progress through the processor at a time. But we saw some dangers that crept in. If this were politics, we would tell pipelining to apologize and resign. But we’re computer scientists, and we fix things rather than give up on them. Let’s walk through the issues we saw last time and talk about how to avoid them or lessen their impact.
First we saw data hazards. We looked at read-after-write, in which a later instruction relies on the result of an earlier instruction. If the earlier instruction hasn’t written back the result to the register file, the later instruction will read the wrong value. What can we do about it?
- Stalls. We could simply delay the later instruction. This could be done by inserting intervening
nop(no operation) instructions that pad out the stages or by letting the later instruction dwell a bit longer in some stages. Such delays are often called stalls or bubbles. Though they provide correctness of our computation, they undermine the performance advantages we hoped pipelining would provide.
- Forwarding. A smarter approach would be to run some circuitry between hardware components. After the ALU computes our register’s new value, we can forward the result along to the other stages that want to read it. We don’t need to wait for it to be committed to the register file.
- Out-of-order execution. Probably the most drastic approach is to delay a read-after-write instruction not by
nops but by other, independent instructions. There might be useful work that can be done while we are waiting. For this to work, the hardware will need to collect up a handful of instructions so it can have a few on hand when a data hazard appears. Consider this example:
add r1, r2, #1 lsl r3, r1, #2 mov r5, #57 mvn r4, r4 and r6, r6, r7That
r1, but the three instructions after it don’t have any interdepencies, so we might as well execute the program like this:
add r1, r2, #1 mov r5, #57 mvn r4, r4 and r6, r6, r7 lsl r3, r1, #2
We also saw control hazards. When we have a branch instruction, there are two possible next instructions. (If we have
mov pc, ?, there are a lot more than two choices!) Since we don’t know exactly what the program counter is going to be, we can’t schedule the next instruction in the pipeline. Is this a big deal? Yes, given that about 20% of instructions are branches! What do we do? Well, we have a few choices:
- Stall. We could stall until we get enough information to determine the new program counter. But once again, there go the performance gains.
- Loop unrolling. We could eliminate the branches altogether by unrolling our loops. This is something we usually let the compiler do. But the compiler has to be able to know at compile time how many times the loop will iterate. We’ll look at an example of how C++ templates are really good at supporting this.
- Assume branch taken/not taken. Think of a loop in the canonical do/while structure. How often is the branch taken? On all but the last iteration. It’d be reasonable then to guess that the branch will be taken, and load the labeled instruction into the pipeline. Of course, if we discover that we were wrong, we’ll need to flush out the predicted instructions from the processor.
- 1-bit branch history. Instead of making mindless decisions, we could keep a memory of whether a branch was taken or not in the past. Our memory can just be a collection of bits keyed on the low-order bits of the branch instruction address. We can consult this memory to predict what the next value of the program counter should be. Consider this code:
for y to 240 for x to 320 puts (x, y)On the inner loop, how many times is the prediction wrong? It’s wrong on the first iteration, because the last time that loop was run, the branch was not taken. It’s also wrong on the last iteration, because on the previous iteration, the branch was taken.
- 2-bit branch history. To overcome the brittleness of a 1-bit prediction table, we can extend to two bits. The bits encode a state in the following state machine: When our prediction is correct, we move into a stronger prediction state. When our prediction is incorrect, we move into a weaker prediction state. But it takes two wrongs before we alter our prediction. How will this handle our inner loop? Once the loop is primed, we’ll be in the strong predict taken state. On the last iteration, we’ll be wrong and move into the weak predict taken state. When the outer loop comes around again, we’ll still predict the branch as taken. That’s one less incorrect prediction than we had before!
Next class we have a couple folks from Superion visiting to talk about FPGAs. So, this will be my last time at the front of the room. I just want to revisit what we’ve done this semester:
- We’ve looked at how we can tame electricity and switching technology to compute logical and arithmetic operations.
- We’ve examined how we can mix in time and memory to our devices so they execute not just one operation but a whole program.
- We’ve looked briefly at how we can connect up devices to our machine so they can pull input from the user and push output back.
- We’ve discussed mechanisms for making sure that this monster that we’ve built performs as fast as possible.
Why did we talk about all this stuff, which seems fairly removed from the activities of our internships and other classes? After all, critics of the computer-science-for-all initiatives argue that one does not need to understand how an engine works in order to drive across town. They may be right, but you are not just a driver, a user of a computer. You are making it do things it didn’t know how to do before you came along. For that to happen to its fullest, we have to crack open the hood.
Here’s your TODO list for next time:
- Read The Free Lunch is Over. On a quarter sheet, answer a few questions. What impact do Sutters’ statements have on you as you consider your future career in technology? Abstractions have hidden many low-level aspects of technology away from software developers. Can they do the same for concurrency?
- For an extra credit participation, find and read/watch an introduction to FPGAs. On a separate quarter sheet, jot down a few things that you learn.
See you next class!
P.S. Here’s the code we wrote together…