SLIDE 1 Slides for Lecture 16
ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng
Electrical & Computer Engineering Schulich School of Engineering University of Calgary
11 March, 2014
SLIDE 2 ENCM 501 W14 Slides for Lecture 16
slide 2/26
Previous Lecture
◮ context switches and effects on memory latency ◮ memory system summary ◮ introduction to ILP (instruction-level parallelism) ◮ review of simple pipelining
SLIDE 3 ENCM 501 W14 Slides for Lecture 16
slide 3/26
Today’s Lecture
◮ pipeline hazards ◮ solutions to pipeline hazards
Related reading in Hennessy & Patterson: Sections C.2–C.3
SLIDE 4 ENCM 501 W14 Slides for Lecture 16
slide 4/26
A rough sketch of the 5-stage pipeline
This sketch was presented at the end of the previous lecture:
CLK IF/ID CLK ID/EX CLK EX/MEM CLK MEM/WB
GPRs I-mem D-mem instr. decode add ID IF EX MEM WB ALU
? CLK PC CLK
SLIDE 5
ENCM 501 W14 Slides for Lecture 16
slide 5/26
Pipeline Hazards
If a certain sequence of instructions prevents the usual throughput of one instruction for clock cycle in a simple pipeline, the situation is called a pipeline hazard. Hazards can be categorized into three main types: structural hazards, data hazards, and control hazards.
SLIDE 6
ENCM 501 W14 Slides for Lecture 16
slide 6/26
Structural hazards
These occur when two instructions “want” to use the same physical resource at the same time, in incompatible ways. For example, if the simple 5-stage pipeline had a single memory unit, instead of split instruction and data memories, MEM of an LW or SW instruction would interfere with IF of a later instruction. Why is access to three GPRs by two different instructions, one in WB and a later one in ID, not a structural hazard?
SLIDE 7 ENCM 501 W14 Slides for Lecture 16
slide 7/26
Structural hazards: solutions
The best solution is to design hardware to avoid structural hazards wherever possible. For example:
◮ in the simple, 5-stage pipeline, use separate instruction
and data memories;
◮ in real pipelines, have separate I-TLBs and D-TLBs, and
separate L1 I-caches and D-caches. For complex pipelines, it may be practically impossible to avoid all structural hazards, so stalls may be required—if two instructions are contending for a resource, one or the other will be delayed one or more clock cycles.
SLIDE 8 ENCM 501 W14 Slides for Lecture 16
slide 8/26
Data hazards
(We’ll use MIPS32 instructions as examples, because instructions like ADD and SUB are easier to deal with than DADD and DSUB.) The most common kind of data hazard is called a RAW hazard: RAW stands for Read-After-Write. ADD R8, R9, R10 SUB R11, R12, R8 For correct processing, SUB must work as if R8 is read by SUB after R8 is written by ADD. (This is where the term RAW comes from.) Let’s draw a “pipeline diagram” to get a precise understanding
SLIDE 9
ENCM 501 W14 Slides for Lecture 16
slide 9/26
More examples of RAW hazards
For the simple 5-stage pipeline, let’s find all the RAW hazards in this sequence . . . LW R8, 0(R4) AND R9, R8, R5 OR R10, R6, R8 SLT R11, R8, R7 Remark: The deeper a pipeline is (the more stages it has), the greater will be the number and complexity of potential RAW hazards.
SLIDE 10
ENCM 501 W14 Slides for Lecture 16
slide 10/26
Forwarding
Forwarding is the name given to a technique that can often solve RAW data hazards without loss of clock cycles to stalls. (Another name for forwarding is bypassing.) The essential idea is that if Instruction B depends on the result of Instruction A, Instruction B should not wait for Instruction A to write that result to its destination, but instead grab that result as soon as it is available. Let’s look at how forwarding helps with this sequence . . . LW R8, 0(R4) AND R9, R8, R5 OR R10, R6, R8 SLT R11, R8, R7
SLIDE 11 ENCM 501 W14 Slides for Lecture 16
slide 11/26
Sketch of forwarding hardware for 5-stage MIPS32
Here is an incomplete schematic for the EX stage . . .
1 10 00 01 10 00 01
forward control
ALU
LW/SW
LW or ALU result from MEM/WB reg. ALU result from EX/MEM reg.
data for SW GPR GPR
ID/EX pipeline register
CLK
2 2
FwdB FwdA
A B
SLIDE 12
ENCM 501 W14 Slides for Lecture 16
slide 12/26
Q1: What should the values of the “forward control” outputs be in the case where no forwarding is needed? Consider this sequence: LW R8, 0(R4) AND R9, R10, R11 SUB R12, R8, R9 Q2: What should the values of the “forward control” outputs be when SUB is in the EX stage? Q3: What are the inputs to “forward control” and how does the forwarding logic work? (We’ll give an example or two, not completely specify the logic!)
SLIDE 13
ENCM 501 W14 Slides for Lecture 16
slide 13/26
Can forwarding solve all RAW hazards?
Consider this sequence: LW R15, 0(R14) ADD R16, R17, R15 Is it possible to solve the hazard by forwarding? If not, what is the most time-efficient way to solve the hazard? Let’s make some general remarks about optimal solutions of RAW data hazards.
SLIDE 14
ENCM 501 W14 Slides for Lecture 16
slide 14/26
Control hazards: Introduction
In a simple pipeline, a control hazard is a difficulty in determining the address to use for the next Instruction Fetch. Look at this example, and assume a version of MIPS32 in which the delay slot instruction is not supposed to be completed if the branch is taken: L1: LW R9, 0(R5) instructions in loop body BEQ R8, R0, L1 OR R16, R10, R0 In the clock cycle after IF for the BEQ instruction, why is doing IF difficult? (There is more than one reason.)
SLIDE 15 ENCM 501 W14 Slides for Lecture 16
slide 15/26
Control hazards: Not just for conditional branches!
In a conditional branch, there is an obvious motivation to wait for the decision about whether or not to take the branch. But consider the following unconditional updates to the PC:
◮ jump within a procedure; ◮ procedure call; ◮ procedure return.
Why do these kinds of instructions generate control hazards? How many cycles might be lost due to such a hazard in a 5-stage pipeline like the one we’ve been looking at?
SLIDE 16
ENCM 501 W14 Slides for Lecture 16
slide 16/26
“Old school” solutions to control hazards (1)
Stall as long as necessary to ensure that instruction results are correct. This obviously makes CPI worse (higher) if programs have lots of conditional branches and unconditional jumps.
SLIDE 17 ENCM 501 W14 Slides for Lecture 16
slide 17/26
“Old school” solutions to control hazards (2)
Delayed jumps and branches. Because it is very difficult to do IF properly in the cycle immediately following a jump or a taken branch, many ISA designs decreed that the successor to a jump or branch would always be completed before the jump
- r branch target instruction . . .
BEQ R12, R0, L99 ADD R13, R14, R15 # successor more instructions L99: SUB R8, R9, R10 # branch target OR R16, R8, R0 Real MIPS ISAs (as opposed to some hypothetical MIPS-like ISAs in textbooks and lecture slides) have delayed branches and jumps.
SLIDE 18 ENCM 501 W14 Slides for Lecture 16
slide 18/26
Dynamic branch prediction
Dynamic branch prediction is the most important current technology for management of control hazards. A branch prediction circuit is a memory array comparable in size to an L1 I-cache, and somewhat more complex. A branch prediction circuit records the locations of thousands
- f recently-encountered branches and jumps, along with the
addresses of their targets. For each conditional branch, a branch prediction circuit maintains a few bits of information that can be used to predict whether the branch will be taken or untaken.
SLIDE 19
ENCM 501 W14 Slides for Lecture 16
slide 19/26
Branch prediction code example
p and past_last are of type int*. count is an int. do { if (*p < 0) count++; p++; } while (p != past_last); p walks through an array of int elements, and count records how many of those elements are negative.
SLIDE 20 ENCM 501 W14 Slides for Lecture 16
slide 20/26
Branch prediction code example, continued
Assembly language for a MIPS32-like ISA that does not have delayed branch . . . L1: LW R8, (R4) SLT R9, R0, R8 BEQ R9, R0, L2 # branch if !(*p < 0) ADDIU R25, R25, 1 # count++ L2: ADDIU R4, R4, 4 # p++ BNE R4, R24, L1 # branch if p != past_last Let’s suppose that there are a lot of array elements, and most
- f them are negative. As the processor runs the loop, what
predictions will it learn to make about the BEQ and BNE instructions?
SLIDE 21
ENCM 501 W14 Slides for Lecture 16
slide 21/26
Scalar versus Superscalar
It seems like the right moment to introduce these terms. A scalar processor core starts no more than one instruction per clock cycle. In some cycles it can’t start an instruction, due to a stall caused by a pipeline hazard. All of the pipeline examples so far have been for scalar cores. A superscalar processor core tries to start two or more instructions per clock cycle. When I start talking about superscalar cores, I will let you know.
SLIDE 22
ENCM 501 W14 Slides for Lecture 16
slide 22/26
A 5-stage pipeline with dynamic branch prediction
Let’s review our previous sketch of the 5-stage pipeline, then show how it would be modified to support dynamic branch prediction. An instruction fetch unit encapsulates a PC, an L1 I-cache, and a branch prediction circuit. Both sketches are for scalar systems.
SLIDE 23 ENCM 501 W14 Slides for Lecture 16
slide 23/26
A rough sketch of the 5-stage pipeline
These are the pieces we saw previously . . .
CLK IF/ID CLK ID/EX CLK EX/MEM CLK MEM/WB
GPRs I-mem D-mem instr. decode add ID IF EX MEM WB ALU
? CLK PC CLK
SLIDE 24 ENCM 501 W14 Slides for Lecture 16
slide 24/26
5-stage pipeline with dynamic branch prediction
Note that a monster has moved into the IF stage . . .
CLK IF/ID CLK ID/EX CLK EX/MEM CLK MEM/WB
GPRs D-mem instr. decode ID IF EX MEM WB ALU
? CLK CLK
instruction fetch unit
SLIDE 25
ENCM 501 W14 Slides for Lecture 16
slide 25/26
Scalar performance with dynamic branch prediction
If the branch predictor does a good job, CPI will be very close to 1. What are two reasons why, for most programs, CPI will be somewhat greater than 1?
SLIDE 26 ENCM 501 W14 Slides for Lecture 16
slide 26/26
Upcoming Topics
◮ Practical considerations for scalar pipelines. ◮ Processing instructions with parallel pipelines.
Related reading in Hennessy & Patterson: Sections C.5–C.6, 3.4–3.5