 
              Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1 Cycle 2 Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr 19
Why Pipeline? ° Suppose we execute 100 instructions ° Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns ° Multicycle Machine • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns ° Ideal pipelined machine • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns 20
Why Pipeline? Because the resources are there! Time (clock cycles) RegRead RegWrite I ALU Im Reg Dm Reg Inst 0 n s ALU t Inst 1 Im Reg Dm Reg r. ALU Inst 2 O Im Reg Dm Reg r d Inst 3 ALU Im Reg Dm Reg e r Inst 4 ALU Im Reg Dm Reg Resource MemInst idle idle idle busy busy busy busy busy idle MemData busy busy idle idle idle idle busy busy busy RegRead busy idle idle idle busy busy busy busy idle RegWrite busy busy busy idle idle idle idle busy busy ALU busy idle idle idle idle busy busy busy busy 21
Can pipelining get us into trouble? ° Yes: Pipeline Hazards • structural hazards: attempt to use the same resource two different ways at the same time - E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) • data hazards: attempt to use item before it is ready - E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer - instruction depends on result of prior instruction still in the pipeline • control hazards: attempt to make a decision before condition is evaulated - E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in - branch instructions ° Can always resolve hazards by waiting • pipeline control must detect the hazard • take action (or delay action) to resolve hazards 22
Single Memory (Inst & Data) is a Structural Hazard structural hazards: attempt to use the same resource two different ways at the same time Previous example: Separate InstMem and DataMem Previous example: Separate InstMem and DataMem right half: right half: I ALU highlight means read. Mem Reg Mem Reg highlight means read. n Load s left half write. left half write. t ALU Mem Mem Reg Reg Instr 1 r. ALU O Mem Reg Mem Reg Instr 2 r d ALU Mem Reg Mem Reg Instr 3 e r Detection is easy in this case! Detection is easy in this case! Resource idle idle busy busy busy busy busy idle Mem(Inst & Data) busy idle idle busy busy busy busy idle RegRead busy busy idle idle idle idle busy busy RegWrite busy idle idle idle busy busy busy busy ALU 23
Single Memory (Inst & Data) is a Structural Hazard structural hazards: attempt to use the same resource two different ways at the same time By change the architecture from a Harvard (separate instruction and data memory) to a von Neuman memory, we actually created a structural hazard! Structural hazards can be avoid by changing • hardware: design of the architecture (splitting resources) • software: re-order the instruction sequence • software: delay 24
Pipelining ° Improve perfomance by increasing instruction throughput Program� 2 4 6 8 10 12 14 16 18 execution� Time order� (in instructions) Instruction� Data� lw $1, 100($0) Reg ALU Reg fetch access Instruction� Data� lw $2, 200($0) 8 ns Reg ALU Reg fetch access Instruction� lw $3, 300($0) 8 ns fetch ... 8 ns Program� 2 4 6 8 10 12 14 execution� Time order� (in instructions) Instruction� Data� lw $1, 100($0) Reg ALU Reg fetch access � Instruction� Data� lw $2, 200($0) Reg ALU Reg 2 ns fetch access Instruction� Data� lw $3, 300($0) 2 ns Reg ALU Reg fetch access 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedup is number of stages in the pipeline. Do we achieve this? 25
Stall on Branch Program execution 2 4 6 8 10 12 14 16 Time order (in instructions) Instruction Data Reg ALU Reg add $4, $5, $6 fetch access Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access 2ns Instruction Data lw $3, 300($0) Reg ALU Reg fetch access 4 ns 2ns Figure 6.4 26
Predicting branches Program 2 4 6 8 10 12 14 execution Time order (in instructions) Instruction Data add $4, $5, $6 Reg ALU Reg fetch access Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access 2 ns Instruction Data lw $3, 300($0) Reg ALU Reg fetch access 2 ns Program 14 2 4 6 8 10 12 execution Time order (in instructions) Instruction Data add $4, $5 ,$6 Reg ALU Reg fetch access Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access 2 ns bubble bubble bubble bubble bubble Instruction Data Reg ALU Reg or $7, $8, $9 fetch access 4 ns Figure 6.5 27
Delayed branch Program execution 14 2 4 6 8 10 12 order Time (in instructions) Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access Instruction Data add $4, $5, $6 Reg ALU Reg fetch access 2 ns (Delayed branch slot) Instruction Data lw $3, 300($0) Reg ALU Reg fetch access 2 ns 2 ns Figure 6.6 28
Instruction pipeline Figure 6.7 2 4 6 8 10 Time IF ID EX MEM WB add $s0, $t0, $t1 Pipeline stages Resources • IF instruction fetch (read) • Mem instr. & data memory • ID instruction decode • RegRead1 register read port #1 and register read (read) • RegRead2 register read port #2 • EX execute alu operation • RegWrite register write • MEM data memory (read or write) • ALU alu operation • WB Write back to register 29
Forwarding Program execution 2 4 6 8 10 order Time (in instructions) add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 MEM IF ID EX MEM WB Figure 6.8 30
Load Forwarding 2 4 6 8 10 12 14 Time Program execution order (in instructions) lw $s0, 20($t1) IF ID EX MEM WB bubble bubble bubble bubble bubble sub $t2, $s0, $t3 IF ID WB EX MEM Figure 6.9 31
Reordering lw $t0, 0($t1) $t0=Memory[0+$t1] lw $t2, 4($t1) $t2=Memory[4+$t1] sw $t2, 0($t1) Memory[0+$t1]=$t2 sw $t0, 4($t1) Memory[4+$t1]=$t0 lw $t2, 4($t1) lw $t0, 0($y1) sw $t2, 0($t1) sw $t0, 4($t1) Figure 6.9 32
Basic Idea: split the datapath IF: Instruction fetch ID: Instruction decode/� EX: Execute/� MEM: Memory access WB: Write back register file read address calculation 0 M� u� x 1 Add Add� 4 Add result Shift� left 2 Read� register 1 Address PC Read� data 1 Read� Zero register 2 Instruction Registers ALU Read� ALU� 0 Read� Write� data 2 result Address 1 data register M� Instruction� M� u� Data� u� memory Write� x memory x data 1 0 Write� data 16 32 Sign� extend ° What do we need to add to actually split the datapath into stages? 33
Graphically Representing Pipelines Time (in clock cycles) Program� CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 execution� order� (in instructions) ALU lw $10, 20($1) IM Reg DM Reg sub $11, $2, $3 IM Reg DM Reg ALU ° Can help with answering questions like: • how many cycles does it take to execute this code? • what is the ALU doing during cycle 4? • use this representation to help understand datapaths 34
Pipeline datapath with registers 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Figure 6.12 35
Load instruction fetch and decode lw Instruction fetch 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 Address PC Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend lw 0 Instruction decode M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Figure 6.13 36
Load instruction execution lw 0 Execution M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 Address PC Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Figure 6.14 37
Load instruction memory and write back lw 0 Memory M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 Address PC Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M Data M u u memory Write x x data 1 0 Write data 16 32 Sign extend 0 lw M u Write back x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M Data M u memory u Write x x data 1 0 Write data 16 32 Sign extend Figure 6.15 38
Store instruction execution sw 0 Execution M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 Address PC Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M Data M u u Write memory x x data 1 0 Write data 16 32 Sign extend Figure 6.16 39
Store instruction memory and write back sw 0 Memory M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend sw 0 M Write back u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M Data u u Write x memory x data 1 0 Write data 16 32 Sign extend Figure 6.17 40
Load instruction: corrected datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 Address PC Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M Data u u Write x memory x data 1 0 Write data 16 32 Sign extend Figure 6.18 41
Load instruction: overall usage 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Figure 6.19 42
Multi-clock-cycle pipeline diagram Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 execution order (in instructions) lw $10, 20($1) IM Reg ALU DM Reg sub $11, $2, $3 IM Reg DM Reg ALU Time ( in clock cycles) Program execution order CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 (in instructions) Instruction Instruction Data lw $10, $20($1) Execution Write back access fetch decode Instruction Instruction Data Execution Write back sub $11, $2, $3 fetch decode access Figure 6.20-21 43
lw $10, 20($1) Single-cycle #1-2 Instruction fetch 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Clock 1 sub $11, $2, $3 lw $10, 20($1) Instruction fetch Instruction decode 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Clock 2 Figure 6.22 44
Single-cycle #3-4 sub $11, $2, $3 lw $10, 20($1) Instruction decode Execution 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 register data M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Clock 3 sub $11, $2, $3 lw $10, 20($1) 0 M Execution Memory u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Clock 4 Figure 6.23 45
Single-cycle #5-6 sub $11, $2, $3 lw $10, 20($1) 0 M Memory Write back u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 Address PC Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 register data M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Clock 5 sub $11, $2, $3 0 M Write back u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Shift left 2 Read Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M u Data u Write x memory x data 1 0 Write data 16 32 Sign extend Clock 6 Figure 6.24 46
Conventional Pipelined Execution Representation Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB 47
Structural Hazards limit performance ° Example: if 1.3 memory accesses per instruction and only one memory access per cycle then • average CPI � 1.3 • otherwise resource is more than 100% utilized 48
Control Hazard Solutions ° Stall: wait until decision is clear • Its possible to move up decision to 2nd stage by adding hardware to check registers as being read I Time (clock cycles) n s ALU Mem Reg Mem Reg Add t r. ALU Mem Reg Mem Reg Beq O r Load ALU Reg Mem Reg d Mem e r ° Impact: 2 clock cycles per branch instruction => slow 49
Control Hazard Solutions ° Predict: guess one direction then back up if wrong • Predict not taken I Time (clock cycles) n s ALU Mem Reg Mem Reg Add t r. ALU Mem Reg Mem Reg Beq O r Load ALU Mem Reg Mem Reg d e r ° Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right - 50% of time) ° More dynamic scheme: history of 1 branch (- 90%) 50
Control Hazard Solutions ° Redefine branch behavior (takes place after next instruction) “delayed branch” I Time (clock cycles) n s ALU Mem Reg Mem Reg Add t r. ALU Mem Reg Mem Reg Beq O r ALU Misc Mem Reg Mem Reg d e ALU r Load Mem Reg Mem Reg ° Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (- 50% of time) ° As launch more instruction per clock cycle, less useful 51
Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 52
Data Hazard on r1: • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg add r1,r2,r3 Im Dm I n ALU s Im Dm Reg Reg sub r4,r1,r3 t r. ALU Im Dm Reg Reg and r6,r1,r7 O r ALU Im Dm Reg Reg or r8,r1,r9 d e ALU Im Dm Reg r Reg xor r10,r1,r11 53
Data Hazard Solution: • “Forward” result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg add r1,r2,r3 Im Dm I n ALU s Im Dm Reg Reg sub r4,r1,r3 t r. ALU Im Dm Reg Reg and r6,r1,r7 O r ALU Im Dm Reg Reg or r8,r1,r9 d e ALU Im Dm Reg r Reg xor r10,r1,r11 • “or” OK if define read/write properly 54
Forwarding (or Bypassing): What about Loads • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB ALU Reg Reg lw r1,0(r2) Im Dm ALU Im Dm Reg Reg sub r4,r1,r3 • Can’t solve with forwarding: • Must delay/stall instruction dependent on loads 55
Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw Ifetch Reg/Dec Exec Mem Wr 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr ° The five independent functional units in the pipeline datapath are: • Instruction Memory for the Ifetch stage • Register File’s Read ports (bus A and busB) for the Reg/Dec stage • ALU for the Exec stage • Data Memory for the Mem stage • Register File’s Write port (bus W) for the Wr stage 56
The Four Stages of R-type Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: • ALU operates on the two register operands • Update PC ° Wr: Write the ALU output back to the register file 57
Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr ° We have pipeline conflict or structural hazard: • Two instructions try to write to the register file at the same time! • Only one write port 58
Important Observation ° Each functional unit can only be used once per instruction ° Each functional unit must be used at the same stage for all instructions: • Load uses Register File’s Write Port during its 5th stage 1 2 3 4 5 Load Ifetch Reg/Dec Exec Mem Wr • R-type uses Register File’s Write Port during its 4th stage 1 2 3 4 R-type Ifetch Reg/Dec Exec Wr ° 2 ways to solve this pipeline hazard. 59
Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec ° Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle • The control logic can be complex. • Lose instruction fetch and issue opportunity. ° No instruction is started in Cycle 6! 60
Solution 2: Delay R-type’s Write by One Cycle ° Delay R-type’s register write by one cycle: • Now R-type instructions also use Reg File’s write port at Stage 5 • Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr 61
The Four Stages of Store Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Write the data into the Data Memory 62
The Three Stages of Beq Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr ° Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory ° Reg/Dec: • Registers Fetch and Instruction Decode ° Exec: • compares the two register operand, • select correct branch target address • latch into PC 63
Summary: Pipelining ° What makes it easy • all instructions are the same length • just a few instruction formats • memory operands appear only in loads and stores ° What makes it hard? • structural hazards: suppose we had only one memory • control hazards: need to worry about branch instructions • data hazards: an instruction depends on a previous instruction ° We’ll build a simple pipeline and look at these issues ° We’ll talk about modern processors and what really makes it hard: • exception handling • trying to improve performance with out-of-order execution, etc. 64
Summary ° Pipelining is a fundamental concept • multiple steps using distinct resources ° Utilize capabilities of the Datapath by pipelined instruction processing • start next instruction while working on the current one • limited by length of longest stage (plus fill/flush) • detect and resolve hazards 65
Pipeline Control: Controlpath Register bits WB Instruction M WB Control EX M WB IF/ID ID/EX EX/MEM MEM/WB 9 control bits 9 control bits 5 control bits 5 control bits 2 control bits 2 control bits Figure 6.29 66
Pipeline Control: Controlpath table Figure 5.20, Single Cycle Instruction Reg ALU Mem Reg Mem Mem Bra- ALU ALU Dst Src Reg Wrt Red Wrt nch op1 op0 R-format 1 0 0 1 0 0 0 1 0 lw 1 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 ID / EX EX / MEM MEM / WB Figure 6.28 control lines control lines cntrl lines Instruction Reg ALU ALU ALU Bra- Mem Mem Reg Mem Dst Op1 Op0 Src nch Red Wrt Wrt Reg R-format 1 1 0 0 0 0 0 1 0 lw 1 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X 67
Pipeline Hazards Pipeline hazards • Solution #1 always works (for non-realtime) applications: stall, delay & procrastinate! Structural Hazards (i.e. fetching same memory bank) • Solution #2: partition architecture Control Hazards (i.e. branching) • Solution #1: stall! but decreases throughput • Solution #2: guess and back-track • Solution #3: delayed decision: delay branch & fill slot Data Hazards (i.e. register dependencies) • Worst case situation • Solution #2: re-order instructions • Solution #3: forwarding or bypassing: delayed load 68
Pipeline Datapath and Controlpath PCSrc ID/EX 0 M u WB EX/MEM x 1 Control M WB MEM/WB EX M WB IF/ID Add Add 4 Add result RegWrite Branch Shift left 2 MemWrite ALUSrc Read MemtoReg Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M Data u u memory Write x x data 1 0 Write data Instruction 16 32 6 [15–0] Sign ALU MemRead extend control Instruction [20– 16] 0 ALUOp M u Instruction x [15– 11] 1 RegDst 69
load inst. Clock 1 Clock 2 Clock 3 Clock 3 Clock 1 Clock 2 Clock 3 Clock 3 WB=11, M=010 WB=11, M=010 EX=0001 EX=0001 PCSrc WB=11 WB=11 WB=11 WB=11 PC=4 PC=4 M=010 ID/EX 0 M=010 M PC=4+20<<2 u WB PC=4+20<<2 EX/MEM x 1 PC=4+20<<2 Control M WB PC=4+20<<2 MEM/WB MDR=Mem[20+C$1] Clock 0 PC=4 MDR=Mem[20+C$1] EX M WB Clock 0 PC=4 A=C$1 IF/ID A=C$1 Add Add 4 Add result RegWrite Branch Shift PC=0 IR= lw $10,20($1) left 2 MemWrite PC=0 IR= lw $10,20($1) B=X ALUSrc B=X Read MemtoReg Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU ALU=20+C$1 Read ALU memory 0 Read Write ALU=20+C$1 data 2 result Address 1 data register M M Data u S=20 u memory Write x S=20 x data 1 0 Write data Instruction Aluout 16 32 6 [15–0] Aluout Sign ALU MemRead extend control T=$10 T=$10 Instruction [20– 16] 0 ALUOp M u Instruction x [15– 11] 1 RegDst D=$10 D=$10 D=$10 D=$10 D=0 D=0 70
load inst. Clock 1 Clock 2 Clock 3 Clock 3 Clock 1 Clock 2 Clock 3 Clock 3 WB=11, M=010 WB=11, M=010 EX=0001 EX=0001 PCSrc WB=11 WB=11 WB=11 WB=11 PC=4 PC=4 M=010 ID/EX 0 M=010 M PC=4+20<<2 u WB PC=4+20<<2 EX/MEM x 1 PC=4+20<<2 Control M WB PC=4+20<<2 MEM/WB MDR=Mem[20+C$1] Clock 0 PC=4 MDR=Mem[20+C$1] EX M WB Clock 0 PC=4 A=C$1 IF/ID A=C$1 Add Add 4 Add result RegWrite Branch Shift PC=0 IR= lw $10,20($1) left 2 MemWrite PC=0 IR= lw $10,20($1) B=X ALUSrc B=X Read MemtoReg Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU ALU=20+C$1 Read ALU memory 0 Read Write ALU=20+C$1 data 2 result Address 1 data register M M Data u S=20 u memory Write x S=20 x data 1 0 Write data Instruction Aluout 16 32 6 [15–0] Aluout Sign ALU MemRead extend control T=$10 T=$10 Instruction [20– 16] 0 ALUOp M u Instruction x [15– 11] 1 RegDst D=$10 D=$10 D=$10 D=$10 D=0 D=0 71
Pipeline single stepping Contents of Register 1 = C$1 = 3; C$2=4; C$3=4; C$4=6; C$5=7; C$10=8; … Memory[23]=9; Formats: add $rd,$rs=A,$rt=B; lw $rt=B,@($rs=A) Clock <IF/ID> <ID/EX> <EX/MEM> <MEM/WB> <PC, IR> <PC, A, B, S, Rt, Rd> <PC, Z, ALU, B, R> <MDR, ALU, R> 0 <0,?> <?,?,?,?,?,?> <?,?,?,?,?> <?,?,?> 1 <4,lw $10,20($1)> <0,?,?,?,?,?> <?,?,?,?,?> <?,?,?> <4,C$1 → → 3,C$10 → → 8,20,$10,0> <0,?,?,?,?> → → → → 2 <8,sub $11,$2,$3> <?,?,?> <12,and $12,$4,$5> <8,C$2 → → 4,C$3 → → 4,X,$3,$11> <4+20<<2 → → 84,0,20+3 → → 23,8,$10><?,?,?> → → → → → → → → 3 <12,C$4 → → 6,C$5 → → 7,X,$5,$12><X,1,4-4=0,4,$11> <Mem[23] → → 9,23,$10> → → → → → → 4 <16,or $13,$6,$7> 5 <20,add $14,$8,$9> <16,C$6 ,C$7,X,$7,$13> <X,0,1,7,$12> <X,0,$11> 72
Clock 1: Figure 6.31a IF: lw $10, 20($1) ID: before<1> EX: before<2> MEM: before<3> WB: before<4> IF/ID ID/EX EX/MEM MEM/WB 0 M 00 00 WB u x 1 000 000 00 Control M WB 0 0 0 PC=4 0000 00 0 PC=4 EX M WB 0 0 0 Add Add 4 Add result RegWrite Branch PC=0 IR=lw $10,20($1) Shift PC=0 IR=lw $10,20($1) left 2 MemWrite ALUSrc Read MemtoReg Instruction register 1 PC Address Read data 1 Read Zero register 2 Instruction Registers ALU Read ALU memory 0 Read Write data 2 result Address 1 data register M M Data u u memory Write x x data 1 0 Write data Instruction [15– 0] Sign ALU MemRead extend control Instruction [20– 16] 0 ALUOp M u Instruction x [15– 11] 1 Clock 1 RegDst IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3> 73
[15–11] 1 Clock 1 RegDst C C IF: sub $11, $2, $3 ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3> PC=4 PC=4 IF/ID ID/EX EX/MEM MEM/WB 0 M 11 00 u WB x A=C$1 1 lw 010 000 00 A=C$1 Control M WB 0 0 0 PC=8 PC=4 0001 00 0 EX M WB 0 PC=4 PC=8 0 0 Add Add B=X 4 Add B=X result RegWrite Branch Shift PC=4 IR=sub $11, $2, $3 IR=lw $10,20($1) PC=4 IR=sub $11, $2, $3 IR=lw $10,20($1) left 2 MemWrite ALUSrc 1 Read MemtoReg Instruction register 1 PC $1 Address Read data 1 X S=20 Read S=20 Zero register 2 Instruction $X Registers ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M Data u u memory Write x x data 1 0 Write data T=$10 T=$10 Instruction 20 [15– 0] 20 Sign ALU MemRead extend control Instruction 10 [20–16] 10 0 ALUOp M u D=0 Instruction x D=0 X [15–11] X 1 Clock 2 RegDst Figure 6.31b 74
C C C IF: and $12, $4, $5 ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2> C PC=8 PC=8 PC=4 PC=4 IF/ID ID/EX EX/MEM MEM/WB 0 C C M 10 11 u WB x A=C$2 1 sub 000 010 00 A=C$2 PC=4+20<<2 Control M WB PC=4+20<<2 A=C$1 0 0 A=C$1 0 PC=12 1100 00 0 PC=12 EX M WB 0 1 0 PC=8 PC=8 Add B=C$3 B=C$3 Add 4 Add result B=X RegWrite B=X Branch Shift PC=8 IR=and $12,$4,$5 IR=sub $11, $2, $3 left 2 MemWrite PC=8 IR=and $12,$4,$5 IR=sub $11, $2, $3 ALUSrc 2 Read MemtoReg Instruction register 1 PC Address $2 $1 Read ALU=20+C$1 data 1 3 Read S=20 ALU=20+C$1 S=20 Zero register 2 Instruction S=X Registers $3 ALU S=X Read ALU memory 0 Read Write data 2 result Address 1 data register M M Data u u memory Write x x data 1 0 Write T=$10 T=$10 data T=$3 T=$3 Instruction X [15–0] X 20 Sign ALU MemRead extend control Instruction X [20–16] X 10 0 D=0 D=$10 ALUOp D=0 D=$10 M u D=$11 Instruction D=$11 x 11 [15–11] 11 1 Clock 3 RegDst 75
x 11 [15– 11] 11 1 Clock 3 C C C Clock 4: Figure 6.32b C C C RegDst PC=12 PC=8 PC=4+20<<2 PC=12 PC=8 PC=4+20<<2 IF: or $13, $6, $7 ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1> C C C C IF/ID ID/EX EX/MEM MEM/WB 0 A=C$2 M 10 10 A=C$4 A=C$2 PC=X PC=4+20<<2 u WB A=C$4 PC=X PC=4+20<<2 x 1 PC=16 PC=12 and 000 000 11 PC=16 PC=12 Control M WB 1 0 0 1100 10 1 EX M WB 0 MDR=Mem[20+C$1] 0 0 MDR=Mem[20+C$1] B=C$3 B=C$5 B=C$3 Add B=C$5 Add 4 Add result RegWrite PC=20 IR=or $13,$6,$7 IR=and $12,$4,$5 Branch PC=20 IR=and $12,$4,$5 IR=or $13,$6,$7 Shift left 2 MemWrite ALUSrc 4 Read MemtoReg Instruction ALU=C$2-C$3 ALU=20+C$1 register 1 $4 $2 PC Address ALU=20+C$1 ALU=C$2-C$3 Read data 1 5 S=X S=X Read S=X S=X register 2 Zero Instruction Registers $5 $3 ALU Read ALU memory 0 Read Write Address data 2 result 1 data register M M Data u u Write x memory x data 1 0 Write T=$3 T=$3 data T=$3 T=$3 ALU Instruction ALU X [15– 0] X Sign ALU MemRead extend control Instruction X [20– 16] X 0 ALUOp D=$10 D=$11 D=$10 D=$11 M 10 u Instruction D=$11 D=$12 D=$10 x D=$11 D=$12 D=$10 12 [15– 11] 12 11 1 Clock 4 RegDst 76
Data Dependencies: that can be resolved by forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2: 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20 Resolved by forwarding Resolved by forwarding Program execution order (in instructions) Reg At same time: Not a hazard sub $2, $1, $3 IM Reg DM At same time: Not a hazard Data Dependencies and $12, $2, $5 IM DM Reg Reg or $13, $6, $2 IM DM Reg Reg Data Hazards Data Hazards add $14, $2, $2 IM DM Reg Reg sw $15, 100($2) IM DM Reg Reg Forward in time: Not a hazard Forward in time: Not a hazard 77
Data Hazards: arithmetic Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/– 20 – 20 – 20 – 20 – 20 Value of EX/MEM : X X X – 20 X X X X X Value of MEM/WB : X X X X – 20 X X X X Forwards in time: Can be resolved Forwards in time: Can be resolved Program execution order (in instructions) At same time: Not a hazard At same time: Not a hazard sub $2, $1, $3 IM Reg DM Reg and $12, $2, $5 IM Reg DM Reg or $13, $6, $2 IM Reg DM Reg add $14, $2, $2 IM Reg DM Reg sw $15, 100($2) IM Reg DM Reg 78
Data Dependencies: no forwarding 8 Clock 1 2 3 4 5 6 7 8 Clock 1 2 3 4 5 6 7 WB IF ID EX M sub $2,$1,$3 WB and $12,$2,$5 IF ID ID ID EX M Write Read Stall Stall Write Read Stall Stall 1st 2nd 1st 2nd Half Half Half Half Suppose every instruction is dependant = 1 + 2 stalls = 3 clocks MIPS = Clock = 500 Mhz = 167 MIPS CPI 3 79
Data Dependencies: no forwarding A dependant instruction will take = 1 + 2 stalls = 3 clocks An independent instruction will take = 1 + 0 stalls = 1 clocks Suppose 10% of the time the instructions are dependant? Averge instruction time = 10%*3 + 90%*1 = 0.10*3 + 0.90*1 = 1.2 clocks MIPS = Clock = 500 Mhz = 417 MIPS (10% dependency) CPI 1.2 MIPS = Clock = 500 Mhz = 167 MIPS (100% dependency) CPI 3 MIPS = Clock = 500 Mhz = 500 MIPS (0% dependency) CPI 1 80
Data Dependencies: with forwarding Clock 1 2 3 4 5 6 Clock 1 2 3 4 5 6 WB IF ID EX M sub $2,$1,$3 WB and $12,$2,$5 IF ID EX M Detected Detected Data Hazard 1a Data Hazard 1a ID/EX.$rs = EX/M.$rd ID/EX.$rs = EX/M.$rd Suppose every instruction is dependant = 1 + 0 stalls = 1 clock MIPS = Clock = 500 Mhz = 500 MIPS CPI 1 81
Data Dependencies: Hazard Conditions Data Hazard Condition occurs whenever a data source needs a previous unavailable result due to a data destination. Example sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2, $5 and $rd, $rs, $rt Data Hazard Detection is always comparing a destination with a source. Destination Source Hazard Type EX/MEM.$rdest = { 1a. ID/EX.$rs 1b. ID/EX.$rt { 2a. ID/EX.$rs MEM/WB.$rdest = 2b. ID/EX.$rt 82
Data Dependencies: Hazard Conditions 1a Data Hazard: EX/MEM.$rd = ID/EX.$rs sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2, $5 and $rd, $rs, $rt 1b Data Hazard: EX/MEM.$rd = ID/EX.$rt sub $2, $1, $3 sub $rd, $rs, $rt and $12, $1, $2 and $rd, $rs, $rt 2a Data Hazard: MEM/WB.$rd = ID/EX.$rs sub $2, $1, $3 sub $rd, $rs, $rt and $12, $1, $5 sub $rd, $rs, $rt or $13, $2, $1 and $rd, $rs, $rt 2b Data Hazard: MEM/WB.$rd = ID/EX.$rt sub $2, $1, $3 sub $rd, $rs, $rt and $12, $1, $5 sub $rd, $rs, $rt or $13, $6, $2 and $rd, $rs, $rt 83
Data Dependencies: Worst case Data Hazard: sub $2, $1, $3 sub $rd, $rs, $rt and $12, $2, $2 and $rd, $rs, $rt or $13, $2, $2 and $rd, $rs, $rt Data Hazard 1a: EX/MEM.$rd = ID/EX.$rs Data Hazard 1b: EX/MEM.$rd = ID/EX.$rt Data Hazard 2a: MEM/WB.$rd = ID/EX.$rs Data Hazard 2b: MEM/WB.$rd = ID/EX.$rt 84
Data Dependencies: Hazard Conditions Hazard Type Source Destination } 1a. ID/EX.$rs = EX/MEM.$rdest 1b. ID/EX.$rt } 2a. ID/EX.$rs = MEM/WB.$rdest 2b. ID/EX.$rt Pipeline Registers ID/EX EX/MEM MEM/WB $rs $rs $rt $rt $rd $rd $rd $rd $rd $rd 85
ID/EX EX/MEM MEM/WB Registers ALU Data memory M u x a. No forwarding ID/EX EX/MEM MEM/WB M u x Registers ForwardA ALU M Data u memory M x u x Rs ForwardB Rt Rt M EX/MEM.RegisterRd u Rd x Forwarding MEM/WB.RegisterRd unit b. With forwarding 86
Data Hazards: Loads Time (in clock cycles) Backwards in time: Cannot be resolved Backwards in time: Cannot be resolved Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 execution Forwards in time: Cannot be resolved order Forwards in time: Cannot be resolved (in instructions) Reg lw $2, 20($1) IM DM Reg and $4, $2, $5 IM Reg DM Reg or $8, $2, $6 IM Reg DM Reg add $9, $4, $2 IM Reg DM Reg At same time: Not a hazard At same time: Not a hazard slt $1, $6, $7 IM DM Reg Reg 87
Data Hazards: load stalling Program Time (in clock cycles) execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 order (in instructions) Reg Reg DM lw $2, 20($1) IM Reg IM Reg Reg DM and $4, $2, $5 Reg or $8, $2, $6 IM DM Reg IM bubble add $9, $4, $2 Reg IM DM Reg Stall Stall slt $1, $6, $7 DM Reg IM Reg 88
Data Hazards: Hazard detection unit (page 490) Stall Condition Source Destination } IF/ID.$rs = ID/EX.$rt Λ Λ Λ Λ ID/EX.MemRead=1 IF/ID.$rt Stall Example lw $2, 20($1) lw $rt, addr($rs) and $4, $2, $5 and $rd, $rs, $rt No Stall Example: (only need to look at next instruction) lw $2, 20($1) lw $rt, addr($rs) and $4, $1, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt 89
Data Hazards: Hazard detection unit (page 490) No Stall Example: (only need to look at next instruction) lw $2, 20($1) lw $rt, addr($rs) and $4, $1, $5 and $rd, $rs, $rt or $8, $2, $6 or $rd, $rs, $rt Example load: assume half of the instructions are immediately followed by an instruction that uses it. What is the average number of clocks for the load? load instruction time: 50%*(1 clock) + 50%*(2 clocks)=1.5 90
Hazard Detection Unit: when to stall ID/EX.MemRead Hazard detection unit ID/EX WB IF/IDWrite EX/MEM M Control u M WB MEM/WB x 0 EX M WB IF/ID PCWrite M Instruction u x Registers Data Instruction ALU PC memory M memory u x M u x IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt Rt M EX/MEM.RegisterRd u IF/ID.RegisterRd Rd x ID/EX.RegisterRt Rs Forwarding MEM/WB.RegisterRd Rt unit 91
Data Dependency Units Forwarding Condition Source Destination } ID/EX.$rs = EX/MEM.$rd ID/EX.$rt } ID/EX.$rs = MEM/WB.$rd ID/EX.$rt Stall Condition Source Destination } IF/ID.$rs = ID/EX.$rt Λ Λ ID/EX.MemRead=1 Λ Λ IF/ID.$rt 92
Data Dependency Units Pipeline Registers Stalling Comparisons Forwarding Comparisons IF/ID ID/EX EX/MEM MEM/WB $rs $rs $rs $rs $rt $rt $rt $rt $rd $rd $rd $rd $rd $rd $rd $rd Stall Condition Source Destination } IF/ID.$rs = ID/EX.$rt Λ Λ ID/EX.MemRead=1 Λ Λ IF/ID.$rt 93
Branch Hazards: Soln #1, Stall until Decision made (fig. 6.4) @3C: add $4, $5, $6 @40: beq $1, $3, 7 Soln #1: Stall until Decision is made Soln #1: Stall until Decision is made @44: and $12, $2, $5 @48: or $13, $6, $2 @4C: add $14, $2, $2 @50: lw $4, 50($7) Program execution 14 16 2 4 6 8 10 12 Time order (in instructions) Instruction Data Reg ALU Reg add $4, $5, $6 fetch access Instruction Data beq $1, $2, 40 Reg ALU Reg fetch access 2ns Instruction Data lw $3, 300($0) Reg ALU Reg fetch access 4 ns 2ns Stall Decision made in ID stage: do load Stall Decision made in ID stage: do load 94
Branch Hazards: Soln #2, Predict until Decision made 8 Clock 1 2 3 4 5 6 7 8 Clock 1 2 3 4 5 6 7 WB IF ID EX M beq $1,$3,7 Predict false branch Predict false branch WB and $12, $2, $5 IF ID EX M discard “and $12,$2,$5” instruction discard “and $12,$2,$5” instruction WB lw $4, 50($7) IF ID EX M Decision made in ID stage: discard & branch Decision made in ID stage: discard & branch 95
Branch Hazards: Soln #3, Delayed Decision 8 Clock 1 2 3 4 5 6 7 8 Clock 1 2 3 4 5 6 7 WB IF ID EX M beq $1,$3,7 Move instruction before branch Move instruction before branch WB add $4,$6,$6 IF ID EX M Do not need to discard instruction Do not need to discard instruction WB lw $4, 50($7) IF ID EX M Decision made in ID stage: branch Decision made in ID stage: branch 96
Branch Hazards: Soln #3, Delayed Decision 8 Clock 1 2 3 4 5 6 7 8 Clock 1 2 3 4 5 6 7 WB IF ID EX M beq $1,$3,7 WB and $12, $2, $5 IF ID EX M Decision made in ID stage: do branch Decision made in ID stage: do branch WB lw $4, 50($7) IF ID EX M 97
Branch Hazards: Decision made in the ID stage (figure 6.4) 8 Clock 1 2 3 4 5 6 7 8 Clock 1 2 3 4 5 6 7 WB IF ID EX M beq $1,$3,7 WB nop IF ID EX M No decision yet: No decision yet: insert a nop Decision: do load insert a nop Decision: do load WB lw $4, 50($7) IF ID EX M 98
Branch Hazards: Soln #2, Predict until Decision made Branch Decision made in MEM stage: Branch Decision made in MEM stage: Time (in clock cycles) Program Discard values when wrong prediction Discard values when wrong prediction execution CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 order (in instructions) Predict false branch Predict false branch 40 beq $1, $3, 7 IM Reg DM Reg 44 and $12, $2, $5 IM Reg DM Reg 48 or $13, $6, $2 IM Reg DM Reg 52 add $14, $2, $2 IM Reg DM Reg Same effect as 3 stalls Same effect as 3 stalls 72 lw $4, 50($7) Reg DM Reg IM 99
Figure 6.51 IF.Flush Early branch comparison Early branch comparison Hazard detection unit ID/EX M u x WB EX/MEM M u Control M WB MEM/WB x 0 EX M WB IF/ID 4 Shift left 2 M u x = Registers Data Instruction ALU PC memory M memory u M x u x Sign Flush: if wrong prediciton, add nops Flush: if wrong prediciton, add nops extend M u x Forwarding unit 100
Recommend
More recommend