1
Appendix A
Pipelining: Basic and Intermediate C t
1
Concepts
Overview
- Basics of Pipelining
- Pipeline Hazards
- Pipeline Implementation
- Pipelining + Exceptions
- Pipeline to handle Multicycle Operations
2
Appendix A Pipelining: Basic and Intermediate C Concepts t 1 - - PDF document
Appendix A Pipelining: Basic and Intermediate C Concepts t 1 Overview Basics of Pipelining Pipeline Hazards Pipeline Implementation Pipelining + Exceptions Pipeline to handle Multicycle Operations p y p 2 1
1
2
2 4 6 8 1 0 1 2 1 4 1 6 1 8 P ro g ra m e x e c u tio n
ALU operation = 2 nsec, Register file access = 1 nsec;
In s tru c tio n fe tc h R e g A L U D a ta a c c e s s R e g
8 n s
In s tru c tio n fe tc h R e g A L U D a ta a c c e s s R e g
8 n s
In s tru c tio n fe tc h
T im e ld r 1 , 1 0 0 (r 4 ) ld r 2 , 2 0 0 (r 5 ) ld r 3 , 3 0 0 (r 6 )
. . .
(in in s tr u c tio n s )
3
8 n s
instruction needs 4 clock cycles (i.e. 8 nsec) to execute.
24 nsec). CPI = 12 cycles/3 instructions= 4 cycles / instruction.
4
Time
T a s k O r d
5
d e r
Time
T a s k O
6
r d e r
7
8
5 ns 4 ns 5 ns 10 ns 4 ns
9
5 ns 4 ns 5 ns 10 ns 4 ns
10
5 ns 4 ns 5 ns 10 ns 4 ns
IF MEM ID I1 L(I1) = 28ns EX WB MEM ID IF I2 L(I2) = 33ns EX WB ( ) 8
11
MEM ID IF I3 L(I3) = 38ns EX WB MEM ID IF I4 L(I5) = 43ns EX WB We are in trouble! The latency is not constant. This happens because this is an unbalanced
the same length as the longest one.
T a s k O r Time
12
d e r
13
Depth of the pipeline
14
IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB
15
IF ID EX MEM WB IF ID EX MEM WB
16
17
– Capacitance-charge-discharge rates p g g
– Repeaters used to drive current, handle fan-out problems
– Time to charge/discharge adds to delay – Dominant problem in old integration densities.
– Problem with this approach is power requirements go up – Power dissipation becomes a problem.
18
Power dissipation becomes a problem.
– Speed-of-light propagation delays
much lower.
consume a large part of the clock cycle)
19
CPI
pipelined
= Ideal CPI + Pipeline stall clock cycles per instr
Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall per instr Clock Cycle
pipelined
Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle
pipelined
x x
20
pipelined
21
22
23
IF ID M1 M2 M3 M4 M5
MEM
WB
FP Multiply
EX IF ID M1 M2 M3 M4 M5 EX
MEM
WB
FP Multiply
24
IF ID M1 M2 M3 M4 M5 EX
MEM
WB
FP Multiply
25
M hi B Si l t d b t it i li d
implementation has a 1.05 times faster clock rate
SpeedUpA = Pipeline Depth/(1 + 0) x (clock
unpipe/clockpipe)
= Pipeline Depth /
26
25
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
27
I J
28
29
30
31
32
33
34
Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW R
35
SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd
LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB S LW Rb,b IF ID EX MEM WB LW Rc c IF ID EX MEM WB SW a,Ra IF ID EX MEM WB LW Re,e IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB
36
LW Rc,c IF ID EX MEM WB LW Re,e IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SW a,Ra IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB
37
Branch IF ID EX MEM WB Branch successor IF stall stall IF ID EX MEM WB Branch successor+1 IF ID EX MEM WB Branch successor+2 IF ID EX MEM WB
Branch successor+3 IF ID EX MEM Branch successor+4 IF ID EX
38
39
40
Branch delay of length n
41
Untaken Branch IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 IF ID EX MEM WB Taken Branch IF ID EX MEM WB Instruction i+1 IF stall stall stall stall (clear the IF/ID register) Branch target IF ID EX MEM WB Branch target+1 IF ID EX MEM WB Branch target+2 IF ID EX MEM WB
42
Branch target+2 IF ID EX MEM WB
Compiler organizes code so that the most frequent path is the not-taken one
Untaken Branch IF ID EX MEM WB Instruction i+1 IF stall stall stall stall (clear the IF/ID register) Instruction i+2 IF ID EX MEM WB Instruction i+3 IF ID EX MEM WB Instruction i+4 IF ID EX MEM WB
43
Taken Branch IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Branch target IF ID EX MEM WB Branch target i+1 IF ID EX MEM WB Branch target i+2 IF ID EX MEM WB
44
From before From target From fall through
45
a) From before Branch must not depend on delayed Always instruction instruction b) From target Must be OK to execute delayed When branch is taken instruction if branch is not taken c) From fall Must be OK to execute delayed When branch is not taken through instruction if branch is taken
46
47
48
If branch is almost always taken If branch is almost never taken
49
5 6 10 11 15 16 31 5 6 10 11 15 16 31 20 21
50
5 6 31
51
52
53
54
55
56
57
DADD R5, R6, R7 DSUB R8, R6, R7 OR R9, R6, R7
DADD R5, R1, R7 DSUB R8, R6, R7 OR R9, R6, R7
DADD R5, R6, R7 DSUB R8 R1 R7 58 DSUB R8, R1, R7 OR R9, R6, R7
DADD R5, R6, R7 DSUB R8, R6, R7 OR R9,R1, R7
ALU
IM Reg DM Reg
LW R1, 0(R2) ALU
IM Reg DM Reg
ALU
IM Reg DM
SUB R4, R1, R5 AND R6, R1, R7 59 ALU
IM Reg
OR R8, R1, R9
LW R1, 0(R2) IF ID EX MEM WB SUB R4, R1, R5 IF ID stall EX MEM WB AND R6, R1, R7 IF stall ID EX MEM WB OR R8, R1, R9 stall IF ID EX MEM WB
ID/EX.IR 0..5 IF/ID.IR 0..5 Comparison Load r-r ALU ID/EX.IR[RT] == IF/ID.IR[RS] Load r-r ALU ID/EX.IR[RT] == IF/ID.IR[RT]
60
Load Load, Store, r-i ALU, branch ID/EX.IR[RT] == IF/ID.IR[RS]
61
62
63
64
65
66
67
68
IF ID EX WB
CPU
Complete
Cache Memory IF ID EX WB
IF ID EX WB
IF ID EX WB
Suspend Execution
69
Disk IF ID EX WB
Trap addr IF ID EX WB
Exception handling procedure
70
71
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB M
IF ID EX WB
72
IF ID EX WB
Exception Status Vector Check exceptions here
73
Autoincrement addressing modes
74
75
76
M1 M2 M3 M4 M5 M6 M7 Mem WB ID IF A1 A2 A3 A4 Mem WB ID IF EX Mem WB ID IF EX Mem WB ID IF
77
EX Mem WB ID IF M1 M2 M3 M4 M5 M6 M7 Mem WB ID IF A1 A2 A3 A4 Mem WB ID IF stall stall stall stall stall stall stall stall
LD F4, 0(R2) MULTD F0, F4, F6 ADD F2, F0, F8
78
79
80