Processor Design - Pipelined Processor
Hung-Wei Tseng
Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of - - PowerPoint PPT Presentation
Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor The cycle time is determined by the longest instruction Could be very long, thinking about fetch data from DRAM Hardware is mostly idle 3
Hung-Wei Tseng
3
pipeline stages
pipeline registers (latches)
4
2ns 2ns 2ns 2ns 2ns 10ns
latch latch latch latch latch latch latch latch
5
2ns 2ns 2ns 2ns 2ns
latch latch latch latch latch latch
2ns 2ns 2ns 2ns 2ns
latch latch latch latch latch latch
2ns 2ns 2ns 2ns 2ns
latch latch latch latch latch latch
2ns 2ns 2ns 2ns 2ns
latch latch latch latch latch latch
2ns 2ns 2ns 2ns 2ns
latch latch latch latch latch latch
registers in a design.
time must be long enough for a signal to traverse the critical path.
change performance
7
file
9
Execution (EXE) Instruction Fetch (IF) Instruction Decode (ID) Memory Access (MEM) Write Back (WB)
10
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[15:11] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead RegDst
RegWrite MemWrite PCSrc
Zero PCSrc = Branch & Zero
IF/ID ID/EX EX/MEM MEM/WB
Instruction Fetch Instruction Decode Execution Memory Access Write Back
ALUop
11
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[15:11] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12)
ALUop
12
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[15:11] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12)
ALUop
13
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[15:11] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12)
ALUop
14
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[15:11] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12)
ALUop
RegDst
15
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[15:11] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12)
ALUop
16
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
17
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
with the abbreviations for pipeline stages.
for the instruction stream
18
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10,$11 sw $1, 0($12)
IF EXE WB ID MEM IF EXE WB ID MEM IF EXE ID MEM IF EXE ID IF ID WB WB MEM EXE WB MEM
19
still hard to achieve CPI == 1.
instruction in the pipeline
20
21
instructions that we want to execute at the same cycle
two instructions competing the same register.
before the end of the clock cycle.
23
add $1, $2, $3 lw $4, 0($5) sub $6, $7, $8 sub $9,$10, $1 sw $1, 0($12)
MEM EXE IF EXE ID MEM IF EXE ID IF ID IF ID IF WB MEM EXE ID WB WB MEM EXE WB MEM WB
25
that is not available
consumes the result is still in the pipeline
27
stall the pipeline
propagate through the pipeline like a nop instruction
updates
29
30
add $1, $2, $3 lw $4, 0($1) sub $5, $2, $4 sub $1, $3, $1 sw $1, 0($5)
WB IF IF EXE ID MEM IF EXE ID IF IF ID ID ID IF MEM WB ID ID MEM EXE WB IF IF ID MEM EXE WB IF ID ID ID MEM EXE WB
but publicized in WB!
away!
31
add $1, $2, $3 lw $4, 0($1) sub $5, $2, $4 sub $1, $3, $1 sw $1, 0($5)
IF EXE ID IF ID IF
We obtain the result here!
32
add $1, $2, $3 lw $4, 0($1) sub $5, $2, $4 sub $1, $3, $1 sw $1, 0($5)
IF EXE ID IF ID IF WB MEM EXE ID IF MEM WB ID MEM EXE WB IF ID MEM EXE WB IF ID MEM EXE WB
result from a previous instruction that is entering MEM stage or WB stage
register file
34
35
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
forwarding unit
m u x
ForwardA ForwardB ForwardA ForwardB
destination of Ins#1 Rs of Ins#2 Rt of Ins#2 ALU result of Ins#1 Control of Ins#1 Control of Ins#2
36
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
forwarding unit
m u x
ForwardA ForwardB ForwardA ForwardB
Rd of Ins#1 ALU/MEM result of Ins#1 Control of Ins#1
37
add $1, $2, $3 lw $4, 0($1) sub $5, $2, $4 sub $1, $3, $1 sw $1, 0($5)
IF EXE ID IF ID IF WB MEM EXE ID IF MEM WB ID MEM EXE WB IF ID MEM EXE WB IF ID MEM EXE WB
lw generates result at MEM stage, we have to stall
instruction that does not finish its MEM stage yet, we have to stall!
We need to know the following:
38
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
forwarding unit
m u x
ForwardA ForwardB ForwardA ForwardB
hazard detection unit
ID/EX.MemRead PCWrite IF/IDWrite
m u x
39
fetch
42
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP lw $t3, 0($s0)
WB MEM EXE ID WB MEM EXE IF EXE ID IF IF WB MEM EXE MEM ID EXE IF ID IF ID WB MEM WB IF ID MEM EXE WB
stall
always executed
43
44
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP
branch delay slot
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 bne $t1, $t0, LOOP addi $s0, $s0, 4 lw $t3, 0($s0)
WB MEM EXE ID WB MEM EXE IF EXE ID IF IF WB MEM EXE MEM ID EXE IF ID IF ID IF WB MEM WB ID MEM EXE WB
stall
45
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP sw $v0, 0($s1) add $t4, $t3, $t5
WB MEM EXE ID WB MEM EXE IF EXE ID IF IF WB MEM EXE MEM ID EXE IF ID IF ID IF
ID IF WB MEM nop nop
lw $t3, 0($s0)
IF WB nop nop ID nop MEM EXE WB
flush the instructions fetched incorrectly
48
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add Shi> le>?2 ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
forwarding unit
m u x
ForwardA ForwardB ForwardA ForwardB
hazard detection unit
ID/EX.MemRead PCWrite IF/IDWrite
m u x
49
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
forwarding unit
m u x
ForwardA ForwardB ForwardA ForwardB
hazard detection unit
ID/EX.MemRead PCWrite IF/IDWrite
m u x
Shi> le>?2
50
Read Address
Instruc(on Memory
PC ALU Write+Data 4 Add Read +Data+1 Read +Data+2 Read+Reg+1 Read+Reg+2 Write+Reg
Register File
inst[25:21] inst[20:16] inst[31:0]
m u x
1
m u x
1
sign- extend
32 16
Data Memory
Address Read Data
m u x
1
Write+Data
m u x
1
Add ALUSrc
MemtoReg
MemRead
RegDst
RegWrite MemWrite PCSrc
Zero
IF/ID ID/EX EX/MEM MEM/WB
inst[15:11] ALUop
Control
WB ME EX WB ME WB
RegWrite
forwarding unit
m u x
ForwardA ForwardB ForwardA ForwardB
hazard detection unit
ID/EX.MemRead PCWrite IF/IDWrite
m u x
Shi> le>?2
Branch Target Buffer
Consult BTB in fetch stage
51
PC
Branch Target Buffer
branch PC target address or target instruction
52
LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP
WB MEM EXE ID WB MEM EXE IF EXE ID IF IF WB MEM EXE MEM ID EXE IF ID IF ID IF ID IF WB MEM EXE ID IF WB MEM EXE ID WB MEM WB MEM EXE WB
lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3
55
result of the last time this branch executed
56
0x400420 0x8048324 1 0x400464 0x8048392 1 0x400578 0x804850a 0x41000C 0x8049624 1
Branch Target Buffer
PC = 0x400420
Taken!
Taken (11) Taken (10)
Not Taken (00) Not Taken (01)
taken taken taken not taken not taken not taken not taken taken
58
Branch Target Buffer
PC = 0x400420
Taken!
0x400420 0x8048324 11 0x400464 0x8048392 10 0x400578 0x804850a 00 0x41000C 0x8049624 01
for(i = 0; i < 10; i++) { ! sum += a[i]; }
Taken (11) Taken (10)
Not Taken (00) Not Taken (01)
taken taken taken not taken not taken not taken not taken taken
Branch, and branch resolved in EX stage, average CPI?
59
i = 0; do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i; } while ( ++i < 100) // Branch X
61
i branch result Y T X T 1 Y NT 1 X T 2 Y NT 2 X T 3 Y T 3 X T 4 Y NT 4 X T 5 Y NT 5 X T 6 Y T 6 X T 7 Y NT
Can we capture the pattern?
a bit vector (global history register, GHR) made up of the previous branch outcomes.
62 01 11 10 11 00 11 11 10
history table
n-bit GHR 2n entries
= 101 (T, NT, T) Taken!
i = 0; do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i; // Branch Y } while ( ++i < 100) // Branch X
63
i ? GHR BHT prediction actual
New BHT
Y 0000 10 T T 11 X 0001 10 T T 11 1 Y 0011 10 T NT 01 1 X 0110 10 T T 11 2 Y 1101 10 T NT 01 2 X 1010 10 T T 11 3 Y 0101 10 T T 11 3 X 1011 10 T T 11 4 Y 0111 10 T NT 01 4 X 1110 10 T T 11 5 Y 1101 01 NT NT 00 5 X 1010 11 T T 11 6 Y 0101 11 T T 11 6 X 1011 11 T T 11 7 Y 0111 01 NT NT 00 7 X 1110 11 T T 11 8 Y 1101 00 NT NT 00 8 X 1010 11 T T 11 9 Y 0101 11 T T 11 9 X 1011 11 T T 11 10 Y 0111 00 NT NT 00
Assume that we start with a 4-bit GHR= 0, all counters are 10. Nearly perfect after this