Chapter 7 <1>
Chapter 7 Digital Design and Computer Architecture , 2 nd Edition - - PowerPoint PPT Presentation
Chapter 7 Digital Design and Computer Architecture , 2 nd Edition - - PowerPoint PPT Presentation
Chapter 7 Digital Design and Computer Architecture , 2 nd Edition David Money Harris and Sarah L. Harris Chapter 7 <1> Chapter 7 :: Topics Introduction Performance Analysis Single-Cycle Processor Multicycle Processor
Chapter 7 <2>
Chapter 7 :: Topics
- Introduction
- Performance Analysis
- Single-Cycle Processor
- Multicycle Processor
- Pipelined Processor
- Exceptions
- Advanced Microarchitecture
Chapter 7 <3>
- Microarchitecture: how to
implement an architecture in hardware
- Processor:
– Datapath: functional blocks – Control: control signals
Physics Devices Analog Circuits Digital Circuits Logic Micro- architecture Architecture Operating Systems Application Software electrons transistors diodes amplifiers filters AND gates NOT gates adders memories datapaths controllers instructions registers device drivers programs
Introduction
Chapter 7 <4>
- Multiple implementations for a single
architecture:
– Single-cycle: Each instruction executes in a single cycle – Multicycle: Each instruction is broken into series
- f shorter steps
– Pipelined: Each instruction broken up into series
- f steps & multiple instructions execute at once
Microarchitecture
Chapter 7 <5>
- Program execution time
Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)
- Definitions:
– CPI: Cycles/instruction – clock period: seconds/cycle – IPC: instructions/cycle = IPC
- Challenge is to satisfy constraints of:
– Cost – Power – Performance
Processor Performance
Chapter 7 <6>
- Consider subset of MIPS instructions:
– R-type instructions: and, or, add, sub, slt – Memory instructions: lw, sw – Branch instructions: beq
MIPS Processor
Chapter 7 <7>
- Determines everything about a processor:
– PC – 32 registers – Memory
Architectural State
Chapter 7 <8>
CLK A RD Instruction Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Register File A RD Data Memory WD WE PC PC' CLK
32 32 32 32 32 32 32 32 32 32 5 5 5
MIPS State Elements
Chapter 7 <9>
- Datapath
- Control
Single-Cycle MIPS Processor
Chapter 7 <10>
STEP 1: Fetch instruction
CLK A RD Instruction Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Register File A RD Data Memory WD WE PC PC' Instr CLK
Single-Cycle Datapath: lw fetch
Chapter 7 <11>
STEP 2: Read source operands from RF
Instr CLK A RD Instruction Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Register File A RD Data Memory WD WE PC PC'
25:21
CLK
Single-Cycle Datapath: lw Register Read
Chapter 7 <12>
STEP 3: Sign-extend the immediate
SignImm CLK A RD Instruction Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File A RD Data Memory WD WE PC PC' Instr
25:21 15:0
CLK
Single-Cycle Datapath: lw Immediate
Chapter 7 <13>
STEP 4: Compute the memory address
SignImm CLK A RD Instruction Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File A RD Data Memory WD WE PC PC' Instr
25:21 15:0
SrcB ALUResult SrcA Zero CLK ALUControl2:0
ALU
010
Single-Cycle Datapath: lw address
Chapter 7 <14>
- STEP 5: Read data from memory and write
it back to register file
A1 A3 WD3 RD2 RD1 WE3 A2 SignImm CLK A RD Instruction Memory CLK Sign Extend Register File A RD Data Memory WD WE PC PC' Instr
25:21 15:0
SrcB
20:16
ALUResult ReadData SrcA RegWrite Zero CLK ALUControl2:0
ALU
010 1
Single-Cycle Datapath: lw Memory Read
Chapter 7 <15>
STEP 6: Determine address of next instruction
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File A RD Data Memory WD WE PC PC' Instr
25:21 15:0
SrcB
20:16
ALUResult ReadData SrcA PCPlus4 Result RegWrite Zero CLK ALUControl2:0
ALU
010 1
Single-Cycle Datapath: lw PC Increment
Chapter 7 <16>
Write data in rt to memory
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File A RD Data Memory WD WE PC PC' Instr
25:21 20:16 15:0
SrcB
20:16
ALUResult ReadData WriteData SrcA PCPlus4 Result MemWrite RegWrite Zero CLK ALUControl2:0
ALU
1 010
Single-Cycle Datapath: sw
Chapter 7 <17>
- Read from rs and rt
- Write ALUResult to register file
- Write to rd (instead of rt)
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
ALUResult ReadData WriteData SrcA PCPlus4 WriteReg4:0 Result RegDst MemWrite MemtoReg ALUSrc RegWrite Zero CLK ALUControl2:0
ALU
varies 1 1
Single-Cycle Datapath: R-Type
Chapter 7 <18>
- Determine whether values in rs and rt are equal
- Calculate branch target address:
BTA = (sign-extended immediate << 2) + (PC+4)
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Zero PCSrc CLK ALUControl2:0
ALU
110 x x 1
Single-Cycle Datapath: beq
Chapter 7 <19>
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26
RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
Single-Cycle Processor
Chapter 7 <20>
RegDst Branch MemWrite MemtoReg ALUSrc Opcode5:0 Control Unit ALUControl2:0 Funct5:0 Main Decoder ALUOp1:0 ALU Decoder RegWrite
Single-Cycle Control
Chapter 7 <21>
ALU
N N N 3 A B Y F
F2:0 Function 000 A & B 001 A | B 010 A + B 011 not used 100 A & ~B 101 A | ~B 110 A - B 111 SLT
Review: ALU
Chapter 7 <22>
+ 2 1 A B Cout Y 3 1 F2 F1:0
[N-1] S
N N N N N N N N N 2 Zero Extend
Review: ALU
Chapter 7 <23>
ALUOp1:0 Meaning 00 Add 01 Subtract 10 Look at Funct 11 Not Used ALUOp1:0 Funct ALUControl2:0 00 X 010 (Add) X1 X 110 (Subtract) 1X 100000 (add) 010 (Add) 1X 100010 (sub) 110 (Subtract) 1X 100100 (and) 000 (And) 1X 100101 (or) 001 (Or) 1X 101010 (slt) 111 (SLT)
Control Unit: ALU Decoder
Chapter 7 <24>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000 lw 100011 sw 101011 beq 000100
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0SrcB
20:16 15:11<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
Control Unit Main Decoder
Chapter 7 <25>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000
1 1 10
lw 100011
1 1 00
sw 101011
X 1 1 X 00
beq 000100
X 1 X 01
Control Unit: Main Decoder
Chapter 7 <26>
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26
RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
001 1 1
Single-Cycle Datapath: or
Chapter 7 <27>
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26
RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
No change to datapath
Extended Functionality: addi
Chapter 7 <28>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000
1 1 10
lw 100011
1 1 1 00
sw 101011
X 1 1 X 00
beq 000100
X 1 X 01
addi 001000
Control Unit: addi
Chapter 7 <29>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0
R-type 000000
1 1 10
lw 100011
1 1 1 00
sw 101011
X 1 1 X 00
beq 000100
X 1 X 01
addi 001000
1 1 00
Control Unit: addi
Chapter 7 <30>
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26
RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
1
25:0
<<2
27:0 31:28
PCJump Jump
Extended Functionality: j
Chapter 7 <31>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000
1 1 10
lw 100011
1 1 1 00
sw 101011
X 1 1 X 00
beq 000100
X 1 X 01
j 000010
Control Unit: Main Decoder
Chapter 7 <32>
Instruction Op5:0 RegWrite RegDst AluSrc Branch MemWrite MemtoReg ALUOp1:0 Jump
R-type 000000
1 1 10
lw 100011
1 1 1 00
sw 101011
X 1 1 X 00
beq 000100
X 1 X 01
j 000010
X X X X XX 1
Control Unit: Main Decoder
Chapter 7 <33>
Program Execution Time
= (#instructions)(cycles/instruction)(seconds/cycle) = # instructions x CPI x TC
Review: Processor Performance
Chapter 7 <34>
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26
RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
1 010 1 1
TC limited by critical path (lw)
Single-Cycle Performance
Chapter 7 <35>
- Single-cycle critical path:
Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup
- Typically, limiting paths are:
– memory, ALU, register file – Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup
Single-Cycle Performance
Chapter 7 <36>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20
Tc = ?
Single-Cycle Performance Example
Chapter 7 <37>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20
Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup = [30 + 2(250) + 150 + 25 + 200 + 20] ps = 925 ps
Single-Cycle Performance Example
Chapter 7 <38>
Program with 100 billion instructions: Execution Time = # instructions x CPI x TC = (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds
Single-Cycle Performance Example
Chapter 7 <39>
- Single-cycle:
+ simple
- cycle time limited by longest instruction (lw)
- 2 adders/ALUs & 2 memories
- Multicycle:
+ higher clock speed + simpler instructions run faster + reuse expensive hardware on multiple cycles
- sequencing overhead paid many times
- Same design steps: datapath & control
Multicycle MIPS Processor
Chapter 7 <40>
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Register File PC PC' WD WE CLK
EN
- Replace Instruction and Data memories with
a single unified memory – more realistic
Multicycle State Elements
Chapter 7 <41>
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Register File PC PC' Instr CLK WD WE CLK
EN
IRWrite
STEP 1: Fetch instruction
Multicycle Datapath: Instruction Fetch
Chapter 7 <42>
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Register File PC PC' Instr
25:21
CLK WD WE CLK CLK A
EN
IRWrite
Multicycle Datapath: lw Register Read
STEP 2a: Read source operands from RF
Chapter 7 <43>
SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File PC PC' Instr
25:21 15:0
CLK WD WE CLK CLK A
EN
IRWrite
Multicycle Datapath: lw Immediate
STEP 2b: Sign-extend the immediate
Chapter 7 <44>
SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File PC PC' Instr
25:21 15:0
SrcB ALUResult SrcA ALUOut CLK ALUControl2:0
ALU
WD WE CLK CLK A CLK
EN
IRWrite
Multicycle Datapath: lw Address
STEP 3: Compute the memory address
Chapter 7 <45>
SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File PC PC' Instr
25:21 15:0
SrcB ALUResult SrcA ALUOut CLK ALUControl2:0
ALU
WD WE CLK Adr Data CLK CLK A CLK
EN
IRWrite IorD 1
Multicycle Datapath: lw Memory Read
STEP 4: Read data from memory
Chapter 7 <46>
SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File PC PC' Instr
25:21 15:0
SrcB
20:16
ALUResult SrcA ALUOut RegWrite CLK ALUControl2:0
ALU
WD WE CLK Adr Data CLK CLK A CLK
EN
IRWrite IorD 1
Multicycle Datapath: lw Write Register
STEP 5: Write data back to register file
Chapter 7 <47>
PCWrite SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 PC PC' Instr
25:21 15:0
SrcB
20:16
ALUResult SrcA ALUOut ALUSrcA RegWrite CLK ALUControl2:0
ALU
WD WE CLK Adr Data CLK CLK A 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD 1
Multicycle Datapath: Increment PC
STEP 6: Increment PC
Chapter 7 <48>
SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16
ALUResult SrcA ALUOut MemWrite ALUSrcA RegWrite CLK ALUControl2:0
ALU
WD WE CLK Adr Data CLK CLK A 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite B
Write data in rt to memory
Multicycle Datapath: sw
Chapter 7 <49>
1 SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
ALUResult SrcA ALUOut RegDst MemWrite MemtoReg ALUSrcA RegWrite CLK ALUControl2:0
ALU
WD WE CLK Adr Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite
- Read from rs and rt
- Write ALUResult to register file
- Write to rd (instead of rt)
Multicycle Datapath: R-Type
Chapter 7 <50>
SignImm
b
CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut RegDst Branch MemWrite MemtoReg ALUSrcA RegWrite Zero PCSrc CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn
- rs == rt?
- BTA = (sign-extended immediate << 2) + (PC+4)
Multicycle Datapath: beq
Chapter 7 <51>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn
Multicycle Processor
Chapter 7 <52>
ALUSrcA PCSrc Branch ALUSrcB1:0 Opcode5:0 Control Unit ALUControl2:0 Funct5:0 Main Controller (FSM) ALUOp1:0 ALU Decoder RegWrite PCWrite IorD MemWrite IRWrite RegDst MemtoReg Register Enables Multiplexer Selects
Multicycle Control
Chapter 7 <53>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn 1 1 X X 01 010 1
Reset S0: Fetch
Main Controller FSM: Fetch
Chapter 7 <54>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn 1 1 X X 01 010 1
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite Reset S0: Fetch
Main Controller FSM: Fetch
Chapter 7 <55>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite Reset S0: Fetch S1: Decode
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn X X X X XX XXX X
Main Controller FSM: Decode
Chapter 7 <56>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite Reset S0: Fetch S2: MemAdr S1: Decode Op = LW
- r
Op = SW
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn X X X 1 10 010 X
Main Controller FSM: Address
Chapter 7 <57>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 Reset S0: Fetch S2: MemAdr S1: Decode Op = LW
- r
Op = SW
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn X X X 1 10 010 X
Main Controller FSM: Address
Chapter 7 <58>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead Op = LW
- r
Op = SW Op = LW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback
Main Controller FSM: lw
Chapter 7 <59>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 IorD = 1 MemWrite Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite Op = LW
- r
Op = SW Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback
Main Controller FSM: sw
Chapter 7 <60>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback Op = LW
- r
Op = SW Op = R-type Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback
Main Controller FSM: R-Type
Chapter 7 <61>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 1 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback
Main Controller FSM: beq
Chapter 7 <62>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 1 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback
Multicycle Controller FSM
Chapter 7 <63>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 1 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback Op = ADDI S9: ADDI Execute S10: ADDI Writeback
Extended Functionality: addi
Chapter 7 <64>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 0 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 1 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 RegDst = 0 MemtoReg = 0 RegWrite Op = ADDI S9: ADDI Execute S10: ADDI Writeback
Main Controller FSM: addi
Chapter 7 <65>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut RegDst Branch MemWrite MemtoReg ALUSrcA RegWrite Zero PCSrc1:0 CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn 00 01 10 <<2
25:0 (jump) 31:28 27:0
PCJump
Extended Functionality: j
Chapter 7 <66>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 00 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 01 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 RegDst = 0 MemtoReg = 0 RegWrite Op = ADDI S9: ADDI Execute S10: ADDI Writeback Op = J S11: Jump
Main Controller FSM: j
Chapter 7 <67>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 00 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 0 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 01 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 1 RegWrite S4: Mem Writeback ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 RegDst = 0 MemtoReg = 0 RegWrite Op = ADDI S9: ADDI Execute S10: ADDI Writeback PCSrc = 10 PCWrite Op = J S11: Jump
Main Controller FSM: j
Chapter 7 <68>
- Instructions take different number of cycles:
– 3 cycles: beq, j – 4 cycles: R-Type, sw, addi – 5 cycles: lw
- CPI is weighted average
- SPECINT2000 benchmark:
– 25% loads – 10% stores – 11% branches – 2% jumps – 52% R-type
Average CPI = (0.11 + 0.02)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
Multicycle Processor Performance
Chapter 7 <69>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut
31:26
RegDst Branch MemWrite
MemtoReg
ALUSrcA RegWrite Op Funct Control Unit Zero PCSrc CLK CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn
Multicycle critical path:
Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup
Multicycle Processor Performance
Chapter 7 <70>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20
Tc = ?
Multicycle Performance Example
Chapter 7 <71>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20
Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup = tpcq_PC + tmux + tmem + tsetup = [30 + 25 + 250 + 20] ps = 325 ps
Multicycle Performance Example
Chapter 7 <72>
Program with 100 billion instructions Execution Time = ?
Multicycle Performance Example
Chapter 7 <73>
Program with 100 billion instructions Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12) = 133.9 seconds
This is slower than the single-cycle processor (92.5 seconds). Why?
Multicycle Performance Example
Chapter 7 <74>
Program with 100 billion instructions Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12) = 133.9 seconds
This is slower than the single-cycle processor (92.5 seconds). Why?
– Not all steps same length – Sequencing overhead for each step (tpcq + tsetup= 50 ps)
Multicycle Performance Example
Chapter 7 <75>
SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0 5:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result
31:26
RegDst Branch MemWrite MemtoReg ALUSrc RegWrite Op Funct Control Unit Zero PCSrc CLK ALUControl2:0
ALU
1
25:0
<<2
27:0 31:28
PCJump Jump
Review: Single-Cycle Processor
Chapter 7 <76>
ImmExt CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut Zero CLK
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
00 01 10 <<2
25:0 (Addr) 31:28 27:0
PCJump
5:0 31:26
Branch MemWrite ALUSrcA RegWrite Op Funct Control Unit PCSrc CLK ALUControl2:0 ALUSrcB1:0 IRWrite IorD PCWrite PCEn RegDst
MemtoReg
Review: Multicycle Processor
Chapter 7 <77>
- Temporal parallelism
- Divide single-cycle processor into 5 stages:
– Fetch – Decode – Execute – Memory – Writeback
- Add pipeline registers between stages
Pipelined MIPS Processor
Chapter 7 <78>
Time (ps) Instr Fetch Instruction Decode
Read Reg
Execute ALU Memory Read / Write Write Reg 1 2 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 1900 1000 Instr 1 2 3 Fetch Instruction Decode
Read Reg
Execute ALU Memory Read / Write Write Reg Fetch Instruction Decode
Read Reg
Execute ALU Memory Read/Write Write Reg Fetch Instruction Decode
Read Reg
Execute ALU Memory Read/Write Write Reg Fetch Instruction Decode
Read Reg
Execute ALU Memory Read/Write Write Reg
Single-Cycle Pipelined
Single-Cycle vs. Pipelined
Chapter 7 <79>
Time (cycles)
lw $s2, 40($0)
RF 40 $0 RF $s2
+
DM RF $t2 $t1 RF $s3
+
DM RF $s5 $s1 RF $s4
- DM
RF $t6 $t5 RF $s5
&
DM RF 20 $s1 RF $s6
+
DM RF $t4 $t3 RF $s7
|
DM
add $s3, $t1, $t2 sub $s4, $s1, $s5 and $s5, $t5, $t6 sw $s6, 20($s1)
- r $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add IM IM IM IM IM IM lw sub and sw
- r
Pipelined Processor Abstraction
Chapter 7 <80>
SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0
SrcBE
20:16 15:11
RtE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM ResultW PCPlus4E PCPlus4F ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK CLK CLK SignImm CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2
+
ALUResult ReadData WriteData SrcA PCPlus4 PCBranch WriteReg4:0 Result Zero CLK
ALU Fetch Decode Execute Memory Writeback
Single-Cycle & Pipelined Datapath
Chapter 7 <81>
SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0
SrcBE
20:16 15:11
RtE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM WriteRegM4:0 ResultW PCPlus4E PCPlus4F ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK CLK CLK
Fetch Decode Execute Memory Writeback
WriteReg must arrive at same time as Result
Corrected Pipelined Datapath
Chapter 7 <82>
SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0 5:0
SrcBE
20:16 15:11
RtE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM WriteRegM4:0 ResultW PCPlus4E PCPlus4F
31:26
RegDstD BranchD MemWriteD MemtoRegD ALUControlD ALUSrcD RegWriteD Op Funct
Control Unit
ZeroM PCSrcM
CLK CLK CLK CLK CLK
WriteRegW4:0 ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM BranchE BranchM RegDstE ALUSrcE WriteRegE4:0
- Same control unit as single-cycle processor
- Control delayed to proper pipeline stage
Pipelined Processor Control
Chapter 7 <83>
- When an instruction depends on result from
instruction that hasn’t completed
- Types:
– Data hazard: register value not yet written back to register file – Control hazard: next instruction not decided yet (caused by branches)
Pipeline Hazards
Chapter 7 <84>
Time (cycles)
add $s0, $s2, $s3
RF $s3 $s2 RF $s0
+
DM RF $s1 $s0 RF $t0
&
DM RF $s0 $s4 RF $t1
|
DM RF $s5 $s0 RF $t2
- DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM IM IM add
- r
sub
Data Hazard
Chapter 7 <85>
- Insert nops in code at compile time
- Rearrange code at compile time
- Forward data at run time
- Stall the processor at run time
Handling Data Hazards
Chapter 7 <86>
Time (cycles)
add $s0, $s2, $s3
RF $s3 $s2 RF $s0
+
DM RF $s1 $s0 RF $t0
&
DM RF $s0 $s4 RF $t1
|
DM RF $s5 $s0 RF $t2
- DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM IM IM add
- r
sub
nop nop
RF RF DM nop IM RF RF DM nop IM
9 10
- Insert enough nops for result to be ready
- Or move independent useful instructions forward
Compile-Time Hazard Elimination
Chapter 7 <87>
Time (cycles)
add $s0, $s2, $s3
RF $s3 $s2 RF $s0
+
DM RF $s1 $s0 RF $t0
&
DM RF $s0 $s4 RF $t1
|
DM RF $s5 $s0 RF $t2
- DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM IM IM add
- r
sub
Data Forwarding
Chapter 7 <88>
SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0 5:0
SrcBE
25:21 15:11
RsE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM WriteRegM4:0 ResultW PCPlus4F
31:26
RegDstD BranchD MemWriteD MemtoRegD ALUControlD2:0 ALUSrcD RegWriteD Op Funct
Control Unit
PCSrcM
CLK CLK CLK CLK CLK
WriteRegW4:0 ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE ALUSrcE WriteRegE4:0 00 01 10 00 01 10 SignImmD ForwardAE ForwardBE
20:16
RtE RsD RdD RtD RegWriteM RegWriteW
Hazard Unit
PCPlus4E BranchE BranchM ZeroM
Data Forwarding
Chapter 7 <89>
- Forward to Execute stage from either:
– Memory stage or – Writeback stage
- Forwarding logic for ForwardAE:
if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01 else ForwardAE = 00
Forwarding logic for ForwardBE same, but replace rsE with rtE
Data Forwarding
Chapter 7 <90>
Time (cycles)
lw $s0, 40($0)
RF 40 $0 RF $s0
+
DM RF $s1 $s0 RF $t0
&
DM RF $s0 $s4 RF $t1
|
DM RF $s5 $s0 RF $t2
- DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM IM IM lw
- r
sub
Trouble!
Stalling
Chapter 7 <91>
Time (cycles)
lw $s0, 40($0)
RF 40 $0 RF $s0
+
DM RF $s1 $s0 RF $t0
&
DM RF $s0 $s4 RF $t1
|
DM RF $s5 $s0 RF $t2
- DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM IM IM lw
- r
sub
9
RF $s1 $s0 IM
- r
Stall
Stalling
Chapter 7 <92>
SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0 5:0
SrcBE
25:21 15:11
RsE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM WriteRegM4:0 ResultW PCPlus4F
31:26
RegDstD BranchD MemWriteD MemtoRegD ALUControlD2:0 ALUSrcD RegWriteD Op Funct
Control Unit
PCSrcM
CLK CLK CLK CLK CLK
WriteRegW4:0 ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE ALUSrcE WriteRegE4:0 00 01 10 00 01 10 SignImmD StallF StallD ForwardAE ForwardBE
20:16
RtE RsD RdD RtD RegWriteM RegWriteW MemtoRegE
Hazard Unit
FlushE PCPlus4E BranchE BranchM ZeroM
EN EN CLR
Stalling Hardware
Chapter 7 <93>
lwstall = ((rsD==rtE) OR (rtD==rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall
Stalling Logic
Chapter 7 <94>
- beq:
– branch not determined until 4th stage of pipeline – Instructions after branch fetched before branch occurs – These instructions must be flushed if branch happens
- Branch misprediction penalty
– number of instruction flushed when branch is taken – May be reduced by determining branch earlier
Control Hazards
Chapter 7 <95>
SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0 5:0
SrcBE
25:21 15:11
RsE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchM WriteRegM4:0 ResultW PCPlus4F
31:26
RegDstD BranchD MemWriteD MemtoRegD ALUControlD2:0 ALUSrcD RegWriteD Op Funct
Control Unit
PCSrcM
CLK CLK CLK CLK CLK
WriteRegW4:0 ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE ALUSrcE WriteRegE4:0 00 01 10 00 01 10 SignImmD StallF StallD ForwardAE ForwardBE
20:16
RtE RsD RdD RtD RegWriteM RegWriteW MemtoRegE
Hazard Unit
FlushE PCPlus4E BranchE BranchM ZeroM
EN EN CLR
Control Hazards: Original Pipeline
Chapter 7 <96>
Time (cycles)
beq $t1, $t2, 40
RF $t2 $t1 RF
- DM
RF $s1 $s0 RF
&
DM RF $s0 $s4 RF
|
DM RF $s5 $s0 RF
- DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM IM IM lw
- r
sub
20 24 28 2C 30 ... ... 9
Flush these instructions
64 slt $t3, $s2, $s3
RF $s3 $s2 RF $t3
slt
DM IM slt
Control Hazards
Chapter 7 <97>
EqualD SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0 5:0
SrcBE
25:21 15:11
RsE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchD WriteRegM4:0 ResultW PCPlus4F
31:26
RegDstD BranchD MemWriteD MemtoRegD ALUControlD2:0 ALUSrcD RegWriteD Op Funct
Control Unit
PCSrcD
CLK CLK CLK CLK CLK
WriteRegW4:0 ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE ALUSrcE WriteRegE4:0 00 01 10 00 01 10
=
SignImmD StallF StallD ForwardAE ForwardBE
20:16
RtE RsD RdE RtD RegWriteM RegWriteW MemtoRegE
Hazard Unit
FlushE
EN EN CLR CLR
Introduced another data hazard in Decode stage
Early Branch Resolution
Chapter 7 <98>
Time (cycles)
beq $t1, $t2, 40
RF $t2 $t1 RF
- DM
RF $s1 $s0 RF
&
DM
and $t0, $s0, $s1
- r $t1, $s4, $s0
sub $t2, $s0, $s5 1 2 3 4 5 6 7 8
and IM IM lw
20 24 28 2C 30 ... ... 9
Flush this instruction
64 slt $t3, $s2, $s3
RF $s3 $s2 RF $t3
slt
DM IM slt
Early Branch Resolution
Chapter 7 <99>
EqualD SignImmE
CLK A RD Instruction Memory
+
4 A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 A RD Data Memory WD WE 1
PCF
1
PC' InstrD
25:21 20:16 15:0 5:0
SrcBE
25:21 15:11
RsE RdE
<<2
+
ALUOutM ALUOutW ReadDataW WriteDataE WriteDataM SrcAE PCPlus4D PCBranchD WriteRegM4:0 ResultW PCPlus4F
31:26
RegDstD BranchD MemWriteD MemtoRegD ALUControlD2:0 ALUSrcD RegWriteD Op Funct
Control Unit
PCSrcD
CLK CLK CLK CLK CLK
WriteRegW4:0 ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW MemtoRegE MemtoRegM MemtoRegW MemWriteE MemWriteM RegDstE ALUSrcE WriteRegE4:0 00 01 10 00 01 10
1 1
=
SignImmD StallF StallD ForwardAE ForwardBE ForwardAD ForwardBD
20:16
RtE RsD RdD RtD RegWriteE RegWriteM RegWriteW MemtoRegE BranchD
Hazard Unit
FlushE
EN EN CLR CLR
Handling Data & Control Hazards
Chapter 7 <100>
- Forwarding logic:
ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM
- Stalling logic:
branchstall = BranchD AND [RegWriteE AND ((WriteRegE == rsD) OR (WriteRegE == rtD)) OR [MemtoRegM AND ((WriteRegM == rsD) OR (WriteRegM == rtD))] StallF = StallD = FlushE = lwstall OR branchstall
Control Forwarding & Stalling Logic
Chapter 7 <101>
- Guess whether branch will be taken
– Backward branches are usually taken (loops) – Consider history to improve guess
- Good prediction reduces fraction of branches
requiring a flush
Branch Prediction
Chapter 7 <102>
- SPECINT2000 benchmark:
– 25% loads – 10% stores – 11% branches – 2% jumps – 52% R-type
- Suppose:
– 40% of loads used by next instruction – 25% of branches mispredicted – All jumps flush next instruction
- What is the average CPI?
Pipelined Performance Example
Chapter 7 <103>
- SPECINT2000 benchmark:
– 25% loads – 10% stores – 11% branches – 2% jumps – 52% R-type
- Suppose:
– 40% of loads used by next instruction – 25% of branches mispredicted – All jumps flush next instruction
- What is the average CPI?
– Load/Branch CPI = 1 when no stalling, 2 when stalling – CPIlw = 1(0.6) + 2(0.4) = 1.4 – CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15
Pipelined Performance Example
Chapter 7 <104>
- Pipelined processor critical path:
Tc = max { tpcq + tmem + tsetup 2(tRFread + tmux + teq + tAND + tmux + tsetup ) tpcq + tmux + tmux + tALU + tsetup tpcq + tmemwrite + tsetup 2(tpcq + tmux + tRFwrite) }
Pipelined Performance
Chapter 7 <105>
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30 Register setup tsetup 20 Multiplexer tmux 25 ALU tALU 200 Memory read tmem 250 Register file read tRFread 150 Register file setup tRFsetup 20 Equality comparator teq 40 AND gate tAND 15 Memory write tmemwrite 220 Register file write tRFwrite 100
Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup ) = 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps
Pipelined Performance Example
Chapter 7 <106>
Program with 100 billion instructions Execution Time = (# instructions) × CPI × Tc = (100 × 109)(1.15)(550 × 10-12) = 63 seconds
Pipelined Performance Example
Chapter 7 <107>
Processor Execution Time (seconds) Speedup (single-cycle as baseline) Single-cycle 92.5 1 Multicycle 133 0.70 Pipelined 63 1.47
Processor Performance Comparison
Chapter 7 <108>
- Unscheduled function call to exception handler
- Caused by:
– Hardware, also called an interrupt, e.g. keyboard – Software, also called traps, e.g. undefined instruction
- When exception occurs, the processor:
– Records cause of exception (Cause register) – Jumps to exception handler (0x80000180) – Returns to program (EPC register)
Review: Exceptions
Chapter 7 <109>
Example Exception
Chapter 7 <110>
- Not part of register file
– Cause
- Records cause of exception
- Coprocessor 0 register 13
– EPC (Exception PC)
- Records PC where exception occurred
- Coprocessor 0 register 14
- Move from Coprocessor 0
– mfc0 $t0, Cause – Moves contents of Cause into $t0
00000 $t0 (8) Cause (13) 00000000000
mfc0
31:26 25:21 20:16 15:11 10:0 010000
Exception Registers
Chapter 7 <111>
Exception Cause
Hardware Interrupt 0x00000000 System Call 0x00000020 Breakpoint / Divide by 0 0x00000024 Undefined Instruction 0x00000028 Arithmetic Overflow 0x00000030 Extend multicycle MIPS processor to handle last two types of exceptions
Exception Causes
Chapter 7 <112>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut RegDst Branch MemWrite MemtoReg ALUSrcA RegWrite Zero PCSrc1:0 CLK ALUControl2:0
ALU
WD WE CLK Adr 1 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn <<2
25:0 (jump) 31:28 27:0
PCJump 00 01 10 11
0x8000 0180
Overflow CLK
EN
EPCWrite CLK
EN
CauseWrite 1 IntCause
0x30 0x28
EPC Cause
Exception Hardware: EPC & Cause
Chapter 7 <113>
IorD = 0 AluSrcA = 0 ALUSrcB = 01 ALUOp = 00 PCSrc = 00 IRWrite PCWrite ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 IorD = 1 RegDst = 1 MemtoReg = 00 RegWrite IorD = 1 MemWrite ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCSrc = 01 Branch Reset S0: Fetch S2: MemAdr S1: Decode S3: MemRead S5: MemWrite S6: Execute S7: ALU Writeback S8: Branch Op = LW
- r
Op = SW Op = R-type Op = BEQ Op = LW Op = SW RegDst = 0 MemtoReg = 01 RegWrite S4: Mem Writeback ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 RegDst = 0 MemtoReg = 00 RegWrite Op = ADDI S9: ADDI Execute S10: ADDI Writeback PCSrc = 10 PCWrite Op = J S11: Jump Overflow Overflow S13: Overflow PCSrc = 11 PCWrite IntCause = 0 CauseWrite EPCWrite Op = others PCSrc = 11 PCWrite IntCause = 1 CauseWrite EPCWrite S12: Undefined RegDst = 0 Memtoreg = 10 RegWrite Op = mfc0 S14: MFC0
Control FSM with Exceptions
Chapter 7 <114>
SignImm CLK A RD Instr / Data Memory A1 A3 WD3 RD2 RD1 WE3 A2 CLK Sign Extend Register File 1 1 PC 1 PC' Instr
25:21 20:16 15:0
SrcB
20:16 15:11
<<2 ALUResult SrcA ALUOut RegDst Branch MemWrite MemtoReg1:0 ALUSrcA RegWrite Zero PCSrc1:0 CLK ALUControl2:0
ALU
WD WE CLK Adr 00 01 Data CLK CLK A B 00 01 10 11 4 CLK
EN EN
ALUSrcB1:0 IRWrite IorD PCWrite PCEn <<2
25:0 (jump) 31:28 27:0
PCJump 00 01 10 11
0x8000 0180
CLK
EN
EPCWrite CLK
EN
CauseWrite 1 IntCause
0x30 0x28
EPC Cause Overflow ...
01101 01110
...
15:11
10 C0
Exception Hardware: mfc0
Chapter 7 <115>
- Deep Pipelining
- Branch Prediction
- Superscalar Processors
- Out of Order Processors
- Register Renaming
- SIMD
- Multithreading
- Multiprocessors
Advanced Microarchitecture
Chapter 7 <116>
- 10-20 stages typical
- Number of stages limited by:
– Pipeline hazards – Sequencing overhead – Power – Cost
Deep Pipelining
Chapter 7 <117>
- Ideal pipelined processor: CPI = 1
- Branch misprediction increases CPI
- Static branch prediction:
– Check direction of branch (forward or backward) – If backward, predict taken – Else, predict not taken
- Dynamic branch prediction:
– Keep history of last (several hundred) branches in branch target buffer, record:
- Branch destination
- Whether branch was taken
Branch Prediction
Chapter 7 <118>
add $s1, $0, $0 # sum = 0 add $s0, $0, $0 # i = 0 addi $t0, $0, 10 # $t0 = 10 for: beq $s0, $t0, done # if i == 10, branch add $s1, $s1, $s0 # sum = sum + i addi $s0, $s0, 1 # increment i j for done:
Branch Prediction Example
Chapter 7 <119>
- Remembers whether branch was taken the
last time and does the same thing
- Mispredicts first and last branch of loop
1-Bit Branch Predictor
Chapter 7 <120>
Only mispredicts last branch of loop
strongly taken predict taken weakly taken predict taken weakly not taken predict not taken strongly not taken predict not taken taken taken taken taken taken taken taken taken
2-Bit Branch Predictor
Chapter 7 <121>
- Multiple copies of datapath execute multiple
instructions at once
- Dependencies make it tricky to issue multiple
instructions at once
CLK CLK CLK CLK A RD
A1 A2 RD1 A3 WD3 WD6 A4 A5 A6 RD4 RD2 RD5
Instruction Memory Register File Data Memory
ALUs
PC CLK A1 A2 WD1 WD2 RD1 RD2
Superscalar
Chapter 7 <122>
lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 Ideal IPC: 2 and $t2, $s4, $t0 Actual IPC: 2
- r $t3, $s5, $s6
sw $s7, 80($t3)
Time (cycles)
1 2 3 4 5 6 7 8
RF 40 $s0 RF $t0
+
DM IM lw add
lw $t0, 40($s0) add $t1, $s1, $s2 sub $t2, $s1, $s3 and $t3, $s3, $s4
- r $t4, $s1, $s5
sw $s5, 80($s0)
$t1 $s2 $s1
+
RF $s3 $s1 RF $t2
- DM
IM sub and $t3 $s4 $s3
&
RF $s5 $s1 RF $t4
|
DM IM
- r
sw 80 $s0
+
$s5
Superscalar Example
Chapter 7 <123>
lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 Ideal IPC: 2 and $t2, $s4, $t0 Actual IPC: 6/5 = 1.2
- r
$t3, $s5, $s6 sw $s7, 80($t3)
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF 40 $s0 RF $t0
+
DM IM lw
lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 and $t2, $s4, $t0 sw $s7, 80($t3)
RF $s1 $t0 add RF $s1 $t0 RF $t1
+
DM RF $t0 $s4 RF $t2
&
DM IM and IM
- r
and sub
|
$s6 $s5 $t3 RF 80 $t3 RF
+
DM sw IM $s7
9
$s3 $s2 $s3 $s2
- $t0
- r
- r $t3, $s5, $s6
IM
Superscalar with Dependencies
Chapter 7 <124>
- Looks ahead across multiple instructions
- Issues as many instructions as possible at once
- Issues instructions out of order (as long as no
dependencies)
- Dependencies:
– RAW (read after write): one instruction writes, later instruction reads a register – WAR (write after read): one instruction reads, later instruction writes a register – WAW (write after write): one instruction writes, later instruction writes a register
Out of Order Processor
Chapter 7 <125>
- Instruction level parallelism (ILP): number
- f instruction that can be issued
simultaneously (average < 3)
- Scoreboard: table that keeps track of:
–Instructions waiting to issue –Available functional units –Dependencies
Out of Order Processor
Chapter 7 <126>
lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 Ideal IPC: 2 and $t2, $s4, $t0 Actual IPC: 6/4 = 1.5
- r
$t3, $s5, $s6 sw $s7, 80($t3)
Time (cycles)
1 2 3 4 5 6 7 8
RF 40 $s0 RF $t0
+
DM IM lw
lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 and $t2, $s4, $t0 sw $s7, 80($t3)
- r
|
$s6 $s5 $t3 RF 80 $t3 RF
+
DM sw $s7
- r $t3, $s5, $s6
IM RF $s1 $t0 RF $t1
+
DM IM add sub
- $s3
$s2 $t0
two cycle latency between load and use of $t0 RAW WAR RAW
RF $t0 $s4 RF
&
DM and IM $t2
RAW
Out of Order Processor Example
Chapter 7 <127>
Time (cycles)
1 2 3 4 5 6 7
RF 40 $s0 RF $t0
+
DM IM lw
lw $t0, 40($s0) add $t1, $t0, $s1 sub $r0, $s2, $s3 and $t2, $s4, $r0 sw $s7, 80($t3)
sub
- $s3
$s2 $r0 RF $r0 $s4 RF
&
DM and $s7
- r $t3, $s5, $s6
IM RF $s1 $t0 RF $t1
+
DM IM add sw
+
80 $t3
RAW
$s6 $s5
|
- r
2-cycle RAW RAW
$t2 $t3
lw $t0, 40($s0) add $t1, $t0, $s1 sub $t0, $s2, $s3 Ideal IPC: 2 and $t2, $s4, $t0 Actual IPC: 6/3 = 2
- r $t3, $s5, $s6
sw $s7, 80($t3)
Register Renaming
Chapter 7 <128>
- Single Instruction Multiple Data (SIMD)
– Single instruction acts on multiple pieces of data at once – Common application: graphics – Perform short arithmetic operations (also called packed arithmetic)
- For example, add four 8-bit elements
padd8 $s2, $s0, $s1 a0
7 8 15 16 23 24 32 Bit position
$s0 a1 a2 a3 b0 $s1 b1 b2 b3 a0 + b0 $s2 a1 + b1 a2 + b2 a3 + b3 +
SIMD
Chapter 7 <129>
- Multithreading
– Wordprocessor: thread for typing, spell checking, printing
- Multiprocessors
– Multiple processors (cores) on a single chip
Advanced Architecture Techniques
Chapter 7 <130>
- Process: program running on a computer
– Multiple processes can run at once: e.g., surfing Web, playing music, writing a paper
- Thread: part of a program
– Each process has multiple threads: e.g., a word processor may have threads for typing, spell checking, printing
Threading: Definitions
Chapter 7 <131>
- One thread runs at once
- When one thread stalls (for example, waiting
for memory):
– Architectural state of that thread stored – Architectural state of waiting thread loaded into processor and it runs – Called context switching
- Appears to user like all threads running
simultaneously
Threads in Conventional Processor
Chapter 7 <132>
- Multiple copies of architectural state
- Multiple threads active at once:
– When one thread stalls, another runs immediately – If one thread can’t keep all execution units busy, another thread can use them
- Does not increase instruction-level parallelism
(ILP) of single thread, but increases throughput Intel calls this “hyperthreading”
Multithreading
Chapter 7 <133>
- Multiple processors (cores) with a method of
communication between them
- Types:
– Homogeneous: multiple cores with shared memory – Heterogeneous: separate cores for different tasks (for example, DSP and CPU in cell phone) – Clusters: each core has own memory system
Multiprocessors
Chapter 7 <134>
- Patterson & Hennessy’s: Computer
Architecture: A Quantitative Approach
- Conferences: