CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 1
CS104 Computer Organization and Design Datapaths CS104 (Hilton): - - PowerPoint PPT Presentation
CS104 Computer Organization and Design Datapaths CS104 (Hilton): - - PowerPoint PPT Presentation
CS104 Computer Organization and Design Datapaths CS104 (Hilton): Datapaths [Slides adapted from A. Roths] 1 Admin Homework Homework 4 out tonight Due Monday March 26 th Download/check your submissions Reading: Chapter
Admin
- Homework
- Homework 4 out tonight
- Due Monday March 26th
- Download/check your submissions
- Reading:
- Chapter 4
- (Maybe review 1.4)
- Midterm 2
- March 28
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 2
What did we do last week?
- Who can remind us what we did last week?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 3
What did we do last week?
- Who can remind us what we did last week?
- Ski
- Go to the beach
- Sleep in
- Read a book
- …
- Ok, but seriously?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 4
When last I saw you all..
- Last time I was here (Feb 27/29)
- Learned basics of logic design
- Gates (And, Or, Nor, …)
- Put gates together to make
- Muxes
- Adders
- Latches
- Flip-flops
- …
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 5
While I was at HPCA..
- Prof. Lebeck started teaching you all about datapaths
- Putting logic together to execute instructions
- Started on single-cycle datapath
- We’ll review/continue with single cycle
- Then jump into more things!
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 6
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 7
Datapath for MIPS ISA
- Consider only the following instructions
add $1,$2,$3 addi $1,2,$3 lw $1,4($3) sw $1,4($3) beq $1,$2,PC_relative_target j absolute_target
- Why only these?
- Most other instructions are the same from datapath viewpoint
- The one’s that aren’t are left for you to figure out
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 8
Start With Fetch
- PC and instruction memory
- A +4 incrementer computes default next instruction PC
P C Insn Mem
+ 4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 9
First Instruction: add
- Add register file and ALU
P C Insn Mem Register File
s1 s2 d
+ 4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 10
Second Instruction: addi
- Destination register can now be either Rd or Rt
- Add sign extension unit and mux into second ALU input
P C Insn Mem Register File
S X
s1 s2 d
+ 4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 11
Third Instruction: lw
- Add data memory, address is ALU output
- Add register write data mux to select memory output or ALU output
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 12
Fourth Instruction: sw
- Add path from second input register to data memory data input
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 13
Fifth Instruction: beq
- Add left shift unit and adder to compute PC-relative branch target
- Add PC input mux to select PC+4 or branch target
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
z
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 14
Sixth Instruction: j
- Add shifter to compute left shift of 26-bit immediate
- Add additional PC input mux for jump target
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 15
“Continuous Read” Datapath Timing
- Works because writes (PC, RegFile, DMem) are independent
- And because no read logically follows any write
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
Read IMem Read Registers Read DMEM Write DMEM Write Registers Write PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 16
What Is Control?
- 9 signals control flow of data through this datapath
- MUX selectors, or register/memory write enable signals
- A real datapath has 300-500 control signals
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
Rwe ALUinB DMwe JP ALUop BR Rwd Rdst
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 17
Example: Control for add
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
BR=0 JP=0 Rwd=0 DMwe=0 ALUop=0 ALUinB=0 Rdst=1 Rwe=1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 18
Example: Control for sw
- Difference between sw and add is 5 signals
- 3 if you don’t count the X (don’t care) signals
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
Rwe=0 ALUinB=1 DMwe=1 JP=0 ALUop=0 BR=0 Rwd=X Rdst=X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 19
Example: Control for beq
- Difference between sw and beq is only 4 signals
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
Rwe=0 ALUinB=0 DMwe=0 JP=0 ALUop=1 BR=1 Rwd=X Rdst=X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 20
You all figure LW
- How would these control signals be set for LW?
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
Rwe ALUinB DMwe JP ALUop BR Rwd Rdst
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 21
Example: Control for LW
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
BR=0 JP=0 Rwd=1 DMwe=0 ALUop=0 ALUinB=1 Rdst=1 Rwe=1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 22
How Is Control Implemented?
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2 << 2
Rwe ALUinB DMwe JP ALUop BR Rwd Rdst Control?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 23
Implementing Control
- Each insn has a unique set of control signals
- Most are function of opcode
- Some may be encoded in the instruction itself
- E.g., the ALUop signal is some portion of the MIPS Func field
+ Simplifies controller implementation
- Requires careful ISA design
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 24
Control Implementation: ROM
- ROM (read only memory): think rows of bits
- Bits in data words are control signals
- Lines indexed by opcode
- Example: ROM control for 6-insn MIPS datapath
- X is “don’t care”
BR JP ALUinB ALUop DMwe Rwe Rdst Rwd add 1 addi 1 1 1 lw 1 1 1 1 sw 1 1 X X beq 1 1 X X j 1 X X
- pcode
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 25
Control Implementation: Random Logic
- Real machines have 100+ insns 300+ control signals
- 30,000+ control bits (~4KB)
– Not huge, but hard to make faster than datapath (important!)
- Alternative: random logic (random = ‘non-repeating’)
- Exploits the observation: many signals have few 1s or few 0s
- Example: random logic control for 6-insn MIPS datapath
ALUinB
- pcode
add addi lw sw beq j BR JP DMwe Rwd Rdst ALUop Rwe
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 26
Datapath and Control Timing
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
Control ROM/random logic
Read IMem Read Registers (Read Control ROM) Read DMEM Write DMEM Write Registers Write PC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 27
Single-Cycle Datapath Performance
- Goes against make common case fast (MCCF) principle
+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
Control ROM/random logic
Interlude: Performance
- Previous slide alludes to something new: Performance
- Don’t just want it to work…
- But want it to go fast!
- Three components to performance:
Number of instructions x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 28
Interlude: Performance
- Three components to performance:
Number of instructions <- Compiler’s Job x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program
- Insns/Program: determined by compiler + ISA
- Generally assume fixed program when do micro-architecture
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 29
Micro-architectural factors
- Micro-architecture:
- The details of how the ISA is implemented
- Affects CPI and Clock frequency
- Often will look at fixed program, and consider MIPS
- Million Instructions Per Second
- MIPS = IPC * Frequency (in MHz)
- IPC = Instruction Per Cycle (1 / CPI)
- Gives “Bigger is better” number
Instructions Cycles Instructions ————— x ————— = —————— Cycle Second Second (IPC) (Frequency) (Throughput)
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 30
“Best” IPC
- For now, best we can do: IPC = 1 (CPI = 1)
- Do 1 instruction every cycle
- Later:
- Real processors can do multiple instructions at once!
- Potentially: IPC < 1!
- Best possible IPC depends on design
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 31
Performance vs ….
- 1990s: Performance at all cost
- Actually more “clock frequency” at all cost…
- Now: Care about other things
- Energy (electric bill, battery life)
- Power (cooling, also affects energy)
- Area (chip cost)
- Reliability (tolerance of transient faults: e.g., charge particle strikes)
- …
- Important metric these days “Performance / Watt”
- Throughput divided by power consumption
- Why?
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 32
Performance Modeling and Analysis
- Speaking of performance
- Making a processor takes time (years) and money (millions)
- Want to know it will perform well before you finish
- If its wrong, doing it all over is painful…
- Performance can be simulated in software
- Estimate what IPC will be
- Guide design
- This is my other job by the way…
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 33
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 34
Single-Cycle Datapath Performance
- Goes against make common case fast (MCCF) principle
+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
Control ROM/random logic
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 35
Alternative: Multi-Cycle Datapath
- Multi-cycle datapath: attacks high clock period
- Cut datapath into multiple stages (5 here), isolate using FFs
- FSM control “walks” insns thru stages (by staging control signals)
+ Insns can bypass stages and exit early P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
s3 s3 s3 s4 s5 s5 s5
Finite State Machine (FSM)
- FSM = States + Transitions
- Next state: function of current state + inputs
- Outputs: function of current state + inputs
- Canonical Example: Combination Lock
- Must enter 3 8 4 to unlock
- P.S. Useful in software too
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 36
Finite State Machines: Example
- Combination Lock Example:
- Need to enter 3 8 4 to unlock
- Initial State: no valid piece of combo seen
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 37
Start
Finite State Machines: Example
- Combination Lock Example:
- Need to enter 3 8 4 to unlock
- Input of 3: transition to new state
- Any other input: stay in same state
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 38
Start 1 3 0-2,4-9
Finite State Machines: Example
- Combination Lock Example:
- Need to enter 3 8 4 to unlock
- State 1:
- Input = 8? Goto state 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 39
Start 1 3 0-2,4-9 2 8 3 0-2,4-7,9
Finite State Machines: Example
- Combination Lock Example:
- Need to enter 3 8 4 to unlock
- State 2:
- Input = 4? Goto state 3
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 40
Start 1 3 0-2,4-9 2 8 0-2,5-9 3 3 0-2,4-7,9 3 4
Finite State Machines: Example
- Combination Lock Example:
- Need to enter 3 8 4 to unlock
- State 3:
Unlock!
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 41
Start 1 3 0-2,4-9 2 8 0-2,5-9 3 3 0-2,4-7,9 3 4
FSM in Hardware
- Flip flop (s) to hold state (s)
- Combinatorial logic to determine next state/output
- (Assumes FF enable on input_valid)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 42
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 43
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 44
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 45
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 46
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 47
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 48
FSM Hardware Example
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 49
FSM Implementation: ROM
- Just saw: FSM implemented with sum-of-products
- Remind us what that is?
- Can also be implemented with a ROM
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 50
2(N+K) Entry ROM Inputs K Register N M Outputs N N + K K-bit input N-bit state M-bit output
FSM ROM Implementation Example
- Combination Lock (3 8 4) Example
- 4-bit input
- 2-bit state
- 64-entry ROM (indexed with S1 S0 I3 I2 I1 I0)
- Each entry needs 3 bits (S1 S0 U)
- 2 for next state
- 1 for unlock signal
- Example entries in ROM
- 0x00 = 000
- 0x03 = 010
- 0x18 = 100
- 0x13 = 010
- 0x3_ = 001
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 51
Multi-cycle Datapath FSM
- First state: Get a New Instruction
- Output signals to fetch (e.g., read enable IMEM)
- Next State: Always Decode
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 52
Next Insn Decode Insn
Multi-cycle Datapath FSM
- Second State: Decode
- Output signals to decode instruction (RdEn RegFile)
- Go to Next Insn if NOP
- Otherwise Execute
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 53
Next Insn Decode Insn Execute Insn NOP
Multi-cycle Datapath FSM
- Execute State
- Execute Insn (varies by insn type)
- Next State: Also depends on insn type
- Branches: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 54
Next Insn Decode Insn Execute Insn NOP Branch
Multi-cycle Datapath FSM
- Execute State
- Execute Insn (varies by insn type)
- Next State: Also depends on insn type
- ALU op: write register
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 55
Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU
Multi-cycle Datapath FSM
- Execute State
- Execute Insn (varies by insn type)
- Next State: Also depends on insn type
- Load: Read Memory
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 56
Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load
Multi-cycle Datapath FSM
- Execute State
- Execute Insn (varies by insn type)
- Next State: Also depends on insn type
- Store: Write Memory
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 57
Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store
Multi-cycle Datapath FSM
- Read DMEM State
- Control signals enable DMEM Read
- Next state is writeback
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 58
Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store
Multi-cycle Datapath FSM
- Writeback state
- Control signals enable regfile write
- Next state: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 59
Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store
Multi-cycle Datapath FSM
- Write DMEM state
- Control signals enable memory write
- Next state: Next Insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 60
Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 61
Multi-Cycle Datapath Example: Add
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
- Example: Add
- Cycle 1: Read IMEM
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 62
Multi-Cycle Datapath Example: Add
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
- Example: Add
- Cycle 1: Read IMEM
- Cycle 2: Decode + Read RF
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 63
Multi-Cycle Datapath Example: Add
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
- Example: Add
- Cycle 1: Read IMEM
- Cycle 2: Decode + Read RF
- Cycle 3: ALU
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 64
Multi-Cycle Datapath Example: Add
- Example: Add
- Cycle 1: Read IMEM
- Cycle 2: Decode + Read RF
- Cycle 3: ALU
- Cycle 4: Writeback + Increment PC
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 65
Multi-Cycle Datapath Performance
- Opposite performance split of single-cycle datapath
+ Short clock period – High CPI P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
Multi-cycle Data-path CPI
- CPI depends on instructions
- Branches / Jumps: 3 cycles
- ALU: 4 cycles
- Stores: 4 cycles
- Loads: 5 cycles
- Overall CPI is weighted average
- Example:
- 20% loads, 15% stores, 20% branches, 45% ALU
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 66
Multi-cycle Data-path CPI
- CPI depends on instructions
- Branches / Jumps: 3 cycles
- ALU: 4 cycles
- Stores: 4 cycles
- Loads: 5 cycles
- Overall CPI is weighted average
- Example:
- 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 +
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 67
Multi-cycle Data-path CPI
- CPI depends on instructions
- Branches / Jumps: 3 cycles
- ALU: 4 cycles
- Stores: 4 cycles
- Loads: 5 cycles
- Overall CPI is weighted average
- Example:
- 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 +
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 68
Multi-cycle Data-path CPI
- CPI depends on instructions
- Branches / Jumps: 3 cycles
- ALU: 4 cycles
- Stores: 4 cycles
- Loads: 5 cycles
- Overall CPI is weighted average
- Example:
- 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0
CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 69
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 70
Multi-cycle Datapath Performance
- Single-cycle
- Clock period = 50ns, CPI = 1
- Performace = 50 ns/insn
- Multi-cycle
- Clock period = 10ns
- CPI = (0.2*3+0.2*5+0.6*4) = 4
- Performance = 40 ns/insn
- But wait…
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 71
Multi-Cycle Datapath Performance
- Did not just cut up existing logic into 5 pieces
- Also added logic (flip flops)
- So clock period not 1/5 of single cycle, but slightly longer
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 72
Multi-cycle Datapath Performance
- Single-cycle
- Clock period = 50ns, CPI = 1
- Performace = 50 ns/insn
- Multi-cycle
- Clock period = 12ns
- CPI = (0.2*3+0.2*5+0.6*4) = 4
- Performance = 48 ns/insn
- Better, but not as exciting…
- Can we do better still?
- Have our cake (low CPI) and eat it too (high clock frequency)?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 73
Clock Period and CPI
- Single-cycle datapath
+ Low CPI: 1 – Long clock period: to accommodate slowest insn
- Multi-cycle datapath
+ Short clock period – High CPI
- Can we have both low CPI and short clock period?
– No good way to make a single insn go faster + Insn latency doesn’t matter anyway … insn throughput matters
- Key: exploit inter-insn parallelism
insn0.fetch, dec, exec insn1.fetch, dec, exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 74
Pipelining
- Pipelining: important performance technique
- Improves insn throughput rather than insn latency
- Exploits parallelism at insn-stage level to do so
- Begin with multi-cycle design
- When insn advances from stage 1 to 2, next insn enters stage 1
- Individual insns take same number of stages
+ But insns enter and leave at a much faster rate
- Physically breaks “atomic” VN loop ... but must maintain illusion
- Automotive assembly line analogy
insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 75
5 Stage Multi-Cycle Datapath
P C Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
I R D O B A
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 76
5 Stage Pipelined Datapath
- Temporary values (PC,IR,A,B,O,D) re-latched every stage
- Why? 5 insns may be in pipeline at once, they share a single PC?
- Notice, PC not latched after ALU stage (why not?)
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 77
Pipeline Terminology
- Stages: Fetch, Decode, eXecute, Memory, Writeback
- Latches (pipeline registers): PC, F/D, D/X, X/M, M/W
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 78
Some More Terminology
- Scalar pipeline: one insn per stage per cycle
- Alternative: “superscalar” (next unit)
- In-order pipeline: insns enter execute stage in VN order
- Alternative: “out-of-order” (not covered in CSE 371)
- Pipeline depth: number of pipeline stages
- Nothing magical about five
- Trend has been to deeper pipelines
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 79
Pipeline Example: Cycle 1
- 3 instructions
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
add $3,$2,$1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 80
Pipeline Example: Cycle 2
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
lw $4,0($5) add $3,$2,$1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 81
Pipeline Example: Cycle 3
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
sw $6,4($7) lw $4,0($5) add $3,$2,$1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 82
Pipeline Example: Cycle 4
- 3 instructions
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
sw $6,4($7) lw $4,0($5) add $3,$2,$1
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 83
Pipeline Example: Cycle 5
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
sw $6,4($7) lw $4,0($5) add
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 84
Pipeline Example: Cycle 6
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
sw $6,4(7) lw
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 85
Pipeline Example: Cycle 7
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
sw
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 86
Pipeline Diagram
- Pipeline diagram: shorthand for what we just saw
- Across: cycles
- Down: insns
- Convention: X means lw $4,0($5) finishes execute stage and
writes into X/M latch at end of cycle 4
1 2 3 4 5 6 7 8 9
add $3,$2,$1
F D X M W
lw $4,0($5)
F D X M W
sw $6,4($7)
F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 87
What About Pipelined Control?
- Should it be like single-cycle control?
- But individual insn signals must be staged
- Should it be like multi-cycle control?
- But all stages are simultaneously active
- How many different controllers are we going to need?
- One for each insn in pipeline?
- Solution: use simple single-cycle control, but pipeline it
- Single controller
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 88
Pipelined Control
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
CTRL xC mC wC mC wC wC
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 89
Pipeline Performance Calculation
- Single-cycle
- Clock period = 50ns, CPI = 1
- Performace = 50ns/insn
- Multi-cycle
- Branch: 20% (3 cycles), load: 20% (5 cycles), other: 60% (4
cycles)
- Clock period = 12ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
- Remember: latching overhead makes it 12, not 10
- Performance = 48ns/insn
- Pipelined
- Clock period = 12ns
- CPI = 1.5 (on average insn completes every 1.5 cycles)
- Performance = 18ns/insn
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 90
Q1: Why Is Pipeline Clock Period …
- … > delay thru datapath / number of pipeline stages?
- Latches (FFs) add delay
- Pipeline stages have different delays, clock period is max delay
- Both factors have implications for ideal number pipeline stages
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 91
Q2: Why Is Pipeline CPI…
- … > 1?
- CPI for scalar in-order pipeline is 1 + stall penalties
- Stalls used to resolve hazards
- Hazard: condition that jeopardizes VN illusion
- Stall: artificial pipeline delay introduced to restore VN illusion
- Calculating pipeline CPI
- Frequency of stall * stall cycles
- Penalties add (stalls generally don’t overlap in in-order pipelines)
- 1 + stall-freq1*stall-cyc1 + stall-freq2*stall-cyc2 + …
- Correctness/performance/MCCF
- Long penalties OK if they happen rarely, e.g., 1 + 0.01 * 10 = 1.1
- Stalls also have implications for ideal number of pipeline stages
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 92
Dependences and Hazards
- Dependence: relationship between two insns
- Data: two insns use same storage location
- Control: one insn affects whether another executes at all
- Not a bad thing, programs would be boring without them
- Enforced by making older insn go before younger one
- Happens naturally in single-/multi-cycle designs
- But not in a pipeline
- Hazard: dependence & possibility of wrong insn order
- Effects of wrong insn order cannot be externally visible
- Stall: for order by keeping younger insn in same stage
- Hazards are a bad thing: stalls reduce performance
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 93
Why Does Every Insn Take 5 Cycles?
- Could /should we allow add to skip M and go to W? No
– It wouldn’t help: peak fetch still only 1 insn per cycle – Structural hazards: imagine add follows lw
PC
Insn Mem Register File
S X
s1 s2 d
Data Mem
a d
+ 4
<< 2
PC IR PC A B IR O B IR O D IR
PC F/D D/X X/M M/W
add $3,$2,$1 lw $4,0($5)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 94
Structural Hazards
- Structural hazards
- Two insns trying to use same circuit at same time
- E.g., structural hazard on regfile write port
- To fix structural hazards: proper ISA/pipeline design
- Each insn uses every structure exactly once
- For at most one cycle
- Always at same stage relative to F
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 95
Data Hazards
- Let’s forget about branches and the control for a while
- The three insn sequence we saw earlier executed fine…
- But it wasn’t a real program
- Real programs have data dependences
- They pass values via registers and memory
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($5) sw $6,0($7)
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 96
Data Hazards
- Would this “program” execute correctly on this pipeline?
- Which insns would execute with correct inputs?
- add is writing its result into $3 in current cycle
– lw read $3 2 cycles ago → got wrong value – addi read $3 1 cycle ago → got wrong value
- sw is reading $3 this cycle → OK (regfile timing: write first half)
add $3,$2,$1 lw $4,0($3) sw $3,0($7) addi $6,1,$3
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 97
Memory Data Hazards
- What about data hazards through memory? No
- lw following sw to same address in next cycle, gets right value
- Why? DMem read/write take place in same stage
- Data hazards through registers? Yes (previous slide)
- Occur because register write is 3 stages after register read
- Can only read a register value 3 cycles after writing it
sw $5,0($1) lw $4,0($1)
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 98
Fixing Register Data Hazards
- Can only read register value 3 cycles after writing it
- One way to enforce this: make sure programs don’t do it
- Compiler puts two independent insns between write/read insn pair
- If they aren’t there already
- Independent means: “do not interfere with register in question”
- Do not write it: otherwise meaning of program changes
- Do not read it: otherwise create new data hazard
- Code scheduling: compiler moves around existing insns to do this
- If none can be found, must use nops
- This is called software interlocks
- MIPS: Microprocessor w/out Interlocking Pipeline Stages
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 99
Software Interlock Example
add $3,$2,$1 lw $4,0($3) sw $7,0($3) add $6,$2,$8 addi $3,$5,4
- Can any of last three insns be scheduled between first two
- sw $7,0($3)? No, creates hazard with add $3,$2,$1
- add $6,$2,$8? OK
- addi $3,$5,4? No, lw would read $3 from it
- Still need one more insn, use nop
add $3,$2,$1 add $6,$2,$8 nop lw $4,0($3) sw $7,0($3) addi $3,$5,4
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 100
Software Interlock Performance
- Same deal
- Branch: 20%, load: 20%, store: 10%, other: 50%
- Software interlocks
- 20% of insns require insertion of 1 nop
- 5% of insns require insertion of 2 nops
- CPI is still 1 technically
- But now there are more insns
- #insns = 1 + 0.20*1 + 0.05*2 = 1.3
– 30% more insns (30% slowdown) due to data hazards
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 101
Hardware Interlocks
- Problem with software interlocks? Not compatible
- Where does 3 in “read register 3 cycles after writing” come from?
- From structure (depth) of pipeline
- What if next MIPS version uses a 7 stage pipeline?
- Programs compiled assuming 5 stage pipeline will break
- A better (more compatible) way: hardware interlocks
- Processor detects data hazards and fixes them
- Two aspects to this
- Detecting hazards
- Fixing hazards
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 102
Detecting Data Hazards
- Compare F/D insn input register names with output
register names of older insns in pipeline
Hazard = (F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
hazard
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 103
Fixing Data Hazards
- Prevent F/D insn from reading (advancing) this cycle
- Write nop into D/X.IR (effectively, insert nop in hardware)
- Also reset (clear) the datapath control signals
- Disable F/D latch and PC write enables (why?)
- Re-evaluate situation next cycle
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
hazard
nop
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 104
Aside: Insert NOP/Reset Register
- Earlier: registers support separate clock, write enable
- Useful for writes into register file
- Also useful for implementing stalls
- Registers should also support synchronous reset (clear)
- Useful for implementing stalls
- Implement as additional hardwired 0 input to FF data mux
- Resetting pipeline registers equivalent to inserting a NOP
- If NOP is all zeros
- If zero means “don’t write” for all write-enable control signals
- Design ISA/control signals to make sure this is the case
FF D Q [RST:WE] FF D Q WE 2
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 105
Hardware Interlock Example: cycle 1
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 1 Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
hazard
nop
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 106
Hardware Interlock Example: cycle 2
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 1 Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
hazard
nop
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 107
Hardware Interlock Example: cycle 3
(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 0 Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
hazard
nop
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 108
Pipeline Control Terminology
- Hardware interlock maneuver is called stall or bubble
- Mechanism is called stall logic
- Part of more general pipeline control mechanism
- Controls advancement of insns through pipeline
- Distinguish from pipelined datapath control
- Controls datapath at each stage
- Pipeline control controls advancement of datapath control
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 109
Pipeline Diagram with Data Hazards
- Data hazard stall indicated with d*
- Stall propagates to younger insns
- This is not good (why?)
1 2 3 4 5 6 7 8 9
add $3,$2,$1
F D X M W
lw $4,0($3)
F d* d* D X M W
sw $6,4($7)
F D X M W 1 2 3 4 5 6 7 8 9
add $3,$2,$1
F D X M W
lw $4,0($3)
F d* d* D X M W
sw $6,4($7)
F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 110
Hardware Interlock Performance
- Same deal
- Branch: 20%, load: 20%, store: 10%, other: 50%
- Hardware interlocks: same as software interlocks
- 20% of insns require 1 cycle stall (I.e., insertion of 1 nop)
- 5% of insns require 2 cycle stall (I.e., insertion of 2 nops)
- CPI = 1 * 0.20*1 + 0.05*2 = 1.3
- So, either CPI stays at 1 and #insns increases 30% (software)
- Or, #insns stays at 1 (relative) and CPI increases 30% (hardware)
- Same difference
- Anyway, we can do better
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 111
Observe
- Technically, this situation is broken
- lw $4,0($3) has already read $3 from regfile
- add $3,$2,$1 hasn’t yet written $3 to regfile
- But fundamentally, everything is OK
- lw $4,0($3) hasn’t actually used $3 yet
- add $3,$2,$1 has already computed $3
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 112
Bypassing
- Bypassing
- Reading a value from an intermediate (µarchitectural) source
- Not waiting until it is available from primary source
- Here, we are bypassing the register file
- Also called forwarding
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 113
WX Bypassing
- What about this combination?
- Add another bypass path and MUX input
- First one was an MX bypass
- This one is a WX bypass
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 lw $4,0($3)
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 114
ALUinB Bypassing
- Can also bypass to ALU input B
Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
add $3,$2,$1 add $4,$2,$3
Data Mem
a d
O D IR
M/W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 115
WM Bypassing?
- Does WM bypassing make sense?
- Not to the address input (why not?)
- But to the store data input, yes
Register File
S X
s1 s2 d
Data Mem
a d
IR A B IR O B IR O D IR
F/D D/X X/M M/W
lw $3,0($2) sw $3,0($4)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 116
Bypass Logic
- Each MUX has its own, here it is for MUX ALUinA
(D/X.IR.RS1 == X/M.IR.RD) => 0 (D/X.IR.RS1 == M/W.IR.RD) => 1 Else => 2 Register File
S X
s1 s2 d
IR A B IR O B IR
F/D D/X X/M
Data Mem
a d
O D IR
M/W
bypass
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 117
Bypass and Stall Logic
- Two separate things
- Stall logic controls pipeline registers
- Bypass logic controls MUXs
- But complementary
- For a given data hazard: if can’t bypass, must stall
- Slide #43 shows full bypassing: all bypasses possible
- Is stall logic still necessary?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 118
Yes, Load Output to ALU Input
Stall = (D/X.IR.OP == LOAD) && ((F/D.IR.RS1 == D/X.IR.RD) || ((F/D.IR.RS2 == D/X.IR.RD) && (F/D.IR.OP != STORE)) Register File
S X
s1 s2 d
Data Mem
a d
IR A B IR O B IR O D IR
F/D D/X X/M M/W
lw $3,0($2)
stall
nop
add $4,$2,$3 lw $3,0($2) add $4,$2,$3
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 119
Pipeline Diagram With Bypassing
- Use compiler scheduling to reduce load-use stall frequency
- Like software interlocks, but for performance not correctness
1 2 3 4 5 6 7 8 9
add $3,$2,$1
F D X M W
lw $4,0($3)
F D X M W
addi $6,$4,1
F d* D X M W 1 2 3 4 5 6 7 8 9
add $3,$2,$1
F D X M W
lw $4,0($3)
F D X M W
sub $8,$3,$1
F D X M W
addi $6,$4,1
F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 120
Control Hazards
- Control hazards
- Must fetch post branch insns before branch outcome is known
- Default: assume “not-taken” (at fetch, can’t tell it’s a branch)
PC
Insn Mem Register File
s1 s2 d
+ 4
<< 2
F/D D/X X/M
PC A B IR O B IR PC IR
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 121
Branch Recovery
- Branch recovery: what to do when branch is actually taken
- Insns that will be written into F/D and D/X are wrong
- Flush them, i.e., replace them with nops
+ They haven’t had written permanent state yet (regfile, DMem)
PC
Insn Mem Register File
s1 s2 d
+ 4
<< 2
F/D D/X X/M nop nop
PC A B IR O B IR PC IR
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 122
Branch Recovery Pipeline Diagram
- Convention: don’t fill in flushed insns
- Taken branch penalty is 2 cycles
1 2 3 4 5 6 7 8 9
addi $3,$0,1
F D X M W
bnez $3,targ
F D X M W
sw $6,4($7)
F D
targ: addi $8,$7,1
F
targ: addi $8,$7,1
F D X M W 1 2 3 4 5 6 7 8 9
addi $3,$0,1
F D X M W
bnez $3,targ
F D X M W
targ: addi $8,$7,1
F D X M W
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 123
Branch Performance
- Back of the envelope calculation
- Branch: 20%, load: 20%, store: 10%, other: 50%
- 75% of branches are taken
- CPI = 1 + 0.20*0.75*2 = 1.3
– Branches cause 30% slowdown
- How do we reduce this penalty?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 124
Fast Branch
- Fast branch: can decide at D, not X
- Test must be comparison to zero or equality, no time for ALU
+ New taken branch penalty is 1 – Additional insns (slt) for more complex tests, must bypass to D too
- 25% of branches have complex tests that require extra insn
- CPI = 1 + 0.20*0.75*1(branch) + 0.20*0.25*1(extra insn) = 1.2
PC
Insn Mem Register File
s1 s2 d
+ 4
<< 2
F/D D/X X/M
S X <>
O B IR A B IR PC IR
S X
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 125
Speculative Execution
- Speculation: “risky transactions on chance of profit”
- Speculative execution
- Execute before all parameters known with certainty
- Correct speculation
+ Avoid stall, improve performance
- Incorrect speculation (mis-speculation)
– Must abort/flush/squash incorrect insns – Must undo incorrect changes (recover pre-speculation state)
- The “game”: [%correct * gain] – [(1–%correct) * penalty]
- Control speculation: speculation aimed at control hazards
- Unknown parameter: are these the correct insns to execute next?
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 126
Control Speculation Mechanics
- Guess branch target, start fetching at guessed position
- Doing nothing is implicitly guessing target is PC+4
- Can actively guess other targets: dynamic branch prediction
- Execute branch to verify (check) guess
- Correct speculation? keep going
- Mis-speculation? Flush mis-speculated insns
- Hopefully haven’t modified permanent state (Regfile, DMem)
+ Happens naturally in in-order 5-stage pipeline
- “Game” for in-order 5 stage pipeline
- %correct = ?
- Gain = 2 cycles
+ Penalty = 0 cycles → mis-speculation no worse than stalling
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 127
Dynamic Branch Prediction
- Dynamic branch prediction: guess outcome
- Start fetching from guessed address
- Flush on mis-prediction (notice new recovery circuit)
PC
Insn Mem Register File
S X
s1 s2 d
+ 4
<< 2
TG PC IR TG PC A B IR O B IR
F/D D/X X/M nop nop BP
<>
Branch Prediction: Short Summary
- Key principle of micro-architecture:
- Programs do the same thing over and over (why?)
- Exploit for performance:
- Learn what a program did before
- Guess that it will do the same thing again
- Details of branch prediction: later (~1 month)
- For now, just know it can be done and is important to performance
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 128
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 129
Branch Prediction Performance
- Dynamic branch prediction
- Simple predictor: branches predicted with 75% accuracy
- CPI = 1 + 0.20*0.25*2 = 1.1
- More advanced predictor: 95% accuracy
- CPI = 1 + 0.20*0.05*2 = 1.02
- Branch mis-predictions still a big problem though
- Pipelines are long: typical mis-prediction penalty is 10+ cycles
- Pipelines have full bypassing: compiler schedules the rest
- Pipelines are superscalar (later)
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 130
Pipelining And Exceptions
- Pipelining makes exceptions nasty
- 5 insns in pipeline at once
- Exception happens, how do you know which insn caused it?
- Exceptions propagate along pipeline in latches
- Two exceptions happen, how do you know which one to take first?
- One belonging to oldest insn
- When handling exception, have to flush younger insns
- Piggy-back on branch mis-prediction machinery to do this
- What about multi-cycle operations?
- Just FYI
CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 131
Pipeline Depth
- No magic about 5 stages, trend had been to deeper pipelines
- 486: 5 stages (50+ gate delays / clock)
- Pentium: 7 stages
- Pentium II/III: 12 stages
- Pentium 4: 22 stages (~10 gate delays / clock) “super-pipelining”
- Core1/2: 14 stages
- Increasing pipeline depth
+ Increases clock frequency (reduces period) – But decreases IPC (increases CPI)
- Branch mis-prediction penalty becomes longer
- Non-bypassed data hazard stalls become longer
- At some point, CPI losses offset clock gains, question is when?
- 1GHz Pentium 4 was slower than 800 MHz PentiumIII
- What was the point? People by frequency, not frequency * IPC