[PPT] - CS104 Computer Organization and Design Datapaths CS104 (Hilton): PowerPoint Presentation

SLIDE 1

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 1

CS104 Computer Organization and Design

Datapaths

SLIDE 2

Admin

Homework
Homework 4 out tonight
Due Monday March 26th
Download/check your submissions
Reading:
Chapter 4
(Maybe review 1.4)
Midterm 2
March 28

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 2

SLIDE 3

What did we do last week?

Who can remind us what we did last week?

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 3

SLIDE 4

What did we do last week?

Who can remind us what we did last week?
Ski
Go to the beach
Sleep in
Read a book
…
Ok, but seriously?

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 4

SLIDE 5

When last I saw you all..

Last time I was here (Feb 27/29)
Learned basics of logic design
Gates (And, Or, Nor, …)
Put gates together to make
Muxes
Adders
Latches
Flip-flops
…

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 5

SLIDE 6

While I was at HPCA..

Prof. Lebeck started teaching you all about datapaths
Putting logic together to execute instructions
Started on single-cycle datapath
We’ll review/continue with single cycle
Then jump into more things!

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 6

SLIDE 7

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 7

Datapath for MIPS ISA

Consider only the following instructions

add $1,$2,$3 addi $1,2,$3 lw $1,4($3) sw $1,4($3) beq $1,$2,PC_relative_target j absolute_target

Why only these?
Most other instructions are the same from datapath viewpoint
The one’s that aren’t are left for you to figure out

SLIDE 8

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 8

Start With Fetch

PC and instruction memory
A +4 incrementer computes default next instruction PC

P C Insn Mem

+ 4

SLIDE 9

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 9

First Instruction: add

Add register file and ALU

P C Insn Mem Register File

s1 s2 d

+ 4

SLIDE 10

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 10

Second Instruction: addi

Destination register can now be either Rd or Rt
Add sign extension unit and mux into second ALU input

P C Insn Mem Register File

S X

s1 s2 d

+ 4

SLIDE 11

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 11

Third Instruction: lw

Add data memory, address is ALU output
Add register write data mux to select memory output or ALU output

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

SLIDE 12

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 12

Fourth Instruction: sw

Add path from second input register to data memory data input

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

SLIDE 13

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 13

Fifth Instruction: beq

Add left shift unit and adder to compute PC-relative branch target
Add PC input mux to select PC+4 or branch target

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

z

SLIDE 14

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 14

Sixth Instruction: j

Add shifter to compute left shift of 26-bit immediate
Add additional PC input mux for jump target

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

SLIDE 15

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 15

“Continuous Read” Datapath Timing

Works because writes (PC, RegFile, DMem) are independent
And because no read logically follows any write

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Read IMem Read Registers Read DMEM Write DMEM Write Registers Write PC

SLIDE 16

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 16

What Is Control?

9 signals control flow of data through this datapath
MUX selectors, or register/memory write enable signals
A real datapath has 300-500 control signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst

SLIDE 17

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 17

Example: Control for add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

BR=0 JP=0 Rwd=0 DMwe=0 ALUop=0 ALUinB=0 Rdst=1 Rwe=1

SLIDE 18

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 18

Example: Control for sw

Difference between sw and add is 5 signals
3 if you don’t count the X (don’t care) signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe=0 ALUinB=1 DMwe=1 JP=0 ALUop=0 BR=0 Rwd=X Rdst=X

SLIDE 19

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 19

Example: Control for beq

Difference between sw and beq is only 4 signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe=0 ALUinB=0 DMwe=0 JP=0 ALUop=1 BR=1 Rwd=X Rdst=X

SLIDE 20

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 20

You all figure LW

How would these control signals be set for LW?

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst

SLIDE 21

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 21

Example: Control for LW

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

BR=0 JP=0 Rwd=1 DMwe=0 ALUop=0 ALUinB=1 Rdst=1 Rwe=1

SLIDE 22

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 22

How Is Control Implemented?

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst Control?

SLIDE 23

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 23

Implementing Control

Each insn has a unique set of control signals
Most are function of opcode
Some may be encoded in the instruction itself
E.g., the ALUop signal is some portion of the MIPS Func field

+ Simplifies controller implementation

Requires careful ISA design

SLIDE 24

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 24

Control Implementation: ROM

ROM (read only memory): think rows of bits
Bits in data words are control signals
Lines indexed by opcode
Example: ROM control for 6-insn MIPS datapath
X is “don’t care”

BR JP ALUinB ALUop DMwe Rwe Rdst Rwd add 1 addi 1 1 1 lw 1 1 1 1 sw 1 1 X X beq 1 1 X X j 1 X X

pcode

SLIDE 25

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 25

Control Implementation: Random Logic

Real machines have 100+ insns 300+ control signals
30,000+ control bits (~4KB)

– Not huge, but hard to make faster than datapath (important!)

Alternative: random logic (random = ‘non-repeating’)
Exploits the observation: many signals have few 1s or few 0s
Example: random logic control for 6-insn MIPS datapath

ALUinB

pcode

add addi lw sw beq j BR JP DMwe Rwd Rdst ALUop Rwe

SLIDE 26

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 26

Datapath and Control Timing

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Control ROM/random logic

Read IMem Read Registers (Read Control ROM) Read DMEM Write DMEM Write Registers Write PC

SLIDE 27

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 27

Single-Cycle Datapath Performance

Goes against make common case fast (MCCF) principle

+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Control ROM/random logic

SLIDE 28

Interlude: Performance

Previous slide alludes to something new: Performance
Don’t just want it to work…
But want it to go fast!
Three components to performance:

Number of instructions x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 28

SLIDE 29

Interlude: Performance

Three components to performance:

Number of instructions <- Compiler’s Job x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program

Insns/Program: determined by compiler + ISA
Generally assume fixed program when do micro-architecture

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 29

SLIDE 30

Micro-architectural factors

Micro-architecture:
The details of how the ISA is implemented
Affects CPI and Clock frequency
Often will look at fixed program, and consider MIPS
Million Instructions Per Second
MIPS = IPC * Frequency (in MHz)
IPC = Instruction Per Cycle (1 / CPI)
Gives “Bigger is better” number

Instructions Cycles Instructions ————— x ————— = —————— Cycle Second Second (IPC) (Frequency) (Throughput)

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 30

SLIDE 31

“Best” IPC

For now, best we can do: IPC = 1 (CPI = 1)
Do 1 instruction every cycle
Later:
Real processors can do multiple instructions at once!
Potentially: IPC < 1!
Best possible IPC depends on design

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 31

SLIDE 32

Performance vs ….

1990s: Performance at all cost
Actually more “clock frequency” at all cost…
Now: Care about other things
Energy (electric bill, battery life)
Power (cooling, also affects energy)
Area (chip cost)
Reliability (tolerance of transient faults: e.g., charge particle strikes)
…
Important metric these days “Performance / Watt”
Throughput divided by power consumption
Why?

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 32

SLIDE 33

Performance Modeling and Analysis

Speaking of performance
Making a processor takes time (years) and money (millions)
Want to know it will perform well before you finish
If its wrong, doing it all over is painful…
Performance can be simulated in software
Estimate what IPC will be
Guide design
This is my other job by the way…

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 33

SLIDE 34

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 34

Single-Cycle Datapath Performance

Goes against make common case fast (MCCF) principle

+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Control ROM/random logic

SLIDE 35

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 35

Alternative: Multi-Cycle Datapath

Multi-cycle datapath: attacks high clock period
Cut datapath into multiple stages (5 here), isolate using FFs
FSM control “walks” insns thru stages (by staging control signals)

+ Insns can bypass stages and exit early P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

s3 s3 s3 s4 s5 s5 s5

SLIDE 36

Finite State Machine (FSM)

FSM = States + Transitions
Next state: function of current state + inputs
Outputs: function of current state + inputs
Canonical Example: Combination Lock
Must enter 3 8 4 to unlock
P.S. Useful in software too

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 36

SLIDE 37

Finite State Machines: Example

Combination Lock Example:
Need to enter 3 8 4 to unlock
Initial State: no valid piece of combo seen

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 37

Start

SLIDE 38

Finite State Machines: Example

Combination Lock Example:
Need to enter 3 8 4 to unlock
Input of 3: transition to new state
Any other input: stay in same state

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 38

Start 1 3 0-2,4-9

SLIDE 39

Finite State Machines: Example

Combination Lock Example:
Need to enter 3 8 4 to unlock
State 1:
Input = 8? Goto state 2

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 39

Start 1 3 0-2,4-9 2 8 3 0-2,4-7,9

SLIDE 40

Finite State Machines: Example

Combination Lock Example:
Need to enter 3 8 4 to unlock
State 2:
Input = 4? Goto state 3

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 40

Start 1 3 0-2,4-9 2 8 0-2,5-9 3 3 0-2,4-7,9 3 4

SLIDE 41

Finite State Machines: Example

Combination Lock Example:
Need to enter 3 8 4 to unlock
State 3:

Unlock!

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 41

Start 1 3 0-2,4-9 2 8 0-2,5-9 3 3 0-2,4-7,9 3 4

SLIDE 42

FSM in Hardware

Flip flop (s) to hold state (s)
Combinatorial logic to determine next state/output
(Assumes FF enable on input_valid)

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 42

SLIDE 43

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 43

SLIDE 44

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 44

SLIDE 45

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 45

SLIDE 46

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 46

SLIDE 47

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 47

SLIDE 48

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 48

SLIDE 49

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 49

SLIDE 50

FSM Implementation: ROM

Just saw: FSM implemented with sum-of-products
Remind us what that is?
Can also be implemented with a ROM

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 50

2(N+K) Entry ROM Inputs K Register N M Outputs N N + K K-bit input N-bit state M-bit output

SLIDE 51

FSM ROM Implementation Example

Combination Lock (3 8 4) Example
4-bit input
2-bit state
64-entry ROM (indexed with S1 S0 I3 I2 I1 I0)
Each entry needs 3 bits (S1 S0 U)
2 for next state
1 for unlock signal
Example entries in ROM
0x00 = 000
0x03 = 010
0x18 = 100
0x13 = 010
0x3_ = 001

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 51

SLIDE 52

Multi-cycle Datapath FSM

First state: Get a New Instruction
Output signals to fetch (e.g., read enable IMEM)
Next State: Always Decode

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 52

Next Insn Decode Insn

SLIDE 53

Multi-cycle Datapath FSM

Second State: Decode
Output signals to decode instruction (RdEn RegFile)
Go to Next Insn if NOP
Otherwise Execute

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 53

Next Insn Decode Insn Execute Insn NOP

SLIDE 54

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
Branches: Next Insn

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 54

Next Insn Decode Insn Execute Insn NOP Branch

SLIDE 55

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
ALU op: write register

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 55

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU

SLIDE 56

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
Load: Read Memory

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 56

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load

SLIDE 57

Multi-cycle Datapath FSM

Execute State
Execute Insn (varies by insn type)
Next State: Also depends on insn type
Store: Write Memory

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 57

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

SLIDE 58

Multi-cycle Datapath FSM

Read DMEM State
Control signals enable DMEM Read
Next state is writeback

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 58

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

SLIDE 59

Multi-cycle Datapath FSM

Writeback state
Control signals enable regfile write
Next state: Next Insn

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 59

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

SLIDE 60

Multi-cycle Datapath FSM

Write DMEM state
Control signals enable memory write
Next state: Next Insn

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 60

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

SLIDE 61

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 61

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

Example: Add
Cycle 1: Read IMEM

SLIDE 62

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 62

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

Example: Add
Cycle 1: Read IMEM
Cycle 2: Decode + Read RF

SLIDE 63

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 63

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

Example: Add
Cycle 1: Read IMEM
Cycle 2: Decode + Read RF
Cycle 3: ALU

SLIDE 64

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 64

Multi-Cycle Datapath Example: Add

Example: Add
Cycle 1: Read IMEM
Cycle 2: Decode + Read RF
Cycle 3: ALU
Cycle 4: Writeback + Increment PC

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

SLIDE 65

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 65

Multi-Cycle Datapath Performance

Opposite performance split of single-cycle datapath

+ Short clock period – High CPI P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

SLIDE 66

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 66

SLIDE 67

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 +

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 67

SLIDE 68

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 + 0.15 * 4 +

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 68

SLIDE 69

Multi-cycle Data-path CPI

CPI depends on instructions
Branches / Jumps: 3 cycles
ALU: 4 cycles
Stores: 4 cycles
Loads: 5 cycles
Overall CPI is weighted average
Example:
20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 69

SLIDE 70

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 70

Multi-cycle Datapath Performance

Single-cycle
Clock period = 50ns, CPI = 1
Performace = 50 ns/insn
Multi-cycle
Clock period = 10ns
CPI = (0.2*3+0.2*5+0.6*4) = 4
Performance = 40 ns/insn
But wait…

SLIDE 71

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 71

Multi-Cycle Datapath Performance

Did not just cut up existing logic into 5 pieces
Also added logic (flip flops)
So clock period not 1/5 of single cycle, but slightly longer

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

SLIDE 72

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 72

Multi-cycle Datapath Performance

Single-cycle
Clock period = 50ns, CPI = 1
Performace = 50 ns/insn
Multi-cycle
Clock period = 12ns
CPI = (0.2*3+0.2*5+0.6*4) = 4
Performance = 48 ns/insn
Better, but not as exciting…
Can we do better still?
Have our cake (low CPI) and eat it too (high clock frequency)?

SLIDE 73

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 73

Clock Period and CPI

Single-cycle datapath

+ Low CPI: 1 – Long clock period: to accommodate slowest insn

Multi-cycle datapath

+ Short clock period – High CPI

Can we have both low CPI and short clock period?

– No good way to make a single insn go faster + Insn latency doesn’t matter anyway … insn throughput matters

Key: exploit inter-insn parallelism

insn0.fetch, dec, exec insn1.fetch, dec, exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec

SLIDE 74

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 74

Pipelining

Pipelining: important performance technique
Improves insn throughput rather than insn latency
Exploits parallelism at insn-stage level to do so
Begin with multi-cycle design
When insn advances from stage 1 to 2, next insn enters stage 1
Individual insns take same number of stages

+ But insns enter and leave at a much faster rate

Physically breaks “atomic” VN loop ... but must maintain illusion
Automotive assembly line analogy

insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec

SLIDE 75

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 75

5 Stage Multi-Cycle Datapath

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

SLIDE 76

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 76

5 Stage Pipelined Datapath

Temporary values (PC,IR,A,B,O,D) re-latched every stage
Why? 5 insns may be in pipeline at once, they share a single PC?
Notice, PC not latched after ALU stage (why not?)

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

SLIDE 77

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 77

Pipeline Terminology

Stages: Fetch, Decode, eXecute, Memory, Writeback
Latches (pipeline registers): PC, F/D, D/X, X/M, M/W

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

SLIDE 78

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 78

Some More Terminology

Scalar pipeline: one insn per stage per cycle
Alternative: “superscalar” (next unit)
In-order pipeline: insns enter execute stage in VN order
Alternative: “out-of-order” (not covered in CSE 371)
Pipeline depth: number of pipeline stages
Nothing magical about five
Trend has been to deeper pipelines

SLIDE 79

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 79

Pipeline Example: Cycle 1

3 instructions

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

add $3,$2,$1

SLIDE 80

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 80

Pipeline Example: Cycle 2

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

lw $4,0($5) add $3,$2,$1

SLIDE 81

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 81

Pipeline Example: Cycle 3

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4($7) lw $4,0($5) add $3,$2,$1

SLIDE 82

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 82

Pipeline Example: Cycle 4

3 instructions

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4($7) lw $4,0($5) add $3,$2,$1

SLIDE 83

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 83

Pipeline Example: Cycle 5

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4($7) lw $4,0($5) add

SLIDE 84

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 84

Pipeline Example: Cycle 6

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4(7) lw

SLIDE 85

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 85

Pipeline Example: Cycle 7

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw

SLIDE 86

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 86

Pipeline Diagram

Pipeline diagram: shorthand for what we just saw
Across: cycles
Down: insns
Convention: X means lw $4,0($5) finishes execute stage and

writes into X/M latch at end of cycle 4

1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($5)

F D X M W

sw $6,4($7)

F D X M W

SLIDE 87

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 87

What About Pipelined Control?

Should it be like single-cycle control?
But individual insn signals must be staged
Should it be like multi-cycle control?
But all stages are simultaneously active
How many different controllers are we going to need?
One for each insn in pipeline?
Solution: use simple single-cycle control, but pipeline it
Single controller

SLIDE 88

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 88

Pipelined Control

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

CTRL xC mC wC mC wC wC

SLIDE 89

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 89

Pipeline Performance Calculation

Single-cycle
Clock period = 50ns, CPI = 1
Performace = 50ns/insn
Multi-cycle
Branch: 20% (3 cycles), load: 20% (5 cycles), other: 60% (4

cycles)

Clock period = 12ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
Remember: latching overhead makes it 12, not 10
Performance = 48ns/insn
Pipelined
Clock period = 12ns
CPI = 1.5 (on average insn completes every 1.5 cycles)
Performance = 18ns/insn

SLIDE 90

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 90

Q1: Why Is Pipeline Clock Period …

… > delay thru datapath / number of pipeline stages?
Latches (FFs) add delay
Pipeline stages have different delays, clock period is max delay
Both factors have implications for ideal number pipeline stages

SLIDE 91

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 91

Q2: Why Is Pipeline CPI…

… > 1?
CPI for scalar in-order pipeline is 1 + stall penalties
Stalls used to resolve hazards
Hazard: condition that jeopardizes VN illusion
Stall: artificial pipeline delay introduced to restore VN illusion
Calculating pipeline CPI
Frequency of stall * stall cycles
Penalties add (stalls generally don’t overlap in in-order pipelines)
1 + stall-freq1*stall-cyc1 + stall-freq2*stall-cyc2 + …
Correctness/performance/MCCF
Long penalties OK if they happen rarely, e.g., 1 + 0.01 * 10 = 1.1
Stalls also have implications for ideal number of pipeline stages

SLIDE 92

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 92

Dependences and Hazards

Dependence: relationship between two insns
Data: two insns use same storage location
Control: one insn affects whether another executes at all
Not a bad thing, programs would be boring without them
Enforced by making older insn go before younger one
Happens naturally in single-/multi-cycle designs
But not in a pipeline
Hazard: dependence & possibility of wrong insn order
Effects of wrong insn order cannot be externally visible
Stall: for order by keeping younger insn in same stage
Hazards are a bad thing: stalls reduce performance

SLIDE 93

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 93

Why Does Every Insn Take 5 Cycles?

Could /should we allow add to skip M and go to W? No

– It wouldn’t help: peak fetch still only 1 insn per cycle – Structural hazards: imagine add follows lw

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

add $3,$2,$1 lw $4,0($5)

SLIDE 94

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 94

Structural Hazards

Structural hazards
Two insns trying to use same circuit at same time
E.g., structural hazard on regfile write port
To fix structural hazards: proper ISA/pipeline design
Each insn uses every structure exactly once
For at most one cycle
Always at same stage relative to F

SLIDE 95

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 95

Data Hazards

Let’s forget about branches and the control for a while
The three insn sequence we saw earlier executed fine…
But it wasn’t a real program
Real programs have data dependences
They pass values via registers and memory

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($5) sw $6,0($7)

Data Mem

a d

O D IR

M/W

SLIDE 96

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 96

Data Hazards

Would this “program” execute correctly on this pipeline?
Which insns would execute with correct inputs?
add is writing its result into $3 in current cycle

– lw read $3 2 cycles ago → got wrong value – addi read $3 1 cycle ago → got wrong value

sw is reading $3 this cycle → OK (regfile timing: write first half)

add $3,$2,$1 lw $4,0($3) sw $3,0($7) addi $6,1,$3

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

Data Mem

a d

O D IR

M/W

SLIDE 97

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 97

Memory Data Hazards

What about data hazards through memory? No
lw following sw to same address in next cycle, gets right value
Why? DMem read/write take place in same stage
Data hazards through registers? Yes (previous slide)
Occur because register write is 3 stages after register read
Can only read a register value 3 cycles after writing it

sw $5,0($1) lw $4,0($1)

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

Data Mem

a d

O D IR

M/W

SLIDE 98

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 98

Fixing Register Data Hazards

Can only read register value 3 cycles after writing it
One way to enforce this: make sure programs don’t do it
Compiler puts two independent insns between write/read insn pair
If they aren’t there already
Independent means: “do not interfere with register in question”
Do not write it: otherwise meaning of program changes
Do not read it: otherwise create new data hazard
Code scheduling: compiler moves around existing insns to do this
If none can be found, must use nops
This is called software interlocks
MIPS: Microprocessor w/out Interlocking Pipeline Stages

SLIDE 99

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 99

Software Interlock Example

add $3,$2,$1 lw $4,0($3) sw $7,0($3) add $6,$2,$8 addi $3,$5,4

Can any of last three insns be scheduled between first two
sw $7,0($3)? No, creates hazard with add $3,$2,$1
add $6,$2,$8? OK
addi $3,$5,4? No, lw would read $3 from it
Still need one more insn, use nop

add $3,$2,$1 add $6,$2,$8 nop lw $4,0($3) sw $7,0($3) addi $3,$5,4

SLIDE 100

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 100

Software Interlock Performance

Same deal
Branch: 20%, load: 20%, store: 10%, other: 50%
Software interlocks
20% of insns require insertion of 1 nop
5% of insns require insertion of 2 nops
CPI is still 1 technically
But now there are more insns
#insns = 1 + 0.20*1 + 0.05*2 = 1.3

– 30% more insns (30% slowdown) due to data hazards

SLIDE 101

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 101

Hardware Interlocks

Problem with software interlocks? Not compatible
Where does 3 in “read register 3 cycles after writing” come from?
From structure (depth) of pipeline
What if next MIPS version uses a 7 stage pipeline?
Programs compiled assuming 5 stage pipeline will break
A better (more compatible) way: hardware interlocks
Processor detects data hazards and fixes them
Two aspects to this
Detecting hazards
Fixing hazards

SLIDE 102

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 102

Detecting Data Hazards

Compare F/D insn input register names with output

register names of older insns in pipeline

Hazard = (F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

hazard

Data Mem

a d

O D IR

M/W

SLIDE 103

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 103

Fixing Data Hazards

Prevent F/D insn from reading (advancing) this cycle
Write nop into D/X.IR (effectively, insert nop in hardware)
Also reset (clear) the datapath control signals
Disable F/D latch and PC write enables (why?)
Re-evaluate situation next cycle

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

hazard

nop

Data Mem

a d

O D IR

M/W

SLIDE 104

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 104

Aside: Insert NOP/Reset Register

Earlier: registers support separate clock, write enable
Useful for writes into register file
Also useful for implementing stalls
Registers should also support synchronous reset (clear)
Useful for implementing stalls
Implement as additional hardwired 0 input to FF data mux
Resetting pipeline registers equivalent to inserting a NOP
If NOP is all zeros
If zero means “don’t write” for all write-enable control signals
Design ISA/control signals to make sure this is the case

FF D Q [RST:WE] FF D Q WE 2

SLIDE 105

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 105

Hardware Interlock Example: cycle 1

(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 1 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

hazard

nop

Data Mem

a d

O D IR

M/W

SLIDE 106

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 106

Hardware Interlock Example: cycle 2

(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 1 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

hazard

nop

Data Mem

a d

O D IR

M/W

SLIDE 107

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 107

Hardware Interlock Example: cycle 3

(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 0 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

hazard

nop

Data Mem

a d

O D IR

M/W

SLIDE 108

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 108

Pipeline Control Terminology

Hardware interlock maneuver is called stall or bubble
Mechanism is called stall logic
Part of more general pipeline control mechanism
Controls advancement of insns through pipeline
Distinguish from pipelined datapath control
Controls datapath at each stage
Pipeline control controls advancement of datapath control

SLIDE 109

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 109

Pipeline Diagram with Data Hazards

Data hazard stall indicated with d*
Stall propagates to younger insns
This is not good (why?)

1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F d* d* D X M W

sw $6,4($7)

F D X M W 1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F d* d* D X M W

sw $6,4($7)

F D X M W

SLIDE 110

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 110

Hardware Interlock Performance

Same deal
Branch: 20%, load: 20%, store: 10%, other: 50%
Hardware interlocks: same as software interlocks
20% of insns require 1 cycle stall (I.e., insertion of 1 nop)
5% of insns require 2 cycle stall (I.e., insertion of 2 nops)
CPI = 1 * 0.20*1 + 0.05*2 = 1.3
So, either CPI stays at 1 and #insns increases 30% (software)
Or, #insns stays at 1 (relative) and CPI increases 30% (hardware)
Same difference
Anyway, we can do better

SLIDE 111

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 111

Observe

Technically, this situation is broken
lw $4,0($3) has already read $3 from regfile
add $3,$2,$1 hasn’t yet written $3 to regfile
But fundamentally, everything is OK
lw $4,0($3) hasn’t actually used $3 yet
add $3,$2,$1 has already computed $3

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

Data Mem

a d

O D IR

M/W

SLIDE 112

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 112

Bypassing

Bypassing
Reading a value from an intermediate (µarchitectural) source
Not waiting until it is available from primary source
Here, we are bypassing the register file
Also called forwarding

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

Data Mem

a d

O D IR

M/W

SLIDE 113

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 113

WX Bypassing

What about this combination?
Add another bypass path and MUX input
First one was an MX bypass
This one is a WX bypass

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

Data Mem

a d

O D IR

M/W

SLIDE 114

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 114

ALUinB Bypassing

Can also bypass to ALU input B

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 add $4,$2,$3

Data Mem

a d

O D IR

M/W

SLIDE 115

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 115

WM Bypassing?

Does WM bypassing make sense?
Not to the address input (why not?)
But to the store data input, yes

Register File

S X

s1 s2 d

Data Mem

a d

IR A B IR O B IR O D IR

F/D D/X X/M M/W

lw $3,0($2) sw $3,0($4)

SLIDE 116

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 116

Bypass Logic

Each MUX has its own, here it is for MUX ALUinA

(D/X.IR.RS1 == X/M.IR.RD) => 0 (D/X.IR.RS1 == M/W.IR.RD) => 1 Else => 2 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

Data Mem

a d

O D IR

M/W

bypass

SLIDE 117

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 117

Bypass and Stall Logic

Two separate things
Stall logic controls pipeline registers
Bypass logic controls MUXs
But complementary
For a given data hazard: if can’t bypass, must stall
Slide #43 shows full bypassing: all bypasses possible
Is stall logic still necessary?

SLIDE 118

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 118

Yes, Load Output to ALU Input

Stall = (D/X.IR.OP == LOAD) && ((F/D.IR.RS1 == D/X.IR.RD) || ((F/D.IR.RS2 == D/X.IR.RD) && (F/D.IR.OP != STORE)) Register File

S X

s1 s2 d

Data Mem

a d

IR A B IR O B IR O D IR

F/D D/X X/M M/W

lw $3,0($2)

stall

nop

add $4,$2,$3 lw $3,0($2) add $4,$2,$3

SLIDE 119

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 119

Pipeline Diagram With Bypassing

Use compiler scheduling to reduce load-use stall frequency
Like software interlocks, but for performance not correctness

1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F D X M W

addi $6,$4,1

F d* D X M W 1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F D X M W

sub $8,$3,$1

F D X M W

addi $6,$4,1

F D X M W

SLIDE 120

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 120

Control Hazards

Control hazards
Must fetch post branch insns before branch outcome is known
Default: assume “not-taken” (at fetch, can’t tell it’s a branch)

PC

Insn Mem Register File

s1 s2 d

+ 4

<< 2

F/D D/X X/M

PC A B IR O B IR PC IR

S X

SLIDE 121

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 121

Branch Recovery

Branch recovery: what to do when branch is actually taken
Insns that will be written into F/D and D/X are wrong
Flush them, i.e., replace them with nops

+ They haven’t had written permanent state yet (regfile, DMem)

PC

Insn Mem Register File

s1 s2 d

+ 4

<< 2

F/D D/X X/M nop nop

PC A B IR O B IR PC IR

S X

SLIDE 122

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 122

Branch Recovery Pipeline Diagram

Convention: don’t fill in flushed insns
Taken branch penalty is 2 cycles

1 2 3 4 5 6 7 8 9

addi $3,$0,1

F D X M W

bnez $3,targ

F D X M W

sw $6,4($7)

F D

targ: addi $8,$7,1

F

targ: addi $8,$7,1

F D X M W 1 2 3 4 5 6 7 8 9

addi $3,$0,1

F D X M W

bnez $3,targ

F D X M W

targ: addi $8,$7,1

F D X M W

SLIDE 123

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 123

Branch Performance

Back of the envelope calculation
Branch: 20%, load: 20%, store: 10%, other: 50%
75% of branches are taken
CPI = 1 + 0.20*0.75*2 = 1.3

– Branches cause 30% slowdown

How do we reduce this penalty?

SLIDE 124

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 124

Fast Branch

Fast branch: can decide at D, not X
Test must be comparison to zero or equality, no time for ALU

+ New taken branch penalty is 1 – Additional insns (slt) for more complex tests, must bypass to D too

25% of branches have complex tests that require extra insn
CPI = 1 + 0.20*0.75*1(branch) + 0.20*0.25*1(extra insn) = 1.2

PC

Insn Mem Register File

s1 s2 d

+ 4

<< 2

F/D D/X X/M

S X <>

O B IR A B IR PC IR

S X

SLIDE 125

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 125

Speculative Execution

Speculation: “risky transactions on chance of profit”
Speculative execution
Execute before all parameters known with certainty
Correct speculation

+ Avoid stall, improve performance

Incorrect speculation (mis-speculation)

– Must abort/flush/squash incorrect insns – Must undo incorrect changes (recover pre-speculation state)

The “game”: [%correct * gain] – [(1–%correct) * penalty]
Control speculation: speculation aimed at control hazards
Unknown parameter: are these the correct insns to execute next?

SLIDE 126

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 126

Control Speculation Mechanics

Guess branch target, start fetching at guessed position
Doing nothing is implicitly guessing target is PC+4
Can actively guess other targets: dynamic branch prediction
Execute branch to verify (check) guess
Correct speculation? keep going
Mis-speculation? Flush mis-speculated insns
Hopefully haven’t modified permanent state (Regfile, DMem)

+ Happens naturally in in-order 5-stage pipeline

“Game” for in-order 5 stage pipeline
%correct = ?
Gain = 2 cycles

+ Penalty = 0 cycles → mis-speculation no worse than stalling

SLIDE 127

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 127

Dynamic Branch Prediction

Dynamic branch prediction: guess outcome
Start fetching from guessed address
Flush on mis-prediction (notice new recovery circuit)

PC

Insn Mem Register File

S X

s1 s2 d

+ 4

<< 2

TG PC IR TG PC A B IR O B IR

F/D D/X X/M nop nop BP

<>

SLIDE 128

Branch Prediction: Short Summary

Key principle of micro-architecture:
Programs do the same thing over and over (why?)
Exploit for performance:
Learn what a program did before
Guess that it will do the same thing again
Details of branch prediction: later (~1 month)
For now, just know it can be done and is important to performance

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 128

SLIDE 129

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 129

Branch Prediction Performance

Dynamic branch prediction
Simple predictor: branches predicted with 75% accuracy
CPI = 1 + 0.20*0.25*2 = 1.1
More advanced predictor: 95% accuracy
CPI = 1 + 0.20*0.05*2 = 1.02
Branch mis-predictions still a big problem though
Pipelines are long: typical mis-prediction penalty is 10+ cycles
Pipelines have full bypassing: compiler schedules the rest
Pipelines are superscalar (later)

SLIDE 130

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 130

Pipelining And Exceptions

Pipelining makes exceptions nasty
5 insns in pipeline at once
Exception happens, how do you know which insn caused it?
Exceptions propagate along pipeline in latches
Two exceptions happen, how do you know which one to take first?
One belonging to oldest insn
When handling exception, have to flush younger insns
Piggy-back on branch mis-prediction machinery to do this
What about multi-cycle operations?
Just FYI

SLIDE 131

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 131

Pipeline Depth

No magic about 5 stages, trend had been to deeper pipelines
486: 5 stages (50+ gate delays / clock)
Pentium: 7 stages
Pentium II/III: 12 stages
Pentium 4: 22 stages (~10 gate delays / clock) “super-pipelining”
Core1/2: 14 stages
Increasing pipeline depth

+ Increases clock frequency (reduces period) – But decreases IPC (increases CPI)

Branch mis-prediction penalty becomes longer
Non-bypassed data hazard stalls become longer
At some point, CPI losses offset clock gains, question is when?
1GHz Pentium 4 was slower than 800 MHz PentiumIII
What was the point? People by frequency, not frequency * IPC