CS104 Computer Organization and Design Datapaths CS104 (Hilton): - - PowerPoint PPT Presentation

cs104 computer organization and design
SMART_READER_LITE
LIVE PREVIEW

CS104 Computer Organization and Design Datapaths CS104 (Hilton): - - PowerPoint PPT Presentation

CS104 Computer Organization and Design Datapaths CS104 (Hilton): Datapaths [Slides adapted from A. Roths] 1 Admin Homework Homework 4 out tonight Due Monday March 26 th Download/check your submissions Reading: Chapter


slide-1
SLIDE 1

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 1

CS104 Computer Organization and Design

Datapaths

slide-2
SLIDE 2

Admin

  • Homework
  • Homework 4 out tonight
  • Due Monday March 26th
  • Download/check your submissions
  • Reading:
  • Chapter 4
  • (Maybe review 1.4)
  • Midterm 2
  • March 28

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 2

slide-3
SLIDE 3

What did we do last week?

  • Who can remind us what we did last week?

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 3

slide-4
SLIDE 4

What did we do last week?

  • Who can remind us what we did last week?
  • Ski
  • Go to the beach
  • Sleep in
  • Read a book
  • Ok, but seriously?

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 4

slide-5
SLIDE 5

When last I saw you all..

  • Last time I was here (Feb 27/29)
  • Learned basics of logic design
  • Gates (And, Or, Nor, …)
  • Put gates together to make
  • Muxes
  • Adders
  • Latches
  • Flip-flops

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 5

slide-6
SLIDE 6

While I was at HPCA..

  • Prof. Lebeck started teaching you all about datapaths
  • Putting logic together to execute instructions
  • Started on single-cycle datapath
  • We’ll review/continue with single cycle
  • Then jump into more things!

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 6

slide-7
SLIDE 7

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 7

Datapath for MIPS ISA

  • Consider only the following instructions

add $1,$2,$3 addi $1,2,$3 lw $1,4($3) sw $1,4($3) beq $1,$2,PC_relative_target j absolute_target

  • Why only these?
  • Most other instructions are the same from datapath viewpoint
  • The one’s that aren’t are left for you to figure out
slide-8
SLIDE 8

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 8

Start With Fetch

  • PC and instruction memory
  • A +4 incrementer computes default next instruction PC

P C Insn Mem

+ 4

slide-9
SLIDE 9

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 9

First Instruction: add

  • Add register file and ALU

P C Insn Mem Register File

s1 s2 d

+ 4

slide-10
SLIDE 10

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 10

Second Instruction: addi

  • Destination register can now be either Rd or Rt
  • Add sign extension unit and mux into second ALU input

P C Insn Mem Register File

S X

s1 s2 d

+ 4

slide-11
SLIDE 11

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 11

Third Instruction: lw

  • Add data memory, address is ALU output
  • Add register write data mux to select memory output or ALU output

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

slide-12
SLIDE 12

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 12

Fourth Instruction: sw

  • Add path from second input register to data memory data input

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

slide-13
SLIDE 13

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 13

Fifth Instruction: beq

  • Add left shift unit and adder to compute PC-relative branch target
  • Add PC input mux to select PC+4 or branch target

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

z

slide-14
SLIDE 14

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 14

Sixth Instruction: j

  • Add shifter to compute left shift of 26-bit immediate
  • Add additional PC input mux for jump target

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

slide-15
SLIDE 15

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 15

“Continuous Read” Datapath Timing

  • Works because writes (PC, RegFile, DMem) are independent
  • And because no read logically follows any write

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Read IMem Read Registers Read DMEM Write DMEM Write Registers Write PC

slide-16
SLIDE 16

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 16

What Is Control?

  • 9 signals control flow of data through this datapath
  • MUX selectors, or register/memory write enable signals
  • A real datapath has 300-500 control signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst

slide-17
SLIDE 17

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 17

Example: Control for add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

BR=0 JP=0 Rwd=0 DMwe=0 ALUop=0 ALUinB=0 Rdst=1 Rwe=1

slide-18
SLIDE 18

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 18

Example: Control for sw

  • Difference between sw and add is 5 signals
  • 3 if you don’t count the X (don’t care) signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe=0 ALUinB=1 DMwe=1 JP=0 ALUop=0 BR=0 Rwd=X Rdst=X

slide-19
SLIDE 19

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 19

Example: Control for beq

  • Difference between sw and beq is only 4 signals

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe=0 ALUinB=0 DMwe=0 JP=0 ALUop=1 BR=1 Rwd=X Rdst=X

slide-20
SLIDE 20

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 20

You all figure LW

  • How would these control signals be set for LW?

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst

slide-21
SLIDE 21

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 21

Example: Control for LW

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

BR=0 JP=0 Rwd=1 DMwe=0 ALUop=0 ALUinB=1 Rdst=1 Rwe=1

slide-22
SLIDE 22

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 22

How Is Control Implemented?

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2 << 2

Rwe ALUinB DMwe JP ALUop BR Rwd Rdst Control?

slide-23
SLIDE 23

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 23

Implementing Control

  • Each insn has a unique set of control signals
  • Most are function of opcode
  • Some may be encoded in the instruction itself
  • E.g., the ALUop signal is some portion of the MIPS Func field

+ Simplifies controller implementation

  • Requires careful ISA design
slide-24
SLIDE 24

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 24

Control Implementation: ROM

  • ROM (read only memory): think rows of bits
  • Bits in data words are control signals
  • Lines indexed by opcode
  • Example: ROM control for 6-insn MIPS datapath
  • X is “don’t care”

BR JP ALUinB ALUop DMwe Rwe Rdst Rwd add 1 addi 1 1 1 lw 1 1 1 1 sw 1 1 X X beq 1 1 X X j 1 X X

  • pcode
slide-25
SLIDE 25

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 25

Control Implementation: Random Logic

  • Real machines have 100+ insns 300+ control signals
  • 30,000+ control bits (~4KB)

– Not huge, but hard to make faster than datapath (important!)

  • Alternative: random logic (random = ‘non-repeating’)
  • Exploits the observation: many signals have few 1s or few 0s
  • Example: random logic control for 6-insn MIPS datapath

ALUinB

  • pcode

add addi lw sw beq j BR JP DMwe Rwd Rdst ALUop Rwe

slide-26
SLIDE 26

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 26

Datapath and Control Timing

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Control ROM/random logic

Read IMem Read Registers (Read Control ROM) Read DMEM Write DMEM Write Registers Write PC

slide-27
SLIDE 27

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 27

Single-Cycle Datapath Performance

  • Goes against make common case fast (MCCF) principle

+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Control ROM/random logic

slide-28
SLIDE 28

Interlude: Performance

  • Previous slide alludes to something new: Performance
  • Don’t just want it to work…
  • But want it to go fast!
  • Three components to performance:

Number of instructions x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 28

slide-29
SLIDE 29

Interlude: Performance

  • Three components to performance:

Number of instructions <- Compiler’s Job x Cycles per instruction (CPI) x Clock Period (1 / Clock frequency) Instructions Cycles Seconds Seconds —————— x ————— x ————— = —————— Program Instruction Cycle Program

  • Insns/Program: determined by compiler + ISA
  • Generally assume fixed program when do micro-architecture

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 29

slide-30
SLIDE 30

Micro-architectural factors

  • Micro-architecture:
  • The details of how the ISA is implemented
  • Affects CPI and Clock frequency
  • Often will look at fixed program, and consider MIPS
  • Million Instructions Per Second
  • MIPS = IPC * Frequency (in MHz)
  • IPC = Instruction Per Cycle (1 / CPI)
  • Gives “Bigger is better” number

Instructions Cycles Instructions ————— x ————— = —————— Cycle Second Second (IPC) (Frequency) (Throughput)

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 30

slide-31
SLIDE 31

“Best” IPC

  • For now, best we can do: IPC = 1 (CPI = 1)
  • Do 1 instruction every cycle
  • Later:
  • Real processors can do multiple instructions at once!
  • Potentially: IPC < 1!
  • Best possible IPC depends on design

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 31

slide-32
SLIDE 32

Performance vs ….

  • 1990s: Performance at all cost
  • Actually more “clock frequency” at all cost…
  • Now: Care about other things
  • Energy (electric bill, battery life)
  • Power (cooling, also affects energy)
  • Area (chip cost)
  • Reliability (tolerance of transient faults: e.g., charge particle strikes)
  • Important metric these days “Performance / Watt”
  • Throughput divided by power consumption
  • Why?

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 32

slide-33
SLIDE 33

Performance Modeling and Analysis

  • Speaking of performance
  • Making a processor takes time (years) and money (millions)
  • Want to know it will perform well before you finish
  • If its wrong, doing it all over is painful…
  • Performance can be simulated in software
  • Estimate what IPC will be
  • Guide design
  • This is my other job by the way…

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 33

slide-34
SLIDE 34

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 34

Single-Cycle Datapath Performance

  • Goes against make common case fast (MCCF) principle

+ Low Cycles Per Instruction (CPI): 1 – Long clock period: to accommodate slowest insn P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

Control ROM/random logic

slide-35
SLIDE 35

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 35

Alternative: Multi-Cycle Datapath

  • Multi-cycle datapath: attacks high clock period
  • Cut datapath into multiple stages (5 here), isolate using FFs
  • FSM control “walks” insns thru stages (by staging control signals)

+ Insns can bypass stages and exit early P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

s3 s3 s3 s4 s5 s5 s5

slide-36
SLIDE 36

Finite State Machine (FSM)

  • FSM = States + Transitions
  • Next state: function of current state + inputs
  • Outputs: function of current state + inputs
  • Canonical Example: Combination Lock
  • Must enter 3 8 4 to unlock
  • P.S. Useful in software too

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 36

slide-37
SLIDE 37

Finite State Machines: Example

  • Combination Lock Example:
  • Need to enter 3 8 4 to unlock
  • Initial State: no valid piece of combo seen

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 37

Start

slide-38
SLIDE 38

Finite State Machines: Example

  • Combination Lock Example:
  • Need to enter 3 8 4 to unlock
  • Input of 3: transition to new state
  • Any other input: stay in same state

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 38

Start 1 3 0-2,4-9

slide-39
SLIDE 39

Finite State Machines: Example

  • Combination Lock Example:
  • Need to enter 3 8 4 to unlock
  • State 1:
  • Input = 8? Goto state 2

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 39

Start 1 3 0-2,4-9 2 8 3 0-2,4-7,9

slide-40
SLIDE 40

Finite State Machines: Example

  • Combination Lock Example:
  • Need to enter 3 8 4 to unlock
  • State 2:
  • Input = 4? Goto state 3

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 40

Start 1 3 0-2,4-9 2 8 0-2,5-9 3 3 0-2,4-7,9 3 4

slide-41
SLIDE 41

Finite State Machines: Example

  • Combination Lock Example:
  • Need to enter 3 8 4 to unlock
  • State 3:

Unlock!

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 41

Start 1 3 0-2,4-9 2 8 0-2,5-9 3 3 0-2,4-7,9 3 4

slide-42
SLIDE 42

FSM in Hardware

  • Flip flop (s) to hold state (s)
  • Combinatorial logic to determine next state/output
  • (Assumes FF enable on input_valid)

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 42

slide-43
SLIDE 43

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 43

slide-44
SLIDE 44

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 44

slide-45
SLIDE 45

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 45

slide-46
SLIDE 46

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 46

slide-47
SLIDE 47

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 47

slide-48
SLIDE 48

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 48

slide-49
SLIDE 49

FSM Hardware Example

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 49

slide-50
SLIDE 50

FSM Implementation: ROM

  • Just saw: FSM implemented with sum-of-products
  • Remind us what that is?
  • Can also be implemented with a ROM

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 50

2(N+K) Entry ROM Inputs K Register N M Outputs N N + K K-bit input N-bit state M-bit output

slide-51
SLIDE 51

FSM ROM Implementation Example

  • Combination Lock (3 8 4) Example
  • 4-bit input
  • 2-bit state
  • 64-entry ROM (indexed with S1 S0 I3 I2 I1 I0)
  • Each entry needs 3 bits (S1 S0 U)
  • 2 for next state
  • 1 for unlock signal
  • Example entries in ROM
  • 0x00 = 000
  • 0x03 = 010
  • 0x18 = 100
  • 0x13 = 010
  • 0x3_ = 001

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 51

slide-52
SLIDE 52

Multi-cycle Datapath FSM

  • First state: Get a New Instruction
  • Output signals to fetch (e.g., read enable IMEM)
  • Next State: Always Decode

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 52

Next Insn Decode Insn

slide-53
SLIDE 53

Multi-cycle Datapath FSM

  • Second State: Decode
  • Output signals to decode instruction (RdEn RegFile)
  • Go to Next Insn if NOP
  • Otherwise Execute

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 53

Next Insn Decode Insn Execute Insn NOP

slide-54
SLIDE 54

Multi-cycle Datapath FSM

  • Execute State
  • Execute Insn (varies by insn type)
  • Next State: Also depends on insn type
  • Branches: Next Insn

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 54

Next Insn Decode Insn Execute Insn NOP Branch

slide-55
SLIDE 55

Multi-cycle Datapath FSM

  • Execute State
  • Execute Insn (varies by insn type)
  • Next State: Also depends on insn type
  • ALU op: write register

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 55

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU

slide-56
SLIDE 56

Multi-cycle Datapath FSM

  • Execute State
  • Execute Insn (varies by insn type)
  • Next State: Also depends on insn type
  • Load: Read Memory

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 56

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load

slide-57
SLIDE 57

Multi-cycle Datapath FSM

  • Execute State
  • Execute Insn (varies by insn type)
  • Next State: Also depends on insn type
  • Store: Write Memory

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 57

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

slide-58
SLIDE 58

Multi-cycle Datapath FSM

  • Read DMEM State
  • Control signals enable DMEM Read
  • Next state is writeback

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 58

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

slide-59
SLIDE 59

Multi-cycle Datapath FSM

  • Writeback state
  • Control signals enable regfile write
  • Next state: Next Insn

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 59

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

slide-60
SLIDE 60

Multi-cycle Datapath FSM

  • Write DMEM state
  • Control signals enable memory write
  • Next state: Next Insn

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 60

Next Insn Decode Insn Execute Insn NOP Branch Writeback ALU Read DMEM Load Write DMEM Store

slide-61
SLIDE 61

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 61

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

  • Example: Add
  • Cycle 1: Read IMEM
slide-62
SLIDE 62

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 62

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

  • Example: Add
  • Cycle 1: Read IMEM
  • Cycle 2: Decode + Read RF
slide-63
SLIDE 63

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 63

Multi-Cycle Datapath Example: Add

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

  • Example: Add
  • Cycle 1: Read IMEM
  • Cycle 2: Decode + Read RF
  • Cycle 3: ALU
slide-64
SLIDE 64

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 64

Multi-Cycle Datapath Example: Add

  • Example: Add
  • Cycle 1: Read IMEM
  • Cycle 2: Decode + Read RF
  • Cycle 3: ALU
  • Cycle 4: Writeback + Increment PC

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

slide-65
SLIDE 65

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 65

Multi-Cycle Datapath Performance

  • Opposite performance split of single-cycle datapath

+ Short clock period – High CPI P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

slide-66
SLIDE 66

Multi-cycle Data-path CPI

  • CPI depends on instructions
  • Branches / Jumps: 3 cycles
  • ALU: 4 cycles
  • Stores: 4 cycles
  • Loads: 5 cycles
  • Overall CPI is weighted average
  • Example:
  • 20% loads, 15% stores, 20% branches, 45% ALU

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 66

slide-67
SLIDE 67

Multi-cycle Data-path CPI

  • CPI depends on instructions
  • Branches / Jumps: 3 cycles
  • ALU: 4 cycles
  • Stores: 4 cycles
  • Loads: 5 cycles
  • Overall CPI is weighted average
  • Example:
  • 20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 +

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 67

slide-68
SLIDE 68

Multi-cycle Data-path CPI

  • CPI depends on instructions
  • Branches / Jumps: 3 cycles
  • ALU: 4 cycles
  • Stores: 4 cycles
  • Loads: 5 cycles
  • Overall CPI is weighted average
  • Example:
  • 20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 + 0.15 * 4 +

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 68

slide-69
SLIDE 69

Multi-cycle Data-path CPI

  • CPI depends on instructions
  • Branches / Jumps: 3 cycles
  • ALU: 4 cycles
  • Stores: 4 cycles
  • Loads: 5 cycles
  • Overall CPI is weighted average
  • Example:
  • 20% loads, 15% stores, 20% branches, 45% ALU

CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0

CS104 (Hilton) : Datapaths [Adapted from slides by A. Roth] 69

slide-70
SLIDE 70

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 70

Multi-cycle Datapath Performance

  • Single-cycle
  • Clock period = 50ns, CPI = 1
  • Performace = 50 ns/insn
  • Multi-cycle
  • Clock period = 10ns
  • CPI = (0.2*3+0.2*5+0.6*4) = 4
  • Performance = 40 ns/insn
  • But wait…
slide-71
SLIDE 71

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 71

Multi-Cycle Datapath Performance

  • Did not just cut up existing logic into 5 pieces
  • Also added logic (flip flops)
  • So clock period not 1/5 of single cycle, but slightly longer

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

slide-72
SLIDE 72

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 72

Multi-cycle Datapath Performance

  • Single-cycle
  • Clock period = 50ns, CPI = 1
  • Performace = 50 ns/insn
  • Multi-cycle
  • Clock period = 12ns
  • CPI = (0.2*3+0.2*5+0.6*4) = 4
  • Performance = 48 ns/insn
  • Better, but not as exciting…
  • Can we do better still?
  • Have our cake (low CPI) and eat it too (high clock frequency)?
slide-73
SLIDE 73

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 73

Clock Period and CPI

  • Single-cycle datapath

+ Low CPI: 1 – Long clock period: to accommodate slowest insn

  • Multi-cycle datapath

+ Short clock period – High CPI

  • Can we have both low CPI and short clock period?

– No good way to make a single insn go faster + Insn latency doesn’t matter anyway … insn throughput matters

  • Key: exploit inter-insn parallelism

insn0.fetch, dec, exec insn1.fetch, dec, exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec

slide-74
SLIDE 74

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 74

Pipelining

  • Pipelining: important performance technique
  • Improves insn throughput rather than insn latency
  • Exploits parallelism at insn-stage level to do so
  • Begin with multi-cycle design
  • When insn advances from stage 1 to 2, next insn enters stage 1
  • Individual insns take same number of stages

+ But insns enter and leave at a much faster rate

  • Physically breaks “atomic” VN loop ... but must maintain illusion
  • Automotive assembly line analogy

insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec insn0.dec insn0.fetch insn1.dec insn1.fetch insn0.exec insn1.exec

slide-75
SLIDE 75

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 75

5 Stage Multi-Cycle Datapath

P C Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

I R D O B A

slide-76
SLIDE 76

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 76

5 Stage Pipelined Datapath

  • Temporary values (PC,IR,A,B,O,D) re-latched every stage
  • Why? 5 insns may be in pipeline at once, they share a single PC?
  • Notice, PC not latched after ALU stage (why not?)

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

slide-77
SLIDE 77

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 77

Pipeline Terminology

  • Stages: Fetch, Decode, eXecute, Memory, Writeback
  • Latches (pipeline registers): PC, F/D, D/X, X/M, M/W

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

slide-78
SLIDE 78

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 78

Some More Terminology

  • Scalar pipeline: one insn per stage per cycle
  • Alternative: “superscalar” (next unit)
  • In-order pipeline: insns enter execute stage in VN order
  • Alternative: “out-of-order” (not covered in CSE 371)
  • Pipeline depth: number of pipeline stages
  • Nothing magical about five
  • Trend has been to deeper pipelines
slide-79
SLIDE 79

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 79

Pipeline Example: Cycle 1

  • 3 instructions

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

add $3,$2,$1

slide-80
SLIDE 80

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 80

Pipeline Example: Cycle 2

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

lw $4,0($5) add $3,$2,$1

slide-81
SLIDE 81

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 81

Pipeline Example: Cycle 3

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4($7) lw $4,0($5) add $3,$2,$1

slide-82
SLIDE 82

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 82

Pipeline Example: Cycle 4

  • 3 instructions

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4($7) lw $4,0($5) add $3,$2,$1

slide-83
SLIDE 83

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 83

Pipeline Example: Cycle 5

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4($7) lw $4,0($5) add

slide-84
SLIDE 84

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 84

Pipeline Example: Cycle 6

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw $6,4(7) lw

slide-85
SLIDE 85

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 85

Pipeline Example: Cycle 7

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

sw

slide-86
SLIDE 86

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 86

Pipeline Diagram

  • Pipeline diagram: shorthand for what we just saw
  • Across: cycles
  • Down: insns
  • Convention: X means lw $4,0($5) finishes execute stage and

writes into X/M latch at end of cycle 4

1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($5)

F D X M W

sw $6,4($7)

F D X M W

slide-87
SLIDE 87

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 87

What About Pipelined Control?

  • Should it be like single-cycle control?
  • But individual insn signals must be staged
  • Should it be like multi-cycle control?
  • But all stages are simultaneously active
  • How many different controllers are we going to need?
  • One for each insn in pipeline?
  • Solution: use simple single-cycle control, but pipeline it
  • Single controller
slide-88
SLIDE 88

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 88

Pipelined Control

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

CTRL xC mC wC mC wC wC

slide-89
SLIDE 89

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 89

Pipeline Performance Calculation

  • Single-cycle
  • Clock period = 50ns, CPI = 1
  • Performace = 50ns/insn
  • Multi-cycle
  • Branch: 20% (3 cycles), load: 20% (5 cycles), other: 60% (4

cycles)

  • Clock period = 12ns, CPI = (0.2*3+0.2*5+0.6*4) = 4
  • Remember: latching overhead makes it 12, not 10
  • Performance = 48ns/insn
  • Pipelined
  • Clock period = 12ns
  • CPI = 1.5 (on average insn completes every 1.5 cycles)
  • Performance = 18ns/insn
slide-90
SLIDE 90

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 90

Q1: Why Is Pipeline Clock Period …

  • … > delay thru datapath / number of pipeline stages?
  • Latches (FFs) add delay
  • Pipeline stages have different delays, clock period is max delay
  • Both factors have implications for ideal number pipeline stages
slide-91
SLIDE 91

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 91

Q2: Why Is Pipeline CPI…

  • … > 1?
  • CPI for scalar in-order pipeline is 1 + stall penalties
  • Stalls used to resolve hazards
  • Hazard: condition that jeopardizes VN illusion
  • Stall: artificial pipeline delay introduced to restore VN illusion
  • Calculating pipeline CPI
  • Frequency of stall * stall cycles
  • Penalties add (stalls generally don’t overlap in in-order pipelines)
  • 1 + stall-freq1*stall-cyc1 + stall-freq2*stall-cyc2 + …
  • Correctness/performance/MCCF
  • Long penalties OK if they happen rarely, e.g., 1 + 0.01 * 10 = 1.1
  • Stalls also have implications for ideal number of pipeline stages
slide-92
SLIDE 92

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 92

Dependences and Hazards

  • Dependence: relationship between two insns
  • Data: two insns use same storage location
  • Control: one insn affects whether another executes at all
  • Not a bad thing, programs would be boring without them
  • Enforced by making older insn go before younger one
  • Happens naturally in single-/multi-cycle designs
  • But not in a pipeline
  • Hazard: dependence & possibility of wrong insn order
  • Effects of wrong insn order cannot be externally visible
  • Stall: for order by keeping younger insn in same stage
  • Hazards are a bad thing: stalls reduce performance
slide-93
SLIDE 93

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 93

Why Does Every Insn Take 5 Cycles?

  • Could /should we allow add to skip M and go to W? No

– It wouldn’t help: peak fetch still only 1 insn per cycle – Structural hazards: imagine add follows lw

PC

Insn Mem Register File

S X

s1 s2 d

Data Mem

a d

+ 4

<< 2

PC IR PC A B IR O B IR O D IR

PC F/D D/X X/M M/W

add $3,$2,$1 lw $4,0($5)

slide-94
SLIDE 94

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 94

Structural Hazards

  • Structural hazards
  • Two insns trying to use same circuit at same time
  • E.g., structural hazard on regfile write port
  • To fix structural hazards: proper ISA/pipeline design
  • Each insn uses every structure exactly once
  • For at most one cycle
  • Always at same stage relative to F
slide-95
SLIDE 95

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 95

Data Hazards

  • Let’s forget about branches and the control for a while
  • The three insn sequence we saw earlier executed fine…
  • But it wasn’t a real program
  • Real programs have data dependences
  • They pass values via registers and memory

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($5) sw $6,0($7)

Data Mem

a d

O D IR

M/W

slide-96
SLIDE 96

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 96

Data Hazards

  • Would this “program” execute correctly on this pipeline?
  • Which insns would execute with correct inputs?
  • add is writing its result into $3 in current cycle

– lw read $3 2 cycles ago → got wrong value – addi read $3 1 cycle ago → got wrong value

  • sw is reading $3 this cycle → OK (regfile timing: write first half)

add $3,$2,$1 lw $4,0($3) sw $3,0($7) addi $6,1,$3

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

Data Mem

a d

O D IR

M/W

slide-97
SLIDE 97

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 97

Memory Data Hazards

  • What about data hazards through memory? No
  • lw following sw to same address in next cycle, gets right value
  • Why? DMem read/write take place in same stage
  • Data hazards through registers? Yes (previous slide)
  • Occur because register write is 3 stages after register read
  • Can only read a register value 3 cycles after writing it

sw $5,0($1) lw $4,0($1)

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

Data Mem

a d

O D IR

M/W

slide-98
SLIDE 98

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 98

Fixing Register Data Hazards

  • Can only read register value 3 cycles after writing it
  • One way to enforce this: make sure programs don’t do it
  • Compiler puts two independent insns between write/read insn pair
  • If they aren’t there already
  • Independent means: “do not interfere with register in question”
  • Do not write it: otherwise meaning of program changes
  • Do not read it: otherwise create new data hazard
  • Code scheduling: compiler moves around existing insns to do this
  • If none can be found, must use nops
  • This is called software interlocks
  • MIPS: Microprocessor w/out Interlocking Pipeline Stages
slide-99
SLIDE 99

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 99

Software Interlock Example

add $3,$2,$1 lw $4,0($3) sw $7,0($3) add $6,$2,$8 addi $3,$5,4

  • Can any of last three insns be scheduled between first two
  • sw $7,0($3)? No, creates hazard with add $3,$2,$1
  • add $6,$2,$8? OK
  • addi $3,$5,4? No, lw would read $3 from it
  • Still need one more insn, use nop

add $3,$2,$1 add $6,$2,$8 nop lw $4,0($3) sw $7,0($3) addi $3,$5,4

slide-100
SLIDE 100

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 100

Software Interlock Performance

  • Same deal
  • Branch: 20%, load: 20%, store: 10%, other: 50%
  • Software interlocks
  • 20% of insns require insertion of 1 nop
  • 5% of insns require insertion of 2 nops
  • CPI is still 1 technically
  • But now there are more insns
  • #insns = 1 + 0.20*1 + 0.05*2 = 1.3

– 30% more insns (30% slowdown) due to data hazards

slide-101
SLIDE 101

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 101

Hardware Interlocks

  • Problem with software interlocks? Not compatible
  • Where does 3 in “read register 3 cycles after writing” come from?
  • From structure (depth) of pipeline
  • What if next MIPS version uses a 7 stage pipeline?
  • Programs compiled assuming 5 stage pipeline will break
  • A better (more compatible) way: hardware interlocks
  • Processor detects data hazards and fixes them
  • Two aspects to this
  • Detecting hazards
  • Fixing hazards
slide-102
SLIDE 102

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 102

Detecting Data Hazards

  • Compare F/D insn input register names with output

register names of older insns in pipeline

Hazard = (F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

hazard

Data Mem

a d

O D IR

M/W

slide-103
SLIDE 103

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 103

Fixing Data Hazards

  • Prevent F/D insn from reading (advancing) this cycle
  • Write nop into D/X.IR (effectively, insert nop in hardware)
  • Also reset (clear) the datapath control signals
  • Disable F/D latch and PC write enables (why?)
  • Re-evaluate situation next cycle

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

hazard

nop

Data Mem

a d

O D IR

M/W

slide-104
SLIDE 104

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 104

Aside: Insert NOP/Reset Register

  • Earlier: registers support separate clock, write enable
  • Useful for writes into register file
  • Also useful for implementing stalls
  • Registers should also support synchronous reset (clear)
  • Useful for implementing stalls
  • Implement as additional hardwired 0 input to FF data mux
  • Resetting pipeline registers equivalent to inserting a NOP
  • If NOP is all zeros
  • If zero means “don’t write” for all write-enable control signals
  • Design ISA/control signals to make sure this is the case

FF D Q [RST:WE] FF D Q WE 2

slide-105
SLIDE 105

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 105

Hardware Interlock Example: cycle 1

(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 1 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

hazard

nop

Data Mem

a d

O D IR

M/W

slide-106
SLIDE 106

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 106

Hardware Interlock Example: cycle 2

(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 1 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

hazard

nop

Data Mem

a d

O D IR

M/W

slide-107
SLIDE 107

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 107

Hardware Interlock Example: cycle 3

(F/D.IR.RS1 == D/X.IR.RD) || (F/D.IR.RS2 == D/X.IR.RD) || (F/D.IR.RS1 == X/M.IR.RD) || (F/D.IR.RS2 == X/M.IR.RD) = 0 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

hazard

nop

Data Mem

a d

O D IR

M/W

slide-108
SLIDE 108

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 108

Pipeline Control Terminology

  • Hardware interlock maneuver is called stall or bubble
  • Mechanism is called stall logic
  • Part of more general pipeline control mechanism
  • Controls advancement of insns through pipeline
  • Distinguish from pipelined datapath control
  • Controls datapath at each stage
  • Pipeline control controls advancement of datapath control
slide-109
SLIDE 109

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 109

Pipeline Diagram with Data Hazards

  • Data hazard stall indicated with d*
  • Stall propagates to younger insns
  • This is not good (why?)

1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F d* d* D X M W

sw $6,4($7)

F D X M W 1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F d* d* D X M W

sw $6,4($7)

F D X M W

slide-110
SLIDE 110

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 110

Hardware Interlock Performance

  • Same deal
  • Branch: 20%, load: 20%, store: 10%, other: 50%
  • Hardware interlocks: same as software interlocks
  • 20% of insns require 1 cycle stall (I.e., insertion of 1 nop)
  • 5% of insns require 2 cycle stall (I.e., insertion of 2 nops)
  • CPI = 1 * 0.20*1 + 0.05*2 = 1.3
  • So, either CPI stays at 1 and #insns increases 30% (software)
  • Or, #insns stays at 1 (relative) and CPI increases 30% (hardware)
  • Same difference
  • Anyway, we can do better
slide-111
SLIDE 111

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 111

Observe

  • Technically, this situation is broken
  • lw $4,0($3) has already read $3 from regfile
  • add $3,$2,$1 hasn’t yet written $3 to regfile
  • But fundamentally, everything is OK
  • lw $4,0($3) hasn’t actually used $3 yet
  • add $3,$2,$1 has already computed $3

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

Data Mem

a d

O D IR

M/W

slide-112
SLIDE 112

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 112

Bypassing

  • Bypassing
  • Reading a value from an intermediate (µarchitectural) source
  • Not waiting until it is available from primary source
  • Here, we are bypassing the register file
  • Also called forwarding

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

Data Mem

a d

O D IR

M/W

slide-113
SLIDE 113

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 113

WX Bypassing

  • What about this combination?
  • Add another bypass path and MUX input
  • First one was an MX bypass
  • This one is a WX bypass

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 lw $4,0($3)

Data Mem

a d

O D IR

M/W

slide-114
SLIDE 114

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 114

ALUinB Bypassing

  • Can also bypass to ALU input B

Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

add $3,$2,$1 add $4,$2,$3

Data Mem

a d

O D IR

M/W

slide-115
SLIDE 115

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 115

WM Bypassing?

  • Does WM bypassing make sense?
  • Not to the address input (why not?)
  • But to the store data input, yes

Register File

S X

s1 s2 d

Data Mem

a d

IR A B IR O B IR O D IR

F/D D/X X/M M/W

lw $3,0($2) sw $3,0($4)

slide-116
SLIDE 116

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 116

Bypass Logic

  • Each MUX has its own, here it is for MUX ALUinA

(D/X.IR.RS1 == X/M.IR.RD) => 0 (D/X.IR.RS1 == M/W.IR.RD) => 1 Else => 2 Register File

S X

s1 s2 d

IR A B IR O B IR

F/D D/X X/M

Data Mem

a d

O D IR

M/W

bypass

slide-117
SLIDE 117

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 117

Bypass and Stall Logic

  • Two separate things
  • Stall logic controls pipeline registers
  • Bypass logic controls MUXs
  • But complementary
  • For a given data hazard: if can’t bypass, must stall
  • Slide #43 shows full bypassing: all bypasses possible
  • Is stall logic still necessary?
slide-118
SLIDE 118

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 118

Yes, Load Output to ALU Input

Stall = (D/X.IR.OP == LOAD) && ((F/D.IR.RS1 == D/X.IR.RD) || ((F/D.IR.RS2 == D/X.IR.RD) && (F/D.IR.OP != STORE)) Register File

S X

s1 s2 d

Data Mem

a d

IR A B IR O B IR O D IR

F/D D/X X/M M/W

lw $3,0($2)

stall

nop

add $4,$2,$3 lw $3,0($2) add $4,$2,$3

slide-119
SLIDE 119

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 119

Pipeline Diagram With Bypassing

  • Use compiler scheduling to reduce load-use stall frequency
  • Like software interlocks, but for performance not correctness

1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F D X M W

addi $6,$4,1

F d* D X M W 1 2 3 4 5 6 7 8 9

add $3,$2,$1

F D X M W

lw $4,0($3)

F D X M W

sub $8,$3,$1

F D X M W

addi $6,$4,1

F D X M W

slide-120
SLIDE 120

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 120

Control Hazards

  • Control hazards
  • Must fetch post branch insns before branch outcome is known
  • Default: assume “not-taken” (at fetch, can’t tell it’s a branch)

PC

Insn Mem Register File

s1 s2 d

+ 4

<< 2

F/D D/X X/M

PC A B IR O B IR PC IR

S X

slide-121
SLIDE 121

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 121

Branch Recovery

  • Branch recovery: what to do when branch is actually taken
  • Insns that will be written into F/D and D/X are wrong
  • Flush them, i.e., replace them with nops

+ They haven’t had written permanent state yet (regfile, DMem)

PC

Insn Mem Register File

s1 s2 d

+ 4

<< 2

F/D D/X X/M nop nop

PC A B IR O B IR PC IR

S X

slide-122
SLIDE 122

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 122

Branch Recovery Pipeline Diagram

  • Convention: don’t fill in flushed insns
  • Taken branch penalty is 2 cycles

1 2 3 4 5 6 7 8 9

addi $3,$0,1

F D X M W

bnez $3,targ

F D X M W

sw $6,4($7)

F D

targ: addi $8,$7,1

F

targ: addi $8,$7,1

F D X M W 1 2 3 4 5 6 7 8 9

addi $3,$0,1

F D X M W

bnez $3,targ

F D X M W

targ: addi $8,$7,1

F D X M W

slide-123
SLIDE 123

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 123

Branch Performance

  • Back of the envelope calculation
  • Branch: 20%, load: 20%, store: 10%, other: 50%
  • 75% of branches are taken
  • CPI = 1 + 0.20*0.75*2 = 1.3

– Branches cause 30% slowdown

  • How do we reduce this penalty?
slide-124
SLIDE 124

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 124

Fast Branch

  • Fast branch: can decide at D, not X
  • Test must be comparison to zero or equality, no time for ALU

+ New taken branch penalty is 1 – Additional insns (slt) for more complex tests, must bypass to D too

  • 25% of branches have complex tests that require extra insn
  • CPI = 1 + 0.20*0.75*1(branch) + 0.20*0.25*1(extra insn) = 1.2

PC

Insn Mem Register File

s1 s2 d

+ 4

<< 2

F/D D/X X/M

S X <>

O B IR A B IR PC IR

S X

slide-125
SLIDE 125

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 125

Speculative Execution

  • Speculation: “risky transactions on chance of profit”
  • Speculative execution
  • Execute before all parameters known with certainty
  • Correct speculation

+ Avoid stall, improve performance

  • Incorrect speculation (mis-speculation)

– Must abort/flush/squash incorrect insns – Must undo incorrect changes (recover pre-speculation state)

  • The “game”: [%correct * gain] – [(1–%correct) * penalty]
  • Control speculation: speculation aimed at control hazards
  • Unknown parameter: are these the correct insns to execute next?
slide-126
SLIDE 126

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 126

Control Speculation Mechanics

  • Guess branch target, start fetching at guessed position
  • Doing nothing is implicitly guessing target is PC+4
  • Can actively guess other targets: dynamic branch prediction
  • Execute branch to verify (check) guess
  • Correct speculation? keep going
  • Mis-speculation? Flush mis-speculated insns
  • Hopefully haven’t modified permanent state (Regfile, DMem)

+ Happens naturally in in-order 5-stage pipeline

  • “Game” for in-order 5 stage pipeline
  • %correct = ?
  • Gain = 2 cycles

+ Penalty = 0 cycles → mis-speculation no worse than stalling

slide-127
SLIDE 127

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 127

Dynamic Branch Prediction

  • Dynamic branch prediction: guess outcome
  • Start fetching from guessed address
  • Flush on mis-prediction (notice new recovery circuit)

PC

Insn Mem Register File

S X

s1 s2 d

+ 4

<< 2

TG PC IR TG PC A B IR O B IR

F/D D/X X/M nop nop BP

<>

slide-128
SLIDE 128

Branch Prediction: Short Summary

  • Key principle of micro-architecture:
  • Programs do the same thing over and over (why?)
  • Exploit for performance:
  • Learn what a program did before
  • Guess that it will do the same thing again
  • Details of branch prediction: later (~1 month)
  • For now, just know it can be done and is important to performance

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 128

slide-129
SLIDE 129

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 129

Branch Prediction Performance

  • Dynamic branch prediction
  • Simple predictor: branches predicted with 75% accuracy
  • CPI = 1 + 0.20*0.25*2 = 1.1
  • More advanced predictor: 95% accuracy
  • CPI = 1 + 0.20*0.05*2 = 1.02
  • Branch mis-predictions still a big problem though
  • Pipelines are long: typical mis-prediction penalty is 10+ cycles
  • Pipelines have full bypassing: compiler schedules the rest
  • Pipelines are superscalar (later)
slide-130
SLIDE 130

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 130

Pipelining And Exceptions

  • Pipelining makes exceptions nasty
  • 5 insns in pipeline at once
  • Exception happens, how do you know which insn caused it?
  • Exceptions propagate along pipeline in latches
  • Two exceptions happen, how do you know which one to take first?
  • One belonging to oldest insn
  • When handling exception, have to flush younger insns
  • Piggy-back on branch mis-prediction machinery to do this
  • What about multi-cycle operations?
  • Just FYI
slide-131
SLIDE 131

CS104 (Hilton): Datapaths [Slides adapted from A. Roth’s] 131

Pipeline Depth

  • No magic about 5 stages, trend had been to deeper pipelines
  • 486: 5 stages (50+ gate delays / clock)
  • Pentium: 7 stages
  • Pentium II/III: 12 stages
  • Pentium 4: 22 stages (~10 gate delays / clock) “super-pipelining”
  • Core1/2: 14 stages
  • Increasing pipeline depth

+ Increases clock frequency (reduces period) – But decreases IPC (increases CPI)

  • Branch mis-prediction penalty becomes longer
  • Non-bypassed data hazard stalls become longer
  • At some point, CPI losses offset clock gains, question is when?
  • 1GHz Pentium 4 was slower than 800 MHz PentiumIII
  • What was the point? People by frequency, not frequency * IPC