Pipelining Performance Measurements Cycle Time: Time in between - - PowerPoint PPT Presentation

pipelining performance measurements
SMART_READER_LITE
LIVE PREVIEW

Pipelining Performance Measurements Cycle Time: Time in between - - PowerPoint PPT Presentation

Pipelining Performance Measurements Cycle Time: Time in between clock ticks Latency: Time to finish a complete job, start to finish Throughput: Average jobs completed per unit time CyclesPerJob: Number of cycles between


slide-1
SLIDE 1

Pipelining

slide-2
SLIDE 2

Performance Measurements

  • Cycle Time: Time in between clock

ticks

  • Latency: Time to finish a complete

job, start to finish

  • Throughput: Average jobs completed

per unit time

  • CyclesPerJob: Number of cycles

between finishing jobs.

slide-3
SLIDE 3

Goals

  • Faster clock rate
  • Use machine more efficiently
  • No longer execute only one instruction

at a time

slide-4
SLIDE 4

Laundry

  • Laundry-o-matic washes, dries &

folds

  • Wash: 30 min
  • Dry: 40 min
  • Fold: 20 min
  • It switches them internally with no

delay

  • How long to complete 1 load?

______

slide-5
SLIDE 5

Laundry

  • Laundry-o-matic washes, dries &

folds

  • Wash: 30 min
  • Dry: 40 min
  • Fold: 20 min
  • It switches them internally with no

delay

  • How long to complete 1 load? 90

min

slide-6
SLIDE 6

Laundry-o-Matic - SingleCycle

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

slide-7
SLIDE 7

Laundry-o-Matic - SingleCycle

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F D

slide-8
SLIDE 8

Laundry-o-Matic - SingleCycle

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F W D F D

slide-9
SLIDE 9

Laundry-o-Matic - SingleCycle

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F W D F W F D D

slide-10
SLIDE 10

Laundry-o-Matic

  • Cycle Time: Clothing is switched

every ____ minutes

  • Latency: A single load takes a total of

______ minutes

  • Throughput: A load completes each

______ minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-11
SLIDE 11

Laundry-o-Matic

  • Cycle Time: Clothing is switched

every 90 minutes

  • Latency: A single load takes a total of

______ minutes

  • Throughput: A load completes each

______ minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-12
SLIDE 12

Laundry-o-Matic

  • Cycle Time: Clothing is switched

every 90 minutes

  • Latency: A single load takes a total of

90 minutes

  • Throughput: A load completes each

______ minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-13
SLIDE 13

Laundry-o-Matic

  • Cycle Time: Clothing is switched

every 90 minutes

  • Latency: A single load takes a total of

90 minutes

  • Throughput: A load completes each

90 minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-14
SLIDE 14

Laundry-o-Matic

  • Cycle Time: Clothing is switched

every 90 minutes

  • Latency: A single load takes a total of

90 minutes

  • Throughput: A load completes each

90 minutes

  • CyclesPerLoad: Every 1 cycles, a load

completes

slide-15
SLIDE 15

Pipelined Laundry

  • Split the laundry-o-matic into a

washer, dryer, and folder (what a concept)

  • Moving the laundry from one unit to

another takes 6 minutes

slide-16
SLIDE 16

Pipelined Laundry

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F D

We have to include time to switch stages

slide-17
SLIDE 17

Pipelined Laundry

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F D W F D

slide-18
SLIDE 18

Pipelined Laundry

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F D W F D

Two loads can not be in Dryer at the same time.

slide-19
SLIDE 19

Pipelined Laundry

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F W

Switch all loads at the same time

D D F

slide-20
SLIDE 20

Pipelined Laundry

Minutes Load 1 2 3

0 30 60 90 120 150 180 210 240 270

W F W D D F W D F

slide-21
SLIDE 21

Pipelined Laundry

  • Cycle Time: Clothing is switched

every ____ minutes

  • Latency: A single load takes a total of

______ minutes

  • Throughput: A load completes each

______ minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-22
SLIDE 22

Pipelined Laundry

  • Cycle Time: Clothing is switched

every 46 minutes

  • Latency: A single load takes a total of

______ minutes

  • Throughput: A load completes each

______ minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-23
SLIDE 23

Pipelined Laundry

  • Cycle Time: Clothing is switched

every 46 minutes

  • Latency: A single load takes a total of

138 minutes

  • Throughput: A load completes each

______ minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-24
SLIDE 24

Pipelined Laundry

  • Cycle Time: Clothing is switched

every 46 minutes

  • Latency: A single load takes a total of

138 minutes

  • Throughput: A load completes each

46 minutes

  • CyclesPerLoad: Every ____ cycles, a

load completes

slide-25
SLIDE 25

Pipelined Laundry

  • Cycle Time: Clothing is switched

every 46 minutes

  • Latency: A single load takes a total of

138 minutes

  • Throughput: A load completes each

46 minutes

  • CyclesPerLoad: Every 1 cycles, a load

completes

slide-26
SLIDE 26

Single-Cycle vs Pipelined

  • _________ has the higher cycle time
  • _________ has the higher clock rate
  • _________ has the higher single-load

latency

  • _________ has the higher throughput
  • _________ has the higher CPL (Cycles

per Load)

  • More stages makes a _______ clock

rate

slide-27
SLIDE 27

Single-Cycle vs Pipelined

  • Single has the higher cycle time
  • _________ has the higher clock rate
  • _________ has the higher single-load

latency

  • _________ has the higher throughput
  • _________ has the higher CPL (Cycles

per Load)

  • More stages makes a _______ clock

rate

slide-28
SLIDE 28

Single-Cycle vs Pipelined

  • Single has the higher cycle time
  • Pipelined has the higher clock rate
  • _________ has the higher single-load

latency

  • _________ has the higher throughput
  • _________ has the higher CPL (Cycles

per Load)

  • More stages makes a _______ clock

rate

slide-29
SLIDE 29

Single-Cycle vs Pipelined

  • Single has the higher cycle time
  • Pipelined has the higher clock rate
  • Pipelined has the higher single-load

latency

  • _________ has the higher throughput
  • _________ has the higher CPL (Cycles

per Load)

  • More stages makes a _______ clock

rate

slide-30
SLIDE 30

Single-Cycle vs Pipelined

  • Single has the higher cycle time
  • Pipelined has the higher clock rate
  • Pipelined has the higher single-load

latency

  • Pipelined has the higher throughput
  • _________ has the higher CPL (Cycles

per Load)

  • More stages makes a _______ clock

rate

slide-31
SLIDE 31

Single-Cycle vs Pipelined

  • Single has the higher cycle time
  • Pipelined has the higher clock rate
  • Pipelined has the higher single-load

latency

  • Pipelined has the higher throughput
  • Neither has the higher CPL (Cycles per

Load)

  • More stages makes a _______ clock

rate

slide-32
SLIDE 32

Single-Cycle vs Pipelined

  • Single has the higher cycle time
  • Pipelined has the higher clock rate
  • Pipelined has the higher single-load

latency

  • Pipelined has the higher throughput
  • Neither has the higher CPL (Cycles per

Load)

  • More stages makes a Higher clock rate
slide-33
SLIDE 33

Obstacles to speedup in Pipelining

W F D

  • 1.
  • 2.
  • Ideal cycle time w/out above

limitations with n stage pipeline:

slide-34
SLIDE 34

Obstacles to speedup in Pipelining

  • 1. Uneven Stages
  • 2.
  • Ideal cycle time w/out above

limitations with n stage pipeline: W F D

slide-35
SLIDE 35

Obstacles to speedup in Pipelining

  • 1. Uneven Stages
  • 2. Pipeline Register Delay
  • Ideal cycle time w/out above

limitations with n stage pipeline: W F D

slide-36
SLIDE 36

Obstacles to speedup in Pipelining

  • 1. Uneven Stages
  • 2. Pipeline Register Delay
  • Ideal cycle time w/out above

limitations with n stage pipeline:

w OldCycleTime / n

W F D

slide-37
SLIDE 37

Example

  • Washing = 45
  • Drying = 120
  • Folding = 15
  • Switching = 5
  • What is the latency for one load of

laundry?

  • What is the latency for three loads?
slide-38
SLIDE 38

Example

  • Washing = 45
  • Drying = 120
  • Folding = 15
  • Switching = 5
  • What is the latency for one load of

laundry? 375

  • What is the latency for three loads? 625
slide-39
SLIDE 39

Creating Stages

  • Fetch – get instruction
  • Decode – read registers
  • Execute – use ALU
  • Memory – access memory
  • WriteBack – write registers

Fetch Decode Execute Memory WriteBack

IF WB

MEM

ID

slide-40
SLIDE 40

Pipelined Machine

Read Addr Out Data

Instruction Memory PC 4

src1 src1data src2 src2data

Register File

destreg destdata

  • p/fun

rs rt rd imm

Addr Out Data

Data Memory

In Data

32 Sign Ext 16 << 2 << 2 Pipeline Register

Fetch (Writeback) Execute Decode Memory

slide-41
SLIDE 41

IF WB

MEM

ID Time->

IF

1 2 3 4 5 6 7 8

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

add $s0, $0, $0

slide-42
SLIDE 42

IF WB

MEM

ID Time->

IF ID IF

1 2 3 4 5 6 7 8

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

add $s0, $0, $0 lw $s1, 0($t0)

slide-43
SLIDE 43

IF WB

MEM

ID Time->

IF ID IF ID IF

1 2 3 4 5 6 7 8

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

slide-44
SLIDE 44

IF WB

MEM

ID Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF

add $s0, $0, $0

ID IF

lw $s1, 0($t0)

ID IF

sw $s2, 0($t1)

MEM

ID IF

  • r $s3, $s4, $t3

1 2 3 4 5 6 7 8

slide-45
SLIDE 45

IF WB

MEM

ID

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

MEM

ID

WB

slide-46
SLIDE 46

IF WB

MEM

ID

lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

MEM

ID

WB MEM WB

slide-47
SLIDE 47

IF WB

MEM

ID

sw $s2, 0($t1)

  • r $s3, $s4, $t3

Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

MEM

ID

WB MEM WB MEM WB

slide-48
SLIDE 48

IF WB

MEM

ID

  • r $s3, $s4, $t3

Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

slide-49
SLIDE 49

IF WB

MEM

ID Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

The machine in cycle 4

slide-50
SLIDE 50

IF WB

MEM

ID Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

The machine in cycle 5

slide-51
SLIDE 51

In what cycle was $s1 written? In what cycle was $s4 read? In what cycle was the Add executed? Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

slide-52
SLIDE 52

In what cycle was $s1 written? 6 In what cycle was $s4 read? In what cycle was the Add executed? Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

slide-53
SLIDE 53

In what cycle was $s1 written? 6 In what cycle was $s4 read? 5 In what cycle was the Add executed? Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

slide-54
SLIDE 54

In what cycle was $s1 written? 6 In what cycle was $s4 read? 5 In what cycle was the Add executed? 3 Time->

add $s0, $0, $0 lw $s1, 0($t0) sw $s2, 0($t1)

  • r $s3, $s4, $t3

IF ID IF ID IF

MEM

ID IF

1 2 3 4 5 6 7 8

ID

WB MEM WB MEM WB MEM WB

slide-55
SLIDE 55

Performance Analysis

  • Measurements related to our machine
  • Job = single instruction
  • Latency: Time to finish a complete

_______________, start to finish.

  • Throughput: Average ______________

completed per unit time.

  • Which is more important for reducing

program execution time?

slide-56
SLIDE 56

Performance Analysis

  • Measurements related to our machine
  • Job = single instruction
  • Latency: Time to finish a complete

instruction start to finish.

  • Throughput: Average ______________

completed per unit time.

  • Which is more important for reducing

program execution time?

slide-57
SLIDE 57

Performance Analysis

  • Measurements related to our machine
  • Job = single instruction
  • Latency: Time to finish a complete

instruction start to finish.

  • Throughput: Average number of

instructions completed per unit time.

  • Which is more important for reducing

program execution time?

slide-58
SLIDE 58

Pipelined Machine

Read Addr Out Data

Instruction Memory PC 4

src1 src1data src2 src2data

Register File

destreg destdata

  • p/fun

rs rt rd imm

Addr Out Data

Data Memory

In Data

32 Sign Ext 16 << 2 << 2 Pipeline Register

Fetch (Writeback) Execute Decode Memory

slide-59
SLIDE 59

Pipeline Registers

w IF/ID

§ 32b instruction § 32b nPC

w ID/EX

§ 32b register § 32b register § 32b immediate field § 32b nPC

w EX/MEM

§ Zero § 32b ALU result § 32b nPC § 32b register value

w MEM/WB

§ 32b ALU result § 32b memory value

  • Named for two stages they separate
  • Store all data corresponding to lines that go

through them

slide-60
SLIDE 60

Register File

  • Only takes half of a cycle to read or

write to register file

  • Convention:

w Read 2nd half of cycle w Write 1st half of cycle

slide-61
SLIDE 61

Machine Comparison

Fetch Decode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

slide-62
SLIDE 62

Machine Comparison

FetchDecode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: 8 ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

slide-63
SLIDE 63

Machine Comparison

FetchDecode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: 8 ns Latency of a single instruction: 8 ns Throughput for machine: _____ inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

slide-64
SLIDE 64

Machine Comparison

FetchDecode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: 8 ns Latency of a single instruction: 8 ns Throughput for machine: 1/8 inst/ns Pipelined Implementation Clock cycle time: _____ ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

slide-65
SLIDE 65

Machine Comparison

FetchDecode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: 8 ns Latency of a single instruction: 8 ns Throughput for machine: 1/8 inst/ns Pipelined Implementation Clock cycle time: 2.1 ns Latency of a single instruction: _____ ns Throughput for machine: _____ inst/ns

slide-66
SLIDE 66

Machine Comparison

FetchDecode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: 8 ns Latency of a single instruction: 8 ns Throughput for machine: 1/8 inst/ns Pipelined Implementation Clock cycle time: 2.1 ns Latency of a single instruction: 2.1*5=10.5 ns Throughput for machine: _____ inst/ns

slide-67
SLIDE 67

Machine Comparison

FetchDecode Execute Memory WriteBack 2ns 1ns 2ns 2ns 1ns 0.1 ns pipeline register delay Single-Cycle Implementation Clock cycle time: 8 ns Latency of a single instruction: 8 ns Throughput for machine: 1/8 inst/ns Pipelined Implementation Clock cycle time: 2.1 ns Latency of a single instruction: 2.1*5=10.5 ns Throughput for machine: 1 / 2.1 inst/ns

slide-68
SLIDE 68

Example 2 – How do we speed up pipelined machine?

Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

slide-69
SLIDE 69

Example 2 – How do we speed up pipelined machine?

Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Single cycle: 1 / 32 inst / ns Pipelined: 1 / 10.1 inst / ns

slide-70
SLIDE 70

Example 2 – Split more stages

Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? _________ and _________

slide-71
SLIDE 71

Example 2 – Split more stages

Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? Memory and _________

slide-72
SLIDE 72

Example 2 – Split more stages

Fetch Decode Execute Memory Writeback 6ns 4ns 8ns 10ns 4ns 0.1 ns pipelined register delay Which stage(s) should we split? Memory and Execute

slide-73
SLIDE 73

Example 2 – After Split

F D X1 X2 M1 M2 WB ___ns ___ns ___ns ___ns ___ns ___ns ___ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

slide-74
SLIDE 74

Example 2 – After Split

F D X1 X2 M1 M2 WB 6 ns 4 ns ___ns ___ns ___ns ___ns 4 ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

slide-75
SLIDE 75

Example 2 – After Split

F D X1 X2 M1 M2 WB 6 ns 4 ns 4 ns 4 ns ___ns ___ns 4 ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

slide-76
SLIDE 76

Example 2 – After Split

F D X1 X2 M1 M2 WB 6 ns 4 ns 4 ns 4 ns 5 ns 5 ns 4 ns 0.1 ns pipelined register delay Single cycle: 1 / ns Pipelined: 1 / ns

slide-77
SLIDE 77

Example 2 – After Split

F D X1 X2 M1 M2 WB 6 ns 4 ns 4 ns 4 ns 5 ns 5 ns 4 ns 0.1 ns pipelined register delay Single cycle: 1 / 32 ns Pipelined: 1 / ns

slide-78
SLIDE 78

Example 2 – After Split

F D X1 X2 M1 M2 WB 6 ns 4 ns 4 ns 4 ns 5 ns 5 ns 4 ns 0.1 ns pipelined register delay Single cycle: 1 / 32 ns Pipelined: 1 / 6.1 ns

slide-79
SLIDE 79

Easy Right? Not so fast.

In what cycle does the add write $s0? In what cycle does the or read $s0? Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Incorrect Execution

slide-80
SLIDE 80

Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

Easy Right? Not so fast.

In what cycle does the add write $s0? 1st half of cycle 5 In what cycle does the or read $s0?

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

slide-81
SLIDE 81

Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

Easy Right? Not so fast.

In what cycle does the add write $s0 1st half of cycle 5 In what cycle does the or read $s0? 2nd half of cycle 3

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB WB

slide-82
SLIDE 82

Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

Easy Right? Not so fast.

In what cycle does the add write $s0? 1st half of cycle 5 In what cycle does the or read $s0? 2nd half of cycle 3

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Ahhhh! Values can not pass backwards in time

WB

slide-83
SLIDE 83

Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

Easy Right? Not so fast.

In what cycle does the add write $s0? 1st half of cycle 5 In what cycle does the or read $s0? 2nd half of cycle 5 Stall - wasted cycles

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

IF IF

Correct, Slow Execution

slide-84
SLIDE 84

Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

Easy Right? Not so fast.

In what cycle does the add write $s0? 1st half of cycle 5 In what cycle does the or read $s0? 2nd half of cycle 5 Stall - wasted cycles

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

IF IF

Correct, Slow Execution

Only Register File rd/wr in half a cycle. All

  • ther stages take a full cycle – this is

because of shared hardware

slide-85
SLIDE 85

Barriers to pipelined performance

  • Uneven stages
  • Pipeline register delays
slide-86
SLIDE 86

Barriers to pipelined performance

  • Uneven stages
  • Pipeline register delays
  • Data Hazards
slide-87
SLIDE 87

Barriers to pipeline performance

  • Uneven stages
  • Pipeline register delays
  • Data Hazards

w An instruction depends on the result of a previous instruction still in the pipeline

slide-88
SLIDE 88

Solutions?

  • What can we try to reduce data

hazards or their effect?

slide-89
SLIDE 89

Time-> add $s0, $0, $0

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

Easy Right? Not so fast.

In what cycle does the add write $s0? 1st half of cycle 5 In what cycle does the or read $s0? 2nd half of cycle 5 Stall - wasted cycles

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

IF IF

Default (do nothing): Stall

slide-90
SLIDE 90

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB

In what cycle is $s0 calculated in the machine? In what cycle is $s0 used in the machine?

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 1: Data Forwarding

lw $s0, 0($t4)

WB

ID

slide-91
SLIDE 91

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB

In what cycle is $s0 calculated in the machine? End of cycle 4 In what cycle is $s0 used?

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 1: Data Forwarding

lw $s0, 0($t4)

WB

ID

slide-92
SLIDE 92

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB

In what cycle is $s0 calculated in the machine? End of cycle 4 In what cycle is $s0 used? beginning of cycle 4

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 1: Data Forwarding

lw $s0, 0($t4)

WB

ID

slide-93
SLIDE 93

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB

In what cycle is $s0 calculated in the machine? end of cycle 4 In what cycle is $s0 used? beginning of cycle 5

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 1: Data Forwarding

lw $s0, 0($t4) ID IF

WB

ID

slide-94
SLIDE 94

Data-Forwarding Where are those wires?

Read Addr Out Data

Instruction Memory PC 4

src1 src1data src2 src2data

Register File

destreg destdata

  • p/fun

rs rt rd imm

Addr Out Data

Data Memory

In Data

32 Sign Ext 16 << 2 << 2 Pipeline Register

Fetch (Writeback) Execute Decode Memory

slide-95
SLIDE 95

Data-Forwarding Where are those wires?

Read Addr Out Data

Instruction Memory PC 4

src1 src1data src2 src2data

Register File

destreg destdata

  • p/fun

rs rt rd imm

Addr Out Data

Data Memory

In Data

32 Sign Ext 16 << 2 << 2 Pipeline Register

Fetch (Writeback) Execute Decode Memory

slide-96
SLIDE 96

Time-> add $s2, $s2, $t0 F D F M 1 2 3 4 5 6 7 8 9 10 11 12 Draw the timing diagram with data forwarding Draw arrows to indicate data passing through forwarding F lw $t0, 0($s0) W D

Data Forwarding Example 2

sw $s2, 0($s0) addi $t0, $t0, 1

slide-97
SLIDE 97

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB

IF IF Stall - wasted cycles

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 2: Instruction Reordering (Before reordering)

lw $s0, 0($t4)

WB

ID

slide-98
SLIDE 98

Time->

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

sw $s2, 0($t1) and $s6, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 2: Instruction Reordering (After Reordering)

lw $s0, 0($t4)

ID

WB

slide-99
SLIDE 99

Who reorders instructions?

  • Static scheduling

w Compiler w Simpler, but does not know when caches miss or loads/stores are to the same locations

  • Dynamic scheduling

w Hardware w More complicated, but has all knowledge

slide-100
SLIDE 100

Time->

  • r $s3, $s0, $t3

IF ID IF ID

MEM

1 2 3 4 5 6 7 8

MEM WB WB

sw $s3, 0($t1) and $s0, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 2: Instruction Reordering

lw $s0, 0($t4)

ID

WB

slide-101
SLIDE 101

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB WB

sw $s3, 0($t1) and $s0, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 2: Instruction Reordering

lw $s0, 0($t4)

ID

WB

Is this the same execution?!?

slide-102
SLIDE 102

Time->

  • r $s3, $s0, $t3

IF ID IF

MEM

1 2 3 4 5 6 7 8

MEM WB WB

sw $s3, 0($t1) and $s0, $s4, $t3

IF ID IF ID

WB MEM MEM WB

Solution 2: Instruction Reordering

lw $s0, 0($t4)

ID

WB

Is this the same execution?!?

slide-103
SLIDE 103

How Are Data Hazards Detected

A.K.A. The Official Lawson Johnson Aside

slide-104
SLIDE 104

First: Pipelined Control

slide-105
SLIDE 105

Recall: Pipeline Registers

w IF/ID

§ 32b instruction § 32b nPC

w ID/EX

§ 32b register § 32b register § 32b immediate field § 32b nPC

w EX/MEM

§ Zero § 32b ALU result § 32b nPC § 32b register value

w MEM/WB

§ 32b ALU result § 32b memory value

  • Named for two stages they separate
  • Store all data corresponding to lines that go

through them

slide-106
SLIDE 106

Pipelined Control

  • PC is written on each clock cycle
  • No need for signals to pipeline

registers, since they are written on each clock cycle

  • Observation: Each control line is

associated with a component active in

  • nly one stage of the pipeline

w So we can divide control lines into five groups, according to pipeline stage

slide-107
SLIDE 107

Pipelined Control

slide-108
SLIDE 108

The Hazard Detection Pseudocode

  • “Code” operates during ID stage
  • Looks something like this
slide-109
SLIDE 109

The Hazard Detection Pseudocode

  • In English: if the instruction in the execute

phase is a load (the only instruction that reads memory), and if the register being written by the load is either of the source registers for the following instruction, stall the pipeline

slide-110
SLIDE 110

How is the Pipeline Stalled?

  • If ID stage is stalled, IF stage must also

be stalled

w Otherwise we lose the instruction that was to be fetched next w In concrete terms: PC and IF/ID pipeline registers are prevented from changing

§ (Basically, the same instruction is repeatedly fetched and loaded into the IF/ID pipeline register.)

slide-111
SLIDE 111

How is the Pipeline Stalled?

  • And “back half” (EX/M/WB) phases must

be doing “nothing

w Like restarting the wash but letting dryer continue to tumble empty w Accomplished with NOP instruction(s) w Turns out: deasserting all control signals into EX/M/WB states creates a NOP instruction in those stages w Note these are percolated forward (doing nothing), which is what we want