CS 3330: Pipelining 6 October 2016 1 Human pipeline: laundry - - PowerPoint PPT Presentation

cs 3330 pipelining
SMART_READER_LITE
LIVE PREVIEW

CS 3330: Pipelining 6 October 2016 1 Human pipeline: laundry - - PowerPoint PPT Presentation

CS 3330: Pipelining 6 October 2016 1 Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer


slide-1
SLIDE 1

CS 3330: Pipelining

6 October 2016

1

slide-2
SLIDE 2

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

2

slide-3
SLIDE 3

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

2

slide-4
SLIDE 4

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

3

slide-5
SLIDE 5

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

3

slide-6
SLIDE 6

Waste (2)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

4

slide-7
SLIDE 7

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

5

slide-8
SLIDE 8

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

5

slide-9
SLIDE 9

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

5

slide-10
SLIDE 10

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) load h loads/h time between starts (0.83 h)

6

slide-11
SLIDE 11

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

6

slide-12
SLIDE 12

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

6

slide-13
SLIDE 13

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A add add 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

7

slide-14
SLIDE 14

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

7

slide-15
SLIDE 15

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency = ⇒ 10 results/ns throughput

7

slide-16
SLIDE 16

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

add add 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

8

slide-17
SLIDE 17

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

A add A + A 2 × A add 2A + A 3 × A 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

8

slide-18
SLIDE 18

pipelined times three

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

( ) ( ) ( ) ( ) 7 7 14 21 17 34 51

9

slide-19
SLIDE 19

pipelined times three

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

A (t + 2) A (t + 1) 2 × A (t + 1) 3 × A (t + 0) 7 7 14 21 17 34 51

9

slide-20
SLIDE 20

register tolerances

register output register input

  • utput

changes input must not change register delay

10

slide-21
SLIDE 21

register tolerances

register output register input

  • utput

changes input must not change register delay

10

slide-22
SLIDE 22

register tolerances

register output register input

  • utput

changes input must not change register delay

10

slide-23
SLIDE 23

times three pipeline timing

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: ps G operations/sec

11

slide-24
SLIDE 24

times three pipeline timing

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: 1 60 ps ≈ 16 G operations/sec

11

slide-25
SLIDE 25

deeper pipeline

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

ps ps ps ps ps ps ps ps ps

throughput: ps G ops/sec

A

2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-26
SLIDE 26

deeper pipeline

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

throughput: ps G ops/sec

A

2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-27
SLIDE 27

deeper pipeline

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

throughput: 1 35 ps ≈ 28 G ops/sec

A

2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-28
SLIDE 28

deeper pipeline

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

throughput: ps G ops/sec

A

2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-29
SLIDE 29

deeper pipeline

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

throughput: ps G ops/sec

A

2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-30
SLIDE 30

diminishing returns: register delays

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

50 ps

60 ps per cycle

10 ps

logic (2/2)

50 ps 10 ps

logic (1/3)

33 ps

43 ps per cycle

10 ps

logic (2/3)

33 ps 10 ps

logic (3/3)

33 ps 10 ps

. . . . . . . . . . . .

1 ps

11 ps per cycle

10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

13

slide-31
SLIDE 31

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 number of stages time per completion (ps)

14

slide-32
SLIDE 32

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay number of stages time per completion (ps)

14

slide-33
SLIDE 33

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay 1.83x speedup 1.02x speedup number of stages time per completion (ps)

14

slide-34
SLIDE 34

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput number of stages throughput (ops/ns)

15

slide-35
SLIDE 35

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput

  • max. rate of register updates

number of stages throughput (ops/ns)

15

slide-36
SLIDE 36

deeper pipeline

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1)

A (t + 1)

3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

throughput: ps G ops/sec

A

2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

16

slide-37
SLIDE 37

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

35 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

17

slide-38
SLIDE 38

addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback signal skips two stages

18

slide-39
SLIDE 39

addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback signal skips two stages

18

slide-40
SLIDE 40

addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback signal skips two stages

18

slide-41
SLIDE 41

addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback signal skips two stages

18

slide-42
SLIDE 42

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

19

slide-43
SLIDE 43

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

19

slide-44
SLIDE 44

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

19

slide-45
SLIDE 45

addq execution

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

20

slide-46
SLIDE 46

addq execution

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

20

slide-47
SLIDE 47

addq execution

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

20

slide-48
SLIDE 48

addq execution

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

20

slide-49
SLIDE 49

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

21

slide-50
SLIDE 50

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

21

slide-51
SLIDE 51

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

21

slide-52
SLIDE 52

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

21

slide-53
SLIDE 53

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

21

slide-54
SLIDE 54

addq processor performance

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

example delays: path time add 2 (PC update) 80 ps instruction memory 200 ps register fjle read 150 ps add 100 ps register fjle write 150 ps

no pipelining: 1 instruction per 600 ps

add up everything but add 2 (slowest path)

pipelining: 1 instruction per 200 ps + register delay

slowest path through stage + register delay latency: 800 ps + register delay (4 cycles)

22

slide-55
SLIDE 55

OPq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ALU

add 2 ifunc fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-56
SLIDE 56

OPq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ALU

add 2 ifunc fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-57
SLIDE 57

OPq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ALU

add 2 ifunc fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-58
SLIDE 58

OPq processor

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ALU

add 2 ifunc fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-59
SLIDE 59

addq processor: data hazard

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r9, %r8 addq ... addq ...

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 9 8 800 900 9 3 900 800 8 1700 9 4 1700 8 fetch/decode decode/execute execute/writeback

should be 1700

24

slide-60
SLIDE 60

data hazard

addq %r8, %r9 // (1) addq %r9, %r8 // (2) step# pipeline implementation ISA specifjcation 1 read r8, r9 for (1) read r8, r9 for (1) 2 read r9, r8 for (2) write r9 for (1) 3 write r9 for (1) read r9, r8 for (2) 4 write r8 for (2) write r8 ror (2)

pipeline reads older value… instead of value ISA says was just written

25

slide-61
SLIDE 61

data hazard compiler solution

addq %r8, %r9 nop nop addq %r9, %r8

  • ne solution: change the ISA

all addqs take efgect three instructions later

make it compiler’s job usually not acceptable

26

slide-62
SLIDE 62

data hazard hardware solution

addq %r8, %r9 // hardware inserts: nop // hardware inserts: nop addq %r9, %r8

how about hardware add nops? called stalling extra logic:

sometimes don’t change PC sometimes put do-nothing values in pipeline registers

27

slide-63
SLIDE 63

addq processor: data hazard stall

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 // hardware stalls twice addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2* 8 9 2 0x2* F F 800 900 9 3 0x2 F F

  • F

1700 9 4 9 8

  • F
  • F

5 1700 800 8

  • F

6 2500 8 fetch/decode decode/execute execute/writeback

R[9] written during cycle 3; read during cycle 4

28

slide-64
SLIDE 64

addq processor: data hazard stall

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 // hardware stalls twice addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2* 8 9 2 0x2* F F 800 900 9 3 0x2 F F

  • F

1700 9 4 9 8

  • F
  • F

5 1700 800 8

  • F

6 2500 8 fetch/decode decode/execute execute/writeback

R[9] written during cycle 3; read during cycle 4

28

slide-65
SLIDE 65

addq processor: data hazard stall

PC

Instr. Mem.

register fjle

srcA srcB dstM dstE next R[dstM] next R[dstE] R[srcA] R[srcB] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 // hardware stalls twice addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2* 8 9 2 0x2* F F 800 900 9 3 0x2 F F

  • F

1700 9 4 9 8

  • F
  • F

5 1700 800 8

  • F

6 2500 8 fetch/decode decode/execute execute/writeback

R[9] written during cycle 3; read during cycle 4

28

slide-66
SLIDE 66

control hazard

addq %r8, %r9 je 0xFFFF addq %r10, %r11

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC SF/ZF rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 0/1 1 0x2 0/1 8 9 2 ??? 0/1 F F 800 900 9 fetch/decode decode/execute execute/writeback

0xFFFF if R[8] = R[9]; 0x12 otherwise

29

slide-67
SLIDE 67

control hazard

addq %r8, %r9 je 0xFFFF addq %r10, %r11

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC SF/ZF rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 0/1 1 0x2 0/1 8 9 2 ??? 0/1 F F 800 900 9 fetch/decode decode/execute execute/writeback

0xFFFF if R[8] = R[9]; 0x12 otherwise

29

slide-68
SLIDE 68

control hazard: stall

addq %r8, %r9 // insert two nops je 0xFFFF addq %r10, %r11

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC SF/ZF rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 0/1 1 0x2* 0/1 8 9 2 0x2* 0/1 F F 800 900 9 3 0x2 0/0 F F

  • F

1700 9 4 0x10 0/0 F F

  • F
  • F

5 10 11

  • F
  • F

6 1000 1100 11

  • F

fetch/decode decode/execute execute/writeback

wait for two cycles for addq to update SF/ZF execute je instruction (use SF/ZF)

30

slide-69
SLIDE 69

control hazard: stall

addq %r8, %r9 // insert two nops je 0xFFFF addq %r10, %r11

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC SF/ZF rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 0/1 1 0x2* 0/1 8 9 2 0x2* 0/1 F F 800 900 9 3 0x2 0/0 F F

  • F

1700 9 4 0x10 0/0 F F

  • F
  • F

5 10 11

  • F
  • F

6 1000 1100 11

  • F

fetch/decode decode/execute execute/writeback

wait for two cycles for addq to update SF/ZF execute je instruction (use SF/ZF)

30

slide-70
SLIDE 70

control hazard: stall

addq %r8, %r9 // insert two nops je 0xFFFF addq %r10, %r11

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC SF/ZF rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 0/1 1 0x2* 0/1 8 9 2 0x2* 0/1 F F 800 900 9 3 0x2 0/0 F F

  • F

1700 9 4 0x10 0/0 F F

  • F
  • F

5 10 11

  • F
  • F

6 1000 1100 11

  • F

fetch/decode decode/execute execute/writeback

wait for two cycles for addq to update SF/ZF execute je instruction (use SF/ZF)

30

slide-71
SLIDE 71

pipelined Y86 CPU

fjve stages — fetch+PC update/decode/execute/memory/writeback

  • ne per cycle

need: pipeline registers between stages need: way of dealing with control hazards need: way of dealing with data hazards

stalling + two techniques we’ll take about next week

31

slide-72
SLIDE 72

pipelining summary

assembly line for math

divide into pieces each piece in parallel for difgerent instructions

increase throughput but also increase latency limited by uneven division of work limited by dependencies (“hazards”) limited by register delays

32

slide-73
SLIDE 73

register operations when stalling

“stall” — write disable; keep old value “bubble” — write default (no-operation) instead of input HCL2D will provides these directly if it didn’t — MUX in front of register input

33