Pipelining (part 1) 1 Human pipeline: laundry whites sheets - - PowerPoint PPT Presentation

pipelining part 1
SMART_READER_LITE
LIVE PREVIEW

Pipelining (part 1) 1 Human pipeline: laundry whites sheets - - PowerPoint PPT Presentation

Pipelining (part 1) 1 Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer Washer 14:00 13:00


slide-1
SLIDE 1

Pipelining (part 1)

1

slide-2
SLIDE 2

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

2

slide-3
SLIDE 3

Human pipeline: laundry

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors whites whites whites colors colors colors sheets sheets sheets

2

slide-4
SLIDE 4

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

3

slide-5
SLIDE 5

Waste (1)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

wasted time! wasted time!

3

slide-6
SLIDE 6

Waste (2)

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

4

slide-7
SLIDE 7

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

5

slide-8
SLIDE 8

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

5

slide-9
SLIDE 9

Latency — Time for One

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

pipelined latency (2.1 h)

colors colors colors

normal latency (1.8 h)

5

slide-10
SLIDE 10

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) load h loads/h time between starts (0.83 h)

6

slide-11
SLIDE 11

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

6

slide-12
SLIDE 12

Throughput — Rate of Many

Washer Dryer Folding Table

11:00 12:00 13:00 14:00

whites whites whites colors colors colors sheets sheets sheets

time between fjnishes (0.83 h) 1 load 0.83h = 1.2 loads/h time between starts (0.83 h)

6

slide-13
SLIDE 13

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A add add 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

7

slide-14
SLIDE 14

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency 10 results/ns throughput

7

slide-15
SLIDE 15

times three circuit

A

ADD

ADD

ADD

ADD 2 × A 3 × A A add A + A 2 × A add 2A + A 3 × A 7 14 21

0 ps 50 ps 100 ps

100 ps latency = ⇒ 10 results/ns throughput

7

slide-16
SLIDE 16

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

add add 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

8

slide-17
SLIDE 17

times three and repeat

A add A + A 2 × A add 2A + A 3 × A 7 14 17 34 4 8 1 2 23 46

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

A add A + A 2 × A add 2A + A 3 × A 7 14 21 17 34 51 4 8 12 1 2 3 23 46 69

0 ps 100 ps 200 ps 300 ps 400 ps 500 ps

8

slide-18
SLIDE 18

pipelined times three

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

( ) ( ) ( ) ( ) 7 7 14 21 17 17 34

9

slide-19
SLIDE 19

pipelined times three

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

A (t + 2) A (t + 1) 2 × A (t + 1) 3 × A (t + 0) 7 7 14 21 17 17 34

9

slide-20
SLIDE 20

register tolerances

register output register input

  • utput

changes input must not change register delay

10

slide-21
SLIDE 21

register tolerances

register output register input

  • utput

changes input must not change register delay

10

slide-22
SLIDE 22

register tolerances

register output register input

  • utput

changes input must not change register delay

10

slide-23
SLIDE 23

times three pipeline timing

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: ps G operations/sec

11

slide-24
SLIDE 24

times three pipeline timing

A (t + 2) ADD

ADD

ADD

ADD

2 × A (t + 1) A (t + 1) 3 × A (t + 0)

10 ps 50 ps 10 ps 50 ps 10 ps throughput: 1 60 ps ≈ 16 G operations/sec

11

slide-25
SLIDE 25

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

ps ps ps ps ps ps ps ps ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-26
SLIDE 26

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-27
SLIDE 27

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-28
SLIDE 28

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: 1 35 ps ≈ 28 G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

12

slide-29
SLIDE 29

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

13

slide-30
SLIDE 30

diminishing returns: register delays

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

50 ps

60 ps per cycle

10 ps

logic (2/2)

50 ps 10 ps

logic (1/3)

33 ps

43 ps per cycle

10 ps

logic (2/3)

33 ps 10 ps

logic (3/3)

33 ps 10 ps

. . . . . . . . . . . .

1 ps

11 ps per cycle

10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

14

slide-31
SLIDE 31

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 number of stages time per completion (ps)

15

slide-32
SLIDE 32

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay number of stages time per completion (ps)

15

slide-33
SLIDE 33

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 120 register delay 1.83x speedup 1.02x speedup number of stages time per completion (ps)

15

slide-34
SLIDE 34

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput number of stages throughput (ops/ns)

16

slide-35
SLIDE 35

diminishing returns: register delays

2 4 6 8 10 12 14 20 40 60 80 100 1.83x throughput 1.02x throughput

  • max. rate of register updates

number of stages throughput (ops/ns)

16

slide-36
SLIDE 36

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

17

slide-37
SLIDE 37

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: ps G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

18

slide-38
SLIDE 38

deeper pipeline

A (t + 4) ADD

ADD

ADD

ADD

2 × A (t + 2) A (t + 2) 3 × A (t + 0)

10 ps 25 ps 25 ps 10 ps 10 ps 25 ps 10 ps 25 ps 10 ps

exercise: throughput now? throughput: ps G ops/sec

30 ps 20 ps

exercise: throughput now? (didn’t split second add evenly) throughput: 1 40 ps ≈ 25 G ops/sec

A (t + 3) 2 × A partial results 3 × A partial results

Problem: How much faster can we get? Problem: Can we even do this?

18

slide-39
SLIDE 39

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

19

slide-40
SLIDE 40

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

19

slide-41
SLIDE 41

diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily? Probably not...

logic (all)

100 ps

110 ps per cycle

10 ps

logic (1/2)

60 ps

70 ps per cycle

10 ps

logic (2/2)

45 ps 10 ps

logic (1/3)

40 ps

50 ps per cycle

10 ps

logic (2/3)

40 ps 10 ps

logic (3/3)

30 ps 10 ps

. . . . . . . . . . . .

19

slide-42
SLIDE 42

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

20

slide-43
SLIDE 43

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

20

slide-44
SLIDE 44

textbook SEQ ‘stages’

conceptual order only Fetch: read instruction memory Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle PC Update: write PC register

writes happen at end of cycle reads — “magic” like combinatorial logic as values available

20

slide-45
SLIDE 45

textbook stages

conceptual order only pipeline stages Fetch/PC Update: read instruction memory; compute next PC Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle

5 stages

  • ne instruction in each

compute next to start immediatelly

21

slide-46
SLIDE 46

textbook stages

conceptual order only pipeline stages Fetch/PC Update: read instruction memory; compute next PC Decode: read register fjle Execute: arithmetic (ALU) Memory: read/write data memory Writeback: write register fjle

5 stages

  • ne instruction in each

compute next to start immediatelly

21

slide-47
SLIDE 47

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

22

slide-48
SLIDE 48

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

22

slide-49
SLIDE 49

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

22

slide-50
SLIDE 50

addq CPU

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 decode execute writeback signal skips two stages fetch and PC update

22

slide-51
SLIDE 51

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-52
SLIDE 52

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-53
SLIDE 53

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-54
SLIDE 54

pipelined addq processor

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch and PC update decode execute writeback fetch/decode decode/execute execute/writeback fetch/fetch

23

slide-55
SLIDE 55

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

24

slide-56
SLIDE 56

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

24

slide-57
SLIDE 57

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

24

slide-58
SLIDE 58

addq execution

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2 fetch/decode decode/execute execute/writeback fetch/fetch

addq %r8, %r9 // (1) addq %r10, %r11 // (2)

addq %r8, %r9 //(1) address of (2) addq %r10, %r11 //(2) reg #s 8, 9 from (1) reg #s 10, 11 from (2) reg # 9, values for (1)

24

slide-59
SLIDE 59

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

25

slide-60
SLIDE 60

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

25

slide-61
SLIDE 61

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

25

slide-62
SLIDE 62

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

25

slide-63
SLIDE 63

addq processor timing

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

// initially %r8 = 800, // %r9 = 900, etc. addq %r8, %r9 addq %r10, %r11 addq %r12, %r13 addq %r9, %r8

fetch rA rB R[srcA] R[srcB] dstE next R[dstE] dstE cycle PC rA rB R[srcA] R[srcB] dstE next R[dstE] dstE 0x0 1 0x2 8 9 2 0x4 10 11 800 900 9 3 0x6 12 13 1000 1100 11 1700 9 4 9 8 1200 1300 13 2100 11 5 1700 800 8 2500 13 6 2500 8 fetch/decode decode/execute execute/writeback

25

slide-64
SLIDE 64

addq processor performance

PC

Instr. Mem.

register fjle

srcA srcB R[srcA] R[srcB] dstE next R[dstE] dstM next R[dstM] split

0xF

ADD

ADD

add 2

example delays: path time add 2 80 ps instruction memory 200 ps register fjle read 125 ps add 100 ps register fjle write 125 ps

no pipelining: 1 instruction per 550 ps

add up everything but add 2 (critical (slowest) path)

pipelining: 1 instruction per 200 ps + pipeline register delays

slowest path through stage + pipeline register delays latency: 800 ps + pipeline register delays (4 cycles)

26