cs 3330 pipelining
play

CS 3330: Pipelining 6 October 2016 1 Human pipeline: laundry - PowerPoint PPT Presentation

CS 3330: Pipelining 6 October 2016 1 Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer


  1. CS 3330: Pipelining 6 October 2016 1

  2. Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer Washer 14:00 13:00 12:00 11:00 Table Folding Dryer 2

  3. Human pipeline: laundry whites sheets sheets sheets colors colors colors whites whites whites colors colors colors whites whites 14:00 Washer 13:00 12:00 11:00 Table Folding Dryer Washer 14:00 13:00 12:00 11:00 Table Folding Dryer 2

  4. Waste (1) whites wasted time! wasted time! sheets sheets sheets colors colors colors whites Washer whites 14:00 13:00 12:00 11:00 Table Folding Dryer 3

  5. Waste (1) whites wasted time! wasted time! sheets sheets sheets colors colors colors whites Washer whites 14:00 13:00 12:00 11:00 Table Folding Dryer 3

  6. Waste (2) whites sheets sheets sheets colors colors colors whites whites Washer 14:00 13:00 12:00 11:00 Table Folding Dryer 4

  7. Latency — Time for One colors normal latency (1.8 h) colors colors colors pipelined latency (2.1 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 5

  8. Latency — Time for One colors normal latency (1.8 h) colors colors colors pipelined latency (2.1 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 5

  9. Latency — Time for One colors normal latency (1.8 h) colors colors colors pipelined latency (2.1 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 5

  10. Throughput — Rate of Many colors time between starts (0.83 h) loads/h h load time between fjnishes (0.83 h) sheets sheets sheets colors colors Washer whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 6

  11. Throughput — Rate of Many Washer time between starts (0.83 h) time between fjnishes (0.83 h) sheets sheets sheets colors colors colors whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 6 1 load 0 . 83 h = 1 . 2 loads/h

  12. Throughput — Rate of Many Washer time between starts (0.83 h) time between fjnishes (0.83 h) sheets sheets sheets colors colors colors whites whites whites 14:00 13:00 12:00 11:00 Table Folding Dryer 6 1 load 0 . 83 h = 1 . 2 loads/h

  13. times three circuit 7 10 results/ns throughput 100 ps latency 100 ps 50 ps 0 ps 21 14 add add ADD ADD ADD ADD 7 A 2 × A 3 × A

  14. times three circuit 7 10 results/ns throughput 100 ps latency 100 ps 50 ps 0 ps 21 14 7 ADD ADD ADD ADD A 2 × A 3 × A A add A + A 2 × A add 2 A + A 3 × A

  15. times three circuit 7 100 ps 50 ps 0 ps 21 14 7 ADD ADD ADD ADD A 2 × A 3 × A 100 ps latency = ⇒ 10 results/ns throughput A add A + A 2 × A add 2 A + A 3 × A

  16. times three and repeat 2 21 17 34 51 4 8 12 1 3 7 23 46 69 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps 14 add 8 2 7 14 17 34 4 8 add 1 23 46 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps A add A + A 2 × A add 2 A + A 3 × A

  17. times three and repeat 2 21 17 34 51 4 8 12 1 3 7 23 46 69 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps 14 8 2 23 7 14 17 34 4 8 1 46 0 ps 100 ps 200 ps 300 ps 400 ps 500 ps A add A + A 2 × A add 2 A + A 3 × A A add A + A 2 × A add 2 A + A 3 × A

  18. pipelined times three ( 51 34 17 21 14 7 7 ) ( ) ) ( ) ( ADD ADD ADD ADD 9 A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 )

  19. pipelined times three 7 51 34 17 21 14 7 9 ADD ADD ADD ADD A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 ) A ( t + 2 ) A ( t + 1 ) 2 × A ( t + 1 ) 3 × A ( t + 0 )

  20. register tolerances register output register input output changes input must not change register delay 10

  21. register tolerances register output register input output changes input must not change register delay 10

  22. register tolerances register output register input output changes input must not change register delay 10

  23. times three pipeline timing throughput: G operations/sec ps 11 ADD ADD ADD ADD A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 ) 10 ps 50 ps 10 ps 50 ps 10 ps

  24. times three pipeline timing ADD throughput: 11 ADD ADD ADD A ( t + 2 ) 2 × A ( t + 1 ) 3 × A ( t + 0 ) A ( t + 1 ) 10 ps 50 ps 10 ps 50 ps 10 ps 1 60 ps ≈ 16 G operations/sec

  25. deeper pipeline ps Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps throughput: ps ps ps ps ps ps ps ps ADD ADD ADD ADD 12 A ( t + 2 ) 2 × A 2 × A ( t + 1 ) 3 × A 3 × A ( t + 0 ) A ( t + 1 ) A

  26. deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps 12 ADD ADD ADD ADD A ( t + 2 ) 2 × A 2 × A ( t + 1 ) 3 × A 3 × A ( t + 0 ) A ( t + 1 ) A 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps

  27. deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results 12 ADD ADD ADD ADD A ( t + 2 ) 2 × A 2 × A ( t + 1 ) 3 × A 3 × A ( t + 0 ) A ( t + 1 ) A 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 1 35 ps ≈ 28 G ops/sec

  28. deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps 12 ADD ADD ADD ADD A ( t + 2 ) 2 × A 2 × A ( t + 1 ) 3 × A 3 × A ( t + 0 ) A ( t + 1 ) A 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps

  29. deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps 12 ADD ADD ADD ADD A ( t + 2 ) 2 × A 2 × A ( t + 1 ) 3 × A 3 × A ( t + 0 ) A ( t + 1 ) A 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps

  30. diminishing returns: register delays . 10 ps . . . . . . . logic (3/3) . . . . 1 ps 11 ps per cycle … 33 ps 10 ps logic (all) 33 ps 100 ps 110 ps per cycle 10 ps logic (1/2) 50 ps 60 ps per cycle 10 ps logic (2/2) 50 ps 10 ps logic (1/3) 33 ps 43 ps per cycle 10 ps logic (2/3) 13 10 ps 1 ps 10 ps 1 ps 10 ps 1 ps 10 ps

  31. diminishing returns: register delays number of stages time per completion (ps) 14 120 100 80 60 40 20 0 2 4 6 8 10 12 14

  32. diminishing returns: register delays register delay time per completion (ps) number of stages 14 120 100 80 60 40 20 0 2 4 6 8 10 12 14

  33. diminishing returns: register delays register delay time per completion (ps) number of stages 1.02x speedup 1.83x speedup 14 120 100 80 60 40 20 0 2 4 6 8 10 12 14

  34. diminishing returns: register delays 1.83x throughput throughput (ops/ns) number of stages 1.02x throughput 15 100 80 60 40 20 0 2 4 6 8 10 12 14

  35. diminishing returns: register delays 1.83x throughput throughput (ops/ns) number of stages max. rate of register updates 1.02x throughput 15 100 80 60 40 20 0 2 4 6 8 10 12 14

  36. deeper pipeline throughput: Problem: Can we even do this? Problem: How much faster can we get? partial results partial results G ops/sec ps 16 ADD ADD ADD ADD A ( t + 2 ) 2 × A 2 × A ( t + 1 ) 3 × A 3 × A ( t + 0 ) A ( t + 1 ) A 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps 25 ps 10 ps

  37. diminishing returns: uneven split . 10 ps logic (3/3) 30 ps 10 ps . . . . logic (2/3) . . . . . . . 35 ps 10 ps Can we split up some logic (e.g. adder) arbitrarily? 60 ps Probably not... logic (all) 100 ps 110 ps per cycle 10 ps logic (1/2) 70 ps per cycle per cycle 10 ps logic (2/2) 45 ps 10 ps logic (1/3) 40 ps 50 ps 17

  38. addq processor split signal skips two stages writeback execute decode PC update fetch and add 2 ADD ADD 0xF R[srcB] PC R[srcA] next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 18

  39. addq processor split signal skips two stages writeback execute decode PC update fetch and add 2 ADD ADD 0xF R[srcB] PC R[srcA] next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 18

  40. addq processor split signal skips two stages writeback execute decode PC update fetch and add 2 ADD ADD 0xF R[srcB] PC R[srcA] next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 18

  41. addq processor split signal skips two stages writeback execute decode PC update fetch and add 2 ADD ADD 0xF R[srcB] PC R[srcA] next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 18

  42. pipelined addq processor R[srcB] fetch/fetch execute/writeback decode/execute fetch/decode add 2 ADD ADD 0xF split R[srcA] PC next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 19

  43. pipelined addq processor R[srcB] fetch/fetch execute/writeback decode/execute fetch/decode add 2 ADD ADD 0xF split R[srcA] PC next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend