origami folding warps for energy efficient gpus
play

Origami: Folding Warps for Energy Efficient GPUs Mohammad - PowerPoint PPT Presentation

Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong , Justin Huang and Murali Annavaram* * University of Southern California University of California, Riverside Stanford University Outline


  1. Origami: 
 Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong † , Justin Huang ‡ and Murali Annavaram* * University of Southern California † University of California, Riverside ‡ Stanford University

  2. Outline • GPU overview • Motivation and related work • Warp Folding • Origami Scheduler • Evaluation �2

  3. GPGPU Overview (GTX480) 
 SM LD/ST Instruction Cache C C C C Operands LD/ST Fetch and decode SFU LD/ST C C C C INT FP LD/ST Unit Unit LD/ST C C C C LD/ST Warp Scheduler SFU Result Queue LD/ST C C C C (2-level) LD/ST LD/ST C C C C LD/ST Register File SFU LD/ST 128KB C C C C LD/ST LD/ST Execution Units C C C C LD/ST SFU LD/ST 64KB shared C C C C LD/ST Memory/L1 cache SFU LD/ST 19 3

  4. GPGPU Overview (GTX480) 
 SM LD/ST Instruction Cache C C C C Operands LD/ST Fetch and decode SFU LD/ST C C C C INT FP LD/ST Unit Unit LD/ST C C C C LD/ST Warp Scheduler SFU Result Queue LD/ST C C C C (2-level) LD/ST LD/ST C C C C LD/ST Register File SFU LD/ST 128KB C C C C LD/ST LD/ST Execution Units C C C C LD/ST SFU LD/ST 64KB shared C C C C LD/ST Memory/L1 cache SFU LD/ST 19 3

  5. GPGPU Power Break-Down DRAM EXE 0.178 0.201 L2 RF 0.045 0.134 MC 0.048 Other Pipeline 0.072 0.114 NOC Constant 0.095 0.112 GPUWattch, ISCA 2013 �4

  6. GPGPU Power Break-Down DRAM EXE 0.178 0.201 L2 EXE 20.1% RF 0.045 0.134 MC 0.048 Other 0.072 Pipeline 0.114 NOC Constant 0.095 0.112 GPUWattch, ISCA 2013 �4

  7. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  8. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  9. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  10. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  11. GPU Scaling Trend GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion �5

  12. Technology Scaling • As technology scales leakage power will increase – Accounts for 50% of the execution units power • Power Gating can be used to reduce the leakage power – Need long idle periods to be effective Warped Gates, MICRO 2012 6 6

  13. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET �7

  14. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Lost Opportunity �7

  15. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Lost Opportunity �7

  16. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity �7

  17. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity �7

  18. Power Gating Challenges in GPGPUs • Int. Unit idle period length distribution for hotspot – Assume 5 idle detect, 14 BET Energy Loss or Neutral Energy Savings Lost Opportunity Need to increase idle period length �7

  19. Warp Scheduler Effect on Power Gating INT FP INT INT FP INTO Ready Warps INT FP 8 8

  20. Warp Scheduler Effect on Power Gating Ready Warps C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C Busy C C C C (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C C Idle (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) INT FP 8 8

  21. Warp Scheduler Effect on Power Gating Ready Warps C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) C C C C C C C C (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C C C C C C C C • Idle periods 
 (FP) (FP) (FP) (FP) (INT) (INT) (INT) (INT) C interrupted 
 Need to coalesce warp issues 
 C C C C Busy C C C C (FP) (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) by instructions 
 C by resource type C C C C C C C C Idle (FP) that are greedily 
 (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) scheduled C C C C C C C C (INT) (INT) (INT) (INT) (FP) (FP) (FP) (FP) INT FP 8 8

  22. Related Work/Warped-Gates* • Schedule instructions based on their type • Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% 23% Frequency 15% 8% 0% 0 6 13 19 25 Idle period length �9 *Warped-Gates, MICRO 2013

  23. Related Work/Warped-Gates* • Schedule instructions based on their type • Force power gated units to stay in power gating state for at least the breakeven time 30% 54.3% 0.0% 45.7% 23% Frequency 15% 8% 0% 0 6 13 19 25 Idle period length �9 *Warped-Gates, MICRO 2013

  24. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities �10

  25. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler INT FP INT FP INT FP FP INT �10

  26. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler INT FP INT FP INT FP FP INT �10

  27. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 INT FP INT FP INT FP FP INT �10

  28. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP INT FP INT FP FP INT �10

  29. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 INT FP INT FP FP INT �10

  30. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP INT FP FP INT �10

  31. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP Cycle X+3 Bubble INT FP FP INT �10

  32. Fine grain idleness • Temporal idleness – Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating opportunities Scheduler SP0 Cycle X 1111 1111 INT FP Bubble Cycle X+1 Cycle X+2 1111 1111 INT FP Cycle X+3 Bubble INT FP FP INT �10

  33. Fine grain idleness • Spatial Idleness – Lanes have different activity • Branch divergence • Insufficient parallelism �11

  34. Warp Folding ➢ Improve the power gating potential by coalescing the pipeline bubbles �12

  35. Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 1 1 1 1 Active Mask: Bubble Active Mask: 39

  36. Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 1 1 1 1 Active Mask: Bubble Active Mask: Sub_Warp0: Sub_Warp1: 39

  37. Warp Folding Scheduler Ready Warps Queue Issued Warps 1 1 1 1 Active Mask: Bubble Active Mask: Sub_Warp0: 1 1 1 1 Sub_Warp1: 39

  38. Warp Folding Scheduler Ready Warps Queue Issued Warps Active Mask: Bubble Active Mask: Sub_Warp0: 1 1 1 1 Sub_Warp1: 1 1 1 1 39

  39. Warp Folding Scheduler Ready Warps Queue Issued Warps Sub_Warp0: 1 1 1 1 0 0 0 0 Sub_Warp1: 1 1 1 1 0 0 0 0 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend