Origami: Folding Warps for Energy Efficient GPUs
Mohammad Abdel-Majeed*, Daniel Wong†, Justin Huang‡ and Murali Annavaram* * University of Southern California
†University of California, Riverside ‡Stanford University
Origami: Folding Warps for Energy Efficient GPUs Mohammad - - PowerPoint PPT Presentation
Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong , Justin Huang and Murali Annavaram* * University of Southern California University of California, Riverside Stanford University Outline
Mohammad Abdel-Majeed*, Daniel Wong†, Justin Huang‡ and Murali Annavaram* * University of Southern California
†University of California, Riverside ‡Stanford University
2
GPGPU Overview (GTX480)
C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST
SFU SFU SFU SFU
SFU LD/ST INT Unit Operands Result Queue FP Unit
Warp Scheduler (2-level) Register File 128KB Execution Units 64KB shared Memory/L1 cache SM Instruction Cache Fetch and decode
3
19
GPGPU Overview (GTX480)
C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST
SFU SFU SFU SFU
SFU LD/ST INT Unit Operands Result Queue FP Unit
Warp Scheduler (2-level) Register File 128KB Execution Units 64KB shared Memory/L1 cache SM Instruction Cache Fetch and decode
3
19
DRAM 0.178 RF 0.134 Pipeline 0.114 Constant 0.112 NOC 0.095 Other 0.072 MC 0.048 L2 0.045 EXE 0.201
4 GPUWattch, ISCA 2013
DRAM 0.178 RF 0.134 Pipeline 0.114 Constant 0.112 NOC 0.095 Other 0.072 MC 0.048 L2 0.045 EXE 0.201
4
GPUWattch, ISCA 2013
5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion
5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion
5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion
5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion
5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion
6
– Accounts for 50% of the execution units power
– Need long idle periods to be effective
6 Warped Gates, MICRO 2012
– Assume 5 idle detect, 14 BET
7
– Assume 5 idle detect, 14 BET
Lost Opportunity
7
– Assume 5 idle detect, 14 BET
Lost Opportunity Energy Loss or Neutral
7
– Assume 5 idle detect, 14 BET
Lost Opportunity Energy Loss or Neutral Energy Savings
7
– Assume 5 idle detect, 14 BET
Lost Opportunity Energy Loss or Neutral Energy Savings
7
– Assume 5 idle detect, 14 BET
Lost Opportunity Energy Loss or Neutral Energy Savings
7
FP INT INT FP INTO
Ready Warps
INT
8
8
Ready Warps
8
Busy C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(FP)
Idle C
(FP)
8
interrupted by instructions that are greedily scheduled
Ready Warps
8
Busy C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(FP)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(INT)
C
(FP)
Idle C
(FP)
8
9
Frequency 0% 8% 15% 23% 30% Idle period length 6 13 19 25
54.3% 0.0% 45.7%
*Warped-Gates, MICRO 2013
9
Frequency 0% 8% 15% 23% 30% Idle period length 6 13 19 25
54.3% 0.0% 45.7%
*Warped-Gates, MICRO 2013
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
SP0
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
SP0 Cycle X 1111 1111
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
SP0 Cycle X 1111 1111 Cycle X+1 Bubble
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
SP0 Cycle X 1111 1111 Cycle X+1 Cycle X+2 Bubble 1111 1111
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
SP0 Cycle X 1111 1111 Cycle X+1 Bubble Cycle X+2 Bubble Cycle X+3 1111 1111
– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating
10
Scheduler
INT INT INT INT FP FP FP FP
SP0 Cycle X 1111 1111 Cycle X+1 Bubble Cycle X+2 Bubble Cycle X+3 1111 1111
– Lanes have different activity
11
➢Improve the power gating potential by coalescing the pipeline bubbles
12
Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 39
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39
Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39
Scheduler Issued Warps C C C C C C C C Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39
Scheduler Issued Warps C C C C C C C C Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39
14
Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue + Simple
1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 + Simple + Low wiring overhead + Small delay +Support for lane shuffling
14
Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 + Simple + Low wiring overhead + Small delay +Support for lane shuffling
15 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector
15 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector
15 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector
Selective Write
Shifting Logic
SP Pipe
Shifting Logic
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
16
16
16
16
16
16
Shifting Logic
SP Pipe
Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
1 1 0 0 1 1 0 0
16
Shifting Logic
SP Pipe
Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
1 1 0 0 1 1 0 0
16
Shifting Logic
SP Pipe
Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
1 1 0 0 1 1 0 0 Shifting Logic
SP Pipe
Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
0 0 1 1 0 0 1 1
16
Shifting Logic
SP Pipe
Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
1 1 0 0 1 1 0 0 Shifting Logic
SP Pipe
Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
0 0 1 1 0 0 1 1
16
Shifting Logic
SP Pipe
Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
1 1 0 0 1 1 0 0 Shifting Logic
SP Pipe
Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0
C C C C C C C C
Re-Shifting Logic Re-Shifting Logic
Result Collector
0 0 1 1 0 0 1 1
Origami scheduler
➢Improve the power gating potential by coalescing warps based on:
➢Threads utilization ➢Instruction type
17
18
threads
active threads
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 1 1 1 1 1 1 1 1 Cycle x+3: 0 0 1 1 0 1 0 1 Cycle x+4: 0 1 1 1 0 1 1 0 Cycle x+5: 1 1 1 1 1 1 1 1
18
threads
active threads
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1
Equal to 32 group Less than 32 group
18
threads
active threads
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1
Equal to 32 group Less than 32 group
18
threads
active threads
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1
Equal to 32 group Less than 32 group
18
threads
active threads
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1
Equal to 32 group Less than 32 group
18
threads
active threads
19
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1
Equal to 32 group Less than 32 group
Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 1 0 1 0 0 0 Cycle x+1: 1 1 1 0 1 0 0 0 Cycle x+2: 1 1 0 0 1 1 0 0 Cycle x+3: 1 1 1 0 1 1 0 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1
Equal to 32 group Less than 32 group
19
– Folds warps long enough to guarantee savings – Adaptive Folding
20
21
– Nvidia GTX480
– Wakeup delay – 3 cycles – Breakeven time – 14 cycles – Idle detect – 5 cycles
22
22
23
23
23
24
24
25
25
26
26
– Adaptively fold warp to coalesce bubbles
– Scheduler warps based on the threads activity and type.
27
Mohammad Abdel-Majeed*, Daniel Wong†, Justin Huang‡ and Murali Annavaram* abdelmaj@usc.edu, dwong@ece.ucr.edu annavara@usc.edu * University of Southern California
†University of California, Riverside ‡Stanford University
29