Origami: Folding Warps for Energy Efficient GPUs Mohammad - - PowerPoint PPT Presentation

origami folding warps for energy efficient gpus
SMART_READER_LITE
LIVE PREVIEW

Origami: Folding Warps for Energy Efficient GPUs Mohammad - - PowerPoint PPT Presentation

Origami: Folding Warps for Energy Efficient GPUs Mohammad Abdel-Majeed*, Daniel Wong , Justin Huang and Murali Annavaram* * University of Southern California University of California, Riverside Stanford University Outline


slide-1
SLIDE 1

Origami:
 Folding Warps for Energy Efficient GPUs

Mohammad Abdel-Majeed*, Daniel Wong†, Justin Huang‡ and Murali Annavaram* * University of Southern California

†University of California, Riverside ‡Stanford University

slide-2
SLIDE 2

Outline

  • GPU overview
  • Motivation and related work
  • Warp Folding
  • Origami Scheduler
  • Evaluation

2

slide-3
SLIDE 3

GPGPU Overview (GTX480)


C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST

SFU SFU SFU SFU

SFU LD/ST INT Unit Operands Result Queue FP Unit

Warp Scheduler (2-level) Register File 128KB Execution Units 64KB shared Memory/L1 cache SM Instruction Cache Fetch and decode

3

19

slide-4
SLIDE 4

GPGPU Overview (GTX480)


C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C

LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST

SFU SFU SFU SFU

SFU LD/ST INT Unit Operands Result Queue FP Unit

Warp Scheduler (2-level) Register File 128KB Execution Units 64KB shared Memory/L1 cache SM Instruction Cache Fetch and decode

3

19

slide-5
SLIDE 5

DRAM 0.178 RF 0.134 Pipeline 0.114 Constant 0.112 NOC 0.095 Other 0.072 MC 0.048 L2 0.045 EXE 0.201

GPGPU Power Break-Down

4 GPUWattch, ISCA 2013

slide-6
SLIDE 6

DRAM 0.178 RF 0.134 Pipeline 0.114 Constant 0.112 NOC 0.095 Other 0.072 MC 0.048 L2 0.045 EXE 0.201

GPGPU Power Break-Down

4

EXE 20.1%

GPUWattch, ISCA 2013

slide-7
SLIDE 7

GPU Scaling Trend

5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion

slide-8
SLIDE 8

GPU Scaling Trend

5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion

slide-9
SLIDE 9

GPU Scaling Trend

5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion

slide-10
SLIDE 10

GPU Scaling Trend

5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion

slide-11
SLIDE 11

GPU Scaling Trend

5 GPU Fermi GTX 480 Kepler GTX 680 Maxwell GTX 980 Cores (SMs) 16 8 16 Execution Units 512 1536 2048 RF size 128KB/SM 256KB/SM 256KB/SM #transistors 3 billion 3.5 billion 5.2 billion

slide-12
SLIDE 12

Technology Scaling

6

  • As technology scales leakage power will increase

– Accounts for 50% of the execution units power

  • Power Gating can be used to reduce the leakage

power

– Need long idle periods to be effective

6 Warped Gates, MICRO 2012

slide-13
SLIDE 13

Power Gating Challenges in GPGPUs

  • Int. Unit idle period length distribution for hotspot

– Assume 5 idle detect, 14 BET

7

slide-14
SLIDE 14

Power Gating Challenges in GPGPUs

  • Int. Unit idle period length distribution for hotspot

– Assume 5 idle detect, 14 BET

Lost Opportunity

7

slide-15
SLIDE 15

Power Gating Challenges in GPGPUs

  • Int. Unit idle period length distribution for hotspot

– Assume 5 idle detect, 14 BET

Lost Opportunity Energy Loss or Neutral

7

slide-16
SLIDE 16

Power Gating Challenges in GPGPUs

  • Int. Unit idle period length distribution for hotspot

– Assume 5 idle detect, 14 BET

Lost Opportunity Energy Loss or Neutral Energy Savings

7

slide-17
SLIDE 17

Power Gating Challenges in GPGPUs

  • Int. Unit idle period length distribution for hotspot

– Assume 5 idle detect, 14 BET

Lost Opportunity Energy Loss or Neutral Energy Savings

7

slide-18
SLIDE 18

Power Gating Challenges in GPGPUs

  • Int. Unit idle period length distribution for hotspot

– Assume 5 idle detect, 14 BET

Lost Opportunity Energy Loss or Neutral Energy Savings

Need to increase idle period length

7

slide-19
SLIDE 19

Warp Scheduler Effect on Power Gating

INT FP

FP INT INT FP INTO

Ready Warps

INT

8

8

slide-20
SLIDE 20

Warp Scheduler Effect on Power Gating

INT FP

Ready Warps

8

Busy C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(FP)

Idle C

(FP)

8

slide-21
SLIDE 21
  • Idle periods


interrupted
 by instructions
 that are greedily
 scheduled

Warp Scheduler Effect on Power Gating

INT FP

Ready Warps

8

Busy C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(FP)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(INT)

C

(FP)

Idle C

(FP)

Need to coalesce warp issues 
 by resource type

8

slide-22
SLIDE 22

Related Work/Warped-Gates*

  • Schedule instructions based on their type
  • Force power gated units to stay in power gating

state for at least the breakeven time

9

Frequency 0% 8% 15% 23% 30% Idle period length 6 13 19 25

54.3% 0.0% 45.7%

*Warped-Gates, MICRO 2013

slide-23
SLIDE 23

Related Work/Warped-Gates*

  • Schedule instructions based on their type
  • Force power gated units to stay in power gating

state for at least the breakeven time

9

Frequency 0% 8% 15% 23% 30% Idle period length 6 13 19 25

54.3% 0.0% 45.7%

*Warped-Gates, MICRO 2013

slide-24
SLIDE 24

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

slide-25
SLIDE 25

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

slide-26
SLIDE 26

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

slide-27
SLIDE 27

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

SP0

slide-28
SLIDE 28

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

SP0 Cycle X 1111 1111

slide-29
SLIDE 29

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

SP0 Cycle X 1111 1111 Cycle X+1 Bubble

slide-30
SLIDE 30

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

SP0 Cycle X 1111 1111 Cycle X+1 Cycle X+2 Bubble 1111 1111

slide-31
SLIDE 31

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

SP0 Cycle X 1111 1111 Cycle X+1 Bubble Cycle X+2 Bubble Cycle X+3 1111 1111

slide-32
SLIDE 32

Fine grain idleness

  • Temporal idleness

– Infrequent issues to the same pipeline – Finely interspersed leading to limited power gating

  • pportunities

10

Scheduler

INT INT INT INT FP FP FP FP

SP0 Cycle X 1111 1111 Cycle X+1 Bubble Cycle X+2 Bubble Cycle X+3 1111 1111

slide-33
SLIDE 33

Fine grain idleness

  • Spatial Idleness

– Lanes have different activity

  • Branch divergence
  • Insufficient parallelism

11

slide-34
SLIDE 34

Warp Folding

➢Improve the power gating potential by coalescing the pipeline bubbles

12

slide-35
SLIDE 35

Warp Folding

Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39

slide-36
SLIDE 36

Warp Folding

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39

slide-37
SLIDE 37

Warp Folding

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39

slide-38
SLIDE 38

Warp Folding

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue Active Mask: Bubble 1 1 1 1 1 1 1 1 Active Mask: 39

slide-39
SLIDE 39

Warp Folding

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 39

slide-40
SLIDE 40

Warp Folding

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39

slide-41
SLIDE 41

Warp Folding

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39

slide-42
SLIDE 42

Warp Folding

Scheduler Issued Warps Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39

slide-43
SLIDE 43

Warp Folding

Scheduler Issued Warps C C C C C C C C Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39

slide-44
SLIDE 44

Warp Folding

Scheduler Issued Warps C C C C C C C C Ready Warps Queue 0 0 0 0 0 0 0 0 Sub_Warp0: Sub_Warp1: 1 1 1 1 1 1 1 1 39

slide-45
SLIDE 45

14

Folding Granularity

Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-46
SLIDE 46

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-47
SLIDE 47

1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-48
SLIDE 48

1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-49
SLIDE 49

1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue + Simple

  • High wiring overhead
  • Delay
slide-50
SLIDE 50

1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-51
SLIDE 51

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-52
SLIDE 52

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue

slide-53
SLIDE 53

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1

slide-54
SLIDE 54

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1

slide-55
SLIDE 55

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 + Simple + Low wiring overhead + Small delay +Support for lane shuffling

slide-56
SLIDE 56

14

Folding Granularity

Sub_Warp0: Sub_Warp1: Scheduler Issued Warps C C C C C C C C Active Mask: 1 1 1 1 1 1 1 1 Ready Warps Queue 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 + Simple + Low wiring overhead + Small delay +Support for lane shuffling

slide-57
SLIDE 57

Warp Folding Pipeline

15 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector

slide-58
SLIDE 58

Warp Folding Pipeline

15 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector

slide-59
SLIDE 59

Warp Folding Pipeline

15 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector

Selective Write

Shifting Logic

SP Pipe

Shifting Logic

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

slide-60
SLIDE 60

Example

16

slide-61
SLIDE 61

Example

16

1111 1111

slide-62
SLIDE 62

Example

16

1111 1111 1100 1100

slide-63
SLIDE 63

Example

16

1111 1111 1100 1100 0011 0011

slide-64
SLIDE 64

Example

16

1111 1111 1100 1100 0011 0011

slide-65
SLIDE 65

Example

16

Shifting Logic

SP Pipe

Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

1 1 0 0 1 1 0 0

1111 1111 1100 1100 0011 0011

slide-66
SLIDE 66

Example

16

Shifting Logic

SP Pipe

Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

1 1 0 0 1 1 0 0

1111 1111 1100 1100 0011 0011

slide-67
SLIDE 67

Example

16

Shifting Logic

SP Pipe

Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

1 1 0 0 1 1 0 0 Shifting Logic

SP Pipe

Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

0 0 1 1 0 0 1 1

1111 1111 1100 1100 0011 0011

slide-68
SLIDE 68

Example

16

Shifting Logic

SP Pipe

Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

1 1 0 0 1 1 0 0 Shifting Logic

SP Pipe

Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

0 0 1 1 0 0 1 1

1111 1111 1100 1100 0011 0011

slide-69
SLIDE 69

Example

16

Shifting Logic

SP Pipe

Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

1 1 0 0 1 1 0 0 Shifting Logic

SP Pipe

Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0

C C C C C C C C

Re-Shifting Logic Re-Shifting Logic

Result Collector

0 0 1 1 0 0 1 1

1111 1111 1100 1100 0011 0011

slide-70
SLIDE 70

Origami scheduler

➢Improve the power gating potential by coalescing warps based on:

➢Threads utilization ➢Instruction type

17

slide-71
SLIDE 71

Origami scheduler

18

  • Group the threads based on their active mask
  • One group will have the active mask with less than 32

threads

  • The other group will have the active masks with 32

active threads

slide-72
SLIDE 72

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 1 1 1 1 1 1 1 1 Cycle x+3: 0 0 1 1 0 1 0 1 Cycle x+4: 0 1 1 1 0 1 1 0 Cycle x+5: 1 1 1 1 1 1 1 1

Origami scheduler

18

  • Group the threads based on their active mask
  • One group will have the active mask with less than 32

threads

  • The other group will have the active masks with 32

active threads

slide-73
SLIDE 73

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1

Equal to 32 group Less than 32 group

Origami scheduler

18

  • Group the threads based on their active mask
  • One group will have the active mask with less than 32

threads

  • The other group will have the active masks with 32

active threads

slide-74
SLIDE 74

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1

Equal to 32 group Less than 32 group

Origami scheduler

18

  • Group the threads based on their active mask
  • One group will have the active mask with less than 32

threads

  • The other group will have the active masks with 32

active threads

slide-75
SLIDE 75

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1

Equal to 32 group Less than 32 group

Origami scheduler

18

  • Group the threads based on their active mask
  • One group will have the active mask with less than 32

threads

  • The other group will have the active masks with 32

active threads

slide-76
SLIDE 76

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1

Equal to 32 group Less than 32 group

Origami scheduler

18

  • Group the threads based on their active mask
  • One group will have the active mask with less than 32

threads

  • The other group will have the active masks with 32

active threads

Active masks are not aligned!!!

slide-77
SLIDE 77

Lane Shifting

19

  • Shift the threads to the lower order SIMT lanes
  • Done at the cluster level to reduce overhead

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1

Equal to 32 group Less than 32 group

slide-78
SLIDE 78

Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 1 0 1 0 0 0 Cycle x+1: 1 1 1 0 1 0 0 0 Cycle x+2: 1 1 0 0 1 1 0 0 Cycle x+3: 1 1 1 0 1 1 0 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1

Equal to 32 group Less than 32 group

Lane Shifting

19

  • Shift the threads to the lower order SIMT lanes
  • Done at the cluster level to reduce overhead
slide-79
SLIDE 79

Origami Scheduling

  • Runtime Warp Folding Algorithm

– Folds warps long enough to guarantee savings – Adaptive Folding

  • Aggressive folding for warps with lower instruction count
  • Conservative folding for warps with higher instruction count
  • Change folding frequency based on application utilization
  • See paper for more detail!

20

Nphase = Npipelineflush + Nidledetect + Nbreakeventime

slide-80
SLIDE 80

EVALUATION

21

slide-81
SLIDE 81

Evaluation Methodology

  • GPGPU-Sim v3.0.2

– Nvidia GTX480

  • GPUWattch and McPAT 


for energy and area estimation

  • Benchmarks from ISPASS, Rodinia and Parboil
  • Power gating parameters

– Wakeup delay – 3 cycles – Breakeven time – 14 cycles – Idle detect – 5 cycles

22

22

slide-82
SLIDE 82

Folding Ratio

23

  • Folding frequency is application dependent
slide-83
SLIDE 83

Folding Ratio

23

  • Folding frequency is application dependent
slide-84
SLIDE 84

Folding Ratio

23

  • Folding frequency is application dependent
slide-85
SLIDE 85

Energy Savings/INT

24

  • Eliminates negative energy savings
  • Origami scheduler able to amplify folding benefits
  • Origami is able to save 49%
slide-86
SLIDE 86

Energy Savings/INT

24

  • Eliminates negative energy savings
  • Origami scheduler able to amplify folding benefits
  • Origami is able to save 49%
slide-87
SLIDE 87

Energy Savings/FP

25

  • Eliminates negative energy savings
  • Origami scheduler able to amplify folding benefits
  • Origami is able to save 46%
slide-88
SLIDE 88

Energy Savings/FP

25

  • Eliminates negative energy savings
  • Origami scheduler able to amplify folding benefits
  • Origami is able to save 46%
slide-89
SLIDE 89
  • Origami is able to reduce the performance 

  • verhead significantly over Warped-Gates
  • Origami scheduler has positive impact on 


performance for some workloads

Performance

26

slide-90
SLIDE 90
  • Origami is able to reduce the performance 

  • verhead significantly over Warped-Gates
  • Origami scheduler has positive impact on 


performance for some workloads

Performance

26

slide-91
SLIDE 91

Conclusion

  • Execution units energy efficiency is critical
  • Take advantage of the spatial and temporal 


idleness to Improve the power gating potential

  • Warp folding

– Adaptively fold warp to coalesce bubbles

  • Origmai scheduler

– Scheduler warps based on the threads activity and type.

  • Able to save 49% and 46% of the execution units 


leakage energy

  • Negligible performance overhead

27

slide-92
SLIDE 92

Origami:
 Folding Warps for Energy Efficient GPUs

Mohammad Abdel-Majeed*, Daniel Wong†, Justin Huang‡ and Murali Annavaram* abdelmaj@usc.edu, dwong@ece.ucr.edu annavara@usc.edu * University of Southern California

†University of California, Riverside ‡Stanford University

Questions?

slide-93
SLIDE 93

THANK YOU!

29