Streamlining GPU Applications On the Fly Thread Divergence - - PowerPoint PPT Presentation

streamlining gpu applications on the fly
SMART_READER_LITE
LIVE PREVIEW

Streamlining GPU Applications On the Fly Thread Divergence - - PowerPoint PPT Presentation

Streamlining GPU Applications On the Fly Thread Divergence Elimination through Runtime Thread-Data Remapping Eddy Z. Zhang , Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary eddy@cs cs.wm.


slide-1
SLIDE 1

eddy@ eddy@cs cs.wm. .wm.edu edu

1

Streamlining GPU Applications On the Fly

Thread Divergence Elimination through Runtime Thread-Data Remapping

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary

slide-2
SLIDE 2

eddy@ eddy@cs cs.wm. .wm.edu edu

2

GPU Divergence

  • GPU Features

– Streaming multiprocessors

  • SIMD
  • Single instruction issue per SM

– Warp / Half Warp

  • SIMD execution unit
  • Divergence

– Threads in a warp take different execution paths

slide-3
SLIDE 3

eddy@ eddy@cs cs.wm. .wm.edu edu

3

Example of GPU Divergence

  • Inst. A
  • Inst. B
  • Inst. C

Time Warp

Thread Serialization Behavior

A B C

……

Instructions: A, B, C Control Flow

slide-4
SLIDE 4

eddy@ eddy@cs cs.wm. .wm.edu edu

4

Impact of GPU Divergence

  • Degrading GPU Throughput

– E.g., up to degradation on Tesla 1060

  • Impairing GPU Usage

– Esp. when having non-trivial condition statements

15 16

slide-5
SLIDE 5

eddy@ eddy@cs cs.wm. .wm.edu edu

5

Related Work

  • Stream packing and unpacking
  • [Popa: U. Waterloo C.S. master thesis’04]
  • Simulate hardware packing on CPU
  • Dynamic warp formation & scheduling
  • [Fung+: MICRO’07]
  • Hardware solution
  • Control structure splitting
  • [Carrillo+: CF’09]
  • Reducing register pressure but removing no divergences
slide-6
SLIDE 6

eddy@ eddy@cs cs.wm. .wm.edu edu

6

Basic Idea of Our Solution

Swapping Jobs of Threads through Thread-Data Remapping

Thread-Data Remapping

if ( A[tid] ){

  • green

C[tid] += 1; } else{

red

C[tid] -= 1; }

warp 1 warp 2 warp 3 A[ ] threads threads

slide-7
SLIDE 7

eddy@ eddy@cs cs.wm. .wm.edu edu

7

Challenges

  • How to determine a desirable mapping?

– Complexities

  • irregular accesses, complex indexing expressions, side effects
  • n memory reference patterns, ...
  • How to realize the new mapping?

– Data movement or redirect threads’ data references

  • limitations, effectiveness, and safety.
  • How to do it on the fly?

– Large overhead v.s. Need for runtime remapping

  • dependence on runtime data values, minimizing and hiding
  • verhead
slide-8
SLIDE 8

eddy@ eddy@cs cs.wm. .wm.edu edu

8

Outline

  • Thread-data Remapping

– Concept & mechanisms

  • Transformation on the Fly

– CPU-GPU pipelining & LAM

  • Evaluation
  • Conclusion
slide-9
SLIDE 9

eddy@ eddy@cs cs.wm. .wm.edu edu

9

GPU Divergence Causes

  • Control Flows in Code

– E.g., if, do, for, while, switch if, do, for, while, switch

  • Input Data Dependence

– Input data-set --> execution path – Thread-data mapping --> amount of thread divergence

slide-10
SLIDE 10

eddy@ eddy@cs cs.wm. .wm.edu edu

10

Define Divergence

  • Control Flow Path Vector for One Thread

Def: Pvector

Pvector[tid] = < b b1

1, b

, b2

2, b

, b3

3,

, … …, , b bn

n

>

if ( A[tid] % 2 ) {…}; if ( A[tid] < 10 ) {…}; <0,0> 14 2 … … … <1,0> 11 1 <0,1> 2 Pvector A[tid] tid

Condition Statements Path Vector Example

slide-11
SLIDE 11

eddy@ eddy@cs cs.wm. .wm.edu edu

11

Define Divergence

  • Control Flow Path Vector for One Thread

Def: Pvector

Pvector[tid] = < b b1

1, b

, b2

2, b

, b3

3,

, … …, , b bn

n

>

if ( A[tid] % 2 ) {…}; if ( A[tid] < 10 ) {…}; <0,0> 14 2 … … … <1,0> 11 1 <0,1> 2 Pvector A[tid] tid

Condition Statements Path Vector Example

slide-12
SLIDE 12

eddy@ eddy@cs cs.wm. .wm.edu edu

12

Regroup Threads

  • To Satisfy Convergence Condition:

– Sort Pvector Pvector[ [0 0] ], , Pvector Pvector[ [1 1], ], Pvector Pvector[ [2 2], ], … … for for all threads all threads – – E.g, after sorting, the grouping of threads E.g, after sorting, the grouping of threads

0 12 8 11 9 4 6 7 2 5 10 3 1 13 14 15 Thread Index: Thread Index: Path Vector: Path Vector: <0,0> <0,0> <0,1> <0,1> <1,1> <1,1>

Warp Warp Warp Warp

slide-13
SLIDE 13

eddy@ eddy@cs cs.wm. .wm.edu edu

13

<0,0> <1,0>

Example of GPU Thread Divergence

WARP 1 WARP 2

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Path Vector: Data Index: j Thread Index: i

Mapping: Thread[i] --> Data[j]

i == j

slide-14
SLIDE 14

eddy@ eddy@cs cs.wm. .wm.edu edu

14

<0,0> <1,0>

Remapping by Reference Redirection

WARP 1 WARP 2

1 2 3 4 5 6 7 8 Thread Index: i 1 2 3 4 5 6 7 8 Path Vector: Data Index: j

j == IND[i]

Mapping: Thread[i] --> Data[j]

slide-15
SLIDE 15

eddy@ eddy@cs cs.wm. .wm.edu edu

15

<0,0> <1,0>

Remapping by Data Transformation

WARP 1 WARP 2

1 2 3 4 5 6 7 8 Thread Index: i 1 2 3 4 5 6 7 8 Path Vector: Data Index: j

slide-16
SLIDE 16

eddy@ eddy@cs cs.wm. .wm.edu edu

16

<0,0> <1,0>

Remapping by Data Transformation

WARP 1 WARP 2

1 2 3 4 5 6 7 8 Thread Index: i

Swap

1 2 3 4 5 6 7 8 Path Vector: Data Index: j

Mapping: Thread[i] --> Data[j]

i == j

slide-17
SLIDE 17

eddy@ eddy@cs cs.wm. .wm.edu edu

17

Outline

  • Thread-data Remapping

– Concept & mechanisms

  • Transformation on the Fly

– CPU-GPU pipelining & LAM

  • Evaluation
  • Conclusion
slide-18
SLIDE 18

eddy@ eddy@cs cs.wm. .wm.edu edu

18

Overview of CPU-GPU Synergy

CPU GPU

Collect Branch Info Send Remapping Info

  • Path Vectors
  • Compute Best Mapping
  • Feedback Control
  • Independent

Independent

  • Protected

Protected

  • Pipelined

Pipelined

  • Realize Desired

Thread-data Map.

slide-19
SLIDE 19

eddy@ eddy@cs cs.wm. .wm.edu edu

19

CPU-GPU Pipeline Scheme

  • Without Pipelining Scheme
  • With Pipelining Scheme

Tdiv ( Tno-div + Tremap ) : >=1 or < 1 Tdiv (Tno-div + Tremap) : >= 1 No Slow-down!

slide-20
SLIDE 20

eddy@ eddy@cs cs.wm. .wm.edu edu

20

CPU-GPU Pipeline Example I

…… GPU Code

Remapping Thread

…… ……

Timeline GPU Kernel Func. CPU Remap Func.

slide-21
SLIDE 21

eddy@ eddy@cs cs.wm. .wm.edu edu

21

CPU-GPU Pipeline Example II

…… GPU Code

Pipeline Threads

s

…… ……

Timeline GPU Kernel Func. CPU Remap Func.

……

  • Controllable

Threading

  • Adaptive to
  • Avail. Resource
slide-22
SLIDE 22

eddy@ eddy@cs cs.wm. .wm.edu edu

22

Applicable Scenarios

  • Loops

– Multiple invocation of same kernel function

  • Input Data Partition for A Kernel Function

– Create multiple iterations

  • Across Different Kernels

– With idle CPU processing system resources

slide-23
SLIDE 23

eddy@ eddy@cs cs.wm. .wm.edu edu

23

LAM: Reduce Data Movements

  • For Data Layout Transformation - LAM Scheme

– 3 Steps: Label, Assign & Move (LAM) – Label -- Classify path vectors into multiple classes

  • Based on similarity

– Assign -- Assign warps to different classes

  • Based on occupation ratio

– Move -- Determine the destination

Tunable # of Classes Increase # of no-need-moves

slide-24
SLIDE 24

eddy@ eddy@cs cs.wm. .wm.edu edu

24

Outline

  • Thread-data Remapping

– Concept & mechanisms

  • Transformation on the fly

– CPU-GPU pipelining & LAM

  • Evaluation
  • Conclusion
slide-25
SLIDE 25

eddy@ eddy@cs cs.wm. .wm.edu edu

25

Experiment Settings

  • Host Machine

– Dual-socket quad-core Intel Xeon E5540

  • GPU Device

– NVIDIA Tesla 1060

  • Runtime Library

– Reference redirection & data transformation – Pipeline threads management

slide-26
SLIDE 26

eddy@ eddy@cs cs.wm. .wm.edu edu

26

Benchmarks

0% 0% if statement

  • ption

pricing Blackscholes 50% 100% if statement parallel sum Reduction 99% 100% if statement graphics algorithm Marching Cubes 44% 100% if statement & loop genetic algorithm GAFORT 50-100% 50-100% if statement LBM based PDE solver 3D-LBM Div. Reduction Percent of

  • Div. Warps

Potential

  • Div. Source

Comments Program

slide-27
SLIDE 27

eddy@ eddy@cs cs.wm. .wm.edu edu

27

Benchmarks

0% 0% if statement

  • ption

pricing Blackscholes 50% 100% if statement parallel sum Reduction 99% 100% if statement graphics algorithm Marching Cubes 44% 100% if statement & loop genetic algorithm GAFORT 50-100% 50-100% if statement LBM based PDE solver 3D-LBM Div. Reduction Percent of

  • Div. Warps

Potential

  • Div. Source

Comments Program

slide-28
SLIDE 28

eddy@ eddy@cs cs.wm. .wm.edu edu

28

Evaluation: Data Transformation

  • GAFORT

– Mutation probabilities – Regular mem. access – select_cld kernel – Remap scheme

  • Data layout transform.

– Efficiency control

  • LAM & Pipeline

51325 67225 Time 56% 100%

  • Div. Ratio

After Before Perform.

0.2 0.4 0.6 0.8 1 1.2 1.4 Baseline No-LAM +LAM + Pipeline

Speedup

Performance Comparison

44% Reduced 1.31 Speedup

slide-29
SLIDE 29

eddy@ eddy@cs cs.wm. .wm.edu edu

29

Evaluation: Reference Redirect.

  • MarchingCubes

– Div: number of vertices that intersect isosurface – Random memory access – generateTriangles2 kernel – Remap scheme

  • Reference redirection

– Efficiency control

  • Pipeline

1.34 1.35 1.37 SpeedUp 99% 99% 99% Div.Reduct 256 64 32 BlockSize

Performance

12425 12371 12666 Opt.Time 16673 16707 17414 Org.Time 256 64 32 BlockSize

Time (micro-sec)

slide-30
SLIDE 30

eddy@ eddy@cs cs.wm. .wm.edu edu

30

Evaluation: All Benchmarks

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 3D-LBM GAFORT MARCH REDUCT BLACK

speedup

Org Opt-W.O. Eff. Control Opt-With Eff. Control

No Perf. Loss

slide-31
SLIDE 31

eddy@ eddy@cs cs.wm. .wm.edu edu

31

Conclusion

  • An Efficient Software Solution for GPU Div.

– Mechanism

  • On-the-fly thread-data remapping

– Overhead control

  • CPU-GPU pipelining: whole-system synergy
  • LAM scheme: balance between benefit & overhead

– Effectiveness

  • Up to 1.4x speedup
  • Efficiency protection: no slowdown
slide-32
SLIDE 32

eddy@ eddy@cs cs.wm. .wm.edu edu

32

Acknowledgement

  • Ye Zhao
  • Xiaoming Li
  • NVIDIA

– Donation of GPU Device

  • NSF Funds & IBM CAS Fellowship
  • Anonymous Reviewers
slide-33
SLIDE 33

eddy@ eddy@cs cs.wm. .wm.edu edu

33

Questions?