[PPT] - Streamlining GPU Applications On the Fly Thread Divergence PowerPoint Presentation

SLIDE 1

eddy@ eddy@cs cs.wm. .wm.edu edu

1

Streamlining GPU Applications On the Fly

Thread Divergence Elimination through Runtime Thread-Data Remapping

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary

SLIDE 2

eddy@ eddy@cs cs.wm. .wm.edu edu

2

GPU Divergence

GPU Features

– Streaming multiprocessors

SIMD
Single instruction issue per SM

– Warp / Half Warp

SIMD execution unit
Divergence

– Threads in a warp take different execution paths

SLIDE 3

eddy@ eddy@cs cs.wm. .wm.edu edu

3

Example of GPU Divergence

Inst. A
Inst. B
Inst. C

Time Warp

Thread Serialization Behavior

A B C

……

Instructions: A, B, C Control Flow

SLIDE 4

eddy@ eddy@cs cs.wm. .wm.edu edu

4

Impact of GPU Divergence

Degrading GPU Throughput

– E.g., up to degradation on Tesla 1060

Impairing GPU Usage

– Esp. when having non-trivial condition statements

15 16

SLIDE 5

eddy@ eddy@cs cs.wm. .wm.edu edu

5

Related Work

Stream packing and unpacking
[Popa: U. Waterloo C.S. master thesis’04]
Simulate hardware packing on CPU
Dynamic warp formation & scheduling
[Fung+: MICRO’07]
Hardware solution
Control structure splitting
[Carrillo+: CF’09]
Reducing register pressure but removing no divergences

SLIDE 6

eddy@ eddy@cs cs.wm. .wm.edu edu

6

Basic Idea of Our Solution

Swapping Jobs of Threads through Thread-Data Remapping

Thread-Data Remapping

if ( A[tid] ){

green

C[tid] += 1; } else{

red

C[tid] -= 1; }

warp 1 warp 2 warp 3 A[ ] threads threads

SLIDE 7

eddy@ eddy@cs cs.wm. .wm.edu edu

7

Challenges

How to determine a desirable mapping?

– Complexities

irregular accesses, complex indexing expressions, side effects
n memory reference patterns, ...
How to realize the new mapping?

– Data movement or redirect threads’ data references

limitations, effectiveness, and safety.
How to do it on the fly?

– Large overhead v.s. Need for runtime remapping

dependence on runtime data values, minimizing and hiding
verhead

SLIDE 8

eddy@ eddy@cs cs.wm. .wm.edu edu

8

Outline

Thread-data Remapping

– Concept & mechanisms

Transformation on the Fly

– CPU-GPU pipelining & LAM

Evaluation
Conclusion

SLIDE 9

eddy@ eddy@cs cs.wm. .wm.edu edu

9

GPU Divergence Causes

Control Flows in Code

– E.g., if, do, for, while, switch if, do, for, while, switch

Input Data Dependence

– Input data-set --> execution path – Thread-data mapping --> amount of thread divergence

SLIDE 10

eddy@ eddy@cs cs.wm. .wm.edu edu

10

Define Divergence

Control Flow Path Vector for One Thread

Def: Pvector

Pvector[tid] = < b b1

1, b

, b2

2, b

, b3

3,

, … …, , b bn

n

>

if ( A[tid] % 2 ) {…}; if ( A[tid] < 10 ) {…}; <0,0> 14 2 … … … <1,0> 11 1 <0,1> 2 Pvector A[tid] tid

Condition Statements Path Vector Example

SLIDE 11

eddy@ eddy@cs cs.wm. .wm.edu edu

11

Define Divergence

Control Flow Path Vector for One Thread

Def: Pvector

Pvector[tid] = < b b1

1, b

, b2

2, b

, b3

3,

, … …, , b bn

n

>

if ( A[tid] % 2 ) {…}; if ( A[tid] < 10 ) {…}; <0,0> 14 2 … … … <1,0> 11 1 <0,1> 2 Pvector A[tid] tid

Condition Statements Path Vector Example

SLIDE 12

eddy@ eddy@cs cs.wm. .wm.edu edu

12

Regroup Threads

To Satisfy Convergence Condition:

– Sort Pvector Pvector[ [0 0] ], , Pvector Pvector[ [1 1], ], Pvector Pvector[ [2 2], ], … … for for all threads all threads – – E.g, after sorting, the grouping of threads E.g, after sorting, the grouping of threads

0 12 8 11 9 4 6 7 2 5 10 3 1 13 14 15 Thread Index: Thread Index: Path Vector: Path Vector: <0,0> <0,0> <0,1> <0,1> <1,1> <1,1>

Warp Warp Warp Warp

SLIDE 13

eddy@ eddy@cs cs.wm. .wm.edu edu

13

<0,0> <1,0>

Example of GPU Thread Divergence

WARP 1 WARP 2

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Path Vector: Data Index: j Thread Index: i

Mapping: Thread[i] --> Data[j]

i == j

SLIDE 14

eddy@ eddy@cs cs.wm. .wm.edu edu

14

<0,0> <1,0>

Remapping by Reference Redirection

WARP 1 WARP 2

1 2 3 4 5 6 7 8 Thread Index: i 1 2 3 4 5 6 7 8 Path Vector: Data Index: j

j == IND[i]

Mapping: Thread[i] --> Data[j]

SLIDE 15

eddy@ eddy@cs cs.wm. .wm.edu edu

15

<0,0> <1,0>

Remapping by Data Transformation

WARP 1 WARP 2

1 2 3 4 5 6 7 8 Thread Index: i 1 2 3 4 5 6 7 8 Path Vector: Data Index: j

SLIDE 16

eddy@ eddy@cs cs.wm. .wm.edu edu

16

<0,0> <1,0>

Remapping by Data Transformation

WARP 1 WARP 2

1 2 3 4 5 6 7 8 Thread Index: i

Swap

1 2 3 4 5 6 7 8 Path Vector: Data Index: j

Mapping: Thread[i] --> Data[j]

i == j

SLIDE 17

eddy@ eddy@cs cs.wm. .wm.edu edu

17

Outline

Thread-data Remapping

– Concept & mechanisms

Transformation on the Fly

– CPU-GPU pipelining & LAM

Evaluation
Conclusion

SLIDE 18

eddy@ eddy@cs cs.wm. .wm.edu edu

18

Overview of CPU-GPU Synergy

CPU GPU

Collect Branch Info Send Remapping Info

Path Vectors
Compute Best Mapping
Feedback Control
Independent

Independent

Protected

Protected

Pipelined

Pipelined

Realize Desired

Thread-data Map.

SLIDE 19

eddy@ eddy@cs cs.wm. .wm.edu edu

19

CPU-GPU Pipeline Scheme

Without Pipelining Scheme
With Pipelining Scheme

Tdiv ( Tno-div + Tremap ) : >=1 or < 1 Tdiv (Tno-div + Tremap) : >= 1 No Slow-down!

SLIDE 20

eddy@ eddy@cs cs.wm. .wm.edu edu

20

CPU-GPU Pipeline Example I

…… GPU Code

Remapping Thread

…… ……

Timeline GPU Kernel Func. CPU Remap Func.

SLIDE 21

eddy@ eddy@cs cs.wm. .wm.edu edu

21

CPU-GPU Pipeline Example II

…… GPU Code

Pipeline Threads

s

…… ……

Timeline GPU Kernel Func. CPU Remap Func.

……

Controllable

Threading

Adaptive to
Avail. Resource

SLIDE 22

eddy@ eddy@cs cs.wm. .wm.edu edu

22

Applicable Scenarios

Loops

– Multiple invocation of same kernel function

Input Data Partition for A Kernel Function

– Create multiple iterations

Across Different Kernels

– With idle CPU processing system resources

SLIDE 23

eddy@ eddy@cs cs.wm. .wm.edu edu

23

LAM: Reduce Data Movements

For Data Layout Transformation - LAM Scheme

– 3 Steps: Label, Assign & Move (LAM) – Label -- Classify path vectors into multiple classes

Based on similarity

– Assign -- Assign warps to different classes

Based on occupation ratio

– Move -- Determine the destination

Tunable # of Classes Increase # of no-need-moves

SLIDE 24

eddy@ eddy@cs cs.wm. .wm.edu edu

24

Outline

Thread-data Remapping

– Concept & mechanisms

Transformation on the fly

– CPU-GPU pipelining & LAM

Evaluation
Conclusion

SLIDE 25

eddy@ eddy@cs cs.wm. .wm.edu edu

25

Experiment Settings

Host Machine

– Dual-socket quad-core Intel Xeon E5540

GPU Device

– NVIDIA Tesla 1060

Runtime Library

– Reference redirection & data transformation – Pipeline threads management

SLIDE 26

eddy@ eddy@cs cs.wm. .wm.edu edu

26

Benchmarks

0% 0% if statement

ption

pricing Blackscholes 50% 100% if statement parallel sum Reduction 99% 100% if statement graphics algorithm Marching Cubes 44% 100% if statement & loop genetic algorithm GAFORT 50-100% 50-100% if statement LBM based PDE solver 3D-LBM Div. Reduction Percent of

Div. Warps

Potential

Div. Source

Comments Program

SLIDE 27

eddy@ eddy@cs cs.wm. .wm.edu edu

27

Benchmarks

0% 0% if statement

ption

pricing Blackscholes 50% 100% if statement parallel sum Reduction 99% 100% if statement graphics algorithm Marching Cubes 44% 100% if statement & loop genetic algorithm GAFORT 50-100% 50-100% if statement LBM based PDE solver 3D-LBM Div. Reduction Percent of

Div. Warps

Potential

Div. Source

Comments Program

SLIDE 28

eddy@ eddy@cs cs.wm. .wm.edu edu

28

Evaluation: Data Transformation

GAFORT

– Mutation probabilities – Regular mem. access – select_cld kernel – Remap scheme

Data layout transform.

– Efficiency control

LAM & Pipeline

51325 67225 Time 56% 100%

Div. Ratio

After Before Perform.

0.2 0.4 0.6 0.8 1 1.2 1.4 Baseline No-LAM +LAM + Pipeline

Speedup

Performance Comparison

44% Reduced 1.31 Speedup

SLIDE 29

eddy@ eddy@cs cs.wm. .wm.edu edu

29

Evaluation: Reference Redirect.

MarchingCubes

– Div: number of vertices that intersect isosurface – Random memory access – generateTriangles2 kernel – Remap scheme

Reference redirection

– Efficiency control

Pipeline

1.34 1.35 1.37 SpeedUp 99% 99% 99% Div.Reduct 256 64 32 BlockSize

Performance

12425 12371 12666 Opt.Time 16673 16707 17414 Org.Time 256 64 32 BlockSize

Time (micro-sec)

SLIDE 30

eddy@ eddy@cs cs.wm. .wm.edu edu

30

Evaluation: All Benchmarks

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 3D-LBM GAFORT MARCH REDUCT BLACK

speedup

Org Opt-W.O. Eff. Control Opt-With Eff. Control

No Perf. Loss

SLIDE 31

eddy@ eddy@cs cs.wm. .wm.edu edu

31

Conclusion

An Efficient Software Solution for GPU Div.

– Mechanism

On-the-fly thread-data remapping

– Overhead control

CPU-GPU pipelining: whole-system synergy
LAM scheme: balance between benefit & overhead

– Effectiveness

Up to 1.4x speedup
Efficiency protection: no slowdown

SLIDE 32

eddy@ eddy@cs cs.wm. .wm.edu edu

32

Acknowledgement

Ye Zhao
Xiaoming Li
NVIDIA

– Donation of GPU Device

NSF Funds & IBM CAS Fellowship
Anonymous Reviewers

SLIDE 33

eddy@ eddy@cs cs.wm. .wm.edu edu

33