Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - - PowerPoint PPT Presentation

tuned and wildly asynchronous stencil kernels for
SMART_READER_LITE
LIVE PREVIEW

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - - PowerPoint PPT Presentation

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Intl. Conference on Supercomputing June 10, 2009 Motivation and Goals Regular, memory


slide-1
SLIDE 1

Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems

Sundaresan Venkatasubramanian

  • Prof. Richard Vuduc

Georgia Tech Int’l. Conference on Supercomputing June 10, 2009

slide-2
SLIDE 2

Motivation and Goals

 Regular, memory bandwidth-bound kernel  GPU with high memory bandwidth  Seems simple but…

 Tuning  Constraints on parallelism and data

movement

 Real systems mix CPU-like/GPU-like

 Goal of paper: Solve mapping

2

slide-3
SLIDE 3

Key ideas and results

 Hand-tuned to understand how to map

 98% of empirical streaming bandwidth

 Model-driven hybrid CPU/GPU

implementation

 Asynchronous algorithms to avoid global

syncs [Chazan & Miranker ’69]

 1.2 to 2.5x speedup even while performing up

to 1.7x flops!

3

slide-4
SLIDE 4

Problem

4

To solve Poisson’s equation in 2-D on a square grid: Centered finite-difference approximation on a (N +2)*(N+2) grid with step size h:

slide-5
SLIDE 5

Algorithm

5

for t=1,2,3 … T, do for all unknowns in the grid, do (embarrassingly parallel) Ut+1

i,j = ¼ * (Ut i+1,j + Ut i-1,j + Ut+1 i,j-1 + Ut i,j+1)

end for end for

2 copies of the grid required Implicit global sync Used as subroutine in other complex algorithms. E.g, Multigrid Memory bandwidth bound kernel

slide-6
SLIDE 6

CPU and GPU Baselines

6

slide-7
SLIDE 7

Tuned CPU Baseline*

 SIMD Vectorization  Parallelized using pthreads  Binding on NUMA architectures

7

*See paper for details

slide-8
SLIDE 8

Tuned GPU Baseline*

 Exploit shared memory  Bank conflicts – access pattern  Non-coalesced accesses – Padding  Loop unrolling  Occupancy - Proper block size

8

*See paper for details

slide-9
SLIDE 9

Experimental set-up

9

slide-10
SLIDE 10

Results

10

Device: Nvidia Tesla C870 Grid size: 4096

  • No. of Iterations: 32
slide-11
SLIDE 11

Results

11

slide-12
SLIDE 12

12

98% of empirical streaming bandwidth 66% of true peak Approx 37 GFLOPS

slide-13
SLIDE 13

13

Half the work of single Approx 17 GFLOPS

slide-14
SLIDE 14

CPU/GPU and Multi-GPU Methods

14

slide-15
SLIDE 15

CPU-GPU implementation

15

slide-16
SLIDE 16

CPU-GPU implementation

16

Need to exchange

slide-17
SLIDE 17

Algorithm

17

slide-18
SLIDE 18

18

Fraction assigned to CPU Fraction of baseline CPU-only time

0% 100%

Baseline GPU-only GPU part CPU part Hybrid Optimal fraction Exchange time

0% 100%

CPU-GPU Model graph

Baseline CPU-only

slide-19
SLIDE 19

19

Fraction assigned to CPU Fraction of baseline CPU-only time

0% 100%

Baseline GPU-only GPU part CPU part Hybrid Exchange time

0% 100%

CPU-GPU – Slower CPU

Hybrid never beats the pure GPU version Baseline CPU-only

slide-20
SLIDE 20

20

~1.1x speedup

slide-21
SLIDE 21

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

~ 1.8x speedup

slide-24
SLIDE 24

Asynchronous algorithms

24

slide-25
SLIDE 25

TunedSync - Review

25

Global grid 1 Global grid 2 Global grid 1 Global grid 1 Global grid 2 Global grid 2 Shared memory

Fetch Fetch Fetch Compute & Write Compute & Write Compute & Write

Global Synchronization Moving data between global memory T iterations

slide-26
SLIDE 26

TunedSync - Review

26

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write

slide-27
SLIDE 27

Async0

27

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write Compute & Write

Repeat α/2 times (totally α iterations) T’ iterations

  • Effective number of

iterations = (T’*α)

  • Greater than T?
  • More iterations -> More

FLOPS

slide-28
SLIDE 28

How is Async0 different from TunedSync?

 Reduces the number of global memory

accesses

 Fewer global synchronizations  Expect Teff ≥ T, by a little (but can’t be

less!)

 Uses 2 shared memory grids

28

slide-29
SLIDE 29

Motivation for Async1

29

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write Compute & Write

Repeat α/2 times (totally α iterations) T’ iterations Do you need so many local syncs?

slide-30
SLIDE 30

Async1

30

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write Compute & Write

Repeat α/2 times (totally α iterations) T’ iterations

slide-31
SLIDE 31

Motivation for Async2

31

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write Compute & Write

Repeat α/2 times (totally α iterations) T’ iterations The exchange ghost cells is not assured

  • anyway. Why not get

rid of it?

slide-32
SLIDE 32

Async2

32

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write Compute & Write

Repeat α/2 times (totally α iterations) T’ iterations

slide-33
SLIDE 33

Motivation for Async3

33

Global grid 1 Global grid 2 Shared memory

Fetch Compute & Write Compute & Write

Repeat α/2 times (totally α iterations) T’ iterations Why not have a single shared memory grid instead

  • f two?
slide-34
SLIDE 34

Async3

34

Global grid 1 Global grid 2 Shared memory

Fetch

Repeat α/2 times (totally α iterations) T’ iterations

Compute & Write

slide-35
SLIDE 35

35

slide-36
SLIDE 36

36

slide-37
SLIDE 37

37

Conclusion

 Need extensive (automatable) tuning to

achieve near peak performance, even for a simple kernel

 Simple performance models can guide

CPU-GPU and Multi-GPU designs

 “Fast and loose” asynchronous algorithms

yield non-trivial speedups on the GPU

slide-38
SLIDE 38

Future work

 Extension of chaotic relaxation technique to

  • ther domains.

 Extending Multi-GPU study to GPU clusters.  Systems that will decide “on-the-go” if CPU-GPU

  • r Multi-GPU execution will pay-off based on

performance models

 Automatic “asynchronous” code generator for

“arbitrary” iterative methods.

38

slide-39
SLIDE 39

39

Thank you!