Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems
Sundaresan Venkatasubramanian
- Prof. Richard Vuduc
Georgia Tech Int’l. Conference on Supercomputing June 10, 2009
Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous - - PowerPoint PPT Presentation
Tuned and Wildly Asynchronous Stencil Kernels for Heterogeneous CPU/GPU systems Sundaresan Venkatasubramanian Prof. Richard Vuduc Georgia Tech Intl. Conference on Supercomputing June 10, 2009 Motivation and Goals Regular, memory
Sundaresan Venkatasubramanian
Georgia Tech Int’l. Conference on Supercomputing June 10, 2009
2
3
4
To solve Poisson’s equation in 2-D on a square grid: Centered finite-difference approximation on a (N +2)*(N+2) grid with step size h:
5
for t=1,2,3 … T, do for all unknowns in the grid, do (embarrassingly parallel) Ut+1
i,j = ¼ * (Ut i+1,j + Ut i-1,j + Ut+1 i,j-1 + Ut i,j+1)
end for end for
2 copies of the grid required Implicit global sync Used as subroutine in other complex algorithms. E.g, Multigrid Memory bandwidth bound kernel
6
SIMD Vectorization Parallelized using pthreads Binding on NUMA architectures
7
*See paper for details
8
*See paper for details
9
10
Device: Nvidia Tesla C870 Grid size: 4096
11
12
98% of empirical streaming bandwidth 66% of true peak Approx 37 GFLOPS
13
Half the work of single Approx 17 GFLOPS
14
15
16
Need to exchange
17
18
Fraction assigned to CPU Fraction of baseline CPU-only time
0% 100%
Baseline GPU-only GPU part CPU part Hybrid Optimal fraction Exchange time
0% 100%
Baseline CPU-only
19
Fraction assigned to CPU Fraction of baseline CPU-only time
0% 100%
Baseline GPU-only GPU part CPU part Hybrid Exchange time
0% 100%
Hybrid never beats the pure GPU version Baseline CPU-only
20
~1.1x speedup
21
22
23
~ 1.8x speedup
24
25
Global grid 1 Global grid 2 Global grid 1 Global grid 1 Global grid 2 Global grid 2 Shared memory
Fetch Fetch Fetch Compute & Write Compute & Write Compute & Write
Global Synchronization Moving data between global memory T iterations
26
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write
27
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write Compute & Write
Repeat α/2 times (totally α iterations) T’ iterations
iterations = (T’*α)
FLOPS
28
29
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write Compute & Write
Repeat α/2 times (totally α iterations) T’ iterations Do you need so many local syncs?
30
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write Compute & Write
Repeat α/2 times (totally α iterations) T’ iterations
31
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write Compute & Write
Repeat α/2 times (totally α iterations) T’ iterations The exchange ghost cells is not assured
rid of it?
32
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write Compute & Write
Repeat α/2 times (totally α iterations) T’ iterations
33
Global grid 1 Global grid 2 Shared memory
Fetch Compute & Write Compute & Write
Repeat α/2 times (totally α iterations) T’ iterations Why not have a single shared memory grid instead
34
Global grid 1 Global grid 2 Shared memory
Fetch
Repeat α/2 times (totally α iterations) T’ iterations
Compute & Write
35
36
37
Extension of chaotic relaxation technique to
Extending Multi-GPU study to GPU clusters. Systems that will decide “on-the-go” if CPU-GPU
Automatic “asynchronous” code generator for
38
39