Parallelization Strategies ASD Accelerator HPC Workshop Computer - - PowerPoint PPT Presentation
Parallelization Strategies ASD Accelerator HPC Workshop Computer - - PowerPoint PPT Presentation
Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia May 01, 2019 Issues to Consider When Parallelizing a Problem Given a serial
Issues to Consider When Parallelizing a Problem
Given a serial solution to the problem: which loops are data parallel? how is the data laid out in memory? what are the data dependencies? how would thread block sizes and shape affect performance? When the first parallel solution has been made: what are the main issues limiting performance? how can these be mitigated?
Computer Systems (ANU) Parallelization Strategies 01 May 2019 2 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
Outline
1
Dynamic Parallelism – Mandelbrot Set Revisited
2
Stencil Computations
3
Dynamic Programming – The Knapsack Problem
Computer Systems (ANU) Parallelization Strategies 01 May 2019 3 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
Adaptive Parallelism
(code and graphics in the section are from https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism) by Andy Adinets
many computations (e.g. AMR) require more work in certain areas CUDA 5.0 introduced Dynamic Parallelism to support this: ‘coarse grain’ kernels can invoke ‘finer-grain’ kernels
Computer Systems (ANU) Parallelization Strategies 01 May 2019 4 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
Naive GPU Solution to the Mandelbrot Set
the ‘Escape Time’ algorithm is based on computing the dwell, the number of iterations of z ← z2 + c at each pixel (x,y) in an w×h
- image. cmin (cmax) are the bottom-left (top-right) image corners
__host__ __device__ int pixel_dwell (int w, int h,
2
complex cmin , complex cmax , int x, int y) { complex dc = cmax - cmin;
4
float fx = (float)x / w, fy = (float)y / h; complex c = cmin + complex(fx * dc.re , fy * dc.im);
6
complex z = c; int dwell = 0;
8
while (dwell < MAX_DWELL && abs2(z) < 2 * 2) { z = z * z + c;
10
dwell ++; }
12
return dwell }
Computer Systems (ANU) Parallelization Strategies 01 May 2019 5 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
Naive GPU Solution to the Mandelbrot Set (II)
the kernel can be simply expressed with the help of dwells():
__global__ void mandelbrot_k (int *dwells , int w, int h,
2
complex cmin , complex cmax) { int x = threadIdx.x + blockDim.x * blockIdx.x;
4
int y = threadIdx.y + blockDim.y * blockIdx.y; if (x < w && y < h)
6
dwells[y * w + x] = pixel_dwell (w, h, cmin , cmax , x, y); }
8
... // kernel launch
10
int w = 4096 , h = 4096; dim3 bs(64, 4), grid(divup(w, bs.x), divup(h, bs.y));
12
mandelbrot_k <<<grid , bs >>>(d_dwells , w, h, complex (-1.5, 1), complex (0.5 , 1));
although embarrassingly parallel, within a thread block, it suffers from load imbalance, as pixels may have differing dwell values (also there are large areas of constant dwell which need not be computed for every point)
Computer Systems (ANU) Parallelization Strategies 01 May 2019 6 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
The Mariani-Silver Algorithm
solves the above problems by using recursive sub-division in areas of non-constant dwell
1 mariani_silver (rectangle)
if (border(rectangle) has common dwell)
3
fill rectangle with common dwell else if (rectangle size < threshold)
5
per -pixel evaluation
- f the
rectangle else
7
for each sub_rectangle in subdivide( rectangle) mariani_silver ( sub_rectangle )
Computer Systems (ANU) Parallelization Strategies 01 May 2019 7 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
CUDA Implementation of Mariani-Silver Algorithm
__global__ void mandelbrot_block_k (int *dwells ,
2
int w, int h, complex cmin , complex cmax , int x0 , int y0 , int d, int depth) {
4
x0 += d * blockIdx.x, y0 += d * blockIdx.y; int common_dwell = border_dwell (w, h, cmin , cmax , x0 , y0 , d);
6
if (threadIdx.x == 0 && threadIdx.y == 0) { if ( common_dwell != DIFF_DWELL ) { // uniform dwell , just fill
8
dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY)); dwell_fill <<<grid , bs >>>(dwells , w, x0 , y0 , d, comm_dwell );
10
} else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) { dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV , SUBDIV);
12
mandelbrot_block_k <<<grid , bs >>> (dwells , w, h, cmin , cmax , x0 , y0 , d / SUBDIV , depth +1);
14
} else { // leaf , per -pixel kernel dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY));
16
mandelbrot_pixel_k <<<grid , bs >>> (dwells , w, h, cmin , cmax , x0 , y0 , d);
18
} }}
20
... int width = 8192 , height = 8192;
22
mandelbrot_block_k <<<dim3(I_SUBDIV , I_SUBDIV), dim3(BSX , BSY)>>> (dwells , width , height , complex (-1.5,
- 1), complex (0.5 , 1),
24
0, 0, width / I_SUBDIV , 1);
Computer Systems (ANU) Parallelization Strategies 01 May 2019 8 / 26
Dynamic Parallelism – Mandelbrot Set Revisited
Dynamic Parallelism - Closing Remarks
the Mariani-Silver algorithm performed 1.3× to almost 6× faster over the naive (depending on image size) dynamic kernel launch can fail (lack of resources); should perform a cucheck_dev(cudaGetLastError()) after launch
#define cucheck_dev (call) \
2 { cudaError_t
cucheck_err = (call); \ if( cucheck_err != cudaSuccess ) { \
4
const char *err_str = cudaGetErrorString ( cucheck_err ); \ printf("%s (%d): %s\n", __FILE__ , __LINE__ , err_str); \
6
assert (0); \ }}
kernel launch is asynchronous; successful launch only means the kernel is queued must compile for Compute Capability 3.5 or higher (-arch=sm 35) dynamic parallelism is generally useful for recursive algorithms, including tree-based algorithms (e.g. quad-tree re-ordering, tree traversal)
Computer Systems (ANU) Parallelization Strategies 01 May 2019 9 / 26
Stencil Computations
Outline
1
Dynamic Parallelism – Mandelbrot Set Revisited
2
Stencil Computations
3
Dynamic Programming – The Knapsack Problem
Computer Systems (ANU) Parallelization Strategies 01 May 2019 10 / 26
Stencil Computations
Overview: Stencil Computations
degrees of synchronization synchronous example: Heat Diffusion
serial and parallel code comparison of block and strip thread block shapes
- ptimization 1: iterating over contiguous indices
- ptimization 2: use of shared data
Computer Systems (ANU) Parallelization Strategies 01 May 2019 11 / 26
Stencil Computations
Degrees of Synchronization
from fully to loosely synchronous
the more synchronous your computation, the more potential overhead
SIMD: synchronized at the instruction level
provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel
- perations
forall (i = 0; i < n; i++) {
2
data parallel work }
e.g. the Jacobi iteration, which solves a system of linear equations (Ax = b) iteratively (xt+1 = (b − (A − diag(A))xt)/diag(A)) e.g. an s-point stencil computation (At+1
i,j
= f (At
i+c1,j+d1, . . . , Ai+cs,j+ds))
Occurs in many physical problems (e.g. advection), image processing etc
Computer Systems (ANU) Parallelization Strategies 01 May 2019 12 / 26
Stencil Computations
Locally Synchronous Example: Heat Diffusion
Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle – find the temperature in the middle. finite difference approximation to the Laplace equation:
∂2T(x, y) ∂x2 + ∂2T(x, y) ∂y 2 = 0 T(x + δx, y) − 2T(x, y) + T(x − δx, y) δx2 +T(x, y + δy) − 2T(x, y) + T(x, y − δy 2
assuming an even grid (i.e. δx = δy) of n × n points (denoted as hi,j), the temperature at any point is an average of surrounding points: hi,j = hi−1,j + hi+1,j + hi,j−1 + hi,j+1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its NSEW neighbours
Computer Systems (ANU) Parallelization Strategies 01 May 2019 13 / 26
Stencil Computations
Array Ordering
1
x
2
x
k−1
x
k
x
k+1
x
k+2
x
i−k
x
i−1
x
i+1
x
i+k
x
i
x
2k−1
x
2k
x
k
x
2
we will solve iteratively: xi = xi−1+xi+1+xi−k+xi+k
4
but this problem may also be written as a system of linear equations: xi−k + xi−1 − 4xi + xi+1 + xi+k = 0
Computer Systems (ANU) Parallelization Strategies 01 May 2019 14 / 26
Stencil Computations
Heat Equation: Sequential Code
assume a fixed number of iterations and a square mesh beware of what happens at the edges!
1 for (iter = 0; iter < max_iter; iter ++) {
for (i = 1; i < n; i++)
3
for (j = 1; j < n; j++) g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] +
5
h[i][j -1] + h[i][j+1]); for (i = 1; i < n; i++)
7
for (j = 1; j < n; j++) h[i][j] = g[i][j];
9 } Computer Systems (ANU) Parallelization Strategies 01 May 2019 15 / 26
Stencil Computations
Heat Equation: Parallel Code
(bId = blockId, bDim = blockDim)
__global__ heat_diffuse (int n, double g[][] , double h[][]) {
2
int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;
4
for (int i = i0; i < n; i += di) for (int j = j0; j < n; j += dj)
6
g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] + h[i][j -1] + h[i][j+1]);
8 }
__global__ heat_copy(int n, double g[][] , double h[][]) {
10
int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;
12
for (int i = i0; i < n; i += di) for (int j = j0; j < n; j += dj)
14
h[i][j] = g[i][j]; }
16
... for (iter = 0; iter < max_iter; iter ++) {
18
heat_diffuse <<<grid ,block >>>(n, g, h); heat_copy <<<grid ,block >>>(n, g, h);
20
}
Computer Systems (ANU) Parallelization Strategies 01 May 2019 16 / 26
Stencil Computations
Heat Equation: Issues with the Parallel Solution
1 low amount of work per kernel: significant kernel invocation overhead
unless n is very large
2 high ratio of number of memory accesses (4 loads and 1 store) per
floating point operations (4 FLOPs) What can we do to to mitigate these? Note: the heat diffusion problem is a worse-case example: other stencils are more complex and hence richer in floating point operations. For Issue 1, we can replace the call to heat_copy<<<grid,block>>>(n, g, h) with heat_diffuse<<<grid,block>>>(n, h, g) (and iter++ with iter+=2).
Computer Systems (ANU) Parallelization Strategies 01 May 2019 17 / 26
Stencil Computations
Heat Equation: Partitioning
normally more than one point per thread block
- ption of either block or strip partitioning in terms of the shape of
the block
B0 B1 B1
p−1
B
p−1
B B0 Block Partitioning Strip Partitioning
Computer Systems (ANU) Parallelization Strategies 01 May 2019 18 / 26
Stencil Computations
Block/Strip Memory Access Comparison
block partitioning: b × b blocks, 4 boundary edges needed ndata = 4b + b2 strip partitioning: 1 × b2 blocks, 2 edges ndata = 3b2
b b*b Block Communications Strip Communications Computer Systems (ANU) Parallelization Strategies 01 May 2019 19 / 26
Stencil Computations
Iterating over Contiguous Indices
idea: if threads iterate contiguously over j in ‘blocks’ of width w, the load of h[i][j+1] can be re-used for h[i][(j+2)-1]
__global__ heat_c(int n, int w, double g[][] , double h[][]) {
2
int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;
4
for (int i = i0; i < n; i += di) for (int jw = j0*w; jw < n; jw += dj*w) {
6
double hij = h[i][jw], hijm1 = h[i][jw -1]; for (int j = jw; j < jw + w; j++) {
8
double hijp1 = h[i][j+1]; g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] +
10
hijm1 + hijp1); hijm1 = hij , hij = hijp1;
12
}} }
14
... for (iter = 0; iter < max_iter; iter ++) {
16
heat_c <<<grid ,dim3(1, block.y)>>> (n, block.y, g, h); heat_copy <<<grid ,block >>> (n, g, h);
18
}
Computer Systems (ANU) Parallelization Strategies 01 May 2019 20 / 26
Stencil Computations
Load Reduction by Use of Shared Memory
idea: load blocks of h used by a thread block into shared memory (but what about the edges?)
const int BX = 128, BY = 128;
2 __shared__
double hs[BY +2][ BX +2]; __global__ heat_s(int n, double g[][] , double h[][]) {
4
int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;
6
int ti = threadIdx.y+1, tj = threadIdx.x+1; int ei = (ti ==1)?
- 1: (ti== bDim.y)? +1: 0;
8
int ej = (tj ==1)?
- 1: (tj== bDim.x)? +1: 0;
for (int i = i0; i < n; i += di)
10
for (int j = j0; j < n; j += dj) { hs[ti][tj] = h[i][j]; //3 loads per element
12
if (ei) hs[ti+ei][tj] = h[i+ei][j]; //top ,bottom if (ej) hs[ti][tj+ej] = h[i][j+ej]; //left ,right
14
__syncthreads (); g[i][j] = 0.25*( hs[ti -1][ tj] + hs[ti +1][ tj] +
16
hs[ti][tj -1] + hs[ti][tj +1]); }
18
} }
Computer Systems (ANU) Parallelization Strategies 01 May 2019 21 / 26
Stencil Computations
Heat Diffusion: Further Optimizations
so far, both optimizations only reduced 4 loads to 3 on the heat stencil
- n a 3 × 3 stencil, they would reduce 9 loads to 3 (ctg. indices) or 4
(shared memory)
could the edges in the shared memory example be loaded by a single load operation? alternately, to operate on blocks of size b × b, we could have (b + 2) × (b + 2) thread blocks to load all elements in one load
- peration
the way we iterate over blocks is now more complex the threads on the edges do not update any element, resulting in a wastage factor of 4/b
Computer Systems (ANU) Parallelization Strategies 01 May 2019 22 / 26
Dynamic Programming – The Knapsack Problem
Outline
1
Dynamic Parallelism – Mandelbrot Set Revisited
2
Stencil Computations
3
Dynamic Programming – The Knapsack Problem
Computer Systems (ANU) Parallelization Strategies 01 May 2019 23 / 26
Dynamic Programming – The Knapsack Problem
The 0-1 Knapsack Problem
Select the set of items to maximize total value but not exceed the knapsack’s capacity:
(courtesy Software School of XiDian University) Computer Systems (ANU) Parallelization Strategies 01 May 2019 24 / 26
Dynamic Programming – The Knapsack Problem
The 0-1 Knapsack Problem: Optimal Sub-structure
let there be n items of weight wi and value wi and the knapsack capacity be W let c[i, j] be the (maximal) total value that can be taken from the first i items, 1 ≤ i ≤ n, for a sack of capacity j, 1 ≤ j ≤ W
- ptimal sub-structure property: if an optimal solution (does not)
contain item n, the remainder must be an optimal solution to items 1, 2, . . . , n − 1 with capacity W − wn (capacity W ) leading to the recurrence: c[i, j] = if i = 0 or j = 0 c[i − 1, j] if wi > j max(c[i − 1, j − wi] + vi, c[i − 1, j])
- therwise
this can be solved using dynamic programming in O(nW )
in general, dynamic programming problem have dependencies in one dimension (i above), but might not in others (j above)
Computer Systems (ANU) Parallelization Strategies 01 May 2019 25 / 26
Dynamic Programming – The Knapsack Problem
Dynamic Programming Solution to Knapsack
consider the problem with n = 5, C = 11, v = [1, 4, 5, 7, 11], w = [1, 2, 2, 4, 6] the following table is filled from the top to solve this problem capacity: 1 2 3 4 5 6 7 8 9 10 11 wi vi i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 2 1 4 5 5 5 5 5 5 5 5 5 2 5 3 1 5 6 9 10 10 10 10 10 10 10 4 7 4 1 5 6 9 10 12 13 16 17 17 17 6 11 5 1 5 6 9 10 12 13 16 17 20 21 c[5, 11] = 21 = c[4, 11 − w5] + v5 = c[4, 5] + 11 = 10 + 11 c[3, 5] = 10 = c[2, 5 − w3] + v3 = c[2, 3] + 5 = 5 + 5 c[2, 3] = 5 = c[1, 3 − w2] + v2 = c[1, 1] + 4 = 1 + 4 c[1, 1] = 1 = c[0, 1 − w1] + v1 = c[0, 0] + 1 = 0 + 1
Computer Systems (ANU) Parallelization Strategies 01 May 2019 26 / 26