Parallelization Strategies ASD Accelerator HPC Workshop Computer - - PowerPoint PPT Presentation

parallelization strategies asd accelerator hpc workshop
SMART_READER_LITE
LIVE PREVIEW

Parallelization Strategies ASD Accelerator HPC Workshop Computer - - PowerPoint PPT Presentation

Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia May 01, 2019 Issues to Consider When Parallelizing a Problem Given a serial


slide-1
SLIDE 1

Parallelization Strategies ASD Accelerator HPC Workshop

Computer Systems Group

Research School of Computer Science Australian National University Canberra, Australia

May 01, 2019

slide-2
SLIDE 2

Issues to Consider When Parallelizing a Problem

Given a serial solution to the problem: which loops are data parallel? how is the data laid out in memory? what are the data dependencies? how would thread block sizes and shape affect performance? When the first parallel solution has been made: what are the main issues limiting performance? how can these be mitigated?

Computer Systems (ANU) Parallelization Strategies 01 May 2019 2 / 26

slide-3
SLIDE 3

Dynamic Parallelism – Mandelbrot Set Revisited

Outline

1

Dynamic Parallelism – Mandelbrot Set Revisited

2

Stencil Computations

3

Dynamic Programming – The Knapsack Problem

Computer Systems (ANU) Parallelization Strategies 01 May 2019 3 / 26

slide-4
SLIDE 4

Dynamic Parallelism – Mandelbrot Set Revisited

Adaptive Parallelism

(code and graphics in the section are from https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism) by Andy Adinets

many computations (e.g. AMR) require more work in certain areas CUDA 5.0 introduced Dynamic Parallelism to support this: ‘coarse grain’ kernels can invoke ‘finer-grain’ kernels

Computer Systems (ANU) Parallelization Strategies 01 May 2019 4 / 26

slide-5
SLIDE 5

Dynamic Parallelism – Mandelbrot Set Revisited

Naive GPU Solution to the Mandelbrot Set

the ‘Escape Time’ algorithm is based on computing the dwell, the number of iterations of z ← z2 + c at each pixel (x,y) in an w×h

  • image. cmin (cmax) are the bottom-left (top-right) image corners

__host__ __device__ int pixel_dwell (int w, int h,

2

complex cmin , complex cmax , int x, int y) { complex dc = cmax - cmin;

4

float fx = (float)x / w, fy = (float)y / h; complex c = cmin + complex(fx * dc.re , fy * dc.im);

6

complex z = c; int dwell = 0;

8

while (dwell < MAX_DWELL && abs2(z) < 2 * 2) { z = z * z + c;

10

dwell ++; }

12

return dwell }

Computer Systems (ANU) Parallelization Strategies 01 May 2019 5 / 26

slide-6
SLIDE 6

Dynamic Parallelism – Mandelbrot Set Revisited

Naive GPU Solution to the Mandelbrot Set (II)

the kernel can be simply expressed with the help of dwells():

__global__ void mandelbrot_k (int *dwells , int w, int h,

2

complex cmin , complex cmax) { int x = threadIdx.x + blockDim.x * blockIdx.x;

4

int y = threadIdx.y + blockDim.y * blockIdx.y; if (x < w && y < h)

6

dwells[y * w + x] = pixel_dwell (w, h, cmin , cmax , x, y); }

8

... // kernel launch

10

int w = 4096 , h = 4096; dim3 bs(64, 4), grid(divup(w, bs.x), divup(h, bs.y));

12

mandelbrot_k <<<grid , bs >>>(d_dwells , w, h, complex (-1.5, 1), complex (0.5 , 1));

although embarrassingly parallel, within a thread block, it suffers from load imbalance, as pixels may have differing dwell values (also there are large areas of constant dwell which need not be computed for every point)

Computer Systems (ANU) Parallelization Strategies 01 May 2019 6 / 26

slide-7
SLIDE 7

Dynamic Parallelism – Mandelbrot Set Revisited

The Mariani-Silver Algorithm

solves the above problems by using recursive sub-division in areas of non-constant dwell

1 mariani_silver (rectangle)

if (border(rectangle) has common dwell)

3

fill rectangle with common dwell else if (rectangle size < threshold)

5

per -pixel evaluation

  • f the

rectangle else

7

for each sub_rectangle in subdivide( rectangle) mariani_silver ( sub_rectangle )

Computer Systems (ANU) Parallelization Strategies 01 May 2019 7 / 26

slide-8
SLIDE 8

Dynamic Parallelism – Mandelbrot Set Revisited

CUDA Implementation of Mariani-Silver Algorithm

__global__ void mandelbrot_block_k (int *dwells ,

2

int w, int h, complex cmin , complex cmax , int x0 , int y0 , int d, int depth) {

4

x0 += d * blockIdx.x, y0 += d * blockIdx.y; int common_dwell = border_dwell (w, h, cmin , cmax , x0 , y0 , d);

6

if (threadIdx.x == 0 && threadIdx.y == 0) { if ( common_dwell != DIFF_DWELL ) { // uniform dwell , just fill

8

dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY)); dwell_fill <<<grid , bs >>>(dwells , w, x0 , y0 , d, comm_dwell );

10

} else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) { dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV , SUBDIV);

12

mandelbrot_block_k <<<grid , bs >>> (dwells , w, h, cmin , cmax , x0 , y0 , d / SUBDIV , depth +1);

14

} else { // leaf , per -pixel kernel dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY));

16

mandelbrot_pixel_k <<<grid , bs >>> (dwells , w, h, cmin , cmax , x0 , y0 , d);

18

} }}

20

... int width = 8192 , height = 8192;

22

mandelbrot_block_k <<<dim3(I_SUBDIV , I_SUBDIV), dim3(BSX , BSY)>>> (dwells , width , height , complex (-1.5,

  • 1), complex (0.5 , 1),

24

0, 0, width / I_SUBDIV , 1);

Computer Systems (ANU) Parallelization Strategies 01 May 2019 8 / 26

slide-9
SLIDE 9

Dynamic Parallelism – Mandelbrot Set Revisited

Dynamic Parallelism - Closing Remarks

the Mariani-Silver algorithm performed 1.3× to almost 6× faster over the naive (depending on image size) dynamic kernel launch can fail (lack of resources); should perform a cucheck_dev(cudaGetLastError()) after launch

#define cucheck_dev (call) \

2 { cudaError_t

cucheck_err = (call); \ if( cucheck_err != cudaSuccess ) { \

4

const char *err_str = cudaGetErrorString ( cucheck_err ); \ printf("%s (%d): %s\n", __FILE__ , __LINE__ , err_str); \

6

assert (0); \ }}

kernel launch is asynchronous; successful launch only means the kernel is queued must compile for Compute Capability 3.5 or higher (-arch=sm 35) dynamic parallelism is generally useful for recursive algorithms, including tree-based algorithms (e.g. quad-tree re-ordering, tree traversal)

Computer Systems (ANU) Parallelization Strategies 01 May 2019 9 / 26

slide-10
SLIDE 10

Stencil Computations

Outline

1

Dynamic Parallelism – Mandelbrot Set Revisited

2

Stencil Computations

3

Dynamic Programming – The Knapsack Problem

Computer Systems (ANU) Parallelization Strategies 01 May 2019 10 / 26

slide-11
SLIDE 11

Stencil Computations

Overview: Stencil Computations

degrees of synchronization synchronous example: Heat Diffusion

serial and parallel code comparison of block and strip thread block shapes

  • ptimization 1: iterating over contiguous indices
  • ptimization 2: use of shared data

Computer Systems (ANU) Parallelization Strategies 01 May 2019 11 / 26

slide-12
SLIDE 12

Stencil Computations

Degrees of Synchronization

from fully to loosely synchronous

the more synchronous your computation, the more potential overhead

SIMD: synchronized at the instruction level

provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel

  • perations

forall (i = 0; i < n; i++) {

2

data parallel work }

e.g. the Jacobi iteration, which solves a system of linear equations (Ax = b) iteratively (xt+1 = (b − (A − diag(A))xt)/diag(A)) e.g. an s-point stencil computation (At+1

i,j

= f (At

i+c1,j+d1, . . . , Ai+cs,j+ds))

Occurs in many physical problems (e.g. advection), image processing etc

Computer Systems (ANU) Parallelization Strategies 01 May 2019 12 / 26

slide-13
SLIDE 13

Stencil Computations

Locally Synchronous Example: Heat Diffusion

Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle – find the temperature in the middle. finite difference approximation to the Laplace equation:

∂2T(x, y) ∂x2 + ∂2T(x, y) ∂y 2 = 0 T(x + δx, y) − 2T(x, y) + T(x − δx, y) δx2 +T(x, y + δy) − 2T(x, y) + T(x, y − δy 2

assuming an even grid (i.e. δx = δy) of n × n points (denoted as hi,j), the temperature at any point is an average of surrounding points: hi,j = hi−1,j + hi+1,j + hi,j−1 + hi,j+1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its NSEW neighbours

Computer Systems (ANU) Parallelization Strategies 01 May 2019 13 / 26

slide-14
SLIDE 14

Stencil Computations

Array Ordering

1

x

2

x

k−1

x

k

x

k+1

x

k+2

x

i−k

x

i−1

x

i+1

x

i+k

x

i

x

2k−1

x

2k

x

k

x

2

we will solve iteratively: xi = xi−1+xi+1+xi−k+xi+k

4

but this problem may also be written as a system of linear equations: xi−k + xi−1 − 4xi + xi+1 + xi+k = 0

Computer Systems (ANU) Parallelization Strategies 01 May 2019 14 / 26

slide-15
SLIDE 15

Stencil Computations

Heat Equation: Sequential Code

assume a fixed number of iterations and a square mesh beware of what happens at the edges!

1 for (iter = 0; iter < max_iter; iter ++) {

for (i = 1; i < n; i++)

3

for (j = 1; j < n; j++) g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] +

5

h[i][j -1] + h[i][j+1]); for (i = 1; i < n; i++)

7

for (j = 1; j < n; j++) h[i][j] = g[i][j];

9 } Computer Systems (ANU) Parallelization Strategies 01 May 2019 15 / 26

slide-16
SLIDE 16

Stencil Computations

Heat Equation: Parallel Code

(bId = blockId, bDim = blockDim)

__global__ heat_diffuse (int n, double g[][] , double h[][]) {

2

int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;

4

for (int i = i0; i < n; i += di) for (int j = j0; j < n; j += dj)

6

g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] + h[i][j -1] + h[i][j+1]);

8 }

__global__ heat_copy(int n, double g[][] , double h[][]) {

10

int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;

12

for (int i = i0; i < n; i += di) for (int j = j0; j < n; j += dj)

14

h[i][j] = g[i][j]; }

16

... for (iter = 0; iter < max_iter; iter ++) {

18

heat_diffuse <<<grid ,block >>>(n, g, h); heat_copy <<<grid ,block >>>(n, g, h);

20

}

Computer Systems (ANU) Parallelization Strategies 01 May 2019 16 / 26

slide-17
SLIDE 17

Stencil Computations

Heat Equation: Issues with the Parallel Solution

1 low amount of work per kernel: significant kernel invocation overhead

unless n is very large

2 high ratio of number of memory accesses (4 loads and 1 store) per

floating point operations (4 FLOPs) What can we do to to mitigate these? Note: the heat diffusion problem is a worse-case example: other stencils are more complex and hence richer in floating point operations. For Issue 1, we can replace the call to heat_copy<<<grid,block>>>(n, g, h) with heat_diffuse<<<grid,block>>>(n, h, g) (and iter++ with iter+=2).

Computer Systems (ANU) Parallelization Strategies 01 May 2019 17 / 26

slide-18
SLIDE 18

Stencil Computations

Heat Equation: Partitioning

normally more than one point per thread block

  • ption of either block or strip partitioning in terms of the shape of

the block

B0 B1 B1

p−1

B

p−1

B B0 Block Partitioning Strip Partitioning

Computer Systems (ANU) Parallelization Strategies 01 May 2019 18 / 26

slide-19
SLIDE 19

Stencil Computations

Block/Strip Memory Access Comparison

block partitioning: b × b blocks, 4 boundary edges needed ndata = 4b + b2 strip partitioning: 1 × b2 blocks, 2 edges ndata = 3b2

b b*b Block Communications Strip Communications Computer Systems (ANU) Parallelization Strategies 01 May 2019 19 / 26

slide-20
SLIDE 20

Stencil Computations

Iterating over Contiguous Indices

idea: if threads iterate contiguously over j in ‘blocks’ of width w, the load of h[i][j+1] can be re-used for h[i][(j+2)-1]

__global__ heat_c(int n, int w, double g[][] , double h[][]) {

2

int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;

4

for (int i = i0; i < n; i += di) for (int jw = j0*w; jw < n; jw += dj*w) {

6

double hij = h[i][jw], hijm1 = h[i][jw -1]; for (int j = jw; j < jw + w; j++) {

8

double hijp1 = h[i][j+1]; g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] +

10

hijm1 + hijp1); hijm1 = hij , hij = hijp1;

12

}} }

14

... for (iter = 0; iter < max_iter; iter ++) {

16

heat_c <<<grid ,dim3(1, block.y)>>> (n, block.y, g, h); heat_copy <<<grid ,block >>> (n, g, h);

18

}

Computer Systems (ANU) Parallelization Strategies 01 May 2019 20 / 26

slide-21
SLIDE 21

Stencil Computations

Load Reduction by Use of Shared Memory

idea: load blocks of h used by a thread block into shared memory (but what about the edges?)

const int BX = 128, BY = 128;

2 __shared__

double hs[BY +2][ BX +2]; __global__ heat_s(int n, double g[][] , double h[][]) {

4

int i0 = bIdx.y*bDim.y + threadIdx.y, di = bDim.y*gridDim.y; int j0 = bIdx.x*bDim.x + threadIdx.x, dj = bDim.x*gridDim.x;

6

int ti = threadIdx.y+1, tj = threadIdx.x+1; int ei = (ti ==1)?

  • 1: (ti== bDim.y)? +1: 0;

8

int ej = (tj ==1)?

  • 1: (tj== bDim.x)? +1: 0;

for (int i = i0; i < n; i += di)

10

for (int j = j0; j < n; j += dj) { hs[ti][tj] = h[i][j]; //3 loads per element

12

if (ei) hs[ti+ei][tj] = h[i+ei][j]; //top ,bottom if (ej) hs[ti][tj+ej] = h[i][j+ej]; //left ,right

14

__syncthreads (); g[i][j] = 0.25*( hs[ti -1][ tj] + hs[ti +1][ tj] +

16

hs[ti][tj -1] + hs[ti][tj +1]); }

18

} }

Computer Systems (ANU) Parallelization Strategies 01 May 2019 21 / 26

slide-22
SLIDE 22

Stencil Computations

Heat Diffusion: Further Optimizations

so far, both optimizations only reduced 4 loads to 3 on the heat stencil

  • n a 3 × 3 stencil, they would reduce 9 loads to 3 (ctg. indices) or 4

(shared memory)

could the edges in the shared memory example be loaded by a single load operation? alternately, to operate on blocks of size b × b, we could have (b + 2) × (b + 2) thread blocks to load all elements in one load

  • peration

the way we iterate over blocks is now more complex the threads on the edges do not update any element, resulting in a wastage factor of 4/b

Computer Systems (ANU) Parallelization Strategies 01 May 2019 22 / 26

slide-23
SLIDE 23

Dynamic Programming – The Knapsack Problem

Outline

1

Dynamic Parallelism – Mandelbrot Set Revisited

2

Stencil Computations

3

Dynamic Programming – The Knapsack Problem

Computer Systems (ANU) Parallelization Strategies 01 May 2019 23 / 26

slide-24
SLIDE 24

Dynamic Programming – The Knapsack Problem

The 0-1 Knapsack Problem

Select the set of items to maximize total value but not exceed the knapsack’s capacity:

(courtesy Software School of XiDian University) Computer Systems (ANU) Parallelization Strategies 01 May 2019 24 / 26

slide-25
SLIDE 25

Dynamic Programming – The Knapsack Problem

The 0-1 Knapsack Problem: Optimal Sub-structure

let there be n items of weight wi and value wi and the knapsack capacity be W let c[i, j] be the (maximal) total value that can be taken from the first i items, 1 ≤ i ≤ n, for a sack of capacity j, 1 ≤ j ≤ W

  • ptimal sub-structure property: if an optimal solution (does not)

contain item n, the remainder must be an optimal solution to items 1, 2, . . . , n − 1 with capacity W − wn (capacity W ) leading to the recurrence: c[i, j] =    if i = 0 or j = 0 c[i − 1, j] if wi > j max(c[i − 1, j − wi] + vi, c[i − 1, j])

  • therwise

this can be solved using dynamic programming in O(nW )

in general, dynamic programming problem have dependencies in one dimension (i above), but might not in others (j above)

Computer Systems (ANU) Parallelization Strategies 01 May 2019 25 / 26

slide-26
SLIDE 26

Dynamic Programming – The Knapsack Problem

Dynamic Programming Solution to Knapsack

consider the problem with n = 5, C = 11, v = [1, 4, 5, 7, 11], w = [1, 2, 2, 4, 6] the following table is filled from the top to solve this problem capacity: 1 2 3 4 5 6 7 8 9 10 11 wi vi i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 2 1 4 5 5 5 5 5 5 5 5 5 2 5 3 1 5 6 9 10 10 10 10 10 10 10 4 7 4 1 5 6 9 10 12 13 16 17 17 17 6 11 5 1 5 6 9 10 12 13 16 17 20 21 c[5, 11] = 21 = c[4, 11 − w5] + v5 = c[4, 5] + 11 = 10 + 11 c[3, 5] = 10 = c[2, 5 − w3] + v3 = c[2, 3] + 5 = 5 + 5 c[2, 3] = 5 = c[1, 3 − w2] + v2 = c[1, 1] + 4 = 1 + 4 c[1, 1] = 1 = c[0, 1 − w1] + v1 = c[0, 0] + 1 = 0 + 1

Computer Systems (ANU) Parallelization Strategies 01 May 2019 26 / 26