Parallelization Strategies ASD Accelerator HPC Workshop Computer - PowerPoint PPT Presentation

Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia May 01, 2019

Issues to Consider When Parallelizing a Problem Given a serial solution to the problem: which loops are data parallel? how is the data laid out in memory? what are the data dependencies? how would thread block sizes and shape affect performance? When the first parallel solution has been made: what are the main issues limiting performance? how can these be mitigated? Computer Systems (ANU) Parallelization Strategies 01 May 2019 2 / 26

Dynamic Parallelism – Mandelbrot Set Revisited Outline 1 Dynamic Parallelism – Mandelbrot Set Revisited Stencil Computations 2 Dynamic Programming – The Knapsack Problem 3 Computer Systems (ANU) Parallelization Strategies 01 May 2019 3 / 26

Dynamic Parallelism – Mandelbrot Set Revisited Adaptive Parallelism (code and graphics in the section are from https://devblogs.nvidia.com/introduction-cuda-dynamic-parallelism) by Andy Adinets many computations (e.g. AMR) require more work in certain areas CUDA 5.0 introduced Dynamic Parallelism to support this: ‘coarse grain’ kernels can invoke ‘finer-grain’ kernels Computer Systems (ANU) Parallelization Strategies 01 May 2019 4 / 26

Dynamic Parallelism – Mandelbrot Set Revisited Naive GPU Solution to the Mandelbrot Set the ‘Escape Time’ algorithm is based on computing the dwell , the number of iterations of z ← z 2 + c at each pixel (x,y) in an w × h image. cmin ( cmax ) are the bottom-left (top-right) image corners __host__ __device__ int pixel_dwell (int w, int h, complex cmin , complex cmax , int x, int y) { 2 complex dc = cmax - cmin; float fx = (float)x / w, fy = (float)y / h; 4 complex c = cmin + complex(fx * dc.re , fy * dc.im); complex z = c; 6 int dwell = 0; while (dwell < MAX_DWELL && abs2(z) < 2 * 2) { 8 z = z * z + c; dwell ++; 10 } return dwell 12 } Computer Systems (ANU) Parallelization Strategies 01 May 2019 5 / 26

Dynamic Parallelism – Mandelbrot Set Revisited Naive GPU Solution to the Mandelbrot Set (II) the kernel can be simply expressed with the help of dwells() : __global__ void mandelbrot_k (int *dwells , int w, int h, complex cmin , complex cmax) { 2 int x = threadIdx.x + blockDim.x * blockIdx.x; int y = threadIdx.y + blockDim.y * blockIdx.y; 4 if (x < w && y < h) dwells[y * w + x] = pixel_dwell (w, h, cmin , cmax , x, y); 6 } ... 8 // kernel launch int w = 4096 , h = 4096; 10 dim3 bs(64, 4), grid(divup(w, bs.x), divup(h, bs.y)); mandelbrot_k <<<grid , bs >>>(d_dwells , w, h, 12 complex (-1.5, 1), complex (0.5 , 1)); although embarrassingly parallel , within a thread block, it suffers from load imbalance, as pixels may have differing dwell values (also there are large areas of constant dwell which need not be computed for every point) Computer Systems (ANU) Parallelization Strategies 01 May 2019 6 / 26

Dynamic Parallelism – Mandelbrot Set Revisited The Mariani-Silver Algorithm solves the above problems by using recursive sub-division in areas of non-constant dwell 1 mariani_silver (rectangle) if (border(rectangle) has common dwell) fill rectangle with common dwell 3 else if (rectangle size < threshold) per -pixel evaluation of the rectangle 5 else for each sub_rectangle in subdivide( 7 rectangle) mariani_silver ( sub_rectangle ) Computer Systems (ANU) Parallelization Strategies 01 May 2019 7 / 26

Dynamic Parallelism – Mandelbrot Set Revisited CUDA Implementation of Mariani-Silver Algorithm __global__ void mandelbrot_block_k (int *dwells , int w, int h, complex cmin , complex cmax , 2 int x0 , int y0 , int d, int depth) { x0 += d * blockIdx.x, y0 += d * blockIdx.y; 4 int common_dwell = border_dwell (w, h, cmin , cmax , x0 , y0 , d); if (threadIdx.x == 0 && threadIdx.y == 0) { 6 if ( common_dwell != DIFF_DWELL ) { // uniform dwell , just fill dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY)); 8 dwell_fill <<<grid , bs >>>(dwells , w, x0 , y0 , d, comm_dwell ); } else if (depth + 1 < MAX_DEPTH && d / SUBDIV > MIN_SIZE) { 10 dim3 bs(blockDim.x, blockDim.y), grid(SUBDIV , SUBDIV); mandelbrot_block_k <<<grid , bs >>> 12 (dwells , w, h, cmin , cmax , x0 , y0 , d / SUBDIV , depth +1); } else { // leaf , per -pixel kernel 14 dim3 bs(BSX , BSY), grid(divup(d, BSX), divup(d, BSY)); mandelbrot_pixel_k <<<grid , bs >>> 16 (dwells , w, h, cmin , cmax , x0 , y0 , d); } 18 }} ... 20 int width = 8192 , height = 8192; mandelbrot_block_k <<<dim3(I_SUBDIV , I_SUBDIV), dim3(BSX , BSY)>>> 22 (dwells , width , height , complex (-1.5, -1), complex (0.5 , 1), 0, 0, width / I_SUBDIV , 1); 24 Computer Systems (ANU) Parallelization Strategies 01 May 2019 8 / 26

Dynamic Parallelism – Mandelbrot Set Revisited Dynamic Parallelism - Closing Remarks the Mariani-Silver algorithm performed 1.3 × to almost 6 × faster over the naive (depending on image size) dynamic kernel launch can fail (lack of resources); should perform a cucheck_dev(cudaGetLastError()) after launch #define cucheck_dev (call) \ 2 { cudaError_t cucheck_err = (call); \ if( cucheck_err != cudaSuccess ) { \ const char *err_str = cudaGetErrorString ( cucheck_err ); \ 4 printf("%s (%d): %s\n", __FILE__ , __LINE__ , err_str); \ assert (0); \ 6 }} kernel launch is asynchronous; successful launch only means the kernel is queued must compile for Compute Capability 3.5 or higher ( -arch=sm 35 ) dynamic parallelism is generally useful for recursive algorithms, including tree-based algorithms (e.g. quad-tree re-ordering, tree traversal) Computer Systems (ANU) Parallelization Strategies 01 May 2019 9 / 26

Stencil Computations Outline 1 Dynamic Parallelism – Mandelbrot Set Revisited Stencil Computations 2 Dynamic Programming – The Knapsack Problem 3 Computer Systems (ANU) Parallelization Strategies 01 May 2019 10 / 26

Stencil Computations Overview: Stencil Computations degrees of synchronization synchronous example: Heat Diffusion serial and parallel code comparison of block and strip thread block shapes optimization 1: iterating over contiguous indices optimization 2: use of shared data Computer Systems (ANU) Parallelization Strategies 01 May 2019 11 / 26

Stencil Computations Degrees of Synchronization from fully to loosely synchronous the more synchronous your computation, the more potential overhead SIMD : synchronized at the instruction level provides ease of programming (one program) well suited for data decomposition applicable to many numerical problems the forall statement was introduced to specify data parallel operations forall (i = 0; i < n; i++) { data parallel work 2 } e.g. the Jacobi iteration, which solves a system of linear equations ( Ax = b ) iteratively ( x t +1 = ( b − ( A − diag ( A )) x t ) / diag ( A )) e.g. an s -point stencil computation ( A t +1 = f ( A t i + c 1 , j + d 1 , . . . , A i + c s , j + d s )) i , j Occurs in many physical problems (e.g. advection), image processing etc Computer Systems (ANU) Parallelization Strategies 01 May 2019 12 / 26

Stencil Computations Locally Synchronous Example: Heat Diffusion Consider a metal sheet with a fixed temperature along the sides but unknown temperatures in the middle – find the temperature in the middle. finite difference approximation to the Laplace equation: ∂ 2 T ( x , y ) + ∂ 2 T ( x , y ) = 0 ∂ x 2 ∂ y 2 T ( x + δ x , y ) − 2 T ( x , y ) + T ( x − δ x , y ) + T ( x , y + δ y ) − 2 T ( x , y ) + T ( x , y − δ x 2 δ y 2 assuming an even grid (i.e. δ x = δ y ) of n × n points (denoted as h i , j ), the temperature at any point is an average of surrounding points: h i , j = h i − 1 , j + h i +1 , j + h i , j − 1 + h i , j +1 4 problem is very similar to the Game of Life, i.e. what happens in a cell depends upon its NSEW neighbours Computer Systems (ANU) Parallelization Strategies 01 May 2019 13 / 26

Stencil Computations Array Ordering x x x x 1 2 k−1 k x x x x 2k−1 2k k+1 k+2 x i−k x x x i−1 i+1 i x i+k x 2 k we will solve iteratively: x i = x i − 1 + x i +1 + x i − k + x i + k 4 but this problem may also be written as a system of linear equations: x i − k + x i − 1 − 4 x i + x i +1 + x i + k = 0 Computer Systems (ANU) Parallelization Strategies 01 May 2019 14 / 26

Stencil Computations Heat Equation: Sequential Code assume a fixed number of iterations and a square mesh beware of what happens at the edges! 1 for (iter = 0; iter < max_iter; iter ++) { for (i = 1; i < n; i++) for (j = 1; j < n; j++) 3 g[i][j] = 0.25*(h[i -1][j] + h[i+1][j] + h[i][j -1] + h[i][j+1]); 5 for (i = 1; i < n; i++) for (j = 1; j < n; j++) 7 h[i][j] = g[i][j]; 9 } Computer Systems (ANU) Parallelization Strategies 01 May 2019 15 / 26

Parallelization Strategies ASD Accelerator HPC Workshop Computer - PowerPoint PPT Presentation

Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia May 01, 2019 Issues to Consider When Parallelizing a Problem Given a serial

MDT FE Power Consumption M. Fras, 06 June 2019 ASD Power Depending on Voltage ASD Supply [V]

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Parallelization Strategies ASD Distributed Memory HPC Workshop Computer Systems Group Research

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Autism Spectrum Disorder: A Fresh Look ASD in Females Andrea Fourie Speech Therapist ASD:

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Distributed HPC Systems ASD Distributed Memory HPC Workshop Computer Systems Group Research

Upcoming changes to autism spectrum disorder: evaluating DSM-5 What is ASD? ASD disease

BLUEPRINT FOR A NATIONAL ASD STRATEGY All Canadians with ASD and their families have full and

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

PIC codes in the HPC environment PIC codes in the HPC environment - A. Beck SMILEI training

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Health Coverage Options in MN: Understanding MA, MA-TEFRA, Waivers March 4, 2020 Family voices

Symbolic understanding of pictures in typical development and autism: divergent pathways? Melissa

Stepping Outside the Box Diversity in the Workplace Lynne M. Knox, Esq. Morgan A. Giosa

Speaker Disclosure Information Karen J. Ishler, PhD, LSW Senior Research Associate David E.

A Matter of Style: The Causes and Consequences of Style Drift in Mutual Fund Portfolios Russ

Girls ls Ni Nigh ght Ou Out: : Less ssons lear arned from a a uniq ique interv rventio

DEVELOPING AN ADAPTIVE TREATMENT STRATEGY FOR PEER-RELATED SOCIAL SKILLS FOR CHILDREN WITH AUTISM

Training/Preparation for the NSS/SFF/ASD March Storm 2016 Home District Blitz Dale Skran Chair,