Parallelization Strategies ASD Distributed Memory HPC Workshop - PowerPoint PPT Presentation

Parallelization Strategies ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 01, 2017

Day 3 – Schedule Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 2 / 84

Embarrassingly Parallel Problems Outline Embarrassingly Parallel Problems 1 Parallelisation via Data Partitioning 2 Synchronous Computations 3 Parallel Matrix Algorithms 4 Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 3 / 84

Embarrassingly Parallel Problems Outline: Embarrassingly Parallel Problems what they are Mandelbrot Set computation cost considerations static parallelization dynamic parallelizations and its analysis Monte Carlo Methods parallel random number generation Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 4 / 84

Embarrassingly Parallel Problems Embarrassingly Parallel Problems computation can be divided into completely independent parts for execution by separate processors (correspond to totally disconnected computational graphs) infrastructure: Blocks of Independent Computations (BOINC) project SETI@home and Folding@Home are projects solving very large such problems part of an application may be embarrassingly parallel distribution and collection of data are the key issues (can be non-trivial and/or costly) frequently uses the master/slave approach ( p − 1 speedup) Send data Slaves Master Collect Results Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 5 / 84

Embarrassingly Parallel Problems Example#1: Computation of the Mandelbrot Set Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 6 / 84

Embarrassingly Parallel Problems The Mandelbrot Set set of points in complex plane that are quasi-stable computed by iterating the function z k +1 = z 2 k + c z and c are complex numbers ( z = a + bi ) z initially zero c gives the position of the point in the complex plane iterations continue until | z | > 2 or some arbitrary iteration limit is reached � a 2 + b 2 | z | = enclosed by a circle centered at (0,0) of radius 2 Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 7 / 84

Embarrassingly Parallel Problems Evaluating 1 Point 1 typedef struct complex{float real , imag ;} complex; const int MaxIter = 256; 3 int calc_pixel (complex c){ int count = 0; 5 complex z = {0.0 , 0.0}; float temp , lengthsq; 7 do { temp = z.real * z.real - z.imag * z.imag + c.real 9 z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; 11 lengthsq = z.real * z.real + z.imag * z.imag; count ++; 13 } while (lengthsq < 4.0 && count < MaxIter); return count; 15 } Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 8 / 84

Embarrassingly Parallel Problems Building the Full Image Define: min. and max. values for c (usually -2 to 2) number of horizontal and vertical pixels for (x = 0; x < width; x++) for (y = 0; y < height; y++){ 2 c.real = min.real + (( float) x * (max.real - min.real)/width); c.imag = min.imag + (( float) y * (max.imag - min.imag)/height); 4 color = calc_pixel (c); display(x, y, color); 6 } Summary: width × height totally independent tasks each task can be of different length Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 9 / 84

Embarrassingly Parallel Problems Cost Considerations on NCI’s Raijin 10 flops per iteration maximum 256 iterations per point approximate time on one Raijin core: 10 × 256 / (8 × 2 . 6 × 10 9 ) ≈ 0 . 12 usec between two nodes the time to communicate single point to slave and receive result ≈ 2 × 2 usec (latency limited) conclusion: cannot parallelize over individual points also must allow time for master to send to all slaves before it can return to any given process Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 10 / 84

Embarrassingly Parallel Problems Parallelisation: Static Process Width Map Height Process Square Region Distribution Map Row Distribution Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 11 / 84

Embarrassingly Parallel Problems Static Implementation Master: for (slave = 1, row = 0; slave < nproc; slave ++) { send (&row , slave); 2 row = row + height/nproc; 4 } for (npixel = 0; npixel < (width * height); npixel ++) { recv (&x, &y, &color , any_processor ); 6 display(x, y, color); 8 } Slave: const int master = 0; // proc. id 2 recv (& firstrow , master); lastrow = MIN(firstrow + height/nproc , height); 4 for (x = 0; x < width; x++) for (y = firstrow; y < lastrow; y++) { c.real = min.real + (( float) x * (max.real - min.real)/width); 6 c.imag = min.imag + (( float) y * (max.imag - min.imag)/height); color = calc_pixel (c); 8 send (&x, &y, &color , master); } 10 Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 12 / 84

Embarrassingly Parallel Problems Dynamic Task Assignment discussion point: why would we expect static assignment to be sub-optimal for the Mandelbrot set calculation? Would any regular static decomposition be significantly better (or worse)? usa a pool of over-decomposed tasks that are dynamically assigned to next requesting process: Work Pool (x5,y5) (x2,y2) (x4,y4) (x7,y7) (x1,y1) (x6,y6) (x3,y3) Task Result Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 13 / 84

Embarrassingly Parallel Problems Processor Farm: Master count = 0; 2 row = 0; for (slave = 1; slave < nproc; k++){ send (&row , slave , data_tag); 4 count ++; row ++; 6 } 8 do { recv (&slave , &r, &color , any_proc , result_tag ); count --; 10 if (row < height) { send (&row , slave , data_tag); 12 row ++; count ++; 14 } else send (&row , slave , terminator_tag ); 16 display_vector (r, color); 18 } while (count > 0); Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 14 / 84

Embarrassingly Parallel Problems Processor Farm: Slave const int master = 0; // proc id. 2 recv (&y, master , any_tag , source_tag ); while ( source_tag == data_tag) { c.imag = min.imag + (( float) y * (max.imag - min.imag)/height); 4 for (x = 0; x < width; x++) { c.real = min.real + (( float) x * (max.real - min.real)/width); 6 color[x] = calc_pixel (c); } 8 send (&myid , &y, color , master , result_tag ); recv (&y, master , source_tag); 10 } Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 15 / 84

Embarrassingly Parallel Problems Analysis Let p , m , n , I denote nproc, height, width, MaxIter : sequential time: ( t f denotes floating point operation time) t seq ≤ I · mn · t f = O ( mn ) parallel communication 1: (neglect t h term, assume message length of 1 word) t com1 = 2( p − 1)( t s + t w ) parallel computation: t comp ≤ I · mn p − 1 t f parallel communication 2: m t com2 = p − 1 ( t s + t w ) overall: t par ≤ I · mn m p − 1 t f + ( p − 1 + p − 1 )( t s + t w ) Discussion point: What assumptions have we been making here? Are there any situations where we might still have poor performance, and how could we mitigate this? Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 16 / 84

Embarrassingly Parallel Problems Example#2: Monte Carlo Methods use random numbers to solve numerical/physical problems evaluation of π by determining if random points in the unit square fall inside a circle area of square = π (1) 2 2 × 2 = π area of circle 4 Total Area = 4 Area = 2 π Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 17 / 84

Embarrassingly Parallel Problems Monte Carlo Integration evaluation of integral ( x 1 ≤ x i ≤ x 2 ) � x 2 N 1 � area = f ( x ) dx = lim f ( x i )( x 2 − x 1 ) N N →∞ x 1 i =1 example � x 2 ( x 2 − 3 x ) dx I = x 1 sum = 0; 2 for (i = 0; i < N; i++) { xr = rand_v(x1 , x2); sum += xr * xr - 3 * xr; 4 } 6 area = sum * (x2 - x1) / N; where rand_v(x1, x2) computes a pseudo-random number between x1 and x2 Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 18 / 84

Embarrassingly Parallel Problems Parallelization only problem is ensuring each process uses a different random number and that there is no correlation one solution is to have a unique process (maybe the master) issuing random numbers to the slaves Master Partial sum Request Slaves Random number Random number process Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 19 / 84

Embarrassingly Parallel Problems Parallel Code: Integration Master (process 0): Slave: for (i = 0; i < N/n; i++) { for (j = 0; j < n; j++) 2 xr[j] = rand_v(x1 , x2); const int master = 0; // proc id. recv(any_proc , req_tag , &p_src) 2 sum = 0; 4 ; send(master , req_tag); send(xr , n, p_src , comp_tag); 4 recv(xr , &n, master , tag); 6 } while (tag == comp_tag) { for (i=1; i < nproc; i++) { for (i = 0; i < n; i++) 6 recv(i, req_tag); sum += xr[i]*xr[i] - 3*xr[i]; 8 send(i, stop_tag); send(0, req_tag); 8 10 } recv(xr , n, master , &tag); sum = 0; 10 } 12 reduce_add (&sum , p_group); reduce_add (&sum , p_group); Question: performance problems with this code? Computer Systems (ANU) Parallelization Strategies 01 Nov 2017 20 / 84

Parallelization Strategies ASD Distributed Memory HPC Workshop - PowerPoint PPT Presentation

Parallelization Strategies ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 01, 2017 Day 3 Schedule Computer Systems (ANU)

MDT FE Power Consumption M. Fras, 06 June 2019 ASD Power Depending on Voltage ASD Supply [V]

Distributed HPC Systems ASD Distributed Memory HPC Workshop Computer Systems Group Research

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Parallelization Strategies ASD Accelerator HPC Workshop Computer Systems Group Research School

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Autism Spectrum Disorder: A Fresh Look ASD in Females Andrea Fourie Speech Therapist ASD:

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Upcoming changes to autism spectrum disorder: evaluating DSM-5 What is ASD? ASD disease

BLUEPRINT FOR A NATIONAL ASD STRATEGY All Canadians with ASD and their families have full and

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Advanced Message Passing ASD Distributed Memory HPC Workshop Computer Systems Group Research

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Lecture 11: HW3, Rest of Parallel Patterns, Load Balancing G63.2011.002/G22.2945.001 November

A generic data structure for representing discrete paths on regular grids e and Alexandre

access to a function f . The tester has to accept with probability at least 2 / 3 if f belongs to

Physical Operators Scanning, sorting, merging, hashing 193 Physical Operators Execution Query

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Xingbo Wu , Yuehai Xu ,

Main Memory Adaptive Indexing for Multi-core Systems Felix Martin Schuhknecht Victor Alvarez

Dynamic Memory Management Allocating memory: The Interface Buddy System

St. Patricks Day Tips Standards of Behavior High-risk drinking behaviors (e.g., underage