Partitioned Global Address Space Paradigm ASD Distributed Memory HPC - PowerPoint PPT Presentation

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 02, 2017

Day 4 – Schedule Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 2 / 90

Introduction to the PGAS Paradigm and Chapel Outline Introduction to the PGAS Paradigm and Chapel 1 Chapel Programming Strategies for Distributed Memory 2 Runtime Support for PGAS 3 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 3 / 90

Introduction to the PGAS Paradigm and Chapel Partitioned Global Address Space recall the shared memory model : multiple threads with pointers to a global address space in the partitioned global address space ( PGAS ) model: have multiple threads, each with affinity to some portion of global address space SPMD or fork-join thread creation remote pointers to access data in other partitions the model maps to a cluster with remote memory access can also map to NUMA domains Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 4 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Design Principles C ray H igh P erformance L anguage originally developed under DARPA High Productivity Computing Systems program Targeted at massively parallel computers object-oriented (Java-like syntax, but influenced by ZPL & HPF) supports exploratory programming implicit (statically-inferable) types, run-time settable parameters ( config ), implicit main and module wrappings multiresolution design: build higher-level concepts in terms of lower Fork-join, not SPMD Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 5 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Language Primitives Task parallelism: concurrent loops and blocks ( cobegin , coforall ) Data parallelism: Concurrent map operations ( forall ) Concurrent fold operations ( scan , reduce ) Synchronization: Task synchronization, sync variables, atomic sections Locality: locales (UMA places to hold data and run tasks) (index) domains used to specify arrays, iteration ranges distributions (mappings of domains to locales) can drastically reduce code size compared to MPI+X more info on Chapel home page Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 6 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Compile Chain chpl compiler generates standard C code, or uses LLVM backend (Image: Cray Inc.) Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 7 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Base Language variables, constants, parameters: 1 var timestep:int; 2 param pi: real = 3.14159265; 3 config const epsilon = 0.05; // $ ./ myProgram --epsilon =0.01 records: 1 record Vector3D { var x, y, z: real; 2 3 } 4 var pos = new Vector3D (0.0 , 1.0, -1.5); 5 pos.x = 2.0; 6 var copy = pos; // copied by value classes: 1 class Person { var firstName , surname: string; 2 var age:int; 3 4 } 5 var patsy = new Person("Patricia", "Stone", 39); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 8 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Base Language (2) procedures, type inference, generic methods: 1 proc square(n) { return n * n; 2 3 } 4 5 var x = 2; 6 var x2 = square(x); 7 writeln(x2 , ": ",x2.type:string); // 4: int (64) 8 9 var y = 0.5; 10 var y2 = square(y); 11 writeln(y2 , ": ",y2.type:string); // 0.25: real (64) Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 9 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Base Language (3) iterators: 1 iter triangle(n) { var current = 0; 2 for i in 1..n { 3 current += i; 4 yield current; 5 } 6 7 } tuples, zippered iteration: 1 config const n = 10; 2 for (i,t) in zip (0..#n, triangle(n)) do writeln("triangle number ", i, " is ", t); 3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 10 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Task Parallelism task creation: 1 begin doStuff (); // spawn task and don ’t wait 2 cobegin { doStuff1 (); 3 doStuff2 (); 4 5 } // wait for completion of all statements in the block synchronisation variables: 1 var a$: sync int; 2 begin a$ = foo (); 3 var c = 2 * a$; // suspend until a$ is assigned Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 11 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Synchronization Variables single variables can only be written once; sync variables are reset to empty when read. 1 var item$: sync int; 2 proc produce () { for i in 0..#N do 1 var latch$: single bool; 3 item$ = i; 2 proc await () { 4 5 } latch$; 3 6 proc consume () { 4 } for i in 0..#N { 5 proc release () { 7 var x = item$; latch$ = true; 8 6 writeln(x); 7 } 9 } 10 8 11 } 9 begin await (); 10 begin release (); 12 13 begin produce (); 14 begin consume (); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 12 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Task Parallelism Example Fibonacci numbers: 1 proc fib(n): int { 1 proc fib(n): int { if n <= 2 then if n <= 2 then 2 2 return 1; return 1; 3 3 var t1$: single int; var t1$ , t2$: single int; 4 4 var t2: int; cobegin { 5 5 begin t1$ = fib(n -1); t1$ = fib(n -1); 6 6 t2 = fib (n -2); t2$ = fib(n -2); 7 7 // wait for $t1 } 8 8 return t1$ + t2; // wait for t1$ and t2$ 9 9 10 } return t1$ + t2$; 10 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 13 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Data Parallelism ranges: 1 var r1 = 0..3; // 0, 1, 2, 3 2 var r2 = 0..#10 by 2; // 0, 2, 4, 6, 8 arrays, data parallel loops: 1 var A, B: [0..#N] real; 2 forall i in 0..#N do // cf. coforall A(i) = A(i) + B(i); 3 scalar promotion: 1 A = A + B; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 14 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Data Parallelism (2) example: DAXPY 1 config const alpha = 3.0; 2 const MyRange = 0..#N; 3 proc daxpy(x: [MyRange] real , y: [MyRange] real): int { forall i in MyRange do 4 y(i) = alpha * x(i) + y(i); 5 6 } Alternatively, via promotion, the forall loop can be replaced by: y = alpha * x + y; reductions and scans: 1 var mx = (max reduce A); 2 A = (+ scan A); // prefix sum of A - parallel? the target of data parallelism could be SIMD, GPU or normal threads (currently no way to express this) Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 15 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: forall vs. coforall Use forall when iterations may be executed in parallel Use coforall when iterations must be executed in parallel What’s wrong with this code? 1 var a$: [0..#N] single int; 2 forall i in {0..#N} { if i < (N -1) then 3 a$[i] = a$[i+1] - 1; 4 else 5 a$[i] = N; 6 var result = a$[i]; 7 writeln(result); 8 9 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 16 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Task Intents constant (default): 1 config const N = 10; 2 var race:int; 3 coforall i in 0..#N do race += 1; // illegal! 4 reference: 1 var deliberateRace :int; 2 coforall i in 0..#N with (ref deliberateRace ) do deliberateRace += 1; 3 reduce: 1 var sum:int; 2 coforall i in 0..#N with (+ reduce sum) do sum += i; 3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 17 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Domains domain : an index set, can be used to declare arrays dense (rectangular): a tensor product of ranges, e.g. 1 config const M = 5, N = 7; 2 const D: domain (2) = {0..#M, 0..#N}; strided: 1 const D1 = {0..#M by 4, 0..#N by 2}; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 18 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Domains (2) sparse: 1 const SparseD: sparse subdomain(D) = ((0 ,0) , (1 ,2), (3 ,2), (4 ,4)); 2 associative: 1 var Colours: domain(string) = {"Black", "Yellow", "Red"}; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 19 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Locales locale : a unit of the target architecture: processing elements with (uniform) local memory 1 const Locales: [0..# numLocales ] locale = ... ; //built -in 2 on Locales [1] do foo (); 3 coforall (loc , id) in zip(Locales , 1..) do 4 on loc do // migrates this task to loc 5 coforall tid in 0..# numTasks do 6 writeln("Task ", id , " thread ", tid , " on ", loc); 7 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 20 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps use domain maps to map indices in a domain to locales: 1 use CyclicDist ; 2 const Dist = new dmap( new Cyclic(startIdx = 1, targetLocales = Locales [0..1])); 3 4 const D = {0..#N} dmapped Dist; 5 var x, y: [D] real; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 21 / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps (2) block: 1 use BlockDist; 2 const space1D = {0..#N}; 3 const B = space1D dmapped Block( boundingBox =space1D); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 22 / 90

Introduction to the PGAS Paradigm and Chapel Hands-on Exercise: Locales in Chapel Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 23 / 90

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC - PowerPoint PPT Presentation

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 02, 2017 Day 4 Schedule Computer Systems

MDT FE Power Consumption M. Fras, 06 June 2019 ASD Power Depending on Voltage ASD Supply [V]

Porting GASNet to Portals: Porting GASNet to Portals: Partitioned Global Address Space (PGAS)

Autism Spectrum Disorder: A Fresh Look ASD in Females Andrea Fourie Speech Therapist ASD:

Upcoming changes to autism spectrum disorder: evaluating DSM-5 What is ASD? ASD disease

BLUEPRINT FOR A NATIONAL ASD STRATEGY All Canadians with ASD and their families have full and

Building Partitioned Architectures Building Partitioned Architectures based on the based on the

Block and Triangular Matrices Block Matrices Defn. A partitioned matrix has the rows and columns

Paradigm Shift: Moving from Vertical Paradigm Shift: Moving from Vertical Paradigm Shift:

Prolog Declarative/logic paradigm Functional paradigm No assignment statement

14/04/2016 Global vs Partitioned scheduling Single shared queue instead of multiple dedicated

6 KEYNOTE ADDRESS SLIDES 7 KEYNOTE ADDRESS SLIDES 8 KEYNOTE ADDRESS SLIDES 9 KEYNOTE ADDRESS

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Sreeram Potluri,

S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Anshuman

Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C.

P Starting at address 0, going to address MAX prog 0 But where do addresses come from? MOV

PARADIGM Erkin Otles CS 838 PARADIGM Approach We developed an approach called PARADIGM

Steve Deitz Cray Inc. A new parallel language Under development at Cray Inc. Supported

SteveDeitz CrayInc. Anewparallellanguage

E-Matching with Free Variables Philipp Rmmer Uppsala University Sweden FATPA Workshop

On the Almost Axisymmetric Flows with Forcing Terms Marc Sedjro joint work with Michael Cullen.

RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: rvesse@apache.org 1

Global HPCC Benchmarks in Chapel: STREAM Triad, Random Access, and FFT HPC Challenge BOF, SC06

Introduction to High Performance Computing and Optimization Oliver Ernst Audience: 1./3. CMS,

VECTORS MEET VECTORS MEET VIRTUALIZATION VIRTUALIZATION ALEX BENNE ALEX BENNE FOSDEM 2018