Partitioned Global Address Space Paradigm ASD Distributed Memory HPC - - PowerPoint PPT Presentation
Partitioned Global Address Space Paradigm ASD Distributed Memory HPC - - PowerPoint PPT Presentation
Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 02, 2017 Day 4 Schedule Computer Systems
Day 4 – Schedule
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 2 / 90
Introduction to the PGAS Paradigm and Chapel
Outline
1
Introduction to the PGAS Paradigm and Chapel
2
Chapel Programming Strategies for Distributed Memory
3
Runtime Support for PGAS
4
MPI One-Sided Communications
5
Fault Tolerance
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 3 / 90
Introduction to the PGAS Paradigm and Chapel
Partitioned Global Address Space
recall the shared memory model: multiple threads with pointers to a global address space in the partitioned global address space (PGAS) model: have multiple threads, each with affinity to some portion of global address space SPMD or fork-join thread creation remote pointers to access data in other partitions the model maps to a cluster with remote memory access can also map to NUMA domains
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 4 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Design Principles
Cray High Performance Language
- riginally developed under DARPA High Productivity Computing
Systems program Targeted at massively parallel computers
- bject-oriented (Java-like syntax, but influenced by ZPL & HPF)
supports exploratory programming implicit (statically-inferable) types, run-time settable parameters (config), implicit main and module wrappings multiresolution design: build higher-level concepts in terms of lower Fork-join, not SPMD
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 5 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Language Primitives
Task parallelism:
concurrent loops and blocks (cobegin, coforall)
Data parallelism:
Concurrent map operations (forall) Concurrent fold operations (scan, reduce)
Synchronization:
Task synchronization, sync variables, atomic sections
Locality:
locales (UMA places to hold data and run tasks) (index) domains used to specify arrays, iteration ranges distributions (mappings of domains to locales)
can drastically reduce code size compared to MPI+X more info on Chapel home page
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 6 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Compile Chain
chpl compiler generates standard C code, or uses LLVM backend
(Image: Cray Inc.) Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 7 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Base Language
variables, constants, parameters:
1 var
timestep:int;
2 param pi: real = 3.14159265; 3 config
const epsilon = 0.05; // $ ./ myProgram
- -epsilon =0.01
records:
1 record
Vector3D {
2
var x, y, z: real;
3 } 4 var pos = new
Vector3D (0.0 , 1.0,
- 1.5);
5 pos.x = 2.0; 6 var
copy = pos; // copied by value
classes:
1 class
Person {
2
var firstName , surname: string;
3
var age:int;
4 } 5 var
patsy = new Person("Patricia", "Stone", 39);
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 8 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Base Language (2)
procedures, type inference, generic methods:
1 proc
square(n) {
2
return n * n;
3 } 4 5 var x = 2; 6 var x2 = square(x); 7 writeln(x2 , ": ",x2.type:string); // 4: int (64) 8 9 var y = 0.5; 10 var y2 = square(y); 11 writeln(y2 , ": ",y2.type:string); // 0.25:
real (64)
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 9 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Base Language (3)
iterators:
1 iter
triangle(n) {
2
var current = 0;
3
for i in 1..n {
4
current += i;
5
yield current;
6
}
7 }
tuples, zippered iteration:
1 config
const n = 10;
2 for (i,t) in zip (0..#n, triangle(n)) do 3
writeln("triangle number ", i, " is ", t);
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 10 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Task Parallelism
task creation:
1 begin
doStuff (); // spawn task and don ’t wait
2 cobegin { 3
doStuff1 ();
4
doStuff2 ();
5 } // wait
for completion
- f all
statements in the block
synchronisation variables:
1 var a$: sync
int;
2 begin a$ = foo (); 3 var c = 2 * a$; //
suspend until a$ is assigned
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 11 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Synchronization Variables
single variables can only be written once; sync variables are reset to empty when read.
1 var
item$: sync int;
2 proc
produce () {
3
for i in 0..#N do
4
item$ = i;
5 } 6 proc
consume () {
7
for i in 0..#N {
8
var x = item$;
9
writeln(x);
10
}
11 } 12 13 begin
produce ();
14 begin
consume ();
1 var
latch$: single bool;
2 proc
await () {
3
latch$;
4 } 5 proc
release () {
6
latch$ = true;
7 } 8 9 begin
await ();
10 begin
release ();
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 12 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Task Parallelism Example
Fibonacci numbers:
1 proc
fib(n): int {
2
if n <= 2 then
3
return 1;
4
var t1$: single int;
5
var t2: int;
6
begin t1$ = fib(n -1);
7
t2 = fib (n -2);
8
// wait for $t1
9
return t1$ + t2;
10 } 1 proc
fib(n): int {
2
if n <= 2 then
3
return 1;
4
var t1$ , t2$: single int;
5
cobegin {
6
t1$ = fib(n -1);
7
t2$ = fib(n -2);
8
}
9
// wait for t1$ and t2$
10
return t1$ + t2$;
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 13 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Data Parallelism
ranges:
1 var r1 = 0..3; // 0, 1, 2, 3 2 var r2 = 0..#10
by 2; // 0, 2, 4, 6, 8
arrays, data parallel loops:
1 var A, B: [0..#N] real; 2 forall i in 0..#N do // cf. coforall 3
A(i) = A(i) + B(i);
scalar promotion:
1 A = A + B; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 14 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Data Parallelism (2)
example: DAXPY
1 config
const alpha = 3.0;
2 const
MyRange = 0..#N;
3 proc
daxpy(x: [MyRange] real , y: [MyRange] real): int {
4
forall i in MyRange do
5
y(i) = alpha * x(i) + y(i);
6 }
Alternatively, via promotion, the forall loop can be replaced by:
y = alpha * x + y;
reductions and scans:
1 var mx = (max
reduce A);
2 A = (+ scan A); //
prefix sum of A - parallel?
the target of data parallelism could be SIMD, GPU or normal threads (currently no way to express this)
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 15 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: forall vs. coforall
Use forall when iterations may be executed in parallel Use coforall when iterations must be executed in parallel What’s wrong with this code?
1 var a$: [0..#N] single
int;
2 forall i in {0..#N} { 3
if i < (N -1) then
4
a$[i] = a$[i+1] - 1;
5
else
6
a$[i] = N;
7
var result = a$[i];
8
writeln(result);
9 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 16 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Task Intents
constant (default):
1 config
const N = 10;
2 var
race:int;
3 coforall i in 0..#N do 4
race += 1; // illegal!
reference:
1 var
deliberateRace :int;
2 coforall i in 0..#N with (ref
deliberateRace ) do
3
deliberateRace += 1;
reduce:
1 var sum:int; 2 coforall i in 0..#N with (+ reduce
sum) do
3
sum += i;
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 17 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Domains
domain: an index set, can be used to declare arrays dense (rectangular): a tensor product of ranges, e.g.
1 config
const M = 5, N = 7;
2 const D: domain (2) = {0..#M, 0..#N};
strided:
1 const D1 = {0..#M by 4, 0..#N by 2}; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 18 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Domains (2)
sparse:
1 const
SparseD: sparse subdomain(D)
2
= ((0 ,0) , (1 ,2), (3 ,2), (4 ,4));
associative:
1 var
Colours: domain(string) = {"Black", "Yellow", "Red"};
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 19 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Locales
locale: a unit of the target architecture: processing elements with (uniform) local memory
1 const
Locales: [0..# numLocales ] locale = ... ; //built -in
2 on
Locales [1] do
3
foo ();
4
coforall (loc , id) in zip(Locales , 1..) do
5
- n loc do //
migrates this task to loc
6
coforall tid in 0..# numTasks do
7
writeln("Task ", id , " thread ", tid , " on ", loc);
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 20 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Domain Maps
use domain maps to map indices in a domain to locales:
1 use
CyclicDist ;
2 const
Dist = new dmap(
3
new Cyclic(startIdx = 1, targetLocales = Locales [0..1]));
4 const D = {0..#N} dmapped
Dist;
5 var x, y: [D] real; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 21 / 90
Introduction to the PGAS Paradigm and Chapel
Chapel: Domain Maps (2)
block:
1 use
BlockDist;
2 const
space1D = {0..#N};
3 const B = space1D
dmapped Block( boundingBox =space1D);
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 22 / 90
Introduction to the PGAS Paradigm and Chapel
Hands-on Exercise: Locales in Chapel
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 23 / 90
Chapel Programming Strategies for Distributed Memory
Outline
1
Introduction to the PGAS Paradigm and Chapel
2
Chapel Programming Strategies for Distributed Memory
3
Runtime Support for PGAS
4
MPI One-Sided Communications
5
Fault Tolerance
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 24 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Programming Strategies
Think globally, compute locally Define key data structures
arrays domains
Specify distribution and layout
domain maps
Exploit parallelism over the available hardware
(co-)forall (co-)begin
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 25 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Matrix Multiplication
start with sequential matrix multiplication:
1 proc
matMul(const ref A, const ref B, C) {
2
for (m,n) in C.domain {
3
var c = 0.0;
4
for k in A.domain.dim (2) do
5
c += A[m,k] * B[k,n];
6
C[m,n] = c;
7
}
8 } 9 config
const M = 4, K = 4, N = 4;
10 var A: [0..#M ,0..#K] real; 11 var B: [0..#K ,0..#N] real; 12 var C: [0..#M ,0..#N] real; 13 matMul(A, B, C);
i j k k
A B C Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 26 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Performance Timing
- ne way to measure elapsed time:
1 var
timer:Timer;
2 timer.start (); 3 matMul(A, B, C); 4 timer.stop (); 5 var
timeMillis = timer.elapsed(TimeUnits. milliseconds );
6 writef("Serial
Multiply M=%i,N=%i,K=%i took %7.3 dr ms (%7.3 dr GFLOP/s)\n", M, N, K, timeMillis , 2*M*K*N/1e6/timeMillis );
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 27 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Matrix Multiplication
parallel, single locale:
1 proc
parMatMul(const ref A, const ref B, C) {
2
forall (m,n) in C.domain {
3
var c = 0.0;
4
for k in A.domain.dim (2) do
5
c += A[m,k] * B[k,n];
6
C[m,n] = c;
7
}
8 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 28 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Matrix Multiplication
parallel, distributed:
1 const
rows = reshape(Locales , {0..# numLocales , 0..0});
2 const
cols = reshape(Locales , {0..0 , 0..# numLocales });
3 const
spaceA = {0..#M, 0..#K};
4 const dA: domain (2)
dmapped Block(spaceA , rows) = spaceA;
5 const
spaceB = {0..#K, 0..#N};
6 const dB: domain (2)
dmapped Block(spaceB , cols) = spaceB;
7 const
spaceC = {0..#M, 0..#N};
8 const dC: domain (2)
dmapped Block(spaceC , rows) = spaceC;
9 var
blockA: [dA] real;
10 var
blockB: [dB] real;
11 var
blockC: [dC] real;
12 parMatMul(blockA , blockB , blockC); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 29 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Programming Strategies (continued)
Batch communications - avoid fine-grained remote accesses
array slicing specialized distributions e.g. StencilDist
Overlap computation and communication
tasks sync variables
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 30 / 90
Chapel Programming Strategies for Distributed Memory
Chapel: Further Reading
Chapel Web page: http://chapel.cray.com Chapel tutorials: http://chapel.cray.com/tutorials.html
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 31 / 90
Chapel Programming Strategies for Distributed Memory
Hands-on Exercise: 2D Stencil via Templates
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 32 / 90
Runtime Support for PGAS
Outline
1
Introduction to the PGAS Paradigm and Chapel
2
Chapel Programming Strategies for Distributed Memory
3
Runtime Support for PGAS
4
MPI One-Sided Communications
5
Fault Tolerance
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 33 / 90
MPI One-Sided Communications
Outline
1
Introduction to the PGAS Paradigm and Chapel
2
Chapel Programming Strategies for Distributed Memory
3
Runtime Support for PGAS
4
MPI One-Sided Communications
5
Fault Tolerance
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 34 / 90
MPI One-Sided Communications
Programming Models
Each process exposes a part of its memory to the other processes Allow data movement without direct involvement of process that holds the data
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 35 / 90
MPI One-Sided Communications
Comparison with Two-sided
http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture34.pdf Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 36 / 90
MPI One-Sided Communications
It’s all about Memory Consistency
Remember this from the shared memory course? Memory consistency concerns how memory behaves with respect to read and write operations from multiple processors Sequential consistency:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
- f each individual processor appear in this sequence in the order
specified by its program.
See: Shared Memory Consistency Models: A Tutorial, Sarita V. Adve Kourosh Gharachorloo Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 37 / 90
MPI One-Sided Communications
Reference Material
Subsequent slides will draw heavily on following material: Overviews of MPI3
William D Gropp: New Features of MPI-3 Fabio Affinito: MPI3
Two detailed lectures on one sided MPI
William Gropp: One-sided Communication in MPI William Gropp: More on One Sided Communication
Tutorial on MPI 2.2 and 3.0 by Torsten Hoefler
Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial
Detailed paper on remote memory access programming in MPI-3
- T. Hoefler et al., 2013. Remote Memory Access Programming in
MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 (March 2013)
Cornell Virtual Workshop on one-sided communication methods
https://cvw.cac.cornell.edu/MPIoneSided/default
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 38 / 90
MPI One-Sided Communications
RMA advantages and Issues
Advantages
Multiple transfers with single synchronization Bypass tag matching Can be faster exploiting underlying hardware support Better able to handle problems where communication pattern is unknown or irregular
Issues
How to create remote accessible memory Reading, writing and updating remote memory Data synchronisation Memory model
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 39 / 90
MPI One-Sided Communications
Window Creation
Regions of memory that we want to expose to RMA operations are called windows, they can be created in four ways
Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 40 / 90
MPI One-Sided Communications
Simple Window Creation
Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 41 / 90
MPI One-Sided Communications
Data Movement
MPI provides operations to read, write and atomically modify remote data
MPI GET MPI PUT MPI ACCUMULATE MPI GET ACCUMULATE MPI COMPARE AND SWAP MPI FETCH AND OP
Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 42 / 90
MPI One-Sided Communications
The Memory Consistency Issue
Fabio Affinito: MPI3
Three Synchronization models
Fence (active target) Post-start-complete-wait (generalized active target) Lock/Unlock (passive target)
Data accesses occur within epochs
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 43 / 90
MPI One-Sided Communications
Three Synchronization Models
Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 44 / 90
MPI One-Sided Communications
Passive Target Synchronization
William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 45 / 90
MPI One-Sided Communications
Completion Model
Relaxed memory model, acquire and release
Immediate Data Movement Delayed Data Movement Which is best when?
William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 46 / 90
MPI One-Sided Communications
Memory Models
Unified model new in MPI 3, what are advantages?
Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 47 / 90
MPI One-Sided Communications
Separate Semantics
Another table for unified semantics
Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 48 / 90
MPI One-Sided Communications
MPI-3 Communication Options
- T. Hoefler et al., 2013. Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 49 / 90
MPI One-Sided Communications
Example Codes
Fence Synchronization Post-Start-Complete-Wait Synchronization Lock-Unlock Synchronization
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 50 / 90
MPI One-Sided Communications
Fence Synchronization
1 // S t a r t up MPI . . . . 2 MPI Win win ; 3 i f ( rank == 0) { 4 /∗ Everyone w i l l r e t r i e v e from a b u f f e r
- n
root ∗/ 5 i n t s o i = s i z e o f ( i n t ) ; 6 MPI Win create ( buf , s o i ∗20 , soi , MPI INFO NULL ,comm,& win ) ; } 7 e l s e { 8 /∗ Others
- nly
r e t r i e v e , so t h e s e windows can be s i z e 0 ∗/ 9 MPI Win create (NULL, 0 , s i z e o f ( i n t ) , MPI INFO NULL ,comm,& win ) ; 10 } 11 12 /∗ No l o c a l
- p e r a t i o n s
p r i o r to t h i s epoch , so g i v e an a s s e r t i o n ∗/ 13 MPI Win fence (MPI MODE NOPRECEDE, win ) ; 14 i f ( rank != 0) { 15 /∗ I n s i d e the fence , make RMA c a l l s to GET from rank 0 ∗/ 16 MPI Get ( buf , 20 , MPI INT , 0 ,0 , 20 , MPI INT , win ) ; 17 } 18 19 /∗ Complete the epoch − t h i s w i l l block u n t i l MPI Get i s complete ∗/ 20 MPI Win fence (0 , win ) ; 21 /∗ A l l done with the window − t e l l MPI t h e r e are no more epochs ∗/ 22 MPI Win fence (MPI MODE NOSUCCEED, win ) ; 23 /∗ Free up
- ur
window ∗/ 24 MPI Win free(&win ) 25 // shut down . . . Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/fence Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 51 / 90
MPI One-Sided Communications
Post-Start-Complete-Wait Synchronization
1 // S t a r t up MPI . . . 2 MPI Group comm group , group ; 3 4 f o r ( i =0; i <3; i ++) { 5 ranks [ i ] = i ; // For forming groups , l a t e r 6 } 7 MPI Comm group (MPI COMM WORLD,&comm group ) ; 8 9 /∗ Create new window f o r t h i s comm ∗/ 10 i f ( rank == 0) { 11 MPI Win create ( buf , s i z e o f ( i n t ) ∗3 , s i z e o f ( i n t ) , 12 MPI INFO NULL ,MPI COMM WORLD,& win ) ; 13 } 14 e l s e { 15 /∗ Rank 1
- r
2 ∗/ 16 MPI Win create (NULL, 0 , s i z e o f ( i n t ) , 17 MPI INFO NULL ,MPI COMM WORLD,& win ) ; 18 } 19 20 /∗ − − − − > c o n t i n u e s i n next s l i d e − − − − > ∗/ Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/pscw Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 52 / 90
MPI One-Sided Communications
Post-Start-Complete-Wait Synchronization (2)
1 /∗ Now do the communication epochs ∗/ 2 i f ( rank == 0) { 3 /∗ O r i g i n group c o n s i s t s
- f
ranks 1 and 2 ∗/ 4 MPI Group incl ( comm group , 2 , ranks +1,&group ) ; 5 /∗ Begin the exposure epoch ∗/ 6 MPI Win post ( group , 0 , win ) ; 7 /∗ Wait f o r epoch to end ∗/ 8 MPI Win wait ( win ) ; 9 } 10 e l s e { 11 /∗ Target group c o n s i s t s
- f
rank 0 ∗/ 12 MPI Group incl ( comm group , 1 , ranks ,& group ) ; 13 /∗ Begin the a c c e s s epoch ∗/ 14 MPI Win start ( group , 0 , win ) ; 15 /∗ Put i n t o rank==0 a c c o r d i n g to my rank ∗/ 16 MPI Put ( buf , 1 , MPI INT , 0 , rank , 1 , MPI INT , win ) ; 17 /∗ Terminate the a c c e s s epoch ∗/ 18 MPI Win complete ( win ) ; 19 } 20 21 /∗ Free window and groups ∗/ 22 MPI Win free(&win ) ; 23 MPI Group free(&group ) ; 24 MPI Group free(&comm group ) ; 25 26 // Shut down . . . Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/pscw Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 53 / 90
MPI One-Sided Communications
Lock-Unlock Synchronization
1 // S t a r t up MPI . . . 2 MPI Win win ; 3 4 i f ( rank == 0) { 5 /∗ Rank 0 w i l l be the c a l l e r , so n u l l window ∗/ 6 MPI Win create (NULL, 0 , 1 , 7 MPI INFO NULL ,MPI COMM WORLD,& win ) ; 8 /∗ Request l o c k
- f
p r o c e s s 1 ∗/ 9 MPI Win lock (MPI LOCK SHARED , 1 , 0 , win ) ; 10 MPI Put ( buf , 1 , MPI INT , 1 , 0 , 1 , MPI INT , win ) ; 11 /∗ Block u n t i l put succeeds ∗/ 12 MPI Win unlock (1 , win ) ; 13 /∗ Free the window ∗/ 14 MPI Win free(&win ) ; 15 } 16 e l s e { 17 /∗ Rank 1 i s the t a r g e t p r o c e s s ∗/ 18 MPI Win create ( buf ,2∗ s i z e o f ( i n t ) , s i z e o f ( i n t ) , 19 MPI INFO NULL , MPI COMM WORLD, &win ) ; 20 /∗ No sync c a l l s
- n
the t a r g e t p r o c e s s ! ∗/ 21 MPI Win free(&win ) ; 22 } Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/lul Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 54 / 90
MPI One-Sided Communications
Case Studies
- T. Hoefler et al., 2013. Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 55 / 90
MPI One-Sided Communications
Hands-on Exercise: The 3 Synchronization Methods
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 56 / 90
Fault Tolerance
Outline
1
Introduction to the PGAS Paradigm and Chapel
2
Chapel Programming Strategies for Distributed Memory
3
Runtime Support for PGAS
4
MPI One-Sided Communications
5
Fault Tolerance
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 57 / 90
Fault Tolerance
HPC Systems: Fast, Complex and Error Prone
Sunway TaihuLight: the fastest supercomputer today (peak 125.4359 Pflop/s)
(Image: Top500)
- 1. Dongarra, Jack. ”Report on the sunway taihulight system”. Tech Report UT-EECS-16-742. (2016)
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 58 / 90
Fault Tolerance
The Reliability Challenge in HPC
Reliability Terms
MTTI : Mean Time To Interrupt MTTR : Mean Time To Repair MTBF : Mean Time Between Failures = MTTI + MTTR
Reliability Figures for Terascale Systems: System CPUs Reliability Src LANL ASCI Q 8,192 MTTI: 6.5 hours 2 LLNL ASCI White (2003) 8,192 MTBF: 40 hours 2 PSC Lemieux 3,016 MTTI: 9.7 hours 2 LLNL BlueGene/L 106,496 MTTI: 7-10 days 3
- 2. Feng, Wu-chun. ”The importance of being low power in high performance computing.” (2005)
- 3. Bronevetsky, Greg, and Adam Moody. Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O. (2009)
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 59 / 90
Fault Tolerance
A Statistical Study of Failures on HPC Systems
Conclusions: “First, the failure rate of a system grows proportional to the number
- f processor chips in the system. Second, there is little indication that systems and
their hardware get more reliable over time as technology changes.”
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 60 / 90
Fault Tolerance
The Reliability Challenge in HPC
Prediction of MTTI with three rates of growth in cores: doubling every 18, 24 and 30 months Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 61 / 90
Fault Tolerance
Fault Tolerance
As HPC systems grow in size, the MTTI shrinks and long-running applications become at a higher risk of encountering faults. Faults are generally classified into:
hard faults: inhibit process execution and result in data loss. soft faults: undetected bit flips silently corrupting data in disk, memory, or registers.
Fault tolerance is the ability to contain faults and reduce their impact.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 62 / 90
Fault Tolerance
Fault Tolerance Techniques
Rollback recovery:
Returns the application to an old consistent state. Recomputes previously reached states before the failure. Common technique:
Checkpoint/restart
Forward recovery:
Computation proceeds after a failure without rollback. Requires a fault-aware runtime system (i.e. a runtime system that does not crash upon a failure). Common techniques:
Replication Master-Worker ABFT (Algorithmic-Based Fault Tolerance)
Or composite techniques
e.g. Replication-enhaned checkpointing [4]
- 4. Ni, Xiang, et al. ”ACR: Automatic checkpoint/restart for soft and hard error protection.” Proceedings of the International
Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013. Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 63 / 90
Fault Tolerance
Rollback Recovery
Checkpoint/Restart
The most widely used fault tolerance mechanism in HPC systems. Requires saving the application state periodically on a reliable storage. Upon a failure, the application restarts from the last consistent checkpoint.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 64 / 90
Fault Tolerance
Checkpointing Classifications
Coordinated Uncoordinated
Collective checkpointing Processes checkpoint independently All processes restart Only the failed process restarts Suitable for synchronized computations Suitable for loosly coupled processes Often requires message logging Vulnerable to the domino effect
- 5. Elnozahy, Elmootazbellah Nabil, et al. ”A survey of rollback-recovery protocols in message-passing systems.” ACM
Computing Surveys (CSUR) 34.3 (2002): 375-408. Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 65 / 90
Fault Tolerance
Checkpointing Classifications (Cont.)
Disk-based Diskless
I/O Intensive Replaces disk with in-memory replication Applicable to all runtime systems Fault-aware systems only More replicas → more reliability, and higher failure-free overhead
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 66 / 90
Fault Tolerance
Checkpointing MPI Applications
Coordinated disk-based checkpointing is the common mechanism for fault tolerance on HPC platforms.
Provided transparently in some MPI implementations (e.g. Intel-MPI: mpirun -chkpoint-interval 100sec -np 100 ./MyApp) Provided outside of MPI by tools that dump process image into disk, like:
BLCR: Berkeley Lab Checkpoint/Restart for Linux DMTCP: Distributed MultiThreaded CheckPointing
Or done manually by programmers using file system APIs.
Diskless checkpointing is only applicable to fault-aware MPI implementations (like MPI-ULFM, which we cover in the last part of this lecture)
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 67 / 90
Fault Tolerance
Checkpoint Interval
The checkpoint interval has a crucial impact on performance:
A long interval → less checkpoints, and more lost work upon a failure. A short interval → more checkpoints, and less lost work upon a failure.
Young’s formula [6] is often used to compute the optimal checkpoint interval i as √ 2 ∗ t ∗ MTTI, where t is the checkpointing time. The effective application utilization (u) of a system can be computed as [7]: u = 1− (lost utilization for recovery + lost utilization for checkpointing) u = 1 − ( 1
2.i. 1 MTTI + t. 1 i )
- 6. Young, John W. ”A first order approximation to the optimum checkpoint interval.” Communications of the ACM 17.9 (1974)
- 7. Schroeder, Bianca, and Garth A. Gibson. ”Understanding failures in petascale computers.” Journal of Physics. 78.1 (2007).
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 68 / 90
Fault Tolerance
Projected System Utilization with C/R
Effective application utilization over time Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 69 / 90
Fault Tolerance
Forward Recovery
Replication
executes one or more replicas of each process on independent nodes when a replica is lost, another replica takes over without rollback
Replication in message passing systems: the message ordering challenge
used as a detection and correction mechanism for silent data corruption errors. despite its expensive resource requirements, recent studies [8,9] suggest that replication can be a viable alternative for checkpointing on extreme scale systems with short MTTI.
- 8. Ropars, Thomas, et al. ”Efficient Process Replication for MPI Applications: Sharing Work between Replicas.” IPDPS. 2015.
- 9. Ferreira, Kurt, et al. ”Evaluating the viability of process replication reliability for exascale systems.” SC. 2011.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 70 / 90
Fault Tolerance
Forward Recovery
Master-Worker
Worker failure: can be tolerated without rollback by assigning the tasks
- f the failed worker to another worker
Master failure: can be tolerated using replication or checkpointing. Because the probably of the master failure is constant (as it does not depend on the scale of the application), it is often more efficient to consider the master failure as a fatal error.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 71 / 90
Fault Tolerance
Forward Recovery
Algorithmic-Based Fault Tolerance
the design of custom recovery mechanisms based on expert-knowledge
- f special algorithm properties (e.g. available data redundancy, the
ability to approximate lost data from remaining data, . . . ). for example: using redundant data to recover lost sub-grids in PDE solvers that use the Sparse Grid Combination Technique (SGCT) [10]:
(Image: Ali, et al. 2016)
- 10. Ali, Md Mohsin, et al. ”Complex scientific applications made fault-tolerant with the sparse grid combination technique.”
IJHPCA 30.3 (2016): 335-359. Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 72 / 90
Fault Tolerance
MPI and Fault Tolerance
The MPI standard does not specify the behaviour of MPI when ranks fail. Most implementations terminate the application as a result of rank failure. Users rely on coordinated disk-based checkpointing because it does not require fault tolerance support from MPI. However, the time to checkpoint an application with a large memory footprint can exceed the MTTI on large systems, making coordinated disk-based checkpointing unapplicable on large scale. User-level fault tolerance techniques can deliver better performance, however, they require fault tolerance support from MPI.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 73 / 90
Fault Tolerance
MPI and Fault Tolerance (Cont.)
MPI User Level Failure Mitigation
A proposal by the MPI Forum’s Fault Tolerance Working Group to add fault tolerance semantics to MPI. Under assessment by the MPI forum to be part of the coming MPI-4 standard. A reference implementation of ULFM is available based on OpenMPI1.7.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 74 / 90
Fault Tolerance
MPI User Level Failure Mitigation
In the following, we cover the following aspects of MPI-ULFM:
Error Handling Failure Notification Failure Mitigation
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 75 / 90
Fault Tolerance
Error Handling (1/2)
In standard MPI:
most MPI interfaces return an error code (e.g. 0=MPI SUCCESS, 1=MPI ERR BUFFER, . . . ) we can set an error handler to a communicator using: MPI Comm set errhandler there are two predefined error handlers:
MPI ERRORS ARE FATAL: terminates MPI (the default). MPI ERRORS RETURN: returns an error code to the caller.
The user can also define a customized error handler, as follows:
1 /* User ’s error
handling function */
2 void
errorCallback (MPI_Comm * comm , int * errCode , ...) { }
3 4 /*
Changing the communicator ’s error handler */
5 MPI_Errhandler
handler;
6 MPI_Comm_create_errhandler (errorCallback , &handler); 7 MPI_Comm_set_errhandler (MPI_COMM_WORLD , handler); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 76 / 90
Fault Tolerance
Error Handling (2/2)
ULFM uses the same error handling mechanism of standard MPI It added new error codes to report process failure events:
54=MPI ERR PROC FAILED 55=MPI ERR PROC FAILED PENDING 56=MPI ERR REVOKED
The default error handler MPI ERRORS ARE FATAL must not be used.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 77 / 90
Fault Tolerance
Failure Notification (1/3)
Process failure errors are raised only in MPI operations that involve a failed rank.
Point-to-point operations
Using a named rank: Using MPI ANY SOURCE:
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 78 / 90
Fault Tolerance
Failure Notification (2/3)
Process failure errors are raised only in MPI operations that involve a failed rank.
Collective operations: some live processes may raise an error, while
- thers return successfully.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 79 / 90
Fault Tolerance
Failure Notification (3/3)
Process failure errors are raised only in MPI operations that involve a failed rank.
Non-blocking operations: error reporting is postponed to the corresponding completion function (e.g. MPI Wait, MPI Test).
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 80 / 90
Fault Tolerance
Failure Mitigation Interfaces (1/2)
MPI Comm failure ack( comm )
a local operation that acknowledges all detected failures on the communicator. it’s purpose is to silence process failure errors in future MPI ANY SOURCE calls that involve an acknowledged process failure.
MPI Comm failure get acked( comm, failedgrp )
returns the group of failed ranks that were already acknowledged
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 81 / 90
Fault Tolerance
Failure Mitigation Interfaces (2/2)
MPI Comm revoke( comm )
a local operation that invalidates the communicator any future communication on a revoked communicator fails with error MPI ERR REVOKED live ranks must collectively create a new communicator
MPI comm shrink( oldcomm, newcomm )
a collective operation that creates a new communicator that excludes the dead ranks in the old communicator like other collectives, it may succeed at some ranks and fail at others.
MPI comm agree( oldcomm, flag )
a collective operation for participants to agree on some value. the only collective operation that returns the same result to all participants.
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 82 / 90
Fault Tolerance
Resilient Iterative Application Skeleton
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 83 / 90
Fault Tolerance 1 #define
CKPT_INTERVAL 10 /* the checkpointing iterval */
2 #define
MAX_ITER 100 /* maximum
- no. of iters
*/
3 4 MPI_Comm
world; /* the working communicator */
5 int
nprocs; /* communicator size */
6 int
rank; /* my rank */
7 bool
restart; /* restart flag */
8 9 void
compute (); /* executes the iterative computation ,*
10
and
- rchestrates
checkpoint /restart */
11 12 int
runIter(int i); /* runs a single iteration , *
13
* and returns the MPI error code *
14
* of the last MPI call */
15 16 void
shrinkWorld (); /* shrinks a failed communicator , *
17
* and sets the new rank and nprocs */
18 19 void
errorCallback (MPI_Comm* comm , int* rc , ...);
20
/* the communicator ’s error handler */
21 22 void
writeCkpt (); /* creates a new checkpoint */
23 24 int
readCkpt (); /* loads the last checkpoint , and
25
* returns the corresponding iteration */
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 84 / 90
Fault Tolerance 1 void
main(int argc , char* argv []) {
2
MPI_Init (&argc , &argv);
3 4
/* the initial world state */
5
world = MPI_COMM_WORLD ;
6
MPI_Comm_rank (world , &rank);
7
MPI_Comm_size (world , &nprocs);
8 9
/* setting the error handler */
10
MPI_Errhandler errHandler ;
11
MPI_Comm_create_errhandler (errorCallback , & errHandler);
12
MPI_Comm_set_errhandler (world , errHandler);
13 14
compute ();
15 16
MPI_Finalize ();
17 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 85 / 90
Fault Tolerance 1 /*
- rchestrates
the iterative processing and C/R */
2 void
compute (){
3
int rc; /* holds MPI return codes */
4
restart = false; /* set to true
- nly in
errorCallback ()*/
5
int i = 0; /* current iteration number */
6
do {
7
if (restart) {
8
i = readCkpt ();
9
rc = MPI_Comm_agree (world , i);
10
if (rc != MPI_SUCCESS )
11
continue;
12
restart = false;
13
}
14
while (i < MAX_ITER) {
15
rc = runIter(i);
16
if (rc != MPI_SUCCESS )
17
break; /* jump to the
- uter
loop to restart */
18 19
if (i % CKPT_INTERVAL == 0)
20
writeCkpt ();
21 22
i++;
23
}
24
} while (restart || i < MAX_ITER);
25 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 86 / 90
Fault Tolerance 1 /* a callback
function to handle MPI errors */
2 void
errorCallback (MPI_Comm * comm , int * errCode , ...) {
3
if (* errCode != MPI_ERR_PROC_FAILED &&
4
*errCode != MPI_ERR_PROC_FAILED_PENDING &&
5
*errCode != MPI_ERR_COMM_REVOKED ) {
6
/*We only tolerate process failure errors */
7
MPI_Abort(comm ,
- 1);
8
}
9 10
/* acknowledge the detected failures */
11
MPI_Comm_failure_ack ( *comm );
12 13
if (errCode != MPI_ERR_COMM_REVOKED ){
14
/* propagate the failure to other ranks */
15
MPI_Comm_revoke ( *comm );
16
}
17 18
/* all live ranks must reach this point
19
to collectively shrink the communicator */
20
shrinkWorld ();
21 22
restart = true;
23 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 87 / 90
Fault Tolerance 1 /*
Creates a new communicator for the application ,
2
excluding dead ranks in the old (revoked) communicator */
3 void
shrinkWorld () {
4
int rc; /* shrink return code */
5
MPI_Comm newComm;
6
do {
7
rc = MPI_Comm_shrink (world , newComm);
8
MPI_Comm_agree (newComm , &rc);
9
} while(rc != MPI_SUCCESS );
10 11
/* update the communicator */
12
world = newComm;
13 14
/* update my rank and nprocs */
15
MPI_Comm_rank (world , &rank);
16
MPI_Comm_size (world , &nprocs);
17 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 88 / 90
Fault Tolerance
Fault Tolerance: Summary
Topics covered today:
the reducing reliability of HPC system as they grow larger fault tolerance techniques (C/R, Replication, Master-Worker, ABFT) the MPI-ULFM proposal for adding fault tolerance support to MPI
Acknowledgement:
The fault tolerance part of today’s lecture is influenced by materials from SC’16 Tutorial ’Fault Tolerance for HPC: Theory and Practice’
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 89 / 90
Fault Tolerance
Hands-on Exercise: Checkpointing and ULFM
Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 90 / 90