Partitioned Global Address Space Paradigm ASD Distributed Memory HPC - - PowerPoint PPT Presentation

partitioned global address space paradigm asd distributed
SMART_READER_LITE
LIVE PREVIEW

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC - - PowerPoint PPT Presentation

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 02, 2017 Day 4 Schedule Computer Systems


slide-1
SLIDE 1

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop

Computer Systems Group

Research School of Computer Science Australian National University Canberra, Australia

November 02, 2017

slide-2
SLIDE 2

Day 4 – Schedule

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 2 / 90

slide-3
SLIDE 3

Introduction to the PGAS Paradigm and Chapel

Outline

1

Introduction to the PGAS Paradigm and Chapel

2

Chapel Programming Strategies for Distributed Memory

3

Runtime Support for PGAS

4

MPI One-Sided Communications

5

Fault Tolerance

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 3 / 90

slide-4
SLIDE 4

Introduction to the PGAS Paradigm and Chapel

Partitioned Global Address Space

recall the shared memory model: multiple threads with pointers to a global address space in the partitioned global address space (PGAS) model: have multiple threads, each with affinity to some portion of global address space SPMD or fork-join thread creation remote pointers to access data in other partitions the model maps to a cluster with remote memory access can also map to NUMA domains

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 4 / 90

slide-5
SLIDE 5

Introduction to the PGAS Paradigm and Chapel

Chapel: Design Principles

Cray High Performance Language

  • riginally developed under DARPA High Productivity Computing

Systems program Targeted at massively parallel computers

  • bject-oriented (Java-like syntax, but influenced by ZPL & HPF)

supports exploratory programming implicit (statically-inferable) types, run-time settable parameters (config), implicit main and module wrappings multiresolution design: build higher-level concepts in terms of lower Fork-join, not SPMD

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 5 / 90

slide-6
SLIDE 6

Introduction to the PGAS Paradigm and Chapel

Chapel: Language Primitives

Task parallelism:

concurrent loops and blocks (cobegin, coforall)

Data parallelism:

Concurrent map operations (forall) Concurrent fold operations (scan, reduce)

Synchronization:

Task synchronization, sync variables, atomic sections

Locality:

locales (UMA places to hold data and run tasks) (index) domains used to specify arrays, iteration ranges distributions (mappings of domains to locales)

can drastically reduce code size compared to MPI+X more info on Chapel home page

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 6 / 90

slide-7
SLIDE 7

Introduction to the PGAS Paradigm and Chapel

Chapel: Compile Chain

chpl compiler generates standard C code, or uses LLVM backend

(Image: Cray Inc.) Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 7 / 90

slide-8
SLIDE 8

Introduction to the PGAS Paradigm and Chapel

Chapel: Base Language

variables, constants, parameters:

1 var

timestep:int;

2 param pi: real = 3.14159265; 3 config

const epsilon = 0.05; // $ ./ myProgram

  • -epsilon =0.01

records:

1 record

Vector3D {

2

var x, y, z: real;

3 } 4 var pos = new

Vector3D (0.0 , 1.0,

  • 1.5);

5 pos.x = 2.0; 6 var

copy = pos; // copied by value

classes:

1 class

Person {

2

var firstName , surname: string;

3

var age:int;

4 } 5 var

patsy = new Person("Patricia", "Stone", 39);

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 8 / 90

slide-9
SLIDE 9

Introduction to the PGAS Paradigm and Chapel

Chapel: Base Language (2)

procedures, type inference, generic methods:

1 proc

square(n) {

2

return n * n;

3 } 4 5 var x = 2; 6 var x2 = square(x); 7 writeln(x2 , ": ",x2.type:string); // 4: int (64) 8 9 var y = 0.5; 10 var y2 = square(y); 11 writeln(y2 , ": ",y2.type:string); // 0.25:

real (64)

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 9 / 90

slide-10
SLIDE 10

Introduction to the PGAS Paradigm and Chapel

Chapel: Base Language (3)

iterators:

1 iter

triangle(n) {

2

var current = 0;

3

for i in 1..n {

4

current += i;

5

yield current;

6

}

7 }

tuples, zippered iteration:

1 config

const n = 10;

2 for (i,t) in zip (0..#n, triangle(n)) do 3

writeln("triangle number ", i, " is ", t);

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 10 / 90

slide-11
SLIDE 11

Introduction to the PGAS Paradigm and Chapel

Chapel: Task Parallelism

task creation:

1 begin

doStuff (); // spawn task and don ’t wait

2 cobegin { 3

doStuff1 ();

4

doStuff2 ();

5 } // wait

for completion

  • f all

statements in the block

synchronisation variables:

1 var a$: sync

int;

2 begin a$ = foo (); 3 var c = 2 * a$; //

suspend until a$ is assigned

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 11 / 90

slide-12
SLIDE 12

Introduction to the PGAS Paradigm and Chapel

Chapel: Synchronization Variables

single variables can only be written once; sync variables are reset to empty when read.

1 var

item$: sync int;

2 proc

produce () {

3

for i in 0..#N do

4

item$ = i;

5 } 6 proc

consume () {

7

for i in 0..#N {

8

var x = item$;

9

writeln(x);

10

}

11 } 12 13 begin

produce ();

14 begin

consume ();

1 var

latch$: single bool;

2 proc

await () {

3

latch$;

4 } 5 proc

release () {

6

latch$ = true;

7 } 8 9 begin

await ();

10 begin

release ();

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 12 / 90

slide-13
SLIDE 13

Introduction to the PGAS Paradigm and Chapel

Chapel: Task Parallelism Example

Fibonacci numbers:

1 proc

fib(n): int {

2

if n <= 2 then

3

return 1;

4

var t1$: single int;

5

var t2: int;

6

begin t1$ = fib(n -1);

7

t2 = fib (n -2);

8

// wait for $t1

9

return t1$ + t2;

10 } 1 proc

fib(n): int {

2

if n <= 2 then

3

return 1;

4

var t1$ , t2$: single int;

5

cobegin {

6

t1$ = fib(n -1);

7

t2$ = fib(n -2);

8

}

9

// wait for t1$ and t2$

10

return t1$ + t2$;

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 13 / 90

slide-14
SLIDE 14

Introduction to the PGAS Paradigm and Chapel

Chapel: Data Parallelism

ranges:

1 var r1 = 0..3; // 0, 1, 2, 3 2 var r2 = 0..#10

by 2; // 0, 2, 4, 6, 8

arrays, data parallel loops:

1 var A, B: [0..#N] real; 2 forall i in 0..#N do // cf. coforall 3

A(i) = A(i) + B(i);

scalar promotion:

1 A = A + B; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 14 / 90

slide-15
SLIDE 15

Introduction to the PGAS Paradigm and Chapel

Chapel: Data Parallelism (2)

example: DAXPY

1 config

const alpha = 3.0;

2 const

MyRange = 0..#N;

3 proc

daxpy(x: [MyRange] real , y: [MyRange] real): int {

4

forall i in MyRange do

5

y(i) = alpha * x(i) + y(i);

6 }

Alternatively, via promotion, the forall loop can be replaced by:

y = alpha * x + y;

reductions and scans:

1 var mx = (max

reduce A);

2 A = (+ scan A); //

prefix sum of A - parallel?

the target of data parallelism could be SIMD, GPU or normal threads (currently no way to express this)

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 15 / 90

slide-16
SLIDE 16

Introduction to the PGAS Paradigm and Chapel

Chapel: forall vs. coforall

Use forall when iterations may be executed in parallel Use coforall when iterations must be executed in parallel What’s wrong with this code?

1 var a$: [0..#N] single

int;

2 forall i in {0..#N} { 3

if i < (N -1) then

4

a$[i] = a$[i+1] - 1;

5

else

6

a$[i] = N;

7

var result = a$[i];

8

writeln(result);

9 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 16 / 90

slide-17
SLIDE 17

Introduction to the PGAS Paradigm and Chapel

Chapel: Task Intents

constant (default):

1 config

const N = 10;

2 var

race:int;

3 coforall i in 0..#N do 4

race += 1; // illegal!

reference:

1 var

deliberateRace :int;

2 coforall i in 0..#N with (ref

deliberateRace ) do

3

deliberateRace += 1;

reduce:

1 var sum:int; 2 coforall i in 0..#N with (+ reduce

sum) do

3

sum += i;

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 17 / 90

slide-18
SLIDE 18

Introduction to the PGAS Paradigm and Chapel

Chapel: Domains

domain: an index set, can be used to declare arrays dense (rectangular): a tensor product of ranges, e.g.

1 config

const M = 5, N = 7;

2 const D: domain (2) = {0..#M, 0..#N};

strided:

1 const D1 = {0..#M by 4, 0..#N by 2}; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 18 / 90

slide-19
SLIDE 19

Introduction to the PGAS Paradigm and Chapel

Chapel: Domains (2)

sparse:

1 const

SparseD: sparse subdomain(D)

2

= ((0 ,0) , (1 ,2), (3 ,2), (4 ,4));

associative:

1 var

Colours: domain(string) = {"Black", "Yellow", "Red"};

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 19 / 90

slide-20
SLIDE 20

Introduction to the PGAS Paradigm and Chapel

Chapel: Locales

locale: a unit of the target architecture: processing elements with (uniform) local memory

1 const

Locales: [0..# numLocales ] locale = ... ; //built -in

2 on

Locales [1] do

3

foo ();

4

coforall (loc , id) in zip(Locales , 1..) do

5

  • n loc do //

migrates this task to loc

6

coforall tid in 0..# numTasks do

7

writeln("Task ", id , " thread ", tid , " on ", loc);

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 20 / 90

slide-21
SLIDE 21

Introduction to the PGAS Paradigm and Chapel

Chapel: Domain Maps

use domain maps to map indices in a domain to locales:

1 use

CyclicDist ;

2 const

Dist = new dmap(

3

new Cyclic(startIdx = 1, targetLocales = Locales [0..1]));

4 const D = {0..#N} dmapped

Dist;

5 var x, y: [D] real; Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 21 / 90

slide-22
SLIDE 22

Introduction to the PGAS Paradigm and Chapel

Chapel: Domain Maps (2)

block:

1 use

BlockDist;

2 const

space1D = {0..#N};

3 const B = space1D

dmapped Block( boundingBox =space1D);

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 22 / 90

slide-23
SLIDE 23

Introduction to the PGAS Paradigm and Chapel

Hands-on Exercise: Locales in Chapel

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 23 / 90

slide-24
SLIDE 24

Chapel Programming Strategies for Distributed Memory

Outline

1

Introduction to the PGAS Paradigm and Chapel

2

Chapel Programming Strategies for Distributed Memory

3

Runtime Support for PGAS

4

MPI One-Sided Communications

5

Fault Tolerance

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 24 / 90

slide-25
SLIDE 25

Chapel Programming Strategies for Distributed Memory

Chapel: Programming Strategies

Think globally, compute locally Define key data structures

arrays domains

Specify distribution and layout

domain maps

Exploit parallelism over the available hardware

(co-)forall (co-)begin

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 25 / 90

slide-26
SLIDE 26

Chapel Programming Strategies for Distributed Memory

Chapel: Matrix Multiplication

start with sequential matrix multiplication:

1 proc

matMul(const ref A, const ref B, C) {

2

for (m,n) in C.domain {

3

var c = 0.0;

4

for k in A.domain.dim (2) do

5

c += A[m,k] * B[k,n];

6

C[m,n] = c;

7

}

8 } 9 config

const M = 4, K = 4, N = 4;

10 var A: [0..#M ,0..#K] real; 11 var B: [0..#K ,0..#N] real; 12 var C: [0..#M ,0..#N] real; 13 matMul(A, B, C);

i j k k

A B C Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 26 / 90

slide-27
SLIDE 27

Chapel Programming Strategies for Distributed Memory

Chapel: Performance Timing

  • ne way to measure elapsed time:

1 var

timer:Timer;

2 timer.start (); 3 matMul(A, B, C); 4 timer.stop (); 5 var

timeMillis = timer.elapsed(TimeUnits. milliseconds );

6 writef("Serial

Multiply M=%i,N=%i,K=%i took %7.3 dr ms (%7.3 dr GFLOP/s)\n", M, N, K, timeMillis , 2*M*K*N/1e6/timeMillis );

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 27 / 90

slide-28
SLIDE 28

Chapel Programming Strategies for Distributed Memory

Chapel: Matrix Multiplication

parallel, single locale:

1 proc

parMatMul(const ref A, const ref B, C) {

2

forall (m,n) in C.domain {

3

var c = 0.0;

4

for k in A.domain.dim (2) do

5

c += A[m,k] * B[k,n];

6

C[m,n] = c;

7

}

8 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 28 / 90

slide-29
SLIDE 29

Chapel Programming Strategies for Distributed Memory

Chapel: Matrix Multiplication

parallel, distributed:

1 const

rows = reshape(Locales , {0..# numLocales , 0..0});

2 const

cols = reshape(Locales , {0..0 , 0..# numLocales });

3 const

spaceA = {0..#M, 0..#K};

4 const dA: domain (2)

dmapped Block(spaceA , rows) = spaceA;

5 const

spaceB = {0..#K, 0..#N};

6 const dB: domain (2)

dmapped Block(spaceB , cols) = spaceB;

7 const

spaceC = {0..#M, 0..#N};

8 const dC: domain (2)

dmapped Block(spaceC , rows) = spaceC;

9 var

blockA: [dA] real;

10 var

blockB: [dB] real;

11 var

blockC: [dC] real;

12 parMatMul(blockA , blockB , blockC); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 29 / 90

slide-30
SLIDE 30

Chapel Programming Strategies for Distributed Memory

Chapel: Programming Strategies (continued)

Batch communications - avoid fine-grained remote accesses

array slicing specialized distributions e.g. StencilDist

Overlap computation and communication

tasks sync variables

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 30 / 90

slide-31
SLIDE 31

Chapel Programming Strategies for Distributed Memory

Chapel: Further Reading

Chapel Web page: http://chapel.cray.com Chapel tutorials: http://chapel.cray.com/tutorials.html

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 31 / 90

slide-32
SLIDE 32

Chapel Programming Strategies for Distributed Memory

Hands-on Exercise: 2D Stencil via Templates

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 32 / 90

slide-33
SLIDE 33

Runtime Support for PGAS

Outline

1

Introduction to the PGAS Paradigm and Chapel

2

Chapel Programming Strategies for Distributed Memory

3

Runtime Support for PGAS

4

MPI One-Sided Communications

5

Fault Tolerance

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 33 / 90

slide-34
SLIDE 34

MPI One-Sided Communications

Outline

1

Introduction to the PGAS Paradigm and Chapel

2

Chapel Programming Strategies for Distributed Memory

3

Runtime Support for PGAS

4

MPI One-Sided Communications

5

Fault Tolerance

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 34 / 90

slide-35
SLIDE 35

MPI One-Sided Communications

Programming Models

Each process exposes a part of its memory to the other processes Allow data movement without direct involvement of process that holds the data

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 35 / 90

slide-36
SLIDE 36

MPI One-Sided Communications

Comparison with Two-sided

http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture34.pdf Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 36 / 90

slide-37
SLIDE 37

MPI One-Sided Communications

It’s all about Memory Consistency

Remember this from the shared memory course? Memory consistency concerns how memory behaves with respect to read and write operations from multiple processors Sequential consistency:

the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations

  • f each individual processor appear in this sequence in the order

specified by its program.

See: Shared Memory Consistency Models: A Tutorial, Sarita V. Adve Kourosh Gharachorloo Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 37 / 90

slide-38
SLIDE 38

MPI One-Sided Communications

Reference Material

Subsequent slides will draw heavily on following material: Overviews of MPI3

William D Gropp: New Features of MPI-3 Fabio Affinito: MPI3

Two detailed lectures on one sided MPI

William Gropp: One-sided Communication in MPI William Gropp: More on One Sided Communication

Tutorial on MPI 2.2 and 3.0 by Torsten Hoefler

Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial

Detailed paper on remote memory access programming in MPI-3

  • T. Hoefler et al., 2013. Remote Memory Access Programming in

MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 (March 2013)

Cornell Virtual Workshop on one-sided communication methods

https://cvw.cac.cornell.edu/MPIoneSided/default

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 38 / 90

slide-39
SLIDE 39

MPI One-Sided Communications

RMA advantages and Issues

Advantages

Multiple transfers with single synchronization Bypass tag matching Can be faster exploiting underlying hardware support Better able to handle problems where communication pattern is unknown or irregular

Issues

How to create remote accessible memory Reading, writing and updating remote memory Data synchronisation Memory model

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 39 / 90

slide-40
SLIDE 40

MPI One-Sided Communications

Window Creation

Regions of memory that we want to expose to RMA operations are called windows, they can be created in four ways

Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 40 / 90

slide-41
SLIDE 41

MPI One-Sided Communications

Simple Window Creation

Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 41 / 90

slide-42
SLIDE 42

MPI One-Sided Communications

Data Movement

MPI provides operations to read, write and atomically modify remote data

MPI GET MPI PUT MPI ACCUMULATE MPI GET ACCUMULATE MPI COMPARE AND SWAP MPI FETCH AND OP

Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 42 / 90

slide-43
SLIDE 43

MPI One-Sided Communications

The Memory Consistency Issue

Fabio Affinito: MPI3

Three Synchronization models

Fence (active target) Post-start-complete-wait (generalized active target) Lock/Unlock (passive target)

Data accesses occur within epochs

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 43 / 90

slide-44
SLIDE 44

MPI One-Sided Communications

Three Synchronization Models

Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 44 / 90

slide-45
SLIDE 45

MPI One-Sided Communications

Passive Target Synchronization

William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 45 / 90

slide-46
SLIDE 46

MPI One-Sided Communications

Completion Model

Relaxed memory model, acquire and release

Immediate Data Movement Delayed Data Movement Which is best when?

William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 46 / 90

slide-47
SLIDE 47

MPI One-Sided Communications

Memory Models

Unified model new in MPI 3, what are advantages?

Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 47 / 90

slide-48
SLIDE 48

MPI One-Sided Communications

Separate Semantics

Another table for unified semantics

Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 48 / 90

slide-49
SLIDE 49

MPI One-Sided Communications

MPI-3 Communication Options

  • T. Hoefler et al., 2013. Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 49 / 90

slide-50
SLIDE 50

MPI One-Sided Communications

Example Codes

Fence Synchronization Post-Start-Complete-Wait Synchronization Lock-Unlock Synchronization

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 50 / 90

slide-51
SLIDE 51

MPI One-Sided Communications

Fence Synchronization

1 // S t a r t up MPI . . . . 2 MPI Win win ; 3 i f ( rank == 0) { 4 /∗ Everyone w i l l r e t r i e v e from a b u f f e r

  • n

root ∗/ 5 i n t s o i = s i z e o f ( i n t ) ; 6 MPI Win create ( buf , s o i ∗20 , soi , MPI INFO NULL ,comm,& win ) ; } 7 e l s e { 8 /∗ Others

  • nly

r e t r i e v e , so t h e s e windows can be s i z e 0 ∗/ 9 MPI Win create (NULL, 0 , s i z e o f ( i n t ) , MPI INFO NULL ,comm,& win ) ; 10 } 11 12 /∗ No l o c a l

  • p e r a t i o n s

p r i o r to t h i s epoch , so g i v e an a s s e r t i o n ∗/ 13 MPI Win fence (MPI MODE NOPRECEDE, win ) ; 14 i f ( rank != 0) { 15 /∗ I n s i d e the fence , make RMA c a l l s to GET from rank 0 ∗/ 16 MPI Get ( buf , 20 , MPI INT , 0 ,0 , 20 , MPI INT , win ) ; 17 } 18 19 /∗ Complete the epoch − t h i s w i l l block u n t i l MPI Get i s complete ∗/ 20 MPI Win fence (0 , win ) ; 21 /∗ A l l done with the window − t e l l MPI t h e r e are no more epochs ∗/ 22 MPI Win fence (MPI MODE NOSUCCEED, win ) ; 23 /∗ Free up

  • ur

window ∗/ 24 MPI Win free(&win ) 25 // shut down . . . Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/fence Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 51 / 90

slide-52
SLIDE 52

MPI One-Sided Communications

Post-Start-Complete-Wait Synchronization

1 // S t a r t up MPI . . . 2 MPI Group comm group , group ; 3 4 f o r ( i =0; i <3; i ++) { 5 ranks [ i ] = i ; // For forming groups , l a t e r 6 } 7 MPI Comm group (MPI COMM WORLD,&comm group ) ; 8 9 /∗ Create new window f o r t h i s comm ∗/ 10 i f ( rank == 0) { 11 MPI Win create ( buf , s i z e o f ( i n t ) ∗3 , s i z e o f ( i n t ) , 12 MPI INFO NULL ,MPI COMM WORLD,& win ) ; 13 } 14 e l s e { 15 /∗ Rank 1

  • r

2 ∗/ 16 MPI Win create (NULL, 0 , s i z e o f ( i n t ) , 17 MPI INFO NULL ,MPI COMM WORLD,& win ) ; 18 } 19 20 /∗ − − − − > c o n t i n u e s i n next s l i d e − − − − > ∗/ Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/pscw Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 52 / 90

slide-53
SLIDE 53

MPI One-Sided Communications

Post-Start-Complete-Wait Synchronization (2)

1 /∗ Now do the communication epochs ∗/ 2 i f ( rank == 0) { 3 /∗ O r i g i n group c o n s i s t s

  • f

ranks 1 and 2 ∗/ 4 MPI Group incl ( comm group , 2 , ranks +1,&group ) ; 5 /∗ Begin the exposure epoch ∗/ 6 MPI Win post ( group , 0 , win ) ; 7 /∗ Wait f o r epoch to end ∗/ 8 MPI Win wait ( win ) ; 9 } 10 e l s e { 11 /∗ Target group c o n s i s t s

  • f

rank 0 ∗/ 12 MPI Group incl ( comm group , 1 , ranks ,& group ) ; 13 /∗ Begin the a c c e s s epoch ∗/ 14 MPI Win start ( group , 0 , win ) ; 15 /∗ Put i n t o rank==0 a c c o r d i n g to my rank ∗/ 16 MPI Put ( buf , 1 , MPI INT , 0 , rank , 1 , MPI INT , win ) ; 17 /∗ Terminate the a c c e s s epoch ∗/ 18 MPI Win complete ( win ) ; 19 } 20 21 /∗ Free window and groups ∗/ 22 MPI Win free(&win ) ; 23 MPI Group free(&group ) ; 24 MPI Group free(&comm group ) ; 25 26 // Shut down . . . Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/pscw Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 53 / 90

slide-54
SLIDE 54

MPI One-Sided Communications

Lock-Unlock Synchronization

1 // S t a r t up MPI . . . 2 MPI Win win ; 3 4 i f ( rank == 0) { 5 /∗ Rank 0 w i l l be the c a l l e r , so n u l l window ∗/ 6 MPI Win create (NULL, 0 , 1 , 7 MPI INFO NULL ,MPI COMM WORLD,& win ) ; 8 /∗ Request l o c k

  • f

p r o c e s s 1 ∗/ 9 MPI Win lock (MPI LOCK SHARED , 1 , 0 , win ) ; 10 MPI Put ( buf , 1 , MPI INT , 1 , 0 , 1 , MPI INT , win ) ; 11 /∗ Block u n t i l put succeeds ∗/ 12 MPI Win unlock (1 , win ) ; 13 /∗ Free the window ∗/ 14 MPI Win free(&win ) ; 15 } 16 e l s e { 17 /∗ Rank 1 i s the t a r g e t p r o c e s s ∗/ 18 MPI Win create ( buf ,2∗ s i z e o f ( i n t ) , s i z e o f ( i n t ) , 19 MPI INFO NULL , MPI COMM WORLD, &win ) ; 20 /∗ No sync c a l l s

  • n

the t a r g e t p r o c e s s ! ∗/ 21 MPI Win free(&win ) ; 22 } Source: Cornell Virtual Workshop: https://cvw.cac.cornell.edu/MPIoneSided/lul Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 54 / 90

slide-55
SLIDE 55

MPI One-Sided Communications

Case Studies

  • T. Hoefler et al., 2013. Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 55 / 90

slide-56
SLIDE 56

MPI One-Sided Communications

Hands-on Exercise: The 3 Synchronization Methods

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 56 / 90

slide-57
SLIDE 57

Fault Tolerance

Outline

1

Introduction to the PGAS Paradigm and Chapel

2

Chapel Programming Strategies for Distributed Memory

3

Runtime Support for PGAS

4

MPI One-Sided Communications

5

Fault Tolerance

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 57 / 90

slide-58
SLIDE 58

Fault Tolerance

HPC Systems: Fast, Complex and Error Prone

Sunway TaihuLight: the fastest supercomputer today (peak 125.4359 Pflop/s)

(Image: Top500)

  • 1. Dongarra, Jack. ”Report on the sunway taihulight system”. Tech Report UT-EECS-16-742. (2016)

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 58 / 90

slide-59
SLIDE 59

Fault Tolerance

The Reliability Challenge in HPC

Reliability Terms

MTTI : Mean Time To Interrupt MTTR : Mean Time To Repair MTBF : Mean Time Between Failures = MTTI + MTTR

Reliability Figures for Terascale Systems: System CPUs Reliability Src LANL ASCI Q 8,192 MTTI: 6.5 hours 2 LLNL ASCI White (2003) 8,192 MTBF: 40 hours 2 PSC Lemieux 3,016 MTTI: 9.7 hours 2 LLNL BlueGene/L 106,496 MTTI: 7-10 days 3

  • 2. Feng, Wu-chun. ”The importance of being low power in high performance computing.” (2005)
  • 3. Bronevetsky, Greg, and Adam Moody. Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O. (2009)

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 59 / 90

slide-60
SLIDE 60

Fault Tolerance

A Statistical Study of Failures on HPC Systems

Conclusions: “First, the failure rate of a system grows proportional to the number

  • f processor chips in the system. Second, there is little indication that systems and

their hardware get more reliable over time as technology changes.”

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 60 / 90

slide-61
SLIDE 61

Fault Tolerance

The Reliability Challenge in HPC

Prediction of MTTI with three rates of growth in cores: doubling every 18, 24 and 30 months Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 61 / 90

slide-62
SLIDE 62

Fault Tolerance

Fault Tolerance

As HPC systems grow in size, the MTTI shrinks and long-running applications become at a higher risk of encountering faults. Faults are generally classified into:

hard faults: inhibit process execution and result in data loss. soft faults: undetected bit flips silently corrupting data in disk, memory, or registers.

Fault tolerance is the ability to contain faults and reduce their impact.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 62 / 90

slide-63
SLIDE 63

Fault Tolerance

Fault Tolerance Techniques

Rollback recovery:

Returns the application to an old consistent state. Recomputes previously reached states before the failure. Common technique:

Checkpoint/restart

Forward recovery:

Computation proceeds after a failure without rollback. Requires a fault-aware runtime system (i.e. a runtime system that does not crash upon a failure). Common techniques:

Replication Master-Worker ABFT (Algorithmic-Based Fault Tolerance)

Or composite techniques

e.g. Replication-enhaned checkpointing [4]

  • 4. Ni, Xiang, et al. ”ACR: Automatic checkpoint/restart for soft and hard error protection.” Proceedings of the International

Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 2013. Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 63 / 90

slide-64
SLIDE 64

Fault Tolerance

Rollback Recovery

Checkpoint/Restart

The most widely used fault tolerance mechanism in HPC systems. Requires saving the application state periodically on a reliable storage. Upon a failure, the application restarts from the last consistent checkpoint.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 64 / 90

slide-65
SLIDE 65

Fault Tolerance

Checkpointing Classifications

Coordinated Uncoordinated

Collective checkpointing Processes checkpoint independently All processes restart Only the failed process restarts Suitable for synchronized computations Suitable for loosly coupled processes Often requires message logging Vulnerable to the domino effect

  • 5. Elnozahy, Elmootazbellah Nabil, et al. ”A survey of rollback-recovery protocols in message-passing systems.” ACM

Computing Surveys (CSUR) 34.3 (2002): 375-408. Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 65 / 90

slide-66
SLIDE 66

Fault Tolerance

Checkpointing Classifications (Cont.)

Disk-based Diskless

I/O Intensive Replaces disk with in-memory replication Applicable to all runtime systems Fault-aware systems only More replicas → more reliability, and higher failure-free overhead

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 66 / 90

slide-67
SLIDE 67

Fault Tolerance

Checkpointing MPI Applications

Coordinated disk-based checkpointing is the common mechanism for fault tolerance on HPC platforms.

Provided transparently in some MPI implementations (e.g. Intel-MPI: mpirun -chkpoint-interval 100sec -np 100 ./MyApp) Provided outside of MPI by tools that dump process image into disk, like:

BLCR: Berkeley Lab Checkpoint/Restart for Linux DMTCP: Distributed MultiThreaded CheckPointing

Or done manually by programmers using file system APIs.

Diskless checkpointing is only applicable to fault-aware MPI implementations (like MPI-ULFM, which we cover in the last part of this lecture)

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 67 / 90

slide-68
SLIDE 68

Fault Tolerance

Checkpoint Interval

The checkpoint interval has a crucial impact on performance:

A long interval → less checkpoints, and more lost work upon a failure. A short interval → more checkpoints, and less lost work upon a failure.

Young’s formula [6] is often used to compute the optimal checkpoint interval i as √ 2 ∗ t ∗ MTTI, where t is the checkpointing time. The effective application utilization (u) of a system can be computed as [7]: u = 1− (lost utilization for recovery + lost utilization for checkpointing) u = 1 − ( 1

2.i. 1 MTTI + t. 1 i )

  • 6. Young, John W. ”A first order approximation to the optimum checkpoint interval.” Communications of the ACM 17.9 (1974)
  • 7. Schroeder, Bianca, and Garth A. Gibson. ”Understanding failures in petascale computers.” Journal of Physics. 78.1 (2007).

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 68 / 90

slide-69
SLIDE 69

Fault Tolerance

Projected System Utilization with C/R

Effective application utilization over time Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 69 / 90

slide-70
SLIDE 70

Fault Tolerance

Forward Recovery

Replication

executes one or more replicas of each process on independent nodes when a replica is lost, another replica takes over without rollback

Replication in message passing systems: the message ordering challenge

used as a detection and correction mechanism for silent data corruption errors. despite its expensive resource requirements, recent studies [8,9] suggest that replication can be a viable alternative for checkpointing on extreme scale systems with short MTTI.

  • 8. Ropars, Thomas, et al. ”Efficient Process Replication for MPI Applications: Sharing Work between Replicas.” IPDPS. 2015.
  • 9. Ferreira, Kurt, et al. ”Evaluating the viability of process replication reliability for exascale systems.” SC. 2011.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 70 / 90

slide-71
SLIDE 71

Fault Tolerance

Forward Recovery

Master-Worker

Worker failure: can be tolerated without rollback by assigning the tasks

  • f the failed worker to another worker

Master failure: can be tolerated using replication or checkpointing. Because the probably of the master failure is constant (as it does not depend on the scale of the application), it is often more efficient to consider the master failure as a fatal error.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 71 / 90

slide-72
SLIDE 72

Fault Tolerance

Forward Recovery

Algorithmic-Based Fault Tolerance

the design of custom recovery mechanisms based on expert-knowledge

  • f special algorithm properties (e.g. available data redundancy, the

ability to approximate lost data from remaining data, . . . ). for example: using redundant data to recover lost sub-grids in PDE solvers that use the Sparse Grid Combination Technique (SGCT) [10]:

(Image: Ali, et al. 2016)

  • 10. Ali, Md Mohsin, et al. ”Complex scientific applications made fault-tolerant with the sparse grid combination technique.”

IJHPCA 30.3 (2016): 335-359. Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 72 / 90

slide-73
SLIDE 73

Fault Tolerance

MPI and Fault Tolerance

The MPI standard does not specify the behaviour of MPI when ranks fail. Most implementations terminate the application as a result of rank failure. Users rely on coordinated disk-based checkpointing because it does not require fault tolerance support from MPI. However, the time to checkpoint an application with a large memory footprint can exceed the MTTI on large systems, making coordinated disk-based checkpointing unapplicable on large scale. User-level fault tolerance techniques can deliver better performance, however, they require fault tolerance support from MPI.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 73 / 90

slide-74
SLIDE 74

Fault Tolerance

MPI and Fault Tolerance (Cont.)

MPI User Level Failure Mitigation

A proposal by the MPI Forum’s Fault Tolerance Working Group to add fault tolerance semantics to MPI. Under assessment by the MPI forum to be part of the coming MPI-4 standard. A reference implementation of ULFM is available based on OpenMPI1.7.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 74 / 90

slide-75
SLIDE 75

Fault Tolerance

MPI User Level Failure Mitigation

In the following, we cover the following aspects of MPI-ULFM:

Error Handling Failure Notification Failure Mitigation

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 75 / 90

slide-76
SLIDE 76

Fault Tolerance

Error Handling (1/2)

In standard MPI:

most MPI interfaces return an error code (e.g. 0=MPI SUCCESS, 1=MPI ERR BUFFER, . . . ) we can set an error handler to a communicator using: MPI Comm set errhandler there are two predefined error handlers:

MPI ERRORS ARE FATAL: terminates MPI (the default). MPI ERRORS RETURN: returns an error code to the caller.

The user can also define a customized error handler, as follows:

1 /* User ’s error

handling function */

2 void

errorCallback (MPI_Comm * comm , int * errCode , ...) { }

3 4 /*

Changing the communicator ’s error handler */

5 MPI_Errhandler

handler;

6 MPI_Comm_create_errhandler (errorCallback , &handler); 7 MPI_Comm_set_errhandler (MPI_COMM_WORLD , handler); Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 76 / 90

slide-77
SLIDE 77

Fault Tolerance

Error Handling (2/2)

ULFM uses the same error handling mechanism of standard MPI It added new error codes to report process failure events:

54=MPI ERR PROC FAILED 55=MPI ERR PROC FAILED PENDING 56=MPI ERR REVOKED

The default error handler MPI ERRORS ARE FATAL must not be used.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 77 / 90

slide-78
SLIDE 78

Fault Tolerance

Failure Notification (1/3)

Process failure errors are raised only in MPI operations that involve a failed rank.

Point-to-point operations

Using a named rank: Using MPI ANY SOURCE:

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 78 / 90

slide-79
SLIDE 79

Fault Tolerance

Failure Notification (2/3)

Process failure errors are raised only in MPI operations that involve a failed rank.

Collective operations: some live processes may raise an error, while

  • thers return successfully.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 79 / 90

slide-80
SLIDE 80

Fault Tolerance

Failure Notification (3/3)

Process failure errors are raised only in MPI operations that involve a failed rank.

Non-blocking operations: error reporting is postponed to the corresponding completion function (e.g. MPI Wait, MPI Test).

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 80 / 90

slide-81
SLIDE 81

Fault Tolerance

Failure Mitigation Interfaces (1/2)

MPI Comm failure ack( comm )

a local operation that acknowledges all detected failures on the communicator. it’s purpose is to silence process failure errors in future MPI ANY SOURCE calls that involve an acknowledged process failure.

MPI Comm failure get acked( comm, failedgrp )

returns the group of failed ranks that were already acknowledged

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 81 / 90

slide-82
SLIDE 82

Fault Tolerance

Failure Mitigation Interfaces (2/2)

MPI Comm revoke( comm )

a local operation that invalidates the communicator any future communication on a revoked communicator fails with error MPI ERR REVOKED live ranks must collectively create a new communicator

MPI comm shrink( oldcomm, newcomm )

a collective operation that creates a new communicator that excludes the dead ranks in the old communicator like other collectives, it may succeed at some ranks and fail at others.

MPI comm agree( oldcomm, flag )

a collective operation for participants to agree on some value. the only collective operation that returns the same result to all participants.

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 82 / 90

slide-83
SLIDE 83

Fault Tolerance

Resilient Iterative Application Skeleton

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 83 / 90

slide-84
SLIDE 84

Fault Tolerance 1 #define

CKPT_INTERVAL 10 /* the checkpointing iterval */

2 #define

MAX_ITER 100 /* maximum

  • no. of iters

*/

3 4 MPI_Comm

world; /* the working communicator */

5 int

nprocs; /* communicator size */

6 int

rank; /* my rank */

7 bool

restart; /* restart flag */

8 9 void

compute (); /* executes the iterative computation ,*

10

and

  • rchestrates

checkpoint /restart */

11 12 int

runIter(int i); /* runs a single iteration , *

13

* and returns the MPI error code *

14

* of the last MPI call */

15 16 void

shrinkWorld (); /* shrinks a failed communicator , *

17

* and sets the new rank and nprocs */

18 19 void

errorCallback (MPI_Comm* comm , int* rc , ...);

20

/* the communicator ’s error handler */

21 22 void

writeCkpt (); /* creates a new checkpoint */

23 24 int

readCkpt (); /* loads the last checkpoint , and

25

* returns the corresponding iteration */

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 84 / 90

slide-85
SLIDE 85

Fault Tolerance 1 void

main(int argc , char* argv []) {

2

MPI_Init (&argc , &argv);

3 4

/* the initial world state */

5

world = MPI_COMM_WORLD ;

6

MPI_Comm_rank (world , &rank);

7

MPI_Comm_size (world , &nprocs);

8 9

/* setting the error handler */

10

MPI_Errhandler errHandler ;

11

MPI_Comm_create_errhandler (errorCallback , & errHandler);

12

MPI_Comm_set_errhandler (world , errHandler);

13 14

compute ();

15 16

MPI_Finalize ();

17 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 85 / 90

slide-86
SLIDE 86

Fault Tolerance 1 /*

  • rchestrates

the iterative processing and C/R */

2 void

compute (){

3

int rc; /* holds MPI return codes */

4

restart = false; /* set to true

  • nly in

errorCallback ()*/

5

int i = 0; /* current iteration number */

6

do {

7

if (restart) {

8

i = readCkpt ();

9

rc = MPI_Comm_agree (world , i);

10

if (rc != MPI_SUCCESS )

11

continue;

12

restart = false;

13

}

14

while (i < MAX_ITER) {

15

rc = runIter(i);

16

if (rc != MPI_SUCCESS )

17

break; /* jump to the

  • uter

loop to restart */

18 19

if (i % CKPT_INTERVAL == 0)

20

writeCkpt ();

21 22

i++;

23

}

24

} while (restart || i < MAX_ITER);

25 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 86 / 90

slide-87
SLIDE 87

Fault Tolerance 1 /* a callback

function to handle MPI errors */

2 void

errorCallback (MPI_Comm * comm , int * errCode , ...) {

3

if (* errCode != MPI_ERR_PROC_FAILED &&

4

*errCode != MPI_ERR_PROC_FAILED_PENDING &&

5

*errCode != MPI_ERR_COMM_REVOKED ) {

6

/*We only tolerate process failure errors */

7

MPI_Abort(comm ,

  • 1);

8

}

9 10

/* acknowledge the detected failures */

11

MPI_Comm_failure_ack ( *comm );

12 13

if (errCode != MPI_ERR_COMM_REVOKED ){

14

/* propagate the failure to other ranks */

15

MPI_Comm_revoke ( *comm );

16

}

17 18

/* all live ranks must reach this point

19

to collectively shrink the communicator */

20

shrinkWorld ();

21 22

restart = true;

23 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 87 / 90

slide-88
SLIDE 88

Fault Tolerance 1 /*

Creates a new communicator for the application ,

2

excluding dead ranks in the old (revoked) communicator */

3 void

shrinkWorld () {

4

int rc; /* shrink return code */

5

MPI_Comm newComm;

6

do {

7

rc = MPI_Comm_shrink (world , newComm);

8

MPI_Comm_agree (newComm , &rc);

9

} while(rc != MPI_SUCCESS );

10 11

/* update the communicator */

12

world = newComm;

13 14

/* update my rank and nprocs */

15

MPI_Comm_rank (world , &rank);

16

MPI_Comm_size (world , &nprocs);

17 } Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 88 / 90

slide-89
SLIDE 89

Fault Tolerance

Fault Tolerance: Summary

Topics covered today:

the reducing reliability of HPC system as they grow larger fault tolerance techniques (C/R, Replication, Master-Worker, ABFT) the MPI-ULFM proposal for adding fault tolerance support to MPI

Acknowledgement:

The fault tolerance part of today’s lecture is influenced by materials from SC’16 Tutorial ’Fault Tolerance for HPC: Theory and Practice’

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 89 / 90

slide-90
SLIDE 90

Fault Tolerance

Hands-on Exercise: Checkpointing and ULFM

Computer Systems (ANU) PGAS Paradigm 02 Nov 2017 90 / 90