Fault-tolerant matrix factorisation: a formal model and proof - - PowerPoint PPT Presentation

fault tolerant matrix factorisation a formal model and
SMART_READER_LITE
LIVE PREVIEW

Fault-tolerant matrix factorisation: a formal model and proof - - PowerPoint PPT Presentation

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 1/26 Fault-tolerant matrix factorisation: a formal model and proof Camille Coti, Laure Petrucci, Daniel Alberto Torres Gonz alez Laboratoire dInformatique de


slide-1
SLIDE 1

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 1/26

Fault-tolerant matrix factorisation: a formal model and proof

Camille Coti, Laure Petrucci, Daniel Alberto Torres Gonz´ alez

Laboratoire d’Informatique de Paris Nord, CNRS UMR 7030, Universit´ e Paris 13, Sorbonne Paris Cit´ e 99, avenue Jean-Baptiste Cl´ ement F-93430 Villetaneuse, FRANCE camille.coti@lipn.univ-paris13.fr laure.petrucci@lipn.univ-paris13.fr daniel.torres@lipn.univ-paris13.fr

April 6, 2019

slide-2
SLIDE 2

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 2/26

Content

1

Motivation

2

Introduction High Performance Computing Fault Tolerance Formal Models

3

The FT-TSQR algorithm TSQR FT-TSQR

4

Model

5

Properties

6

Conclusion and perspectives

slide-3
SLIDE 3

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 3/26 Motivation

Motivation

Matrix operations Addition, transposition, matrix multiplication Row operations, submatrix Diagonal matrix, triangular matrix, identity matrix, orthogonal matrix Determinant, eigenvalues, eigenvectors

slide-4
SLIDE 4

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 3/26 Motivation

Motivation

Matrix operations Addition, transposition, matrix multiplication Row operations, submatrix Diagonal matrix, triangular matrix, identity matrix, orthogonal matrix Determinant, eigenvalues, eigenvectors Decompositions QR, LU, Cholesky

TSQR: iterative methods use it

Linear systems with multiple right-hand sides Block iterative eigensolvers s-step Krylov methods

slide-5
SLIDE 5

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 4/26 Motivation

Motivation

Fault tolerance in HPC System-level

Transparent for the application Specific middleware to ensure coherent state of the application

Application-level

The application itself handles the failures and adapt to them The middleware must be robust enough to provide primitives

slide-6
SLIDE 6

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 5/26 Introduction High Performance Computing

High Performance Systems

Platforms at large scale Have their own technical challenges

The total number of hardware and software components grows exponentially Platforms needed to manage and handle complex computational problems Hardware or software failures may occur anytime during the execution

  • f high parallel applications

System reliability, availability and scalability are factors to deal with

Failures may result in a high execution times and high cost

slide-7
SLIDE 7

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 6/26 Introduction High Performance Computing

Top500.org

Top500 A statistical list with ranks and details of the 500 world’s most powerful supercomputers It shows that performance has almost doubled each year

Rank System Cores Rmax (TFlop/s) Rpeak (TFlop/s) Power (kW) 1 Summit - IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband , IBM DOE/SC/Oak Ridge National Laboratory United States 2,397,824 143,500.0 200,794.9 9,783 2 Sierra - IBM Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband , IBM / NVIDIA / Mellanox DOE/NNSA/LLNL United States 1,572,480 94,640.0 125,712.0 7,438 3 Sunway TaihuLight - Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway , NRCPC National Supercomputing Center in Wuxi China 10,649,600 93,014.6 125,435.9 15,371 4 Tianhe-2A - TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz, TH Express-2, Matrix-2000 , NUDT National Super Computer Center in Guangzhou China 4,981,760 61,444.5 100,678.7 18,482 5 Piz Daint - Cray XC50, Xeon E5-2690v3 12C 2.6GHz, Aries interconnect , NVIDIA Tesla P100 , Cray Inc. Swiss National Supercomputing Centre (CSCS) Switzerland 387,872 21,230.0 27,154.3 2,384

slide-8
SLIDE 8

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 7/26 Introduction High Performance Computing

#1 Cores Number

2x106 4x106 6x106 8x106 1x107 1.2x107 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 # Cores Year #1 Cores Number

slide-9
SLIDE 9

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 8/26 Introduction Fault Tolerance

Failures

Failures in High Performance Systems Node increase in HPC ⇒ platforms more subject to failures

Mean Time Between Failures (MTBF): measure of system reliability Defined as the probability that the system performs without deviations from agreed-upon behavior for a specific period of time

MTBFT = (

n−1

  • i=0

1 MTBFi )−1 Failures arise anytime

Stops partially or totally the execution (crash-type failures) Provides incorrect results (bit errors)

With an increase in the number of components, the system will experience a component failure every few hours or even minutes

slide-10
SLIDE 10

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 9/26 Introduction Fault Tolerance

MTBF

M e a n T i m e B e t w e e n F a i l u r e s

  • f

t h e S y s t e m ( h

  • u

r s ) Number of components in the system 10 000 H 1 10 100 1000 10000 100000 1e+06 1000 2000 3000 4000 5000 100 000 H 1 000 000 H 10 000 000 H

slide-11
SLIDE 11

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 10/26 Introduction Fault Tolerance

Fault tolerance

Challenges in HPC HPC algorithms should be designed to:

expect failures: very difficult to predict all possible failures take suitable actions: ensure that intensive applications run smoothly with reduced overhead

Fault tolerant solutions are being incorporated

Have the ability to contain failures Minimize the impact of failures

Provide a fault tolerant environment

Enhance the utilization of the system at high scale Ensure the failure-free execution of critical algorithms

Hard to describe and verify the system’s properties: how to simplify it?

slide-12
SLIDE 12

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 11/26 Introduction Formal Models

Formal models

Coloured Petri Nets (CPN) Better understanding of the system The system ensures mathematically that it is correct Modelling, validating properties and synchronizing communications of parallel and distributed algorithms Allow for better readability and understandability FT-TSQR Formal Model Formal model and associated verifications Proves it tolerates the failures Guarantees that the final results are correct

Data flow is correct Each process calculates and shares its results At the end, all the process have the same result

slide-13
SLIDE 13

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 12/26 The FT-TSQR algorithm TSQR

Los Tres Amigos

QR factorization: A = QR

  • R upper triangular
  • Q orthogonal

LU decomposition: A = LU

  • L lower triangular
  • U upper triangular

Cholesky factorization: A = LLT

  • A symmetric, positive definite
  • L lower triangular
slide-14
SLIDE 14

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 13/26 The FT-TSQR algorithm TSQR

Tall and Skinny QR

Tall and Skinny QR (TSQR) Factorisation It calculates the QR factorisation of a tall and skinny matrix A, i.e. a matrix with m rows and n columns, m ≫ n Linear algebra applications depend on the algorithm:

Ax = b numerically stable: eigenvalues computation is sensitive to the accuracy of the orthogonalization

slide-15
SLIDE 15

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3

slide-16
SLIDE 16

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3

slide-17
SLIDE 17

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3

slide-18
SLIDE 18

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3

slide-19
SLIDE 19

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23

slide-20
SLIDE 20

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23 R01 R23 R01 R23 R01 R23 R01 R23

slide-21
SLIDE 21

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23 R01 R23 R01 R23 R01 R23 R01 R23

QR

Q0123R0123 Q0123R0123 Q0123R0123 Q0123R0123

slide-22
SLIDE 22

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23 R01 R23 R01 R23 R01 R23 R01 R23

QR

Q0123R0123 Q0123R0123 Q0123R0123 Q0123R0123

data redundancy

slide-23
SLIDE 23

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 14/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR

t processes

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23 R01 R23 R01 R23 R01 R23 R01 R23

QR

Q0123R0123 Q0123R0123 Q0123R0123 Q0123R0123

data redundancy

if a process fails, there exists another process: doing the same operation holding the same data

slide-24
SLIDE 24

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 15/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23 R01 R23 R01 R23 R01 R23 R01 R23

slide-25
SLIDE 25

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 15/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 QR

Q01 R01 Q01 R01 Q23 R23 Q23 R23 R01 R23 R01 R23 R01 R23 R01 R23 R01 R23

X

QR

Q0123R0123 Q0123R0123 Q0123R0123

slide-26
SLIDE 26

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 16/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3

slide-27
SLIDE 27

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 16/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 R2 R3

X

QR

Q01 R01 Q01 R01 Q23 R23

slide-28
SLIDE 28

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 16/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 QR Q0 R0 Q1 R1 Q2 R2 Q3 R3 R0 R1 R0 R1 R2 R3 R2 R3 R2 R3

X

QR

Q01 R01 Q01 R01 Q23 R23 Q01 R01

X

R01 R23 R01 R23

QR

Q0123R0123 Q0123R0123

slide-29
SLIDE 29

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 17/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3

slide-30
SLIDE 30

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 17/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 A2

X

QR Q0 R0 Q1 R1 Q3 R3

slide-31
SLIDE 31

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 17/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 A2

X

QR Q0 R0 Q1 R1 Q3 R3 Q3 R3

X

R0 R1 R0 R1

slide-32
SLIDE 32

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 17/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 A2

X

QR Q0 R0 Q1 R1 Q3 R3 Q3 R3

X

R0 R1 R0 R1 QR

Q01 R01 Q01 R01

slide-33
SLIDE 33

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 17/26 The FT-TSQR algorithm FT-TSQR

Fault-Tolerant TSQR Failure

P0 P1 P2 P3 A0 A1 A2 A3 A2

X

QR Q0 R0 Q1 R1 Q3 R3 Q3 R3

X

R0 R1 R0 R1 QR

Q01 R01 Q01 R01 Q01 R01

X

Q01 R01

X

slide-34
SLIDE 34

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 18/26 The FT-TSQR algorithm FT-TSQR

The algorithm

Algorithm 1: FT-TSQR

Data: Submatrix A

1

˜ Q, ˜ R = QR(A);

2 s = 0 ; 3 while ! done() do 4

pi = myPartner(s) ;

5

f = sendRecv( ˜ R, ˜ R′, pi) ;

6

if FAIL == f then

7

return;

8

A = concatenate( ˜ R, ˜ R′);

9

˜ Q, ˜ R = QR(A);

10

s = s + 1 ; /* All the surviving processes reach this point and own the final ˜ R */

11 return ˜

R;

slide-35
SLIDE 35

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix

slide-36
SLIDE 36

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm

slide-37
SLIDE 37

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm ( q , s , k ) (q′, s, k′) (q, s + 1, k′′) (q′, s + 1, k′′)

slide-38
SLIDE 38

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm ( q , s , k ) (q′, s, k′) (q, s + 1, k′′) (q′, s + 1, k′′) INT×INT MaxFail

  • 0≤s≤⌈log2 t⌉(s, 2s − 1)

contains pairs (s, f ) s: the current step f : number of failures still allowed at step s limits the number of occurrences of transition failure

slide-39
SLIDE 39

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm ( q , s , k ) (q′, s, k′) (q, s + 1, k′′) (q′, s + 1, k′′) INT×INT MaxFail

  • 0≤s≤⌈log2 t⌉(s, 2s − 1)

contains pairs (s, f ) s: the current step f : number of failures still allowed at step s limits the number of occurrences of transition failure failure [f > 0] decreases the number of allowed failures

slide-40
SLIDE 40

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm ( q , s , k ) (q′, s, k′) (q, s + 1, k′′) (q′, s + 1, k′′) INT×INT MaxFail

  • 0≤s≤⌈log2 t⌉(s, 2s − 1)

contains pairs (s, f ) s: the current step f : number of failures still allowed at step s limits the number of occurrences of transition failure failure [f > 0] decreases the number of allowed failures (q, s, k) ( s , f ) (s, f − 1)

slide-41
SLIDE 41

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm ( q , s , k ) (q′, s, k′) (q, s + 1, k′′) (q′, s + 1, k′′) INT×INT MaxFail

  • 0≤s≤⌈log2 t⌉(s, 2s − 1)

contains pairs (s, f ) s: the current step f : number of failures still allowed at step s limits the number of occurrences of transition failure failure [f > 0] decreases the number of allowed failures (q, s, k) ( s , f ) (s, f − 1) nop [q + 2s ≥ t] moves a process to the next step

slide-42
SLIDE 42

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 19/26 Model

The model

Processes PROC×INT×PROC

  • 0≤q<t(q, 0, q)

contains triples (q, s, k) q: a process number s: the current step k: index of the ˜ Ri matrix compute [q + 2s − q mod 2s−1 ≤ q′ < q + 2s+1 − q mod 2s−1 ∧k′′ = min(k, k′)] finds a partner process q′ executes a step of the algorithm ( q , s , k ) (q′, s, k′) (q, s + 1, k′′) (q′, s + 1, k′′) INT×INT MaxFail

  • 0≤s≤⌈log2 t⌉(s, 2s − 1)

contains pairs (s, f ) s: the current step f : number of failures still allowed at step s limits the number of occurrences of transition failure failure [f > 0] decreases the number of allowed failures (q, s, k) ( s , f ) (s, f − 1) nop [q + 2s ≥ t] moves a process to the next step (q, s, k) ( q , s + 1 , k )

slide-43
SLIDE 43

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 20/26 Properties

Properties

The system can reach the end of the computation (prop. 1) The final result is unique and is the expected one (prop. 2) Projection functions

Πx: select the xth element of a token which has a tuple value Πx,y: select the xth and yth elements to form a pair

Πs

x denotes the value of Πx when the step number is s

slide-44
SLIDE 44

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 21/26 Properties

Properties

Property 1 At every step s > 0 , the system can tolerate at most 2s − 1 failures Proof.

When s = 0 each process performs a local computation When s > 0 transition compute takes the ˜ R and ˜ R′ from two processes q and q′ and produces ˜ R′′ on both q and q′ or transition failure consumes a process By recursion, at each step s > 0, ∀M ∈ [M0 >, it holds that: |Πs

3(M(Processes))| + Πs 2(M(MaxFail)) = 2s ← invariant

At each step, the number of process with the same information increases by 2 The guard on transition failure ensures: 0 ≤ Πs

2(M(MaxFail)) ≤ 2s − 1

1 ≤ |Πs

3(M(Processes))| ≤ 2s: at least one process holds each intermediate ˜

R

slide-45
SLIDE 45

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 22/26 Properties

Properties

Property 2 At the end of the computation, if the system did not suffer too many failures, at least one process holds the final R Proof.

From property 1, when s > 0: |Πs

3(M(Processes))| ≥ 1

For each ˜ R: |Πs

3(M(Processes))| + Πs 2(M(MaxFail)) = 2s

0 ≤ Πs

2(M(MaxFail)) ≤ 2s − 1

When s = log2 t: the algorithm reaches the final step |Πs

3(M(Processes))| + Πs 2(M(MaxFail)) = t

Then, all non-failed processes hold the same ˜ R ⇒ ˜ R is unique and is the final R

slide-46
SLIDE 46

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 23/26 Conclusion and perspectives

Conclusion and perspectives

Conclusions A formal model for a fault tolerant algorithm

How the failures are modelled Design of proofs of fault tolerance properties

Number of processes, size of the matrix: parameters of the model

The proof holds for any value of the parameters

Perspectives How to derive a general modelling and verification approach for:

fault tolerant algorithms? square matrices? the Trailing Matrix Update?

Maximum number of errors allowed? Improvements

choosing among more partners? recovery after failure: handled by a spawned node, an inactive one?

slide-47
SLIDE 47

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 24/26 References

References I

Camille Coti. Scalable, robust, fault-tolerant parallel QR factorization. In Souheil Khaddaj, editor, Distributed Computing and Applications to Business, Engineering and Science (DCABES), 2016 15th International Symposium on. IEEE, 2016. Camille Coti, Charles Lakos, and Laure Petrucci. Formally proving and enhancing a self-stabilising algorithm. In Lawrence Cabac, Lars Michael Kristensen, and Heiko R¨

  • lke, editors, Petri Nets and Software Engineering.

International Workshop, PNSE’16, Torun, Poland. Proceedings, volume 1591 of CEUR Workshop Proceedings, pages 255–274. CEUR-WS.org, 2016. James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. Communication-avoiding parallel and sequential QR factorizations. CoRR, abs/0806.2159, 2008.

  • C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello.

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, pages 18–18, 2006.

  • C. Coti.

Exploiting redundant computation in communication-avoiding algorithms for algorithm-based fault tolerance. In 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pages 214–219, 2016.

  • C. Coti.

Scalable, robust, fault-tolerant parallel qr factorization. In 2016 IEEE Intl Conference on Computational Science and Engineering (CSE) and IEEE Intl Conference on Embedded and Ubiquitous Computing (EUC) and 15th Intl Symposium on Distributed Computing and Applications for Business Engineering (DCABES), pages 626–633, 2016.

slide-48
SLIDE 48

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 25/26 References

References II

Kurt Jensen and Lars M. Kristensen. Coloured Petri Nets: Modelling and Validation of Concurrent Systems. Springer Publishing Company, Incorporated, 1st edition, 2009. Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, 1(1):5–28, 2014. Jack Dongarra et al. The international exascale software project roadmap. International Journal of High Performance Computing Applications, 2011. Franck Cappello. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. International Journal of High Performance Computing Applications, 23(3):212–226, 2009. William Gropp and Marc Snir. Programming for exascale computers. Computing in Science & Engineering, 15:27, 2013. Rajeev Thakur, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Torsten Hoefler, Sameer Kumar, Ewing Lusk, and Jesper Larsson Tr¨ aff. MPI at exascale. In Procceedings of SciDAC 2010, Jun. 2010. John Shalf, Sudip Dosanjh, and John Morrison. Exascale computing technology challenges. In International Conference on High Performance Computing for Computational Science, pages 1–25. Springer, 2010.

slide-49
SLIDE 49

Daniel Torres Fault-tolerant matrix factorisation: a formal model and proof 26/26 References

References III

Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. Lessons learned from the analysis of system failures at petascale: The case of Blue Waters. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 610–621. IEEE, 2014. Reliability challenges in large systems. Future Generation Computer Systems, 22(3):293 – 302, 2006. Wesley Bland, Aurelien Bouteiller, Thomas H´ erault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. An evaluation of user-level failure mitigation support in MPI. Computing, 95(12):1171–1184, 2013. Graham E Fagg and Jack J Dongarra. FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pages 346–353. Springer, 2000.