FaultTolerantLinearAlgebra: goalsandmethods. - PowerPoint PPT Presentation

Fault Tolerant Linear Algebra:  goals and methods.     Julien Langou, University of Colorado Denver 

Fault‐tolerant Linear Algebra: Goals and Methods.  0‐ GOALS    0.1‐ ERRASURE OR ERROR?  1‐ METHODS   1.1‐ ERRASURE: DISKLESS CHECKPOINTING AND ROLLBACK   1.2‐ ERRASURE & ERROR: ABFT: ALGORITHM BASED FAULT TOLERANCE   1.3‐ OTHERS  2‐ ERROR: DETECTING/LOCATING/CORRECTING IN FLOATING‐POINT ARITHMETIC   3‐ NOVEL ABFT‐ALGORITHM (GEMM, LU, QR, ETC.) (ERRASURE OR ERROR)  4‐ ABFT‐BLAS LIBRARY  5‐ ABFT‐BLAS EXPERIMENTS (ERRASURE)  

Goals  Perform reliable and efficent computaFon with  unreliable units.  • Unreliable units: Process crash, hardware  failure, erroneous communicaFon, erroneous  computaFon, …  • Our method: at the algorithm level.  • MoFvaFon: cost effecFve, large unit count 

Errasure Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  P 1   P 2   P 3   P 4   2  4  6  8  Error Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  P 1   P 2   P 3   P 4   2  4  6  8 

Errasure Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  P 1   P 2   P 3   P 4   Lost processor 2  2  4  6  8  Error Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  Processor 2 returns an  P 1   P 2   P 3   P 4   2  5  6  8  incorrect result 

Errasure Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  P 1   P 2   P 3   P 4   Lost processor 2  2  4  6  8  ‐  we know whether there is an errasure or not,  Error Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  Processor 2 returns an  P 1   P 2   P 3   P 4   2  5  6  8  incorrect result  ‐  we do not know if there is an error, 

Errasure Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  P 1   P 2   P 3   P 4   Lost processor 2  2  4  6  8  ‐  we know whether there is an errasure or not,  ‐  we know where the errasure is,   Error Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  Processor 2 returns an  P 1   P 2   P 3   P 4   2  5  6  8  incorrect result  ‐  we do not know if there is an error,  ‐  assuming we know that an error occurs, we do not know where it is 

Errasure Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  P 1   P 2   P 3   P 4   Lost processor 2  2  4  6  8  ‐  we know whether there is an errasure or not,  ‐  we know where the errasure is,   ‐  so we only need to recover  Error Problem  P 1   P 2   P 3   P 4   4 processors available  1+1  2+2  3+3  4+4  Processor 2 returns an  P 1   P 2   P 3   P 4   2  5  6  8  incorrect result  ‐  we do not know if there is an error,  ‐  assuming we know that an error occurs, we do not know where it is  ‐  we also need to recover 

Diskless checkpoinFng  4 processors available  P 1   P 2   P 3   P 4  

Diskless checkpoinFng  4 processors available  P 1   P 2   P 3   P 4   Add a 5 th  one and perform a  P 1   P 2   P 3   P 4   P c   +   +   +   checksum (MPI_Reduce)  Ready for the computaFons  P 1   P 2   P 3   P 4   P c   … … … 

Diskless checkpoinFng  4 processors available  P 1   P 2   P 3   P 4   Add a 5 th  one and perform a  P 1   P 2   P 3   P 4   P c   +   +   +   checksum (MPI_Reduce)  Ready for the computaFons  P 1   P 2   P 3   P 4   P c   … … …  P 1   P 2   P 3   P 4   P c   Lost a processor  P 1   P 3   P 4   P c  

Diskless checkpoinFng  4 processors available  P 1   P 2   P 3   P 4   Add a 5 th  one and perform a  P 1   P 2   P 3   P 4   P c   +   +   +   checksum (MPI_Reduce)  Ready for the computaFons  P 1   P 2   P 3   P 4   P c   … … …  P 1   P 2   P 3   P 4   P c   Lost a processor  P 1   P 3   P 4   P c   Recover the processor (FT‐MPI)  P c   P 1   P 3   P 4   P 2   ‐   ‐   ‐   Recover the data (MPI_Reduce)  P 1   P 2   P 3   P 4   P c   Ready for the computaFons 

Diskless checkpoinFng (remarks)  • You can use either floaFng‐point arithmeFc or  binary arithmeFc for the checksum  • MulFple failures/errors supported through  Reed‐Solomon algorithm, opFmal algorithm in  the sense that, to support p simultaneous  failures/errors, only need to add p processes. 

Time for a MPI_Reduce (using  MVAPICH) on Infiniband on  jacquard.nersc.gov  2.5  2  Fme (sec)  1.5  122.0 MB  68.7 MB  1  30.5 MB  7.6 MB  0.5  0  64  81  100  121  256  # of processors 

ABFT = Algorithm Based Fault Tolerance.  • K. Huang, J. Abraham, " Algorithm‐Based Fault Tolerance for Matrix  Opera;ons ," IEEE Trans. on Comp. (Spec. Issue Reliable & Fault‐ Tolerant Comp.), C‐33, 1984, pp. 518‐528.  • If checkpoints are performed in floaFng‐point arithmeFc then we  can exploit the linearity of the mathemaFcal relaFons on the object  to maintain the checksums 

ABFT concept in an example  We want to perform z = λx+μy.  λ  λ  λ  λ  X 1   X 2   X 3   X 4   +   +   +   +   Y 1   Y 2   Y 3   Y 4   μ  μ  μ  μ  Z 1   Z 2   Z 3   Z 4   Proc 1  Proc 2  Proc 3  Proc 4 

ABFT concept in an example  We want to perform z = λx+μy.  X 1   X 2   X 3   X 4   +   +   +   Y 1   Y 2   Y 3   Y 4   +   +   +   Proc 1  Proc 2  Proc 3  Proc 4 

ABFT concept in an example  We want to perform z = λx+μy.  checkX  X 1   X 2   X 3   X 4   X c   +   +   +   checkY  Y 1   Y 2   Y 3   Y 4   Y c   +   +   +   Proc 1  Proc 2  Proc 3  Proc 4  Proc c 

ABFT concept in an example  We want to perform z = λx+μy.  checkX  X 1   X 2   X 3   X 4   X c   checkY  Y 1   Y 2   Y 3   Y 4   Y c   Proc 1  Proc 2  Proc 3  Proc 4  Proc c 

FaultTolerantLinearAlgebra: goalsandmethods. - PowerPoint PPT Presentation

FaultTolerantLinearAlgebra: goalsandmethods. JulienLangou,UniversityofColoradoDenver FaulttolerantLinearAlgebra:GoalsandMethods. 0GOALS

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Fault Tolerance For Sparse Linear Algebra Computations Implemented In A Grid Environment

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Data Elevators Applying the Bundle Protocol in Delay Tolerant Wireless Sensor Networks

FatTire: Declarative Fault Tolerance for SDN Mark Reitblatt (Cornell) (TU Berlin UC Louvain)

Im Improving P Performance o of It Iterative Me Methods by y Lo Lossy y Checkp kpoin

Ontology-based Data Management Maurizio Lenzerini Dipartimento di Ingegneria Informatica

A fault-tolerant one-way quantum computer Robert Raussendorf 1 , Jim Harrington 2 and Kovid Goyal

What Are Unit Tests and Why Write Them? Aime Gott Education Practice Lead, Mango Solutions

Provi viding ing Fault lt-tol tolerant erant Execution n of Web-servi service ce ba

pathogens pathogens Neutrophils Eosinophils PMN/ Basophils Granulocytes Mast cells Myeloid

FaultTolerantLinearAlgebra: goalsandmethods. - PowerPoint PPT Presentation

FaultTolerantLinearAlgebra: goalsandmethods. JulienLangou,UniversityofColoradoDenver FaulttolerantLinearAlgebra:GoalsandMethods. 0GOALS

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptive Fault Tolerant Systems: Adaptive Fault Tolerant Systems: Reflective Design and

Idealised Fault Tolerant Idealised Fault Tolerant Architectural Element Architectural Element

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Fault Tolerance For Sparse Linear Algebra Computations Implemented In A Grid Environment

Fault-tolerant techniques Fault-tolerant techniques What causes component faults? What are the

FAULT-TOLERANT CONTROL Is it possible? JAN MACIEJOWSKI Fault- tolerant control. DPS09,

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Fault-Tolerant Data Collection in Fault-Tolerant Data Collection in Heterogeneous Intelligent

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Overview ECE 753: FAULT-TOLERANT Fault Modeling COMPUTING References Introduction

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Linear Algebra Linear algebra has become as basic and as applicable as calculus, and

Lecture 14: Dense Linear Algebra David Bindel 18 Oct 2010 Where we are This week: dense

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Data Elevators Applying the Bundle Protocol in Delay Tolerant Wireless Sensor Networks

FatTire: Declarative Fault Tolerance for SDN Mark Reitblatt (Cornell) (TU Berlin UC Louvain)

Im Improving P Performance o of It Iterative Me Methods by y Lo Lossy y Checkp kpoin

Ontology-based Data Management Maurizio Lenzerini Dipartimento di Ingegneria Informatica

A fault-tolerant one-way quantum computer Robert Raussendorf 1 , Jim Harrington 2 and Kovid Goyal

What Are Unit Tests and Why Write Them? Aime Gott Education Practice Lead, Mango Solutions

Provi viding ing Fault lt-tol tolerant erant Execution n of Web-servi service ce ba

pathogens pathogens Neutrophils Eosinophils PMN/ Basophils Granulocytes Mast cells Myeloid

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE