Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur´ elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1 . Argonne National Laboratory, USA 2 . ENS Lyon & INRIA, France 3 . Jaypee Institute of Information Technology, India 4 . University of Tennessee Knoxville, USA Anne.Benoit@ens-lyon.fr Dagstuhl Seminar #15281: Algorithms and Scheduling Techniques to Manage Resilience and Power Consumption in Distributed Systems July 6, 2015, Schloss Dagstuhl, Germany 1/1

Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h 2/1

Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h Need more reliable components!! Need more resilient techniques!!! 2/1

General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash 3/1

General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic ⇒ risk of saving corrupted checkpoint! 3/1

General-purpose approach Periodic checkpoint, rollback and recovery: Error Corrupt Detect C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic ⇒ risk of saving corrupted checkpoint! 3/1

Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � 4/1

Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � Optimal period (Young/Daly): Fail-stop (classical) Silent errors T = W + V ∗ + C Pattern T = W + C W ∗ = √ 2 C µ W ∗ = � Optimal ( C + V ∗ ) µ 4/1

One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � 5/1

One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � How many intermediate verifications to use and the positions? 5/1

Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � 6/1

Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time 6/1

Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time Which verification(s) to use? How many? Positions? 6/1

Outline 7/1

Model and objective Silent errors Poisson process: arrival rate λ = 1 /µ , where µ is platform MTBF Strike only computations; checkpointing, recovery, and verifications are protected Resilience parameters Cost of checkpointing C , cost of recovery R k types of partial detectors and a perfect detector � D (1) , D (2) , . . . , D ( k ) , D ∗ � D ( i ) : cost V ( i ) and recall r ( i ) < 1 D ∗ : cost V ∗ and recall r ∗ = 1 Design an optimal periodic computing pattern that minimizes execution time (or makespan) of the application 8/1

Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · 9/1

Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · - Last detector is perfect to avoid saving corrupted checkpoints - The same detector type D ( j ) could be used at the end of several segments 9/1

Outline 10/1

Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications 11/1

Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications However, we prove that finding the optimal pattern is NP-hard We design an FPTAS (Fully Polynomial-Time Approximation Scheme) that gives a makespan within (1 + ǫ ) times the optimal with running time polynomial in the input size and 1 /ǫ We show a simple greedy algorithm that works well in practice 11/1

Summary of results Algorithm to determine a pattern Pattern ( W , n , α , D ): Use FPTAS or Greedy (or even brute force for small instances) to find (optimal) number n of segments and set D of used detectors Arrange the n − 1 partial detectors in any order Compute W ∗ = � 1 − g i − 1 g i o ff 1 λ f re and α ∗ i = U n · (1+ g i − 1 )(1+ g i ) for 1 ≤ i ≤ n , n − 1 V i + V ∗ + C and f re = 1 1 + 1 � � � where o ff = 2 U n i =1 n − 1 1 − g i � with g i = 1 − r i and U n = 1 + 1 + g i i =1 12/1

Expected execution time of a pattern Proposition The expected time to execute a pattern Pattern ( W , n , α , D ) is n − 1 V i + V ∗ + C + λ W ( R + W α T A α + d T α ) + o ( λ ) , � E ( W ) = W + i =1 � � 1 + � j − 1 where A is a symmetric matrix defined by A ij = 1 k = i g k for 2 �� j − 1 � i ≤ j and d is a vector defined by d i = � n k = i g k V i for 1 ≤ i ≤ n. j = i First-order approximation (as in Young/Daly’s classic formula) Matrix A is essential to analysis. For instance, when n = 4 we have:   2 1 + g 1 1 + g 1 g 2 1 + g 1 g 2 g 3 A = 1 1 + g 1 2 1 + g 2 1 + g 2 g 3     1 + g 1 g 2 1 + g 2 2 1 + g 3 2   1 + g 1 g 2 g 3 1 + g 2 g 3 1 + g 3 2 13/1

Minimizing makespan For an application with total work W base , the makespan is E ( W ) W final ≈ × W base W = W base + H ( W ) × W base , where H ( W ) = E ( W ) − 1 is the execution overhead W For instance, if W base = 100 , W final = 120, we have H ( W ) = 20% 14/1

Minimizing makespan For an application with total work W base , the makespan is E ( W ) W final ≈ × W base W = W base + H ( W ) × W base , where H ( W ) = E ( W ) − 1 is the execution overhead W For instance, if W base = 100 , W final = 120, we have H ( W ) = 20% Minimizing makespan is equivalent to minimizing overhead! H ( W ) = o ff W + λ f re W + λ ( R + d T α ) + o ( λ ) n − 1 V i + V ∗ + C � fault-free overhead: o ff = i =1 f re = α T A α re-execution fraction: 14/1

Optimal pattern length to minimize overhead Proposition The execution overhead of a pattern Pattern ( W , n , α , D ) is minimized when its length is � o ff W ∗ = . λ f re The optimal overhead is √ � H ( W ∗ ) = 2 λ o ff f re + o ( λ ) . 15/1

Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1 . Argonne National Laboratory, USA 2 . ENS Lyon & INRIA, France 3 .

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Error Detection Two types Error Detection Codes (e.g. CRC, Parity, Checksums) Error

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Soft Soft Soft LArSoft coord, Oct 10 th , 2017 G. Petrillo (FNAL) Proxies for data products 1

Machine Learning for NLP SVMs for semantic error detection Aurlie Herbelot 2018 Centre for

Adapting Synchronizers Adapting Synchronizers to the Effects of to the Effects of On Chip

Parallel Algorithms and Programming Fault tolerance for Parallel Applications Thomas Ropars

Pushin ing productiv ivit ity boundarie ies of f Micr icrosoft ft Dynamic ics 365!

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker Lindenstruth FIAS Frankfurt

Pairing Model-Theoretic Syntax and Semantic Network for Writing Assistance Jean-Philippe Prost

Online Metric Algorithms with Untrusted Predictions Antonios Antoniadis 1 Christian Coester 2

Macs Column Computing Environment Michael Thonke PP-101206-a-

Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1 . Argonne National Laboratory, USA 2 . ENS Lyon & INRIA, France 3 .

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Error Detection Two types Error Detection Codes (e.g. CRC, Parity, Checksums) Error

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

&gt; SOFT EDGE &lt; By Iskos-Be rlin &gt; SOFT EDGE &lt; Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Soft Soft Soft LArSoft coord, Oct 10 th , 2017 G. Petrillo (FNAL) Proxies for data products 1

Machine Learning for NLP SVMs for semantic error detection Aurlie Herbelot 2018 Centre for

Adapting Synchronizers Adapting Synchronizers to the Effects of to the Effects of On Chip

Parallel Algorithms and Programming Fault tolerance for Parallel Applications Thomas Ropars

Pushin ing productiv ivit ity boundarie ies of f Micr icrosoft ft Dynamic ics 365!

Multilingual App Toolkit Standards and multilingual software development 29, April 2015 Jan

DPG Frhjahrstagung - Darmstadt 17.03.2016 HK 54 Prof. Dr. Volker Lindenstruth FIAS Frankfurt

Pairing Model-Theoretic Syntax and Semantic Network for Writing Assistance Jean-Philippe Prost

Online Metric Algorithms with Untrusted Predictions Antonios Antoniadis 1 Christian Coester 2

Macs Column Computing Environment Michael Thonke PP-101206-a-

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the