Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Problem statement Theoretical analysis Performance evaluation Conclusion Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur´ elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1 . Argonne National Laboratory, USA 2 . ENS Lyon & INRIA, France 3 . Jaypee Institute of Information Technology, India 4 . University of Tennessee Knoxville, USA Anne.Benoit@ens-lyon.fr December 17, HiPC’2015, Bengaluru, India 1/25

Problem statement Theoretical analysis Performance evaluation Conclusion Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h 2/25

Problem statement Theoretical analysis Performance evaluation Conclusion Computing at Exascale Exascale platform: 10 5 or 10 6 nodes, each equipped with 10 2 or 10 3 cores Shorter Mean Time Between Failures (MTBF) µ Theorem: µ p = µ ind for arbitrary distributions p MTBF (individual node) 1 year 10 years 120 years MTBF (platform of 10 6 nodes) 30 sec 5 mn 1 h Need more reliable components!! Need more resilient techniques!!! 2/25

Problem statement Theoretical analysis Performance evaluation Conclusion General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash 3/25

Problem statement Theoretical analysis Performance evaluation Conclusion General-purpose approach Periodic checkpoint, rollback and recovery: Error C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip 3/25

Problem statement Theoretical analysis Performance evaluation Conclusion General-purpose approach Periodic checkpoint, rollback and recovery: Error Corrupt Detect C C C Time W W Fail-stop errors: instantaneous error detection, e.g., resource crash Silent errors (aka silent data corruptions): e.g., soft faults in L1 cache, ALU, double bit flip Silent error is detected only when corrupted data is activated, which could happen long after its occurrence Detection latency is problematic ⇒ risk of saving corrupted checkpoint! 3/25

Problem statement Theoretical analysis Performance evaluation Conclusion Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � 4/25

Problem statement Theoretical analysis Performance evaluation Conclusion Coping with silent errors Couple checkpointing with verification: Detect Error V ∗ C V ∗ C V ∗ C Time W W Before each checkpoint, run some verification mechanism (checksum, ECC, coherence tests, TMR, etc) Silent error is detected by verification ⇒ checkpoint always valid � Optimal period (Young/Daly): Fail-stop (classical) Silent errors T = W + V ∗ + C Pattern T = W + C W ∗ = √ 2 C µ W ∗ = � Optimal ( C + V ∗ ) µ 4/25

Problem statement Theoretical analysis Performance evaluation Conclusion One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � 5/25

Problem statement Theoretical analysis Performance evaluation Conclusion One step further Perform several verifications before each checkpoint: Detect Error V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C Time Pro: silent error is detected earlier in the pattern � Con: additional overhead in error-free executions � How many intermediate verifications to use and the positions? 5/25

Problem statement Theoretical analysis Performance evaluation Conclusion Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � 6/25

Problem statement Theoretical analysis Performance evaluation Conclusion Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time 6/25

Problem statement Theoretical analysis Performance evaluation Conclusion Partial verification Guaranteed/perfect verifications ( V ∗ ) can be very expensive! Partial verifications ( V ) are available for many HPC applications! Lower accuracy: recall r = #detected errors < 1 � #total errors Much lower cost, i.e., V < V ∗ � Detect! Detect? Error V ∗ C V 1 V 2 V ∗ C V 1 V 2 V ∗ C Time Which verification(s) to use? How many? Positions? 6/25

Problem statement Theoretical analysis Performance evaluation Conclusion Outline Problem statement 1 Theoretical analysis 2 Performance evaluation 3 Conclusion 4 7/25

Problem statement Theoretical analysis Performance evaluation Conclusion Model and objective Silent errors Poisson process: arrival rate λ = 1 /µ , where µ is platform MTBF Strike only computations; checkpointing, recovery, and verifications are protected Resilience parameters Cost of checkpointing C , cost of recovery R k types of partial detectors and a perfect detector � D (1) , D (2) , . . . , D ( k ) , D ∗ � D ( i ) : cost V ( i ) and recall r ( i ) < 1 D ∗ : cost V ∗ and recall r ∗ = 1 Design an optimal periodic computing pattern that minimizes execution time (or makespan) of the application 8/25

Problem statement Theoretical analysis Performance evaluation Conclusion Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · 9/25

Problem statement Theoretical analysis Performance evaluation Conclusion Pattern Formally, a pattern Pattern ( W , n , α , D ) is defined by W : pattern work length (or period) n : number of work segments, of lengths w i (with � n i =1 w i = W ) α = [ α 1 , α 2 , . . . , α n ]: work fraction of each segment ( α i = w i / W and � n i =1 α i = 1) D = [ D 1 , D 2 , . . . , D n − 1 , D ∗ ]: detectors used at the end of each segment ( D i = D ( j ) for some type j ) D n − 1 D ∗ C D 1 D 2 D 3 D ∗ C · · · Time w 1 w 2 w 3 w n · · · - Last detector is perfect to avoid saving corrupted checkpoints - The same detector type D ( j ) could be used at the end of several segments 9/25

Problem statement Theoretical analysis Performance evaluation Conclusion Outline Problem statement 1 Theoretical analysis 2 Performance evaluation 3 Conclusion 4 10/25

Problem statement Theoretical analysis Performance evaluation Conclusion Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications 11/25

Problem statement Theoretical analysis Performance evaluation Conclusion Summary of results In a nutshell: Given a pattern Pattern ( W , n , α , D ), We show how to compute the expected execution time We are able to characterize its optimal length We can compute the optimal positions of the partial verifications However, we prove that finding the optimal pattern is NP-hard We design an FPTAS (Fully Polynomial-Time Approximation Scheme) that gives a makespan within (1 + ǫ ) times the optimal with running time polynomial in the input size and 1 /ǫ We show a simple greedy algorithm that works well in practice 11/25

Problem statement Theoretical analysis Performance evaluation Conclusion Summary of results Algorithm to determine a pattern Pattern ( W , n , α , D ): Use FPTAS or Greedy (or even brute force for small instances) to find (optimal) number n of segments and set D of used detectors Arrange the n − 1 partial detectors in any order Compute W ∗ = � 1 − g i − 1 g i o ff 1 λ f re and α ∗ i = U n · (1+ g i − 1 )(1+ g i ) for 1 ≤ i ≤ n , n − 1 V i + V ∗ + C and f re = 1 1 + 1 � � � where o ff = 2 U n i =1 n − 1 1 − g i � with g i = 1 − r i and U n = 1 + 1 + g i i =1 12/25

Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Problem statement Theoretical analysis Performance evaluation Conclusion Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Error Detection Two types Error Detection Codes (e.g. CRC, Parity, Checksums) Error

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Soft Soft Soft LArSoft coord, Oct 10 th , 2017 G. Petrillo (FNAL) Proxies for data products 1

Machine Learning for NLP SVMs for semantic error detection Aurlie Herbelot 2018 Centre for

Error Detection and Correction in Communication Networks Chong Shangguan Joint work with Itzhak

Measurement of Timing Error Detection Performance of Software-based Error Detection Mechanisms

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 26: Overflow

CIRM - Dynamic Error Detection Peter Pirkelbauer Center for Applied Scientific Computing (CASC)

Cloud Simulations and Retrieved Surface Temperature Biases Evan Fishbein Michael Gunson F.

2.5 OLS: Precision and Diagnostics ECON 480 Econometrics Fall 2020 Ryan Safner

Hindley-Milner elaboration in applicative style Fran cois Pottier This pearl presents This

Laser Based H - Beam Diagnostics Yun Liu for Beam Instrumentation Team Research Accelerator

Which verification for soft error detection? Leonardo Bautista-Gomez - PowerPoint PPT Presentation

Problem statement Theoretical analysis Performance evaluation Conclusion Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur elien Cavelan 2 , Saurabh K. Raina 3 , Yves Robert 2 , 4 and Hongyang Sun 2 1

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Error Detection Two types Error Detection Codes (e.g. CRC, Parity, Checksums) Error

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

Which verification for soft error detection? Leonardo Bautista-Gomez 1 , Anne Benoit 2 , Aur

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

DIVS DL/ID Verification Systems Verification of Legal Status DIVS Passport Verification

&gt; SOFT EDGE &lt; By Iskos-Be rlin &gt; SOFT EDGE &lt; Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Soft Soft Soft LArSoft coord, Oct 10 th , 2017 G. Petrillo (FNAL) Proxies for data products 1

Machine Learning for NLP SVMs for semantic error detection Aurlie Herbelot 2018 Centre for

Error Detection and Correction in Communication Networks Chong Shangguan Joint work with Itzhak

Measurement of Timing Error Detection Performance of Software-based Error Detection Mechanisms

Computer Organization &amp; Assembly Language Programming (CSE 2312) Lecture 26: Overflow

CIRM - Dynamic Error Detection Peter Pirkelbauer Center for Applied Scientific Computing (CASC)

Cloud Simulations and Retrieved Surface Temperature Biases Evan Fishbein Michael Gunson F.

2.5 OLS: Precision and Diagnostics ECON 480 Econometrics Fall 2020 Ryan Safner

Hindley-Milner elaboration in applicative style Fran cois Pottier This pearl presents This

Laser Based H - Beam Diagnostics Yun Liu for Beam Instrumentation Team Research Accelerator

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

Computer Organization & Assembly Language Programming (CSE 2312) Lecture 26: Overflow