Accurate Prediction of Soft Error Vulnerability of Scientific - PowerPoint PPT Presentation

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab

Soft error: one-time corruption of system state • Examples: Memory bit-flips, erroneous computations • Caused by – Chip variability – Charged particles passing through transistors • Decay of packaging materials (Lead 208 , Boron 10 ) • Fission due to cosmic neutrons – Temperature, power fluctuations

Soft errors are a critical reliability challenge for supercomputers • Real Machines: – ASCI Q: 26 radiation-induced errors/week – Similar-size Cray XD1: 109 errors/week (estimated) – BlueGene/L: 3-4 L1 cache bit flips/day • Problem grows worse with time – Larger machines ⇒ larger error probability – SRAMs growing exponentially more vulnerable per chip

We must understand the impact of soft errors on applications • Soft errors corrupt application state corrupt output • May lead to crashes or • Need to detect/tolerate soft errors – State of the art: checkers/correctors for individual algorithms – No general solution • Must first understand how errors affect applications – Identify problem – Focus efforts

Prior work says very little about most applications • Prior fault analysis work focuses on injecting errors into individual applications – [Lu and Reed, SC04]: Linux + MPICH + Cactus, NAMD, CAM – [Messer et al, ICSDN00]: Linux + Apache and Linux + Java (Jess, DB, Javac, Jack) – [Some et al, AC02]: Lynx + Mars texture segmentation application … • Where’s my application?

Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same accuracy as per-application fault injection – Much cheaper • Initial steps – Fault injection iterative linear algebra methods – Library-based fault vulnerability analysis …

Step 1: Analyzing fault vulnerability of iterative methods • Target domain: solvers for sparse linear problem Ax=b • Goal: understand error vulnerability of class of algorithms – Raw error rates – Effectiveness of potential solutions • Error model: memory bit-flips

Possible run outcomes • Success: <10% error • Silent Data Corruption (SDC): ≥ 10% error • Hang: method doesn’t reach target tolerance • Abort: SegFault or failed SparseLib check

Errors cause SDCs, Hangs, Aborts in ~8-10%, each

Large scale applications vulnerable to silent data corruptions • Scaled to 1-day, 1,000-processor run of application that only calls iterative method 10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

Larger scale applications even more vulnerable to silent data corruptions • Scaled to 10-day, 100,000-processor run of application that only calls iterative method 10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

Error Detectors Base

Convergence detectors reduce SDC at <20% overhead Base

Native detectors have little effect at little cost Base

Encoding-based detectors significantly reduce SDC at high cost Base

First general analysis of error vulnerability of algorithm class • Vulnerability analysis for class of common subroutines • Described raw error vulnerability • Analyzed various detection/tolerance techniques – No clear winner, rules of thumb

Step 2: Vulnerability analysis of library-based applications • Many applications mostly composed of calls to library routines Inputs Outputs • If error hits some routine, output will be corrupted • Later routines: corrupted inputs ⇒ corrupted outputs (Work in progress)

Idea: predict application vulnerability from routine profiles • Library implementors provide vulnerability profile for each routine: – Error pattern in routine’s output after errors – Function that maps input error patterns to output error patterns Inputs Outputs

Idea: predict application vulnerability from routine profiles • Given application’s dependence graph – Simulate effect of error in each routine – Average over all error locations to produce error pattern at outputs Inputs Outputs

Examined applications that use BLAS and LAPACK • 12 routines ≥ O(n 2 ), double precision real numbers – Matrix-vector multiplication – DGEMV – Matrix-matrix multiplication – DGEMM – Rank-1 update – DGER – Linear least squares – DGESV, DGELS – SVD factorization – DGESVD, DGGSVD, DGESDD – Eigenvectors: DGEEV, DGGEV, DGEES, DGGES

Examined applications that use BLAS and LAPACK • 12 routines ≥ O(n 2 ), double precision real numbers • Executed on randomly-generated nxn matrixes (n=62, 125, 250, 500) • BLAS/LAPACK from Intel’s Math Kernel Library on Opteron(MLK10) and Itanium2(MKL8) – Same results on both • Error model: memory bit-flips

Error patterns: multiplicative error histograms DGEMM

Output error patterns fall into few major categories 1.E+00 1.E ‐ 02 1.E ‐ 04 1.E ‐ 06 1.E ‐ 08 DGGES DGESV Output beta - 62x1 Output L - 62x62 1.E+00 1.E ‐ 02 1.E ‐ 04 1.E ‐ 06 1.E ‐ 08 DGGES DGEMM Output vsr - 62x62 Output C - 62x62

Error patterns may vary with matrix size 62 125 250 500 1.E ‐ 01 DGGSVD 1.E ‐ 03 Output beta 1.E ‐ 05 1.E ‐ 07 1.E ‐ 01 DGGSVD 1.E ‐ 03 Output V 1.E ‐ 05 1.E ‐ 07

Input-Output error transition functions • Input-Output error transition functions: trained predictors – Linear Least Squares – Support Vector Machines (linear, 2 nd degree polynomial, rbf kernels) – Artificial Neural Nets (3,10,100 layers,; linear, gaussian, gaussian symmetric and sigmoid transfer functions)

Trained on multiple input error patterns • DataInj: single bit errors • DataInj-R: output errors of routines with DataInj inputs • UniInj: uniform multiplicative errors ∈ [-100,100] • UniInj-R: output errors of routines with UniInj inputs • Inj-R: output errors of error injected routines

Input-Output error transition functions • Input-Output error transition functions: trained predictors – Linear Least Squares – Support Vector Machines – Artificial Neural Nets • Trained on sample input error patterns DataInj: single bit errors DataInj-R: outputs of routines with DataInj inputs uniform multiplicative errors ∈ [-100,100] UniInj: UniInj-R: outputs of routines with UniInj inputs Inj-R: outputs of error injected routines

Output errors depend on input errors • Equivalence classes – DataInj, DataInj-R | Inj-R – DataUni, DataUni-R

Evaluated accuracy of all predictors on all training sets • Error metric: – probability of error ≥ δ – δ ∈ {1e-14, 1e-13, …, 2, 10, 100) 1 0.1 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 Recorded 1E ‐ 09 Predicted 1E ‐ 10

Evaluated accuracy of all predictors on all training sets 100% 1 90% 0.1 80% 0.01 70% 0.001 60% 0.0001 50% 1E ‐ 05 40% 1E ‐ 06 30% 1E ‐ 07 20% 10% 1E ‐ 08 Recorded 0% 1E ‐ 09 Predicted Recorded Predicted Error 1E ‐ 10 100% 1 90% 0.1 80% 0.01 70% 0.001 60% 0.0001 50% 1E ‐ 05 40% 1E ‐ 06 30% 1E ‐ 07 20% 1E ‐ 08 10% Recorded 0% 1E ‐ 09 1E ‐ 10 Predicted Recorded Predicted Error

Linear Least Squares has best accuracy, Neural nets worst Evaluation set: union of all training sets

Linear Least Squares has best accuracy, Neural nets worst

Accuracy varies among predictors DGEES, output wr

Linear Least Squares has best accuracy, Neural nets worst 100% 1.E ‐ 14 90% 2.E ‐ 14 4.E ‐ 14 80% 9.E ‐ 14 2.E ‐ 13 70% 3.E ‐ 13 60% 7.E ‐ 13 2.E ‐ 11 50% 7.E ‐ 10 2.E ‐ 08 40% 7.E ‐ 07 30% 2.E ‐ 05 7.E ‐ 04 20% 2.E ‐ 02 8.E ‐ 01 10% 2.E+00 0% 1.E+01 e ‐ HeapInj.none.All uni0 ‐ 1All inj0 ‐ 1All e ‐ uni0 ‐ 1All e ‐ inj0 ‐ 1All

Linear Least Squares has best accuracy, Neural nets worst Inj-R DataInj DataUni DataUni-R DataInj-R

Evaluated predictors on randomly- generated applications • Application has constant number of levels • Constant number of operations per level • Operations use as input data from prior level(s) Inputs Outputs

Neural Nets: Poor accuracy for application vulnerability prediction Function=sigmoid, 3 hidden layers Recorded 1 0.1 Predicted 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 1E ‐ 09 1E ‐ 10

Linear Least Squares: Good accuracy, restricted Recorded 1 0.1 Predicted 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 1E ‐ 09 1E ‐ 10

SVMs: Good accuracy, general Function=rbf, gamma=1.0 1 0.1 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 Recorded 1E ‐ 09 1E ‐ 10 Predicted

Work is still in progress • Correlating accuracy of input/output predictors to accuracy of application prediction • More detailed fault injection • Applications with loops • Real applications

Accurate Prediction of Soft Error Vulnerability of Scientific - PowerPoint PPT Presentation

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab Soft error: one-time corruption of system state Examples: Memory bit-flips, erroneous

Earthquake Vulnerability Earthquake Vulnerability Vulnerability Assessment & EVR measures

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI

Vulnerability Management Spring 2020 Jay Chen What is a vulnerability? A vulnerability is a

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Crisis and Crisis and Vulnerability Vulnerability ILO Crisis Response : Trainers Guide

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Charged Lepton Flavor Violation in Muon 3 Major Processes + e + + e + e +

Hashing to Elliptic Curves and Cryptanalysis of RSA-Based Schemes Mehdi Tibouchi Ecole

EA EARL RLY Y LUN UNG G CANC ANCER ER Pang Yong Kek Lecture Outline Why performing

Common Rheumatology Issues in Hospital Medicine Lianne Gensler, MD Associate Professor of

Lecture 9: Threads ... Lisa (Ling) Liu Overview Thread state WaitHandle Synchronization

Logs in Incident Response Mitigating Risk. Automating Compliance. 1 LogLogic Confidential

Dynamics of Networks Jnos Kertsz Central European University Pisa Summer School September

China s Real Exchange Rate and Implications for s Real Exchange Rate and Implications for

Accurate Prediction of Soft Error Vulnerability of Scientific - PowerPoint PPT Presentation

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab Soft error: one-time corruption of system state Examples: Memory bit-flips, erroneous

Earthquake Vulnerability Earthquake Vulnerability Vulnerability Assessment &amp; EVR measures

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Error Log Processing for Accurate Failure Prediction Felix Salfner Steffen Tschirpke ICSI

Vulnerability Management Spring 2020 Jay Chen What is a vulnerability? A vulnerability is a

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

The Prediction Error Signal 1 Prediction Error Signal Behavior 2 LP Speech Analysis file:s5,

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Crisis and Crisis and Vulnerability Vulnerability ILO Crisis Response : Trainers Guide

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

ACCURATE FLOATING-POINT SUMMATION IN CUB URI VERNER Summer intern OUTLINE Who needs accurate

&gt; SOFT EDGE &lt; By Iskos-Be rlin &gt; SOFT EDGE &lt; Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Charged Lepton Flavor Violation in Muon 3 Major Processes + e + + e + e +

Hashing to Elliptic Curves and Cryptanalysis of RSA-Based Schemes Mehdi Tibouchi Ecole

EA EARL RLY Y LUN UNG G CANC ANCER ER Pang Yong Kek Lecture Outline Why performing

Common Rheumatology Issues in Hospital Medicine Lianne Gensler, MD Associate Professor of

Lecture 9: Threads ... Lisa (Ling) Liu Overview Thread state WaitHandle Synchronization

Logs in Incident Response Mitigating Risk. Automating Compliance. 1 LogLogic Confidential

Dynamics of Networks Jnos Kertsz Central European University Pisa Summer School September

China s Real Exchange Rate and Implications for s Real Exchange Rate and Implications for

Earthquake Vulnerability Earthquake Vulnerability Vulnerability Assessment & EVR measures

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the