accurate prediction of soft error vulnerability of
play

Accurate Prediction of Soft Error Vulnerability of Scientific - PowerPoint PPT Presentation

Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab Soft error: one-time corruption of system state Examples: Memory bit-flips, erroneous


  1. Accurate Prediction of Soft Error Vulnerability of Scientific Applications Greg Bronevetsky Post-doctoral Fellow Lawrence Livermore National Lab

  2. Soft error: one-time corruption of system state • Examples: Memory bit-flips, erroneous computations • Caused by – Chip variability – Charged particles passing through transistors • Decay of packaging materials (Lead 208 , Boron 10 ) • Fission due to cosmic neutrons – Temperature, power fluctuations

  3. Soft errors are a critical reliability challenge for supercomputers • Real Machines: – ASCI Q: 26 radiation-induced errors/week – Similar-size Cray XD1: 109 errors/week (estimated) – BlueGene/L: 3-4 L1 cache bit flips/day • Problem grows worse with time – Larger machines ⇒ larger error probability – SRAMs growing exponentially more vulnerable per chip

  4. We must understand the impact of soft errors on applications • Soft errors corrupt application state corrupt output • May lead to crashes or • Need to detect/tolerate soft errors – State of the art: checkers/correctors for individual algorithms – No general solution • Must first understand how errors affect applications – Identify problem – Focus efforts

  5. Prior work says very little about most applications • Prior fault analysis work focuses on injecting errors into individual applications – [Lu and Reed, SC04]: Linux + MPICH + Cactus, NAMD, CAM – [Messer et al, ICSDN00]: Linux + Apache and Linux + Java (Jess, DB, Javac, Jack) – [Some et al, AC02]: Lynx + Mars texture segmentation application … • Where’s my application?

  6. Extending vulnerability characterization to more applications • Goal: general purpose vulnerability characterization – Same accuracy as per-application fault injection – Much cheaper • Initial steps – Fault injection iterative linear algebra methods – Library-based fault vulnerability analysis …

  7. Step 1: Analyzing fault vulnerability of iterative methods • Target domain: solvers for sparse linear problem Ax=b • Goal: understand error vulnerability of class of algorithms – Raw error rates – Effectiveness of potential solutions • Error model: memory bit-flips

  8. Possible run outcomes • Success: <10% error • Silent Data Corruption (SDC): ≥ 10% error • Hang: method doesn’t reach target tolerance • Abort: SegFault or failed SparseLib check

  9. Errors cause SDCs, Hangs, Aborts in ~8-10%, each

  10. Large scale applications vulnerable to silent data corruptions • Scaled to 1-day, 1,000-processor run of application that only calls iterative method 10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

  11. Larger scale applications even more vulnerable to silent data corruptions • Scaled to 10-day, 100,000-processor run of application that only calls iterative method 10FIT/MB DRAM (1,000-5,000 Raw FIT/MB, 90%-98% effective error correction)

  12. Error Detectors Base

  13. Convergence detectors reduce SDC at <20% overhead Base

  14. Convergence detectors reduce SDC at <20% overhead Base

  15. Native detectors have little effect at little cost Base

  16. Encoding-based detectors significantly reduce SDC at high cost Base

  17. Encoding-based detectors significantly reduce SDC at high cost Base

  18. First general analysis of error vulnerability of algorithm class • Vulnerability analysis for class of common subroutines • Described raw error vulnerability • Analyzed various detection/tolerance techniques – No clear winner, rules of thumb

  19. Step 2: Vulnerability analysis of library-based applications • Many applications mostly composed of calls to library routines Inputs Outputs • If error hits some routine, output will be corrupted • Later routines: corrupted inputs ⇒ corrupted outputs (Work in progress)

  20. Idea: predict application vulnerability from routine profiles • Library implementors provide vulnerability profile for each routine: – Error pattern in routine’s output after errors – Function that maps input error patterns to output error patterns Inputs Outputs

  21. Idea: predict application vulnerability from routine profiles • Given application’s dependence graph – Simulate effect of error in each routine – Average over all error locations to produce error pattern at outputs Inputs Outputs

  22. Examined applications that use BLAS and LAPACK • 12 routines ≥ O(n 2 ), double precision real numbers – Matrix-vector multiplication – DGEMV – Matrix-matrix multiplication – DGEMM – Rank-1 update – DGER – Linear least squares – DGESV, DGELS – SVD factorization – DGESVD, DGGSVD, DGESDD – Eigenvectors: DGEEV, DGGEV, DGEES, DGGES

  23. Examined applications that use BLAS and LAPACK • 12 routines ≥ O(n 2 ), double precision real numbers • Executed on randomly-generated nxn matrixes (n=62, 125, 250, 500) • BLAS/LAPACK from Intel’s Math Kernel Library on Opteron(MLK10) and Itanium2(MKL8) – Same results on both • Error model: memory bit-flips

  24. Error patterns: multiplicative error histograms DGEMM

  25. Output error patterns fall into few major categories 1.E+00 1.E ‐ 02 1.E ‐ 04 1.E ‐ 06 1.E ‐ 08 DGGES DGESV Output beta - 62x1 Output L - 62x62 1.E+00 1.E ‐ 02 1.E ‐ 04 1.E ‐ 06 1.E ‐ 08 DGGES DGEMM Output vsr - 62x62 Output C - 62x62

  26. Error patterns may vary with matrix size 62 125 250 500 1.E ‐ 01 DGGSVD 1.E ‐ 03 Output beta 1.E ‐ 05 1.E ‐ 07 1.E ‐ 01 DGGSVD 1.E ‐ 03 Output V 1.E ‐ 05 1.E ‐ 07

  27. Input-Output error transition functions • Input-Output error transition functions: trained predictors – Linear Least Squares – Support Vector Machines (linear, 2 nd degree polynomial, rbf kernels) – Artificial Neural Nets (3,10,100 layers,; linear, gaussian, gaussian symmetric and sigmoid transfer functions)

  28. Trained on multiple input error patterns • DataInj: single bit errors • DataInj-R: output errors of routines with DataInj inputs • UniInj: uniform multiplicative errors ∈ [-100,100] • UniInj-R: output errors of routines with UniInj inputs • Inj-R: output errors of error injected routines

  29. Input-Output error transition functions • Input-Output error transition functions: trained predictors – Linear Least Squares – Support Vector Machines – Artificial Neural Nets • Trained on sample input error patterns DataInj: single bit errors DataInj-R: outputs of routines with DataInj inputs uniform multiplicative errors ∈ [-100,100] UniInj: UniInj-R: outputs of routines with UniInj inputs Inj-R: outputs of error injected routines

  30. Output errors depend on input errors • Equivalence classes – DataInj, DataInj-R | Inj-R – DataUni, DataUni-R

  31. Evaluated accuracy of all predictors on all training sets • Error metric: – probability of error ≥ δ – δ ∈ {1e-14, 1e-13, …, 2, 10, 100) 1 0.1 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 Recorded 1E ‐ 09 Predicted 1E ‐ 10

  32. Evaluated accuracy of all predictors on all training sets 100% 1 90% 0.1 80% 0.01 70% 0.001 60% 0.0001 50% 1E ‐ 05 40% 1E ‐ 06 30% 1E ‐ 07 20% 10% 1E ‐ 08 Recorded 0% 1E ‐ 09 Predicted Recorded Predicted Error 1E ‐ 10 100% 1 90% 0.1 80% 0.01 70% 0.001 60% 0.0001 50% 1E ‐ 05 40% 1E ‐ 06 30% 1E ‐ 07 20% 1E ‐ 08 10% Recorded 0% 1E ‐ 09 1E ‐ 10 Predicted Recorded Predicted Error

  33. Linear Least Squares has best accuracy, Neural nets worst Evaluation set: union of all training sets

  34. Linear Least Squares has best accuracy, Neural nets worst

  35. Accuracy varies among predictors DGEES, output wr

  36. Linear Least Squares has best accuracy, Neural nets worst 100% 1.E ‐ 14 90% 2.E ‐ 14 4.E ‐ 14 80% 9.E ‐ 14 2.E ‐ 13 70% 3.E ‐ 13 60% 7.E ‐ 13 2.E ‐ 11 50% 7.E ‐ 10 2.E ‐ 08 40% 7.E ‐ 07 30% 2.E ‐ 05 7.E ‐ 04 20% 2.E ‐ 02 8.E ‐ 01 10% 2.E+00 0% 1.E+01 e ‐ HeapInj.none.All uni0 ‐ 1All inj0 ‐ 1All e ‐ uni0 ‐ 1All e ‐ inj0 ‐ 1All

  37. Linear Least Squares has best accuracy, Neural nets worst Inj-R DataInj DataUni DataUni-R DataInj-R

  38. Evaluated predictors on randomly- generated applications • Application has constant number of levels • Constant number of operations per level • Operations use as input data from prior level(s) Inputs Outputs

  39. Neural Nets: Poor accuracy for application vulnerability prediction Function=sigmoid, 3 hidden layers Recorded 1 0.1 Predicted 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 1E ‐ 09 1E ‐ 10

  40. Linear Least Squares: Good accuracy, restricted Recorded 1 0.1 Predicted 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 1E ‐ 09 1E ‐ 10

  41. SVMs: Good accuracy, general Function=rbf, gamma=1.0 1 0.1 0.01 0.001 0.0001 1E ‐ 05 1E ‐ 06 1E ‐ 07 1E ‐ 08 Recorded 1E ‐ 09 1E ‐ 10 Predicted

  42. Work is still in progress • Correlating accuracy of input/output predictors to accuracy of application prediction • More detailed fault injection • Applications with loops • Real applications

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend