Hessenberg Reduction with Transient Error Resilience on GPU-Based - PowerPoint PPT Presentation

AsHES 2016 Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures Yulu Jia, Piotr Luszczek (presenting), Jack Dongarra University of Tennessee, Knoxville Innovative Computing Laboratory The Sixth International Workshop on in Accelerators and Hybrid Exascale Systems (AsHES), May 23, 2016 1/17

From a Single Accelerated Node to Exascale ● MTTF of the entire machine depends on reliability of each node ● The MTTF of the entire machine can be statistically computed based on single-node reliability for a number of distributions – Exp(1/100) – Weibull(0.7, 1/100) – Weibull(0.5, 1/100) ● See Yves Robert's work for detailed analysis One node MTTF = 1 year MTTF = 10 years MTTF = 120 years ~10 3 cores ↓ ↓ ↓ ↓ Exascale machine 30 seconds 5 minutes 1 hour ~10 6 nodes 2/17

Field Data on Resilience ● Soft errors... – are caused by: cosmic rays (alpha particles, high energy and/or thermal neutrons) – occur in practice ● Commercial study in 2000 by Sun Microsystems ● ASC Q supercomputer at Los Alamos in 2003 ● Jaguar (Cray XT5) at ORNL – Nearly 225k cores – 1253 separate node crashes during 537 days (Aug 2008-Feb 2010) – Or 2.33 failures per day – Or less 10 hours of failure-free operation ● … and any non-ECC machine ● Accelerators are common – In many shared-memory systems – Supercomputers ● Tianhe-1A, Titan (Cray XK7, 560k cores), Tianhe-2 (3M+ cores) ● And at Exascale ~1 billion threads and MTTF < 1 day! 3/17

Hessenberg Reduction (HRD) and Its Applications ● HRD = Hessenberg Reduction – General (non-symmetric) eigenvalue problem – Generalized eigenvalue problem ● Applications – Structural mechanics – Spectral graph analysis – Control theory – … ● Complexity 10 / 3 n 3 + O ( n 2 ) – – Both compute bound and memory bound 4/17

Numerical Eigenvalue Algorithm Recap ● To solve Ax = λ x or Ax = λ Bx ● We transform matrix A into Hessenberg matrix H with the same eigenvalues: H 1 = Q 1 T A Q 1 * * * * * * H 2 = Q 2 T H 1 Q 2 * * * * * * … 0 * * * * * H n = Q n T H n -1 Q n ≡ H 0 0 * * * * 0 0 0 * * * ● H is in Hessenberg form: 0 0 0 0 * * ● Iterative algorithm is used to fjnd eigenvalues of H 5/17

Panel-Update Approach Begin iteration 1 Panel 2 Right Update 2 Left Update 2 Panel 3 6/17

Propagation of Error During Hessenberg Reduction 7/17

Techniques for Error Protection and Failure Recovery ● Algorithm-Based Fault Tolerance – Kuang-Hua Hua, Jacob Abraham, ABFT for Matrix Operations ● Implementation on systolic arrays – Takes advantage of additional mathematical relationship(s) ● Already present in algorithm ● Introduced (cheaply, if possible) by ABFT – usually weighted sums ● Diskless checkpointing – Additional (small) data is kept in live processes or extra memory – No need for full I/O checkpointing 8/17

From Huang and Abraham: Checksum Mat-Mat-Mul C C H H E E = C C K K S S U U M M C f CHECKSUM CHECKSUM 9/17

Fault Tolerant Hessenberg Reduction DGEMM DLAHRD Begin iteration Factorize panel Right update DLARFB DGEMM Left update Trailing update End iteration 10/17

HRD: Extra Computation for Single-Error Protection DGEMV DGEMV 11/17

HRD: Extra Computation for Two-Error Protection DGETRF (solve) 12/17

Performance with Error Protection (Error in Panel) Intel Sandy Bridge Xeon E5-2670 2.6 GHz+MKL 11.2 NVIDIA Kepler K40c+cuBLAS 7 13/17

Performance with Error Protection (Error in Upper) Intel Sandy Bridge Xeon E5-2670 2.6 GHz+MKL 11.2 NVIDIA Kepler K40c+cuBLAS 7 14/17

Performance with Error Protection (Error in Trailing) Intel Sandy Bridge Xeon E5-2670 2.6 GHz+MKL 11.2 NVIDIA Kepler K40c+cuBLAS 7 15/17

Numerical Accuracy ● Numerical accuracy can be measured in various forms – Scaled residual (backward error) – Orthogonality of Q's ● Accuracy can be different depending on: – Location of error: panel, upper, trailing – Time of error: beginning, middle, end of HRD ● Summary of numerical results for N=1k,…,10k – Errors in non-fault tolerant code: 10 -18 – 10 -17 – Errors in upper and trailing on the order of 10 -18 – 10 -17 – Errors in panel on the order of 10 -15 – 10 -14 16/17

Conclusions and Future Work ● Summary – Presented design and analysis of fault-tolerant Hessenberg reduction – The methods used: ABFT, diskless checkpointing – Hardware used: GPU accelerator – Minimal overhead in performance – About 2-digit loss for some scenarios but still accurate in working precision ● Future directions – Address all two-sided factorizations within a single framework – Support for upcoming accelerators: ● Intel KNL and Sky Lake ● NVIDIA Pascal, Tegra, Jetson, Denver2 ● AMD Polaris, Zen ● Google TPU 17/17

Hessenberg Reduction with Transient Error Resilience on GPU-Based - PowerPoint PPT Presentation

AsHES 2016 Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures Yulu Jia, Piotr Luszczek (presenting), Jack Dongarra University of Tennessee, Knoxville Innovative Computing Laboratory The Sixth

Improvement to Hessenberg Reduction Shankar, Yang, Hao MA 5629 Numerical

Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C:

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

CS475 / CM375 Lecture 17: Nov 8, 2011 QR Algorithm and Reduction to Hessenberg Reading: [TB] Chapt

Outline Side and covert channels Transient execution CSci 5271 Introduction to Computer

Transient Test Reactors Dr. Daniel M. Wachs National Technical Lead for Transient Testing Idaho

Childrens Resilience Initiative One Communitys Response to ACEs through Resilience 1

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT

How do we assess resilience? Paul Ryan, Australian Resilience Centre Allyson Quinlan, Resilience

Do Now: Resilience 1. Create a Circle Map for resilience. 2. Look at the pictures. What

resilience Professor Kate Thomas c.p.thomas@bham.ac.uk What is resilience? Resilience is the

Eulerian polynomials, chromatic quasisymmetric functions, and Hessenberg varieties Michelle Wachs

From Eulerian Polynomials and Chromatic Polynomials to Hessenberg Varieties Michelle Wachs

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Hypothesis testing Edwin Leuven Introduction Statistical inference until now looked as follows

ENERGY STAR Connected Thermostats Stakeholder Working Meeting Field Savings Metric July 1, 2016

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Multiple Comparisons & Type-I Error Paul Gribble Winter, 2019 . . . . . . . . . .

ADAPT Floating-Point Precision Tuning Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen

(AGSDest) An R-package for estimation in classical and adaptive group sequential trials Niklas

Reflections on Statistical Data Analysis in Neutrino Experiments since NOMAD and F-C Bob Cousins

Week 2: Inference for SLR Inference: sampling distributions, testing confidence intervals, and

Hessenberg Reduction with Transient Error Resilience on GPU-Based - PowerPoint PPT Presentation

AsHES 2016 Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures Yulu Jia, Piotr Luszczek (presenting), Jack Dongarra University of Tennessee, Knoxville Innovative Computing Laboratory The Sixth

Improvement to Hessenberg Reduction Shankar, Yang, Hao MA 5629 Numerical

Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C:

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

CS475 / CM375 Lecture 17: Nov 8, 2011 QR Algorithm and Reduction to Hessenberg Reading: [TB] Chapt

Outline Side and covert channels Transient execution CSci 5271 Introduction to Computer

Transient Test Reactors Dr. Daniel M. Wachs National Technical Lead for Transient Testing Idaho

Childrens Resilience Initiative One Communitys Response to ACEs through Resilience 1

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT

How do we assess resilience? Paul Ryan, Australian Resilience Centre Allyson Quinlan, Resilience

Do Now: Resilience 1. Create a Circle Map for resilience. 2. Look at the pictures. What

resilience Professor Kate Thomas c.p.thomas@bham.ac.uk What is resilience? Resilience is the

Eulerian polynomials, chromatic quasisymmetric functions, and Hessenberg varieties Michelle Wachs

From Eulerian Polynomials and Chromatic Polynomials to Hessenberg Varieties Michelle Wachs

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Hypothesis testing Edwin Leuven Introduction Statistical inference until now looked as follows

ENERGY STAR Connected Thermostats Stakeholder Working Meeting Field Savings Metric July 1, 2016

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference

Multiple Comparisons &amp; Type-I Error Paul Gribble Winter, 2019 . . . . . . . . . .

ADAPT Floating-Point Precision Tuning Ignacio Laguna, Harshitha Menon, Tristan Vanderbruggen

(AGSDest) An R-package for estimation in classical and adaptive group sequential trials Niklas

Reflections on Statistical Data Analysis in Neutrino Experiments since NOMAD and F-C Bob Cousins

Week 2: Inference for SLR Inference: sampling distributions, testing confidence intervals, and

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Multiple Comparisons & Type-I Error Paul Gribble Winter, 2019 . . . . . . . . . .