Hessenberg Reduction with Transient Error Resilience on GPU-Based - - PowerPoint PPT Presentation

hessenberg reduction with transient error resilience on
SMART_READER_LITE
LIVE PREVIEW

Hessenberg Reduction with Transient Error Resilience on GPU-Based - - PowerPoint PPT Presentation

AsHES 2016 Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures Yulu Jia, Piotr Luszczek (presenting), Jack Dongarra University of Tennessee, Knoxville Innovative Computing Laboratory The Sixth


slide-1
SLIDE 1

1/17

AsHES 2016

Hessenberg Reduction with Transient Error Resilience

  • n

GPU-Based Hybrid Architectures

Yulu Jia, Piotr Luszczek (presenting), Jack Dongarra University of Tennessee, Knoxville Innovative Computing Laboratory

The Sixth International Workshop on in Accelerators and Hybrid Exascale Systems (AsHES), May 23, 2016

slide-2
SLIDE 2

2/17

From a Single Accelerated Node to Exascale

  • MTTF of the entire machine depends on reliability of each node
  • The MTTF of the entire machine can be statistically computed

based on single-node reliability for a number of distributions

– Exp(1/100) – Weibull(0.7, 1/100) – Weibull(0.5, 1/100)

  • See Yves Robert's work for detailed analysis

One node ~103 cores MTTF = 1 year MTTF = 10 years MTTF = 120 years ↓ ↓ ↓ ↓ Exascale machine ~106 nodes 30 seconds 5 minutes 1 hour

slide-3
SLIDE 3

3/17

Field Data on Resilience

  • Soft errors...

– are caused by: cosmic rays (alpha particles, high energy and/or

thermal neutrons)

– occur in practice

  • Commercial study in 2000 by Sun Microsystems
  • ASC Q supercomputer at Los Alamos in 2003
  • Jaguar (Cray XT5) at ORNL

– Nearly 225k cores – 1253 separate node crashes during 537 days (Aug 2008-Feb 2010) – Or 2.33 failures per day – Or less 10 hours of failure-free operation

  • … and any non-ECC machine
  • Accelerators are common

– In many shared-memory systems – Supercomputers

  • Tianhe-1A, Titan (Cray XK7, 560k cores), Tianhe-2 (3M+ cores)
  • And at Exascale ~1 billion threads and MTTF < 1 day!
slide-4
SLIDE 4

4/17

Hessenberg Reduction (HRD) and Its Applications

  • HRD = Hessenberg Reduction

– General (non-symmetric) eigenvalue problem – Generalized eigenvalue problem

  • Applications

– Structural mechanics – Spectral graph analysis – Control theory – …

  • Complexity

10/3 n3 + O(n2)

– Both compute bound and memory bound

slide-5
SLIDE 5

5/17

Numerical Eigenvalue Algorithm Recap

  • To solve

Ax = λx

  • r

Ax = λBx

  • We transform matrix A into Hessenberg matrix H with the same

eigenvalues: H1 = Q1

T A Q1

H2 = Q2

T H1 Q2

… Hn = Qn

T Hn-1 Qn ≡ H

  • H is in Hessenberg form:
  • Iterative algorithm is used to fjnd eigenvalues of H

* * * * * * * * * * * * * * * * * * * * * * * * * *

slide-6
SLIDE 6

6/17

Panel-Update Approach

Panel 2 Right Update 2 Left Update 2 Panel 3 Begin iteration 1

slide-7
SLIDE 7

7/17

Propagation of Error During Hessenberg Reduction

slide-8
SLIDE 8

8/17

Techniques for Error Protection and Failure Recovery

  • Algorithm-Based Fault Tolerance

– Kuang-Hua Hua, Jacob Abraham, ABFT for Matrix Operations

  • Implementation on systolic arrays

– Takes advantage of additional mathematical relationship(s)

  • Already present in algorithm
  • Introduced (cheaply, if possible) by ABFT – usually weighted sums
  • Diskless checkpointing

– Additional (small) data is kept in live processes or extra memory – No need for full I/O checkpointing

slide-9
SLIDE 9

9/17

From Huang and Abraham: Checksum Mat-Mat-Mul

C H E C K S U M CHECKSUM

=

C H E C K S U M CHECKSUM Cf

slide-10
SLIDE 10

10/17

Fault Tolerant Hessenberg Reduction

Begin iteration Factorize panel Right update Left update Trailing update End iteration DLAHRD DGEMM DGEMM DLARFB

slide-11
SLIDE 11

11/17

HRD: Extra Computation for Single-Error Protection

DGEMV DGEMV

slide-12
SLIDE 12

12/17

HRD: Extra Computation for Two-Error Protection

DGETRF (solve)

slide-13
SLIDE 13

13/17

Performance with Error Protection (Error in Panel)

Intel Sandy Bridge Xeon E5-2670 2.6 GHz+MKL 11.2 NVIDIA Kepler K40c+cuBLAS 7

slide-14
SLIDE 14

14/17

Performance with Error Protection (Error in Upper)

Intel Sandy Bridge Xeon E5-2670 2.6 GHz+MKL 11.2 NVIDIA Kepler K40c+cuBLAS 7

slide-15
SLIDE 15

15/17

Performance with Error Protection (Error in Trailing)

Intel Sandy Bridge Xeon E5-2670 2.6 GHz+MKL 11.2 NVIDIA Kepler K40c+cuBLAS 7

slide-16
SLIDE 16

16/17

Numerical Accuracy

  • Numerical accuracy can be measured in various forms

– Scaled residual (backward error) – Orthogonality of Q's

  • Accuracy can be different depending on:

– Location of error: panel, upper, trailing – Time of error: beginning, middle, end of HRD

  • Summary of numerical results for N=1k,…,10k

– Errors in non-fault tolerant code: 10-18 – 10-17 – Errors in upper and trailing on the order of 10-18 – 10-17 – Errors in panel on the order of 10-15 – 10-14

slide-17
SLIDE 17

17/17

Conclusions and Future Work

  • Summary

– Presented design and analysis of fault-tolerant Hessenberg

reduction

– The methods used: ABFT, diskless checkpointing – Hardware used: GPU accelerator – Minimal overhead in performance – About 2-digit loss for some scenarios but still accurate in working

precision

  • Future directions

– Address all two-sided factorizations within a single framework – Support for upcoming accelerators:

  • Intel KNL and Sky Lake
  • NVIDIA Pascal, Tegra, Jetson, Denver2
  • AMD Polaris, Zen
  • Google TPU