[PPT] - Radiation Reliability Issues in Current and Future Supercomputers PowerPoint Presentation

SLIDE 1

Radiation Reliability Issues in Current and Future Supercomputers

September 26th 2017 – Grenoble, France

PAOLO RECH

SLIDE 2

HPC reliability importance

2

SLIDE 4

Paolo Rech – Grenoble, France

Available Accelerators

Modern parallel accelerators offer:

Low cost
Flexible platform
High efficiency (low per-thread consumption)
High computational power and frequency
Huge amount of resources

3

Kepler K40 Xeon-Phi

SLIDE 5

Paolo Rech – Grenoble, France

Available Accelerators

Modern parallel accelerators offer:

Low cost
Flexible platform
High efficiency (low per-thread consumption)
High computational power and frequency
Huge amount of resources
Reliability?

3

Kepler K40 Xeon-Phi

SLIDE 6

Paolo Rech – Grenoble, France

Available Accelerators

Modern parallel accelerators offer:

Low cost
Flexible platform
High efficiency (low per-thread consumption)
High computational power and frequency
Huge amount of resources
Reliability?

Error Rate

3

Kepler K40 Xeon-Phi

SLIDE 7

Paolo Rech – Grenoble, France

Titan

Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan Detected Uncorrectable Errors MTBF is ~44h*

*(field and experimental data from HPCA’15)

4

SLIDE 8

Paolo Rech – Grenoble, France

HPC bad stories

Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003

1,100 Apple Power Mac G5
Couldn't boot because of the failure rate
Power Mac G5 did not have error-correcting code (ECC) memory
Big Mac was broken apart and sold on-line

Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute ASCI Q – (2002 #2 in Top500 list)

Built with AlphaServers
7 Teraflops
Couldn't run more than 1h without crash
After putting metal side it could last 6h before crash
Address bus on the microprocessors were unprotected (causing

the crashes)

4

SLIDE 9

Paolo Rech – Grenoble, France

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

5

SLIDE 10

Paolo Rech – Grenoble, France

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

SLIDE 11

Paolo Rech – Grenoble, France

Terrestrial Radiation Environment

6

Cosmic rays could be so energetic to pass the Van Allen belts

SLIDE 12

Paolo Rech – Grenoble, France

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2h) @sea level*

*JEDEC JESD89A Standard

6

Cosmic rays could be so energetic to pass the Van Allen belts

SLIDE 13

Paolo Rech – Grenoble, France

Altitude and Radiation

Maximum ionization @ ~13KM above sea level

7

SLIDE 14

Paolo Rech – Grenoble, France

Altitude and Radiation

Maximum ionization @ ~13KM above sea level

7

LANL

SLIDE 15

Paolo Rech – Grenoble, France

Radiation Effects - Soft Errors

1

IONIZING PARTICLE

1

One or more bit-flips

Single Event Upset (SEU) Multiple Bit Upset (MBU) Soft Errors: the device is not permanently damaged, but the particle may generate:

Transient voltage pulse

Single Event Transient (SET) FF Logic

IONIZING PARTICLE

8

SLIDE 16

Paolo Rech – Grenoble, France

Silent Data Corruption vs Crash

Soft Errors in:

data cache
register files
logic gates (ALU)
scheduler

Soft Errors in:

instruction cache
scheduler / dispatcher
PCI-e bus controller

Silent Data Corruption DUE (Crash)

9

SLIDE 17

Paolo Rech – Grenoble, France

Radiation Effects on Parallel Accelerators

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

…

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

SM SM SM SM SM SM SM SM SM SM SM SM

X X

core

X

core core core core core core core

X X X

10

SLIDE 18

Paolo Rech – Grenoble, France

Output Correctness in HPC

…

A single fault can propagate to several parallel threads: multiple corrupted elements.

11

SLIDE 19

Paolo Rech – Grenoble, France

Output Correctness in HPC

error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications …

11

A single fault can propagate to several parallel threads: multiple corrupted elements.

SLIDE 20

Paolo Rech – Grenoble, France

Output Correctness in HPC

error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications

Goal: quantify and qualify SDC in NVIDIA and Intel architectures.

…

11

A single fault can propagate to several parallel threads: multiple corrupted elements.

SLIDE 21

Paolo Rech – Grenoble, France

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

SLIDE 22

Paolo Rech – Grenoble, France

Radiation Test Facilities

12

Irradiation of Chips Electronics

SLIDE 23

Paolo Rech – Grenoble, France

Experimental Setup

13

SLIDE 24

Paolo Rech – Grenoble, France

Radiation Test are NOT for dummies

What can (and actually went) wrong:

Ethernet cables failures
Bios checksum error
HDD failures
Linux GRUB failure
power plug failure (wow, this was risky)
board boot failure
GPU fell off the BUS (this was funny)
mic is lost
etc… etc… etc…
Heather/Sean, can you add something to the list?

14

SLIDE 25

Paolo Rech – Grenoble, France

GPU Radiation Test Setup

microcontrollers FPGA SoC FPGA SoC Flash GPU APU

15

SLIDE 26

Paolo Rech – Grenoble, France

GPU Radiation Test Setup

23/48

GPU power control circuitry is out of beam

NVIDIA K40 Intel Xeon-Phi desktop PCs AMD APU

SLIDE 27

Paolo Rech – Grenoble, France

@LANSCE 1.8x106 n/(cm2 h) @NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation (~ 91,000 years)

Neutrons Spectrum

17

SLIDE 28

Paolo Rech – Grenoble, France

@LANSCE 1.8x106 n/(cm2 h) @NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation (~ 91,000 years)

Neutrons Spectrum

All the collected SDCs are publicly available:

https://github.com/UFRGS-CAROL/HPCA2017-log-data

18

SLIDE 29

Paolo Rech – Grenoble, France

DGEMM: matrix multiplication
lavaMD: particles interactions
Hotspot: heat simulation
Needleman–Wunsch: Biology
CLAMR: DOE’s workload
Quick- Merge- Radix-Sort
Matrix Transpose: Memory
Gaussian

Selected Algorithms

We select a set of benchmarks that:

stimulate different resources
are representative of HPC applications
minimize error masking (high AVF)

19

SLIDE 30

Paolo Rech – Grenoble, France

Xeon Phi vs K40 SDC rate

1 10 100 1000 Xeon Phi K40 15 19 23 210 211 212 Hotspot CLAMR N/A lavaMD DGEMM SDC Relative FIT [a.u.]

Xeon Phi error rate seems lower than Kepler, but:

Xeon Phi is built in 3D Trigate, Kepler in planar CMOS
Xeon Phi and K40 have different throughput

20

SLIDE 31

Paolo Rech – Grenoble, France

Parallelism Management Reliability

200 400 600 800 50 100 150 200 250 300 15 19 23 lavaMD 210 211 212 DGEMM Relative FIT [a.u.] Relative FIT [a.u.]

~95% processor resources used with smallest input Increasing the input size we increase the #threads:

Xeon-Phi error rate remains constant (<20% variation)
K40 SDC error rate increases with input size

K40 Xeon Phi 21

SLIDE 32

Paolo Rech – Grenoble, France

Parallelism Management Reliability

K40 Xeon-Phi FIT increases with input size: HW scheduler is prone to be corrupted! data of 2048 active threads is maintained in the register file constant FIT rate: embedded OS is OK!

nly 4 threads/core are
maintained. Other

threads data in the main memory (not exposed)

22

SLIDE 33

Paolo Rech – Grenoble, France

29x29 210x210 211x211 212x212 213x213 DGEMM GFlops

0.00E+00 2.00E+02 4.00E+02 6.00E+02 8.00E+02 1.00E+03 1.20E+03

Xeon Phi K40

Xeon-Phi GFlops almost constant K40 Gflops rapidly increase

Parallelism Management Reliability

K40 throughput increases with input size. Reliability vs Performances trade-off should be considered

23

SLIDE 34

Paolo Rech – Grenoble, France

Mean Workload Between Failures

Parallel threads Error rate Throughput

24

SLIDE 35

Paolo Rech – Grenoble, France

Mean Workload Between Failures

Parallel threads Error rate Throughput Error rate Throughput

25

SLIDE 36

Paolo Rech – Grenoble, France

Mean Workload Between Failures

Which architecture produces a higher amount of data before experiencing a failure? Is there a sweet spot?

Mean Workload Between Failures

Parallel threads Error rate Throughput Error rate Throughput

26

SLIDE 37

Paolo Rech – Grenoble, France

DGEMM MWBF

Xeon-Phi MWBF decreases significantly with input size. Even if more prone to be corrupted, Kepler produces more correct data (if parallelism is exploited)

27

SLIDE 38

Paolo Rech – Grenoble, France

Quantify and Qualify SDCs

Number of incorrect elements Relative Error how different the error is from the expected value Spatial Locality

x x x xx x x x x x x x x x x x x x x x x x x

line square random

28

SLIDE 39

Paolo Rech – Grenoble, France

Quantify and Qualify SDCs

Number of incorrect elements Relative Error how different the error is from the expected value

x x x xx

Spatial Locality

x x x x x x x x x x x x x x x x x x

line square random

28

SLIDE 40

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

29 K40 Xeon Phi

SLIDE 41

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Greater different from expected value

29 K40 Xeon Phi

SLIDE 42

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Higher number of corrupted elements Greater different from expected value

29 K40 Xeon Phi

SLIDE 43

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Higher number of corrupted elements Greater different from expected value

BAD: high number of corrupted elements, which are very different from the expected output

29 K40 Xeon Phi

SLIDE 44

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

K40 few corrupted elements, value similar to expected one

Xeon Phi: a lot of corrupted elements, which are very different from expected value

29 K40 Xeon Phi

SLIDE 45

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

DGEMM lavaMD

Both K40 and Xeon Phi have few corrupted elements. K40 corruption are very different from the expected one

29 K40 Xeon Phi

SLIDE 46

Paolo Rech – Grenoble, France

Number of Incorrect Elements vs Relative Error

Purely arithmetic operations are more reliable (and faster) on the K40 (GPUs have shorten and faster pipelines). Xeon Phi is more reliable for Finite Different Methods (lavaMD), which are based on transcendental functions (exp).

29

DGEMM lavaMD

K40 Xeon Phi

SLIDE 47

Paolo Rech – Grenoble, France

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

SLIDE 48

Paolo Rech – Grenoble, France

Experimental Results (ECC OFF)

1 10 100 1000 10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes SDC

Single Error Correction Double Error Detection ECC.

K20 FIT

35

00100000100000000 00000000100000000 00100000000000000 OK

X

data from Oliveira et

al. Trans. Comp.

2016

SLIDE 49

Paolo Rech – Grenoble, France

ECC ON - SDC

1 10 100 1000 10000

MxM FFT NW lavaMD Hotspot

K20 FIT ECC reduces the SDC FIT of ~1 order of magnitude (there is almost no code dependence)

ECC OFF ECC ON

36

SLIDE 50

Paolo Rech – Grenoble, France

ECC ON - Crash

MxM FFT NW lavaMD Hotspot

K20 FIT

1 10 100 1000 10000

ECC increases the Crash FIT of about 50% (there is almost no code dependence)

Double Bit Errors cause a crash scheduler is not protected

ECC OFF ECC ON

37

00100000100000000

X

SLIDE 51

Paolo Rech – Grenoble, France

MxM FFT NW lavaMD Hotspot

K20 FIT

ECC ON – SDC vs Crashes

1 10 100 1000 10000

When the ECC is ON Crashes are more likely to occur than SDCs (this is GOOD for HPC centers!) Crash SDC

38

SLIDE 52

Paolo Rech – Grenoble, France

Algorithm Based Fault Tolerance x

A B

checksum checksum

∑ ∑ M

=

col-check row-check

Freivalds ’79

ABFT: technique designed specifically for an algorithm. ABFT requires: input coding, algorithm modification, and output decoding with error detection/correction

col-sum row-sum

X X X

Huang and Abraham ’84 Rech et al., TNS ‘13 39

SLIDE 53

Paolo Rech – Grenoble, France

FFT Hardening Idea*

... ...

x0 x1 x2 x3 xN-2 xN-1

+

2x

+

2x

+

2x

... ...

÷(2+w-0) ÷(2+w-1) ÷(2+w-2) ÷(2+w-3) ÷(2+w-N-2) ÷(2+w-N-1)

...

64-points FFT

+

x N

error

*J.Y. Jou and Abraham ‘88 40

SLIDE 54

Paolo Rech – Grenoble, France

ECC vs ABFT

FIT [log scale] MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10 times, ABFT of ~56 times! ECC increases Crashes

f 50% ABFT of 10%!

1 10 100 1000 10000

Unhardened ECC ABFT 41

SLIDE 55

Paolo Rech – Grenoble, France

Duplication With Comparison

Spatial: block i and i+N are duplicated E-O Spatial: block i and i+1 are duplicated Time: a thread executes twice the operations

SM0

a b c d

SM1

a' b' c' d'

time

SM0

b b' d d'

SM1

a c c'

time

a'

SM0

b & b'

d & d'

SM1

a & a' c & c'

time

42

SLIDE 56

Paolo Rech – Grenoble, France

Hotspot - DWC results*

1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC

FIT [log scale]

SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC Only Time DWC reduces Crashes (no additional Blocks scheduling required) DWC is promising: it is generic, easily implemented, and effective… BUT execution time overhead for Spatial DWC and Spatial E-O is 2.5x and for Time DWC is 2x (data is not copied)

*details on Oliveira et al.

Trans. Nucl. Sci., 2014

43

SLIDE 57

Paolo Rech – Grenoble, France

Duplicate only what REALLY matters

analyze SDC criticality: are there “acceptable” SDCs?

example: CLAMR (DOE workload) experimental result SDC causes a single pixel error SDC causes a huge error

What’s next? Selective Hardening!

44

SLIDE 58

Paolo Rech – Grenoble, France

Tolerable SDCs

Xeon Phi K40 - ECC K40 TitanX

45

SLIDE 59

Paolo Rech – Grenoble, France

Tolerable SDCs

Xeon Phi K40 - ECC K40 TitanX

Output must match the expected output (0% tolerance)

45

SLIDE 60

Paolo Rech – Grenoble, France

Tolerable SDCs

Xeon Phi K40 - ECC K40 TitanX

Output must match the expected output (0% tolerance) Increasing acceptable difference at the output

45

SLIDE 61

Paolo Rech – Grenoble, France

Tolerable SDCs

Xeon Phi K40 - ECC K40 TitanX

If we accept a 2.5% of variance from the expected value more than 60% of SDCs could be tolerated

45

SLIDE 62

Paolo Rech – Grenoble, France

Tolerable SDCs

46

K40 K40 ECC Titan X

Gaussian DGEMM lavaMD Hotspot

SLIDE 63

Paolo Rech – Grenoble, France

Tolerable SDCs

K40 K40 ECC Titan X

Gaussian DGEMM lavaMD Hotspot Hotspot: with a 0.1% of tolerance the error rate is reduced of 90%!

45

SLIDE 64

Paolo Rech – Grenoble, France

2. detect SW-HW causes for critical SDCs
code analysis
fault-injection

(NVIDIA SASSIFI and UFRGS CAROL-FI)

1. analyze SDC criticality: are there “acceptable” SDCs?

Duplicate only what REALLY matters

46

What’s next? Selective Hardening!

SLIDE 65

Paolo Rech – Grenoble, France

2. detect SW-HW causes for critical SDCs
3. harden selected portions of the code
4. evaluate enhanced reliability and performances
1. analyze SDC criticality: are there “acceptable” SDCs?

Duplicate only what REALLY matters

46

code analysis
fault-injection

(NVIDIA SASSIFI and UFRGS CAROL-FI)

What’s next? Selective Hardening!

SLIDE 66

Paolo Rech – Grenoble, France

SASSI-FI and CAROL-FI

SASSI-FI: NVIDIA architectural-level fault-injector

47

SLIDE 67

Paolo Rech – Grenoble, France

SASSI-FI and CAROL-FI

CAROL-FI: UFRGS high-Level Fault Injector for Xeon-Phi and any X86-base processor Modify content of memory currently allocated. Fault Injector requirements: –GDB with python support –OS Interruption signals –Compile the source code in debug mode SASSI-FI: NVIDIA architectural-level fault-injector

47

SLIDE 68

Paolo Rech – Grenoble, France

CAROL-FI

Fault model can be adapted We only inject single bit-flip Overhead ~5x

48

SLIDE 69

Paolo Rech – Grenoble, France

Radiation Data vs CAROL-FI

Radiation Fault-injection Radiation and FI give very different information.

49

SLIDE 70

Paolo Rech – Grenoble, France

CAROL-FI Results

We have injected more than 67,000 faults

49

SLIDE 71

Paolo Rech – Grenoble, France

Results - DGEMM

95% of adverse outcomes come from matrices and loop control variables Matrices: Mem occupation Chance to occur SDC or DUE Loop control variables: Mem occupation Chance to occur SDC or DUE

49

SLIDE 72

Paolo Rech – Grenoble, France

Results - CLAMR

Most adverse outcomes come from 3 Mesh components Sort K-D Tree Other Mesh operations Faults in Sort and K-D Tree are equally harmful

50

SLIDE 73

Paolo Rech – Grenoble, France

Results - Hotspot

Most harmful faults come from constants and control variables Small portion of memory causes most harm: Easy to protect

51

SLIDE 74

Paolo Rech – Grenoble, France

Results - LavaMD

Most harmful faults come from Input arrays(charge and distance) Big portion of memory causes most harm: Hard to protect

52

SLIDE 75

Paolo Rech – Grenoble, France

Results - LUD

SDCs are generated by faults in matrices DUEs are generated by faults in control variables

53

SLIDE 76

Paolo Rech – Grenoble, France

Results - NW

SDCs and DUEs are generated by faults in matrices(with an equal chance) Big portion of memory causes most harm: Hard to protect

54

SLIDE 77

Paolo Rech – Grenoble, France

Results

CAROL-FI insights:

– Selective hardening will be effective for DGEMM and Hotspot (small portion of memory causes harm) – Selective hardening may not be effective for LavaMD and NW (big portion of memory causes harm) – CLAMR: specific operations should be hardened (Sort and K-D Tree)

55

SLIDE 78

Paolo Rech – Grenoble, France

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

56

SLIDE 79

Paolo Rech – Grenoble, France

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

We can show how SDC appears at the output, to

ease detection

Understand SDC criticality. Not all errors significantly

affect output: there are “acceptable” SDC

56

SLIDE 80

Paolo Rech – Grenoble, France

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

We can show how SDC appears at the output, to

ease detection

Understand SDC criticality. Not all errors significantly

affect output: there are “acceptable” SDC

Fault-injection to better understand error propagation

SASSIFI: NVIDIA architectural-level fault-injector CAROL-FI: UFRGS fault-injector for Xeon Phi and X86

56

SLIDE 81

Paolo Rech – Grenoble, France

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not.

We can show how SDC appears at the output, to

ease detection

Understand SDC criticality. Not all errors significantly

affect output: there are “acceptable” SDC

Fault-injection to better understand error propagation

SASSIFI: NVIDIA architectural-level fault-injector CAROL-FI: UFRGS fault-injector for Xeon Phi and X86

Propose selective-hardening solutions

(duplicate only what matters, what REALLY matters)

56

SLIDE 82

Paolo Rech – Grenoble, France

Acknowledgments

Caio Lunardi Caroline Aguiar Daniel Oliveira Fernando Santos Laercio Pilla Vinicius Frattin Philippe Navaux Luigi Carro Chris Frost Nathan DeBardeleben Sean Blanchard Heather Quinn Thomas Fairbanks Steve Wender Timothy Tsai Siva Hari Steve Keckler David Kaeli NUCAR group Matteo Sonza Reorda Luca Sterpone

Radiation Reliability Issues in Current and Future Supercomputers

PAOLO RECH

Sponsors

HPC reliability importance

Available Accelerators

Modern parallel accelerators offer:

Kepler K40 Xeon-Phi

Available Accelerators

Modern parallel accelerators offer:

Kepler K40 Xeon-Phi

Available Accelerators

Modern parallel accelerators offer:

Error Rate

Kepler K40 Xeon-Phi

Titan

Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan Detected Uncorrectable Errors MTBF is ~44h*

HPC bad stories

Virginia Tech’s Advanced Computing facility built a supercomputer called Big Mac in 2003

Jaguar – (2009 #1 Top500 list) ● 360 terabytes of main memory ● 350 ECC errors per minute ASCI Q – (2002 #2 in Top500 list)

the crashes)

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

Terrestrial Radiation Environment

Cosmic rays could be so energetic to pass the Van Allen belts

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2h) @sea level*

Cosmic rays could be so energetic to pass the Van Allen belts

Altitude and Radiation

Maximum ionization @ ~13KM above sea level

Altitude and Radiation

Maximum ionization @ ~13KM above sea level

LANL

Radiation Effects - Soft Errors

1

1

Single Event Upset (SEU) Multiple Bit Upset (MBU) Soft Errors: the device is not permanently damaged, but the particle may generate:

Single Event Transient (SET) FF Logic

Silent Data Corruption vs Crash

Soft Errors in:

Soft Errors in:

Silent Data Corruption DUE (Crash)

Radiation Effects on Parallel Accelerators

CUDA GPU

DRAM

…

X X

X

X X X

Output Correctness in HPC

…

A single fault can propagate to several parallel threads: multiple corrupted elements.

Output Correctness in HPC

error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications …

A single fault can propagate to several parallel threads: multiple corrupted elements.

Output Correctness in HPC

error can be in the float intrinsic variance Values in a given range are accepted as correct in physical simulations Imprecise computation is being applied to HPC

Not all SDCs are critical for HPC applications

Goal: quantify and qualify SDC in NVIDIA and Intel architectures.

…

A single fault can propagate to several parallel threads: multiple corrupted elements.

Outline

The origins of the issue: § Radiation Effects Essentials § Error Criticality in HPC Understand the issue: § Experimental Procedure § K40 vs Xeon Phi Toward the solution of the issue: § ECC – ABFT – Duplication § Selective Hardening What’s the Plan?

Radiation Test Facilities

Irradiation of Chips Electronics

Experimental Setup

Radiation Test are NOT for dummies

What can (and actually went) wrong:

GPU Radiation Test Setup

microcontrollers FPGA SoC FPGA SoC Flash GPU APU

GPU Radiation Test Setup

GPU power control circuitry is out of beam

NVIDIA K40 Intel Xeon-Phi desktop PCs AMD APU

@LANSCE 1.8x106 n/(cm2 h) @NYC 13 n/(cm2 h)

We test each architecture for 800h, simulating 9.2x108 h of natural radiation (~ 91,000 years)

Neutrons Spectrum

@LANSCE 1.8x106 n/(cm2 h) @NYC 13 n/(cm2 h)

13 n/(cm2h) @sea level*