How to Deal with Radiation: Evaluation and Mitigation of GPUs - - PowerPoint PPT Presentation

how to deal with radiation
SMART_READER_LITE
LIVE PREVIEW

How to Deal with Radiation: Evaluation and Mitigation of GPUs - - PowerPoint PPT Presentation

April 6 th 2015 San Jos, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech


slide-1
SLIDE 1

How to Deal with Radiation: Evaluation and Mitigation

  • f GPUs Soft-Errors

April 6th 2015 – San José, CA

Paolo Rech

slide-2
SLIDE 2

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System: embedded GPUs increase cars security

2

slide-3
SLIDE 3

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System: embedded GPUs increase cars security Observed error

2

slide-4
SLIDE 4

Paolo Rech – GTC2016, San José, CA

Motivation: Automotive Applications

Pedestrian Detection System: embedded GPUs increase cars security Observed error

The insurance does not cover those accidents caused by: […] exposure to ionizing radiation*

*Paolo’s car insurance 2

slide-5
SLIDE 5

Paolo Rech – GTC2016, San José, CA

Motivation: HPC Industry

Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h*

*(field data from Tiwari et al. HPCA’15)

3

slide-6
SLIDE 6

Paolo Rech – GTC2016, San José, CA

Motivation: HPC Industry

Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h*

*(field data from Tiwari et al. HPCA’15)

Only Crashes/Hangs considered (correct output is unknown) We perform radiation experiments to measure Silent Data Corruption (SDC) rates

3

slide-7
SLIDE 7

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?

4

slide-8
SLIDE 8

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?
slide-9
SLIDE 9

Paolo Rech – GTC2016, San José, CA

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2h) @sea level

5

slide-10
SLIDE 10

Paolo Rech – GTC2016, San José, CA

Terrestrial Radiation Environment

Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons

13 n/(cm2h) @sea level neutron flux increases exponentially with altitude

5

slide-11
SLIDE 11

Paolo Rech – GTC2016, San José, CA

Radiation Effects - Soft Errors

1

  • One or more bit-flips

Single Event Upset (SEU) Multiple Bit Upset (MBU) Soft Errors: the device is not permanently damaged, but the particle may generate:

6

slide-12
SLIDE 12

Paolo Rech – GTC2016, San José, CA

Radiation Effects - Soft Errors

1

IONIZING PARTICLE

1

  • One or more bit-flips

Single Event Upset (SEU) Multiple Bit Upset (MBU) Soft Errors: the device is not permanently damaged, but the particle may generate:

6

slide-13
SLIDE 13

Paolo Rech – GTC2016, San José, CA

Radiation Effects - Soft Errors

1

IONIZING PARTICLE

1

  • One or more bit-flips

Single Event Upset (SEU) Multiple Bit Upset (MBU) Soft Errors: the device is not permanently damaged, but the particle may generate:

  • Transient voltage pulse

Single Event Transient (SET) FF Logic

IONIZING PARTICLE

6

slide-14
SLIDE 14

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

7

slide-15
SLIDE 15

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

X X

7

slide-16
SLIDE 16

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

X

core

X

7

slide-17
SLIDE 17

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

X

core

X X X

7

slide-18
SLIDE 18

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

X

core

X

core core core core core core core

X X

7

slide-19
SLIDE 19

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

X X

core

X

core core core core core core core

X X X

7

slide-20
SLIDE 20

Paolo Rech – GTC2016, San José, CA

Radiation Effects on GPUs

SM

CUDA GPU

DRAM

Blocks Scheduler and Dispatcher L2 Cache

SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor

Instruction Cache Warp Scheduler Dispatch Unit Register File

core core core core

core core

Shared Memory / L1 Cache

core core

Warp Scheduler Dispatch Unit

SM SM SM SM SM SM SM SM SM SM SM SM

X X

core

X

core core core core core core core

X X X

7

slide-21
SLIDE 21

Paolo Rech – GTC2016, San José, CA

Silent Data Corruption vs Crash&Hang

Errors in:

  • data cache
  • register files
  • logic gates (ALU)
  • scheduler

Silent Data Corruption

8

slide-22
SLIDE 22

Paolo Rech – GTC2016, San José, CA

Silent Data Corruption vs Crash&Hang

Errors in:

  • data cache
  • register files
  • logic gates (ALU)
  • scheduler

Errors in:

  • instruction cache
  • scheduler / dispatcher
  • PCI-e bus controller

Silent Data Corruption Crash & Hang

8

slide-23
SLIDE 23

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?
slide-24
SLIDE 24

Paolo Rech – GTC2016, San José, CA

Radiation Test Facilities

Weapon Nuclear Research

9

slide-25
SLIDE 25

Paolo Rech – GTC2016, San José, CA

@LANSCE 1.8x109 n/(cm2 h) @NYC 13 n/(cm2 h)

Neutrons Spectrum

cross section [cm2] = errors/s flux (n/cm2/s) cross section x flux (13 n/(cm2h)) = Error Rate

10

slide-26
SLIDE 26

Paolo Rech – GTC2016, San José, CA

@LANSCE 1.8x109 n/(cm2 h) @NYC 13 n/(cm2 h)

Neutrons Spectrum

cross section [cm2] = errors/s flux (n/cm2/s) cross section x flux (13 n/(cm2h)) = Error Rate

probability for 1 neutron to generate an output error

10

slide-27
SLIDE 27

Paolo Rech – GTC2016, San José, CA

GPU Radiation Test Setup

microcontrollers FPGA SoC FPGA SoC Flash GPU APU

11

slide-28
SLIDE 28

Paolo Rech – GTC2016, San José, CA

GPU Radiation Test Setup

23/48

GPU power control circuitry is out of beam

AMD APU NVIDIA K20 Intel Xeon-Phi desktop PCs

slide-29
SLIDE 29

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?
slide-30
SLIDE 30

Paolo Rech – GTC2016, San José, CA

Tested Parallel Codes

  • Matrix Multiplication (linear algebra)
  • Matrix Transpose (memory)
  • FFT (signal processing)
  • Needleman–Wunsch (biology)
  • lavaMD (physical simulations)
  • Hotspot (physical simulations)
  • HOG (pedestrian detection)

The selected algorithms are heterogeneous and representative

13

slide-31
SLIDE 31

Paolo Rech – GTC2016, San José, CA

Experimental Results (ECC OFF)

1 10 100 1000 10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes SDC

SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015)

Failure In Time @NYC execution dominated by memory latencies

14

slide-32
SLIDE 32

Paolo Rech – GTC2016, San José, CA

Experimental Results (ECC OFF)

1 10 100 1000 10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes SDC

SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015)

Failure In Time @NYC codes that heavily employ registers execution dominated by memory latencies

14

slide-33
SLIDE 33

Paolo Rech – GTC2016, San José, CA

Experimental Results (ECC OFF)

1 10 100 1000 10000

MxM MTrans FFT NW lavaMD Hotspot

Crashes SDC

SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015)

Failure In Time @NYC codes that heavily employ registers higher #instructions Matrix Multiplication: 6.46102 FIT 1 error every 15 years Titan: 18,688 errors every 15 years (1 error every 7.3h)

14

slide-34
SLIDE 34

Paolo Rech – GTC2016, San José, CA

Error Correction Code - SDC

1 10 100 1000 10000

MxM FFT NW lavaMD Hotspot

Failure In Time @NYC ECC reduces the SDC FIT of ~1 order of magnitude (there is almost no code dependence)

ECC OFF ECC ON

15

slide-35
SLIDE 35

Paolo Rech – GTC2016, San José, CA

Error Correction Code - Crash

MxM FFT NW lavaMD Hotspot

Failure In Time @NYC

1 10 100 1000 10000

ECC increases the Crash FIT of about 50% (there is almost no code dependence)

Double Bit Errors cause a crash scheduler is not protected

ECC OFF ECC ON

16

slide-36
SLIDE 36

Paolo Rech – GTC2016, San José, CA

MxM FFT NW lavaMD Hotspot

Failure In Time @NYC

ECC ON – SDC vs Crashes

1 10 100 1000 10000

When the ECC is ON Crashes are more likely to occur than SDCs (this is GOOD for HPC centers!) Crash SDC

17

slide-37
SLIDE 37

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?
slide-38
SLIDE 38

Paolo Rech – GTC2016, San José, CA

Algorithm Based Fault Tolerance x

A B

checksum checksum

∑ ∑ M

=

col-check row-check

Freivalds ’79

ABFT: technique designed specifically for an algorithm. ABFT requires: input coding, algorithm modification, and output decoding with error detection/correction

col-sum row-sum

X X X

Huang and Abraham ’84 Rech et al., TNS ‘13 18

slide-39
SLIDE 39

Paolo Rech – GTC2016, San José, CA

FFT Hardening Idea

J.Y. Jou and Abraham ’88 Pilla et at., TNS’13

unhardened FFT input coding

  • utput de-coding

error detection

19

slide-40
SLIDE 40

Paolo Rech – GTC2016, San José, CA

ECC vs ABFT

FIT [log scale] MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10 times, ABFT of ~56 times!

1 10 100 1000 10000

Unhardened ECC ABFT 20

slide-41
SLIDE 41

Paolo Rech – GTC2016, San José, CA

ECC vs ABFT

FIT [log scale] MxM FFT

SDC crash SDC crash

ECC reduces FIT of ~10 times, ABFT of ~56 times!

1 10 100 1000 10000

Unhardened ECC ABFT 20

ECC increases Crashes

  • f 50% ABFT of 10%!
slide-42
SLIDE 42

Paolo Rech – GTC2016, San José, CA

ECC vs ABFT

normalized execution time

MxM FFT ECC overhead for MxM is 10%, for FFT 50%! ABFT overhead is less than 20%

0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 Unhardened ECC ABFT

21

slide-43
SLIDE 43

Paolo Rech – GTC2016, San José, CA

Duplication With Comparison

Spatial: block i and i+N are duplicated

SM0

a b c d

SM1

a' b' c' d'

time

22

slide-44
SLIDE 44

Paolo Rech – GTC2016, San José, CA

Duplication With Comparison

Spatial: block i and i+N are duplicated E-O Spatial: block i and i+1 are duplicated

SM0

a b c d

SM1

a' b' c' d'

time

SM0

b b' d d'

SM1

a c c'

time

a'

22

slide-45
SLIDE 45

Paolo Rech – GTC2016, San José, CA

Duplication With Comparison

Spatial: block i and i+N are duplicated E-O Spatial: block i and i+1 are duplicated Time: a thread executes twice the operations

SM0

a b c d

SM1

a' b' c' d'

time

SM0

b b' d d'

SM1

a c c'

time

a'

SM0

b & b'

d & d'

SM1

a & a' c & c'

time

22

slide-46
SLIDE 46

Paolo Rech – GTC2016, San José, CA

Hotspot - DWC results*

1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC

FIT [log scale]

SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC

*details on Oliveira et al.

  • Trans. Nucl. Sci., 2014

23

slide-47
SLIDE 47

Paolo Rech – GTC2016, San José, CA

Hotspot - DWC results*

1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC

FIT [log scale]

SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC Only Time DWC reduces Crashes (no additional Blocks scheduling required)

*details on Oliveira et al.

  • Trans. Nucl. Sci., 2014

23

slide-48
SLIDE 48

Paolo Rech – GTC2016, San José, CA

Hotspot - DWC results*

1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC

FIT [log scale]

SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC Only Time DWC reduces Crashes (no additional Blocks scheduling required) DWC is promising: it is generic, easily implemented, and effective… BUT execution time overhead for Spatial DWC and Spatial E-O is 2.5x and for Time DWC is 2x (data is not copied)

Duplicate only the code’s critical portions

*details on Oliveira et al.

  • Trans. Nucl. Sci., 2014

23

slide-49
SLIDE 49

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?
slide-50
SLIDE 50

Paolo Rech – GTC2016, San José, CA

Codes Optimizations (just baked!)

Novel and incremental algorithm implementations are continuously developed [Rodinia suite].

Code optimizations impact GPUs reliability?

Three case studies (naïve vs optimized) Matrix Multiplication FFT Needleman–Wunsch different input sizes (on GPUs optimizations depends on workload)

24

slide-51
SLIDE 51

Paolo Rech – GTC2016, San José, CA

1,00E+00 6,00E+00 1,10E+01 1,60E+01 2,10E+01 2,60E+01 Naive-SDC Naive-Crash Opt-SDC Opt-Crash

Experimental Results – MxM

Opt-MxM FIT is higher. Errors in obsolete data are NOT critical: higher hit rate in the caches = higher FIT normalized FIT [a.u.]

1024 2048 4096 8192

25

slide-52
SLIDE 52

Paolo Rech – GTC2016, San José, CA

1,00E+00 6,00E+00 1,10E+01 1,60E+01 2,10E+01 2,60E+01 Naive-SDC Naive-Crash Opt-SDC Opt-Crash

Experimental Results – MxM

~20% FIT increase with input size caused by additional threads instantiated Opt-MxM FIT is higher. Errors in obsolete data are NOT critical: higher hit rate in the caches = higher FIT normalized FIT [a.u.]

1024 2048 4096 8192

25

slide-53
SLIDE 53

Paolo Rech – GTC2016, San José, CA

Mean Workload Between Failures

Opt. cross section and FIT

26

slide-54
SLIDE 54

Paolo Rech – GTC2016, San José, CA

Mean Workload Between Failures

We need to consider cross section, execution time, and throughput Opt. neutrons hitting the GPU cross section and FIT execution time

100 200 300 400 500 600 MxM-naive MxM-opt

GFLOPs

1024 2048 4096 8192

Mean WORKLOAD Between Failure: amount of data produced before failure

26

slide-55
SLIDE 55

Paolo Rech – GTC2016, San José, CA

MxM - MWBF

MWBF [data elaborated] 1024 2048 4096 8192

1,00E+00 1,00E+13 2,00E+13 3,00E+13 4,00E+13 Naive-SDC Opt-SDC

Opt-MxM produces more correct data than Naïve-MxM

27

slide-56
SLIDE 56

Paolo Rech – GTC2016, San José, CA

MxM - MWBF

Opt-MxM efficiency increases with input size! If the code is optimized the throughput increases more than the error rate!

MWBF [data elaborated] 1024 2048 4096 8192

1,00E+00 1,00E+13 2,00E+13 3,00E+13 4,00E+13 Naive-SDC Opt-SDC

Opt-MxM produces more correct data than Naïve-MxM

27

slide-57
SLIDE 57

Paolo Rech – GTC2016, San José, CA

Outline

  • Radiation Effects Essentials
  • Evaluation of GPU Radiation Sensitivity
  • Experimental Setup
  • Parallel Algorithms Error Rates
  • Hardening Solution Efficiency
  • Codes Optimizations Effects on HPC Reliability
  • What’s the Plan?
slide-58
SLIDE 58

Paolo Rech – GTC2016, San José, CA

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not. Self Driving Cars. Reliability is a major concern! How we can help:

28

slide-59
SLIDE 59

Paolo Rech – GTC2016, San José, CA

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not. Self Driving Cars. Reliability is a major concern! How we can help:

  • Understand SDC criticality. Not all errors significantly

affect output: are there “acceptable” SDC?

  • Propose selective-hardening solutions for GPUs

(duplicate only what matters, what REALLY matters)

28

slide-60
SLIDE 60

Paolo Rech – GTC2016, San José, CA

What’s The Plan?

Exascale = 55x Titan. Can we afford a 55x error rate? Probably not. Self Driving Cars. Reliability is a major concern! How we can help:

  • Understand SDC criticality. Not all errors significantly

affect output: are there “acceptable” SDC?

  • Propose selective-hardening solutions for GPUs

(duplicate only what matters, what REALLY matters)

  • Understand how algorithm/code/compiler
  • ptimizations will impact future machines error rate
  • Fault-injection to better understand error propagation

28

slide-61
SLIDE 61

Paolo Rech – GTC2016, San José, CA

Acknowledgments

Caio Lunardi Caroline Aguiar Laercio Pilla Daniel Oliveira Vinicius Frattin Philippe Navaux Luigi Carro Chris Frost Nathan DeBardeleben Sean Blanchard Heather Quinn Thomas Fairbanks Steve Wender Timothy Tsai Siva Hari Steve Keckler David Kaeli NUCAR group