How to Deal with Radiation: Evaluation and Mitigation
- f GPUs Soft-Errors
April 6th 2015 – San José, CA
How to Deal with Radiation: Evaluation and Mitigation of GPUs - - PowerPoint PPT Presentation
April 6 th 2015 San Jos, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech
April 6th 2015 – San José, CA
Paolo Rech – GTC2016, San José, CA
2
Paolo Rech – GTC2016, San José, CA
2
Paolo Rech – GTC2016, San José, CA
The insurance does not cover those accidents caused by: […] exposure to ionizing radiation*
*Paolo’s car insurance 2
Paolo Rech – GTC2016, San José, CA
*(field data from Tiwari et al. HPCA’15)
3
Paolo Rech – GTC2016, San José, CA
*(field data from Tiwari et al. HPCA’15)
3
Paolo Rech – GTC2016, San José, CA
4
Paolo Rech – GTC2016, San José, CA
Paolo Rech – GTC2016, San José, CA
Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons
5
Paolo Rech – GTC2016, San José, CA
Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons
5
Paolo Rech – GTC2016, San José, CA
6
Paolo Rech – GTC2016, San José, CA
IONIZING PARTICLE
6
Paolo Rech – GTC2016, San José, CA
IONIZING PARTICLE
IONIZING PARTICLE
6
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
7
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
7
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
core
7
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
core
7
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
core
core core core core core core core
7
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
core
core core core core core core core
7
Paolo Rech – GTC2016, San José, CA
SM
Blocks Scheduler and Dispatcher L2 Cache
SM SM SM SM SM SM SM SM SM SM SM Streaming Multiprocessor
Instruction Cache Warp Scheduler Dispatch Unit Register File
core core core core
core core
Shared Memory / L1 Cache
core core
Warp Scheduler Dispatch Unit
SM SM SM SM SM SM SM SM SM SM SM SM
core
core core core core core core core
7
Paolo Rech – GTC2016, San José, CA
8
Paolo Rech – GTC2016, San José, CA
8
Paolo Rech – GTC2016, San José, CA
Paolo Rech – GTC2016, San José, CA
9
Paolo Rech – GTC2016, San José, CA
10
Paolo Rech – GTC2016, San José, CA
probability for 1 neutron to generate an output error
10
Paolo Rech – GTC2016, San José, CA
11
Paolo Rech – GTC2016, San José, CA
23/48
GPU power control circuitry is out of beam
Paolo Rech – GTC2016, San José, CA
Paolo Rech – GTC2016, San José, CA
13
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 10000
MxM MTrans FFT NW lavaMD Hotspot
Crashes SDC
Failure In Time @NYC execution dominated by memory latencies
14
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 10000
MxM MTrans FFT NW lavaMD Hotspot
Crashes SDC
Failure In Time @NYC codes that heavily employ registers execution dominated by memory latencies
14
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 10000
MxM MTrans FFT NW lavaMD Hotspot
Crashes SDC
Failure In Time @NYC codes that heavily employ registers higher #instructions Matrix Multiplication: 6.46102 FIT 1 error every 15 years Titan: 18,688 errors every 15 years (1 error every 7.3h)
14
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 10000
MxM FFT NW lavaMD Hotspot
Failure In Time @NYC ECC reduces the SDC FIT of ~1 order of magnitude (there is almost no code dependence)
…
ECC OFF ECC ON
15
Paolo Rech – GTC2016, San José, CA
MxM FFT NW lavaMD Hotspot
Failure In Time @NYC
1 10 100 1000 10000
Double Bit Errors cause a crash scheduler is not protected
ECC OFF ECC ON
16
Paolo Rech – GTC2016, San José, CA
MxM FFT NW lavaMD Hotspot
Failure In Time @NYC
1 10 100 1000 10000
17
Paolo Rech – GTC2016, San José, CA
Paolo Rech – GTC2016, San José, CA
checksum checksum
col-check row-check
Freivalds ’79
col-sum row-sum
Huang and Abraham ’84 Rech et al., TNS ‘13 18
Paolo Rech – GTC2016, San José, CA
J.Y. Jou and Abraham ’88 Pilla et at., TNS’13
19
Paolo Rech – GTC2016, San José, CA
SDC crash SDC crash
1 10 100 1000 10000
Unhardened ECC ABFT 20
Paolo Rech – GTC2016, San José, CA
SDC crash SDC crash
1 10 100 1000 10000
Unhardened ECC ABFT 20
Paolo Rech – GTC2016, San José, CA
normalized execution time
0,2 0,4 0,6 0,8 1 1,2 1,4 1,6 Unhardened ECC ABFT
21
Paolo Rech – GTC2016, San José, CA
SM0
SM1
time
22
Paolo Rech – GTC2016, San José, CA
SM0
SM1
time
SM0
SM1
time
22
Paolo Rech – GTC2016, San José, CA
SM0
SM1
time
SM0
SM1
time
SM0
SM1
time
22
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC
SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC
*details on Oliveira et al.
23
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC
SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC Only Time DWC reduces Crashes (no additional Blocks scheduling required)
*details on Oliveira et al.
23
Paolo Rech – GTC2016, San José, CA
1 10 100 1000 Unhardened ECC Spatial DWC E-O Spatial DWC Time DWC
SDC crash Spatial DWC detects all SDC Spatial E-O detects 80% of SDC Time DWC detects 90% of SDC Only Time DWC reduces Crashes (no additional Blocks scheduling required) DWC is promising: it is generic, easily implemented, and effective… BUT execution time overhead for Spatial DWC and Spatial E-O is 2.5x and for Time DWC is 2x (data is not copied)
*details on Oliveira et al.
23
Paolo Rech – GTC2016, San José, CA
Paolo Rech – GTC2016, San José, CA
24
Paolo Rech – GTC2016, San José, CA
1,00E+00 6,00E+00 1,10E+01 1,60E+01 2,10E+01 2,60E+01 Naive-SDC Naive-Crash Opt-SDC Opt-Crash
Opt-MxM FIT is higher. Errors in obsolete data are NOT critical: higher hit rate in the caches = higher FIT normalized FIT [a.u.]
1024 2048 4096 8192
25
Paolo Rech – GTC2016, San José, CA
1,00E+00 6,00E+00 1,10E+01 1,60E+01 2,10E+01 2,60E+01 Naive-SDC Naive-Crash Opt-SDC Opt-Crash
~20% FIT increase with input size caused by additional threads instantiated Opt-MxM FIT is higher. Errors in obsolete data are NOT critical: higher hit rate in the caches = higher FIT normalized FIT [a.u.]
1024 2048 4096 8192
25
Paolo Rech – GTC2016, San José, CA
26
Paolo Rech – GTC2016, San José, CA
100 200 300 400 500 600 MxM-naive MxM-opt
GFLOPs
1024 2048 4096 8192
26
Paolo Rech – GTC2016, San José, CA
MWBF [data elaborated] 1024 2048 4096 8192
1,00E+00 1,00E+13 2,00E+13 3,00E+13 4,00E+13 Naive-SDC Opt-SDC
27
Paolo Rech – GTC2016, San José, CA
MWBF [data elaborated] 1024 2048 4096 8192
1,00E+00 1,00E+13 2,00E+13 3,00E+13 4,00E+13 Naive-SDC Opt-SDC
27
Paolo Rech – GTC2016, San José, CA
Paolo Rech – GTC2016, San José, CA
28
Paolo Rech – GTC2016, San José, CA
28
Paolo Rech – GTC2016, San José, CA
28
Paolo Rech – GTC2016, San José, CA
Caio Lunardi Caroline Aguiar Laercio Pilla Daniel Oliveira Vinicius Frattin Philippe Navaux Luigi Carro Chris Frost Nathan DeBardeleben Sean Blanchard Heather Quinn Thomas Fairbanks Steve Wender Timothy Tsai Siva Hari Steve Keckler David Kaeli NUCAR group