How to Deal with Radiation: Evaluation and Mitigation of GPUs - PowerPoint PPT Presentation

April 6 th 2015 – San José, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech

Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech – GTC2016, San José, CA 2

Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Observed error Paolo Rech – GTC2016, San José, CA 2

Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security The insurance does not cover those accidents caused by: Observed error […] exposure to ionizing radiation* *Paolo ’ s car insurance Paolo Rech – GTC2016, San José, CA 2

Motivation: HPC Industry Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA ’ 15) Paolo Rech – GTC2016, San José, CA 3

Motivation: HPC Industry Titan (Oak Ridge National Lab): 18,688 GPUs High probability of having a GPU corrupted Titan MTBF is ~44h* *(field data from Tiwari et al. HPCA ’ 15) Only Crashes/Hangs considered (correct output is unknown) We perform radiation experiments to measure Silent Data Corruption (SDC) rates Paolo Rech – GTC2016, San José, CA 3

Outline  Radiation Effects Essentials  Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates  Hardening Solution Efficiency  Codes Optimizations Effects on HPC Reliability  What ’ s the Plan? Paolo Rech – GTC2016, San José, CA 4

Outline  Radiation Effects Essentials  Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates  Hardening Solution Efficiency  Codes Optimizations Effects on HPC Reliability  What ’ s the Plan? Paolo Rech – GTC2016, San José, CA

Terrestrial Radiation Environment Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2  h) @sea level Paolo Rech – GTC2016, San José, CA 5

Terrestrial Radiation Environment Galactic cosmic rays interact with atmosphere shower of energetic particles: Muons, Pions, Protons, Gamma rays, Neutrons 13 n/(cm 2  h) @sea level neutron flux increases exponentially with altitude Paolo Rech – GTC2016, San José, CA 5

Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) Paolo Rech – GTC2016, San José, CA 6

Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 Paolo Rech – GTC2016, San José, CA 6

Radiation Effects - Soft Errors Soft Errors: the device is not permanently damaged, but the particle may generate: IONIZING PARTICLE • One or more bit-flips 0 1 Single Event Upset (SEU) Multiple Bit Upset (MBU) 1 0 IONIZING PARTICLE • Transient voltage pulse FF Logic Single Event Transient (SET) Paolo Rech – GTC2016, San José, CA 6

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM Dispatch Unit Dispatch Unit SM SM SM SM Register File core core core core SM SM SM SM … L2 Cache core core core core Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core SM SM SM SM … L2 Cache core core core core Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core SM SM SM SM … L2 Cache core core core core Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core SM SM SM SM … L2 Cache core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core core core SM SM SM SM core … L2 Cache core core core core core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM Register File X core core core core core core core SM SM SM SM core … X L2 Cache core core core core core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Radiation Effects on GPUs Streaming Multiprocessor CUDA GPU Instruction Cache X Blocks Scheduler and Dispatcher Warp Scheduler Warp Scheduler SM SM SM SM SM SM SM SM X Dispatch Unit Dispatch Unit X SM SM SM SM SM SM SM SM Register File X core core core core core core core SM SM SM SM SM SM SM SM core … X L2 Cache core core core core core core core core X Shared Memory / L1 Cache DRAM Paolo Rech – GTC2016, San José, CA 7

Silent Data Corruption vs Crash&Hang Errors in: - data cache - register files Silent Data Corruption - logic gates (ALU) - scheduler Paolo Rech – GTC2016, San José, CA 8

Silent Data Corruption vs Crash&Hang Errors in: - data cache - register files Silent Data Corruption - logic gates (ALU) - scheduler Errors in: - instruction cache Crash & Hang - scheduler / dispatcher - PCI-e bus controller Paolo Rech – GTC2016, San José, CA 8

Outline  Radiation Effects Essentials  Evaluation of GPU Radiation Sensitivity - Experimental Setup - Parallel Algorithms Error Rates  Hardening Solution Efficiency  Codes Optimizations Effects on HPC Reliability  What ’ s the Plan? Paolo Rech – GTC2016, San José, CA

Radiation Test Facilities Weapon Nuclear Research Paolo Rech – GTC2016, San José, CA 9

Neutrons Spectrum @LANSCE 1.8x10 9 n/(cm 2 h) @NYC 13 n/(cm 2 h) errors/s cross section [cm 2 ] = flux (n/cm 2 /s) cross section x flux (13 n/(cm 2  h)) = Error Rate Paolo Rech – GTC2016, San José, CA 10

Neutrons Spectrum @LANSCE 1.8x10 9 n/(cm 2 h) @NYC 13 n/(cm 2 h) probability for 1 neutron to generate an output error errors/s cross section [cm 2 ] = flux (n/cm 2 /s) cross section x flux (13 n/(cm 2  h)) = Error Rate Paolo Rech – GTC2016, San José, CA 10

GPU Radiation Test Setup SoC microcontrollers SoC FPGA GPU FPGA Flash APU Paolo Rech – GTC2016, San José, CA 11

GPU Radiation Test Setup Intel AMD NVIDIA Xeon-Phi APU K20 GPU power control circuitry is out of beam desktop PCs Paolo Rech – GTC2016, San José, CA 23/48

Outline  Radiation Effects Essentials  Evaluation of GPU Radiation Sensitivity - Experimental Setup Parallel Algorithms Error Rates -  Hardening Solution Efficiency  Codes Optimizations Effects on HPC Reliability  What ’ s the Plan? Paolo Rech – GTC2016, San José, CA

Tested Parallel Codes -Matrix Multiplication (linear algebra) -Matrix Transpose (memory) -FFT (signal processing) -Needleman – Wunsch (biology) -lavaMD (physical simulations) -Hotspot (physical simulations) -HOG (pedestrian detection) The selected algorithms are heterogeneous and representative Paolo Rech – GTC2016, San José, CA 13

Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) execution dominated by memory latencies 10000 Failure In Time @NYC Crashes 1000 SDC 100 10 1 NW lavaMD Hotspot MxM MTrans FFT Paolo Rech – GTC2016, San José, CA 14

Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) execution dominated by codes that heavily memory latencies employ registers 10000 Failure In Time @NYC Crashes 1000 SDC 100 10 1 NW lavaMD Hotspot MxM MTrans FFT Paolo Rech – GTC2016, San José, CA 14

Experimental Results (ECC OFF) SDC rate varies ~3 orders of magnitude (details on Oliveira et al. Trans. Comp. 2015) higher codes that heavily Matrix Multiplication: 6.46  10 2 FIT #instructions employ registers 1 error every 15 years 10000 Failure In Time @NYC Titan: 18,688 errors every 15 years Crashes (1 error every 7.3h) 1000 SDC 100 10 1 NW lavaMD Hotspot MxM MTrans FFT Paolo Rech – GTC2016, San José, CA 14

How to Deal with Radiation: Evaluation and Mitigation of GPUs - PowerPoint PPT Presentation

April 6 th 2015 San Jos, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech

6. "Happy Days Are Here Again": FDR and the New Deal 6.1 FDR and the New Deal 6.2 A

The Green Deal Tracy Vegro Director, Green Deal Contents 1. Introducing the Green Deal 2. ECO

Understanding Radiation Understanding Radiation Therapy Therapy For Patients and the Public

Blackbody Radiation Blackbody Radiation A blackbody is a surface that completely absorbs all

Roosevelt's New Deal Mr. Venezia Roosevelt's New Deal 1 Election of 1932 Roosevelt's New Deal

NEW RADIATION LEGISLATION M M TREVOR RADIATION Aims to: Produce a working definition for

Basic Radiation Concepts and Radiation Protection Radiation and Radioactive Material Are Part

Radiation Safety Radiation Safety General Information about General Information about Radiation

Solving the high-dimensional Vlasov equation with deal.II and hyper.deal Eighth deal.II Users and

1 Annual Lower limit for Background known effects Radiation of radiation Lowest 1-year dose

RADIATION SAFETY REVIEW FOR OCCUPATIONAL EMPLOYEES Reducing Radiation Exposure Time: the less time

Radiation Protection Program PETITION FOR REVISION OF Radiation Protection Rules Pa Parts ts

Medical Radiation Physics Department of Physics Stockholm University Medical Radiation Physics -

RADIO HALOS AND SYNCHROTRON RADIATION CONTENTS CONTENTS Synchrotron radiation Spectral

Exposure Dose from Natural and Artificial Radiation around Us Radiation From outer Natural

Safety barriers Ola Holmberg Radiation Protection of Patients Unit Division of Radiation,

Physics Department Management Review Ron Gill Mike Zarcone January 2, 2008 Physics Department

ntroduction to Nuclear Law I Lisa Thiele Senior General Counsel, Canadian Nuclear Safety

GENDER, DEVELOPMENT AND NUCLEAR WEAPONS Shared goals, shared concerns Dr. John Borrie (UNIDIR)

Vermont Department of Health Radiological Health Rule Vermont Nuclear Decommissioning Citizens

Introspection-Based Fault Tolerance for Future On-Board Computing Systems Mark L. James and Hans

Caregiver and child perception of overweight and obesity. Mr Mark Ward (TCD), Dr Evelyn Mahon

Social Class Variation, the Effect of the Economic Recession and Childhood Obesity at 3 Years of

Overview of NSF investments and administration activities in AI Jim Kurose Assistant Director

How to Deal with Radiation: Evaluation and Mitigation of GPUs - PowerPoint PPT Presentation

April 6 th 2015 San Jos, CA How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech Motivation: Automotive Applications Pedestrian Detection System: embedded GPUs increase cars security Paolo Rech

6. &quot;Happy Days Are Here Again&quot;: FDR and the New Deal 6.1 FDR and the New Deal 6.2 A

The Green Deal Tracy Vegro Director, Green Deal Contents 1. Introducing the Green Deal 2. ECO

Understanding Radiation Understanding Radiation Therapy Therapy For Patients and the Public

Blackbody Radiation Blackbody Radiation A blackbody is a surface that completely absorbs all

Roosevelt's New Deal Mr. Venezia Roosevelt's New Deal 1 Election of 1932 Roosevelt's New Deal

NEW RADIATION LEGISLATION M M TREVOR RADIATION Aims to: Produce a working definition for

Basic Radiation Concepts and Radiation Protection Radiation and Radioactive Material Are Part

Radiation Safety Radiation Safety General Information about General Information about Radiation

Solving the high-dimensional Vlasov equation with deal.II and hyper.deal Eighth deal.II Users and

1 Annual Lower limit for Background known effects Radiation of radiation Lowest 1-year dose

RADIATION SAFETY REVIEW FOR OCCUPATIONAL EMPLOYEES Reducing Radiation Exposure Time: the less time

Radiation Protection Program PETITION FOR REVISION OF Radiation Protection Rules Pa Parts ts

Medical Radiation Physics Department of Physics Stockholm University Medical Radiation Physics -

RADIO HALOS AND SYNCHROTRON RADIATION CONTENTS CONTENTS Synchrotron radiation Spectral

Exposure Dose from Natural and Artificial Radiation around Us Radiation From outer Natural

Safety barriers Ola Holmberg Radiation Protection of Patients Unit Division of Radiation,

Physics Department Management Review Ron Gill Mike Zarcone January 2, 2008 Physics Department

ntroduction to Nuclear Law I Lisa Thiele Senior General Counsel, Canadian Nuclear Safety

GENDER, DEVELOPMENT AND NUCLEAR WEAPONS Shared goals, shared concerns Dr. John Borrie (UNIDIR)

Vermont Department of Health Radiological Health Rule Vermont Nuclear Decommissioning Citizens

Introspection-Based Fault Tolerance for Future On-Board Computing Systems Mark L. James and Hans

Caregiver and child perception of overweight and obesity. Mr Mark Ward (TCD), Dr Evelyn Mahon

Social Class Variation, the Effect of the Economic Recession and Childhood Obesity at 3 Years of

Overview of NSF investments and administration activities in AI Jim Kurose Assistant Director

6. "Happy Days Are Here Again": FDR and the New Deal 6.1 FDR and the New Deal 6.2 A