Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel - PowerPoint PPT Presentation

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel Emer

MOTIVATION Automotive and HPC systems need high resilience Need to evaluate resilience of applications Silent Data Corruption (SDC), Detected Unrecoverable Error (DUE) probabilities Identify vulnerable program sections – key for developing low-cost mitigation schemes 2 NVIDIA CONFIDENTIAL

CHALLENGES Application-level evaluation can be slow Application-level resilience evaluation is challenging Application System software Traditional low-level error injection experiments are slow Architecture Low visibility into application behavior Gate Circuit Need quicker GPU application resilience evaluation scheme 3 NVIDIA CONFIDENTIAL

APPROACH Architecture-level Error Injections Inject error at architecture level Application System software Fast and visibility into application Architecture Leverage a low-level assembly-language instrumentation tool (SASSI) Gate Advantages: Circuit Analyze and study SDCs in detail: Magnitude of SDCs and which errors produce SDCs Ability to correlate program properties with program vulnerability Key to develop low cost error mitigation schemes Ability to quantify application level error derating factors 4 NVIDIA CONFIDENTIAL

CONTRIBUTIONS SASSIFI: Architecture-level GPU fault injection tool Developed SASSIFI tool Flexible options to inject many types of errors Examples: single, multiple bit flips in register values; address vs. value errors Demonstrated by conducting four types of resilience studies Released SASSIFI for public usage GitHub: https://github.com/NVlabs/sassifi 5 NVIDIA CONFIDENTIAL

OUTLINE Background: SASSI SASSIFI tool Error injection methodology Use cases: Error models Results 6 NVIDIA CONFIDENTIAL

OVERVIEW OF SASSI Background SASSI is a compiler-based instrumentation framework that allows us to inject code before or after specific points in a program Example: Identify all SASS memory ops and inject code needed to pass op’s address to a user-defined function .L_8: .L_8: ISCADD R7, R5, R3, 0x2; ISCADD R7, R5, R3, 0x2; 1. Create extra stack space STS [R7], R2; IADD R1, R1, -0x4; BAR.SYNC 0x0; STL [R1], R4; 2. Save live registers 3. Pass parameters of interest to user-defined function MOV R0, c[0x0][0x28]; IADD R4, R7, 0x0; 4. Call user-defined function SHF.R R0, R0, 0x1, RZ; JCAL `(_users_function); 5. Restore live registers ISETP.EQ.AND P0, PT, ... LDL R4, [R1]; 6. Restore stack @P0 BRA `(.L_12); IADD R1, R1, 0x4; STS [R7], R2; 7. Execute instrumented instruction BAR.SYNC 0x0; User writes a handler function, _users_function, in CUDA “Flexible Software Profiling of GPU Architectures,” Mark Stephenson, Siva Hari, Yunsup Lee, Eiman Ebrahimi, Daniel Johnson, Dave Nellans , Mike O’Connor, and Steve Keckler, ISCA 2015 7 NVIDIA CONFIDENTIAL

SASSIFI: SASSI BASED FAULT INJECTOR Leveraged SASSI for error injections Instrumented kernels for profiling and error injections 8 NVIDIA CONFIDENTIAL

SASSIFI METHODOLOGY Steps CPU Code GPU Kernels Profile: Identify possible injection sites Instrumented kernels execute on the GPU Output 9 NVIDIA CONFIDENTIAL

SASSIFI METHODOLOGY Steps GPU Kernels CPU Code Profile: Identify possible injection sites Instrumented kernels execute on the GPU Statistically select injection sites based on the error model Output 10 NVIDIA CONFIDENTIAL

SASSIFI METHODOLOGY Steps GPU Kernels CPU Code Profile: Identify possible injection sites Instrumented kernels execute on the GPU Statistically select injection sites based on the error model Injection runs: inject one error at a time Instrument before/after instructions and collect reg/mem info Start application, inject error at the selected site Continue execution until a crash or the output Golden Output Output 11 NVIDIA CONFIDENTIAL

OUTCOME CATEGORIES Explanation Categories Application exits with non-zero exit status DUE Application does not terminate within allocated time (3x fault- free runtime) Potential Kernel exit status is not cudaSuccess DUEs Error messages in stdout /stderr (e.g., Error: misaligned address) SDC Program output file or stdout is different Application output is same as the error free output without any Masked error symptoms 12 NVIDIA CONFIDENTIAL

SASSIFI USE CASES Many uses of SASSIFI What is the probability that a particle-strike in the register file produce an SDC? What is the probability that a bit-flip in the destination register of an executing instruction will result in an SDC? What instruction types are likely to produce more SDCs when subjected to errors in destination registers? How do SDC probabilities change when different architecture-level states (addresses vs. values) are subjected to errors? How do the results change if we inject different bit-flip patterns (single vs. double bit-flips)? 13 NVIDIA CONFIDENTIAL

SASSIFI USE CASES Many uses of SASSIFI What is the probability that a particle-strike in the register file produce an SDC? What is the probability that a bit-flip in the destination register of an executing instruction will result in an SDC? SASSIFI can be used to address all What instruction types are likely to produce more SDCs when subjected to errors in destination registers? these questions How do SDC probabilities change when different architecture-level states (addresses vs. values) are subjected to errors? How do the results change if we inject different bit-flip patterns (single vs. double bit-flips)? 14 NVIDIA CONFIDENTIAL

ERROR MODELS SASSIFI can inject many types of errors Injection Randomly selected modes register (RF) Instruction groups All instructions Use case 1: Register file injections for AVF analysis Bit-flip models Single thread: Single-bit flip Double-bit flip 15 NVIDIA CONFIDENTIAL

ERROR MODELS SASSIFI can inject many types of errors Injection Randomly selected Instruction output modes register (RF) value (IOV) Instruction groups Use case 2: Injecting into a All instructions GPR destination register of a randomly selected instruction Bit-flip models Single thread: Single thread: Single-bit flip Double-bit flip 16 NVIDIA CONFIDENTIAL

ERROR MODELS SASSIFI can inject many types of errors Injection Randomly selected Instruction output modes register (RF) value (IOV) Instruction groups GPR I/F/D ADD-MUL CC Use case 3: Identifying All instructions PR I/F/D FMA SETP instruction types that produce LDS LD more SDCs ST Bit-flip models Single thread: Single thread: Single-bit flip Double-bit flip 17 NVIDIA CONFIDENTIAL

ERROR MODELS SASSIFI can inject many types of errors Injection Randomly selected Instruction output Instruction output modes register (RF) value (IOV) address (IOA) Instruction groups GPR I/F/D ADD-MUL CC All instructions GPR ST PR Use case 4: Injecting into I/F/D FMA SETP LDS different architecture states LD ST Bit-flip models Single thread: Single thread: Single-bit flip Double-bit flip 18 NVIDIA CONFIDENTIAL

ERROR MODELS SASSIFI can inject many types of errors Injection Randomly selected Instruction output Instruction output modes register (RF) value (IOV) address (IOA) Instruction groups GPR I/F/D ADD-MUL CC All instructions GPR ST PR Use case 5: Injecting different I/F/D FMA SETP LDS bit-flip patterns LD ST Bit-flip models Single thread: Single-bit flip Double-bit flip Random value Zero value All threads Single-bit flip Double-bit flip Random value Zero value in one warp: 19 NVIDIA CONFIDENTIAL

ERROR MODELS SASSIFI can inject many types of errors Injection Randomly selected Instruction output Instruction output modes register (RF) value (IOV) address (IOA) Instruction groups GPR I/F/D ADD-MUL CC All instructions GPR ST PR I/F/D FMA SETP Easy to extend to include other models LDS LD ST Bit-flip models Single thread: Single-bit flip Double-bit flip Random value Zero value All threads Single-bit flip Double-bit flip Random value Zero value in one warp: 20 NVIDIA CONFIDENTIAL

IMPLEMENTING DIFFERENT ERROR MODES RF mode IOV mode IOA mode At the selected instruction, Record register/memory Empty handler inject error if the selected content at the selected register is a source. If not, instruction count monitor subsequent . instructions and inject . when found as a source. sassi_before_handler() Opcode Dest, Src1, Src2 sassi_after_handler() At the injection instruction Inject error at the Empty handler . • Read values from correct selected instruction . address and write them to count according to the corrupted address the selected bit-flip • Revert content at the model correct address with the recorded values 21 NVIDIA CONFIDENTIAL

RESULTS: USE CASE 1 Register File AVF Error injection results 100% % of injections 80% 60% 40% 20% 0% Single-bit flip Double-bit flip Single-bit flip Double-bit flip CoMD Lulesh Masked DUEs Potential DUEs SDCs SDC AVF = SDC probability from injections in occupied registers * RF occupancy 0.075 and 0.07 for CoMD and Lulesh, respectively, for single-bit flips 22 NVIDIA CONFIDENTIAL

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel - PowerPoint PPT Presentation

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel Emer MOTIVATION Automotive and HPC systems need high resilience Need to evaluate resilience of applications Silent Data Corruption (SDC), Detected Unrecoverable Error (DUE)

Modeling Soft-Error Propagation in Programs Siva Hari Guanpeng (Justin) Li Michael Sullivan

6.808: Mobile and Sensor Computing Lecture 10: The Pothole Patrol Hari Balakrishnan hari@mit.edu

Siva Therapeutics Inc Precision Tumor Targeting Len Pagliaro, PhD, CEO Angel Capital Summit

The Open Science Computing Ecosystem at the Texas Advanced Computing Center (TACC) Siva

MALAYSIA ASEAN Siva Somasundram Senior Trade Manager and ASEAN Lead Education & Training

Presentation On Issues and Rules relating to the E-way Bill Prepared by Siva Sai, Vedasri

How I Improved my BP Via Anaerobic Interval Training Siva Raj Founder & CEO, Revvo

On a Decidable Fragment of d L or, The Next 700 (Un)decidable Fragments of d L David M Kahn Siva

Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK

6.808 Mobile and Sensor Computing Lecture 12 Map Matching with Cellular Data Hari Balakrishnan

Application in AGL Charming Chinook 2 nd June 2017 SRI MALDIA HARI ASTI Introduction SRI

Reliable and Scalable Packet Striping Hari Adiseshu Guru Parulkar George Varghese Washington

Microsofts .net Initiative Microsofts .net Initiative Hari Sivaramakrishnan

GYROSENSORS IN AIRBAG K. V. Siva Ramakrishna 200601073 Goda Sai Suneel 200601044 Presentation

A NEW SET OF WHEELS AGM PRESENTATION 15 NOVEMBER 2019 Scott Baldwin Siva Subramani Managing

Precision Nutrient Management Siva K Balasundram , PhD Associate Professor Country

Outline Cross-site scripting CSci 5271 More cross-site risks Introduction to Computer Security

Attacks August 28, 2019 Conference on Cryptographic Hardware and Atlanta, USA Embedded Systems

LVI Hijacking Transient Execution with Load Value Injection Daniel Gruss, Daniel Moghimi, Jo Van

Disclosures None How to Do a Knee Injection UCSF Primary Care Sports Medicine Conference 2018

INJECTING SECURITY INTO WEB APPS AT RUNTIME AJIN ABRAHAM SECURITY ENGINEER #WHOAMI

Reaction Injection Molding (RIM)- Low Quantity Injection Molding for the Life Sciences Industry

Using Fault Injection to Turn Data Transfers into Arbitrary Execution Cristofaro Mune Niek

Survey of Cyber Moving Targets Second Edition Authors: B.C. Ward S.R. Gomez R.W. Skowyra D.

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel - PowerPoint PPT Presentation

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel Emer MOTIVATION Automotive and HPC systems need high resilience Need to evaluate resilience of applications Silent Data Corruption (SDC), Detected Unrecoverable Error (DUE)

Modeling Soft-Error Propagation in Programs Siva Hari Guanpeng (Justin) Li Michael Sullivan

6.808: Mobile and Sensor Computing Lecture 10: The Pothole Patrol Hari Balakrishnan hari@mit.edu

Siva Therapeutics Inc Precision Tumor Targeting Len Pagliaro, PhD, CEO Angel Capital Summit

The Open Science Computing Ecosystem at the Texas Advanced Computing Center (TACC) Siva

MALAYSIA ASEAN Siva Somasundram Senior Trade Manager and ASEAN Lead Education &amp; Training

Presentation On Issues and Rules relating to the E-way Bill Prepared by Siva Sai, Vedasri

How I Improved my BP Via Anaerobic Interval Training Siva Raj Founder &amp; CEO, Revvo

On a Decidable Fragment of d L or, The Next 700 (Un)decidable Fragments of d L David M Kahn Siva

Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK

6.808 Mobile and Sensor Computing Lecture 12 Map Matching with Cellular Data Hari Balakrishnan

Application in AGL Charming Chinook 2 nd June 2017 SRI MALDIA HARI ASTI Introduction SRI

Reliable and Scalable Packet Striping Hari Adiseshu Guru Parulkar George Varghese Washington

Microsofts .net Initiative Microsofts .net Initiative Hari Sivaramakrishnan

GYROSENSORS IN AIRBAG K. V. Siva Ramakrishna 200601073 Goda Sai Suneel 200601044 Presentation

A NEW SET OF WHEELS AGM PRESENTATION 15 NOVEMBER 2019 Scott Baldwin Siva Subramani Managing

Precision Nutrient Management Siva K Balasundram , PhD Associate Professor Country

Outline Cross-site scripting CSci 5271 More cross-site risks Introduction to Computer Security

Attacks August 28, 2019 Conference on Cryptographic Hardware and Atlanta, USA Embedded Systems

LVI Hijacking Transient Execution with Load Value Injection Daniel Gruss, Daniel Moghimi, Jo Van

Disclosures None How to Do a Knee Injection UCSF Primary Care Sports Medicine Conference 2018

INJECTING SECURITY INTO WEB APPS AT RUNTIME AJIN ABRAHAM SECURITY ENGINEER #WHOAMI

Reaction Injection Molding (RIM)- Low Quantity Injection Molding for the Life Sciences Industry

Using Fault Injection to Turn Data Transfers into Arbitrary Execution Cristofaro Mune Niek

Survey of Cyber Moving Targets Second Edition Authors: B.C. Ward S.R. Gomez R.W. Skowyra D.

MALAYSIA ASEAN Siva Somasundram Senior Trade Manager and ASEAN Lead Education & Training

How I Improved my BP Via Anaerobic Interval Training Siva Raj Founder & CEO, Revvo