Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel - - PowerPoint PPT Presentation

siva hari
SMART_READER_LITE
LIVE PREVIEW

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel - - PowerPoint PPT Presentation

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel Emer MOTIVATION Automotive and HPC systems need high resilience Need to evaluate resilience of applications Silent Data Corruption (SDC), Detected Unrecoverable Error (DUE)


slide-1
SLIDE 1

Siva Hari Timothy Tsai, Mark Stephenson, Stephen W. Keckler, Joel Emer

slide-2
SLIDE 2

2

NVIDIA CONFIDENTIAL

MOTIVATION

Need to evaluate resilience of applications Silent Data Corruption (SDC), Detected Unrecoverable Error (DUE) probabilities Identify vulnerable program sections – key for developing low-cost mitigation schemes

Automotive and HPC systems need high resilience

slide-3
SLIDE 3

3

NVIDIA CONFIDENTIAL

CHALLENGES

Application-level resilience evaluation is challenging

Traditional low-level error injection experiments are slow Low visibility into application behavior

Need quicker GPU application resilience evaluation scheme

Application-level evaluation can be slow

Application System software Architecture Gate Circuit

slide-4
SLIDE 4

4

NVIDIA CONFIDENTIAL

APPROACH

Inject error at architecture level

Fast and visibility into application

Leverage a low-level assembly-language instrumentation tool (SASSI) Advantages:

Analyze and study SDCs in detail: Magnitude of SDCs and which errors produce SDCs Ability to correlate program properties with program vulnerability

Key to develop low cost error mitigation schemes

Ability to quantify application level error derating factors

Architecture-level Error Injections

Application System software Architecture Gate Circuit

slide-5
SLIDE 5

5

NVIDIA CONFIDENTIAL

CONTRIBUTIONS

Developed SASSIFI tool Flexible options to inject many types of errors Examples: single, multiple bit flips in register values; address vs. value errors Demonstrated by conducting four types of resilience studies Released SASSIFI for public usage GitHub: https://github.com/NVlabs/sassifi

SASSIFI: Architecture-level GPU fault injection tool

slide-6
SLIDE 6

6

NVIDIA CONFIDENTIAL

OUTLINE

Background: SASSI SASSIFI tool Error injection methodology Use cases: Error models Results

slide-7
SLIDE 7

7

NVIDIA CONFIDENTIAL

OVERVIEW OF SASSI

Background

.L_8: ISCADD R7, R5, R3, 0x2; STS [R7], R2; BAR.SYNC 0x0; MOV R0, c[0x0][0x28]; SHF.R R0, R0, 0x1, RZ; ISETP.EQ.AND P0, PT, ... @P0 BRA `(.L_12); .L_8: ISCADD R7, R5, R3, 0x2; IADD R1, R1, -0x4; STL [R1], R4; IADD R4, R7, 0x0; JCAL `(_users_function); LDL R4, [R1]; IADD R1, R1, 0x4; STS [R7], R2; BAR.SYNC 0x0;

SASSI is a compiler-based instrumentation framework that allows us to inject code before or after specific points in a program Example: Identify all SASS memory ops and inject code needed to pass op’s address to a user-defined function User writes a handler function, _users_function, in CUDA

1. Create extra stack space 2. Save live registers 3. Pass parameters of interest to user-defined function 4. Call user-defined function 5. Restore live registers 6. Restore stack 7. Execute instrumented instruction “Flexible Software Profiling of GPU Architectures,” Mark Stephenson, Siva Hari, Yunsup Lee, Eiman Ebrahimi, Daniel Johnson, Dave Nellans, Mike O’Connor, and Steve Keckler, ISCA 2015

slide-8
SLIDE 8

8

NVIDIA CONFIDENTIAL

SASSIFI: SASSI BASED FAULT INJECTOR

Leveraged SASSI for error injections Instrumented kernels for profiling and error injections

slide-9
SLIDE 9

9

NVIDIA CONFIDENTIAL

SASSIFI METHODOLOGY

Profile: Identify possible injection sites

Instrumented kernels execute on the GPU

Steps

GPU Kernels CPU Code Output

slide-10
SLIDE 10

10

NVIDIA CONFIDENTIAL

SASSIFI METHODOLOGY

Steps

GPU Kernels CPU Code Output

Profile: Identify possible injection sites

Instrumented kernels execute on the GPU

Statistically select injection sites based on the error model

slide-11
SLIDE 11

11

NVIDIA CONFIDENTIAL

SASSIFI METHODOLOGY

Steps

GPU Kernels CPU Code Output Golden Output

Profile: Identify possible injection sites

Instrumented kernels execute on the GPU

Statistically select injection sites based on the error model Injection runs: inject one error at a time Instrument before/after instructions and collect reg/mem info Start application, inject error at the selected site Continue execution until a crash or the output

slide-12
SLIDE 12

12

NVIDIA CONFIDENTIAL

OUTCOME CATEGORIES

Categories Explanation DUE Application exits with non-zero exit status Application does not terminate within allocated time (3x fault- free runtime) Potential DUEs Kernel exit status is not cudaSuccess Error messages in stdout/stderr (e.g., Error: misaligned address) SDC Program output file or stdout is different Masked Application output is same as the error free output without any error symptoms

slide-13
SLIDE 13

13

NVIDIA CONFIDENTIAL

SASSIFI USE CASES

What is the probability that a particle-strike in the register file produce an SDC? What is the probability that a bit-flip in the destination register of an executing instruction will result in an SDC? What instruction types are likely to produce more SDCs when subjected to errors in destination registers? How do SDC probabilities change when different architecture-level states (addresses

  • vs. values) are subjected to errors?

How do the results change if we inject different bit-flip patterns (single vs. double bit-flips)?

Many uses of SASSIFI

slide-14
SLIDE 14

14

NVIDIA CONFIDENTIAL

SASSIFI USE CASES

What is the probability that a particle-strike in the register file produce an SDC? What is the probability that a bit-flip in the destination register of an executing instruction will result in an SDC? What instruction types are likely to produce more SDCs when subjected to errors in destination registers? How do SDC probabilities change when different architecture-level states (addresses

  • vs. values) are subjected to errors?

How do the results change if we inject different bit-flip patterns (single vs. double bit-flips)?

Many uses of SASSIFI

SASSIFI can be used to address all these questions

slide-15
SLIDE 15

15

NVIDIA CONFIDENTIAL

ERROR MODELS

SASSIFI can inject many types of errors

Instruction groups Bit-flip models Injection modes Randomly selected register (RF) All instructions Single thread: Single-bit flip Double-bit flip

Use case 1: Register file injections for AVF analysis

slide-16
SLIDE 16

16

NVIDIA CONFIDENTIAL

ERROR MODELS

SASSIFI can inject many types of errors

Instruction groups Bit-flip models Injection modes Randomly selected register (RF) All instructions Single thread: Single-bit flip Double-bit flip Instruction output value (IOV) GPR Single thread:

Use case 2: Injecting into a destination register of a randomly selected instruction

slide-17
SLIDE 17

17

NVIDIA CONFIDENTIAL

ERROR MODELS

SASSIFI can inject many types of errors

Instruction groups Bit-flip models Injection modes Randomly selected register (RF) All instructions Single thread: Single-bit flip Double-bit flip Instruction output value (IOV) GPR I/F/D ADD-MUL I/F/D FMA LDS LD ST Single thread: CC PR SETP

Use case 3: Identifying instruction types that produce more SDCs

slide-18
SLIDE 18

18

NVIDIA CONFIDENTIAL

ERROR MODELS

SASSIFI can inject many types of errors

Instruction groups Bit-flip models Injection modes Randomly selected register (RF) All instructions Single thread: Single-bit flip Double-bit flip Instruction output value (IOV) GPR I/F/D ADD-MUL I/F/D FMA LDS LD ST Single thread: CC PR SETP Instruction output address (IOA) GPR ST

Use case 4: Injecting into different architecture states

slide-19
SLIDE 19

19

NVIDIA CONFIDENTIAL

ERROR MODELS

SASSIFI can inject many types of errors

Instruction groups Bit-flip models Injection modes Randomly selected register (RF) All instructions Instruction output value (IOV) GPR I/F/D ADD-MUL I/F/D FMA LDS LD ST CC PR SETP Instruction output address (IOA) GPR ST Random value Zero value Random value Zero value Single thread: All threads in one warp: Single-bit flip Single-bit flip Double-bit flip Double-bit flip

Use case 5: Injecting different bit-flip patterns

slide-20
SLIDE 20

20

NVIDIA CONFIDENTIAL

ERROR MODELS

SASSIFI can inject many types of errors

Instruction groups Bit-flip models Injection modes Randomly selected register (RF) All instructions Instruction output value (IOV) GPR I/F/D ADD-MUL I/F/D FMA LDS LD ST CC PR SETP Instruction output address (IOA) GPR ST Random value Zero value Random value Zero value Single thread: All threads in one warp: Single-bit flip Single-bit flip Double-bit flip Double-bit flip

Easy to extend to include other models

slide-21
SLIDE 21

21

NVIDIA CONFIDENTIAL

IMPLEMENTING DIFFERENT ERROR MODES

. . sassi_before_handler() Opcode Dest, Src1, Src2 sassi_after_handler() . . Inject error at the selected instruction count according to the selected bit-flip model Empty handler Record register/memory content at the selected instruction count At the selected instruction, inject error if the selected register is a source. If not, monitor subsequent instructions and inject when found as a source. At the injection instruction

  • Read values from correct

address and write them to the corrupted address

  • Revert content at the

correct address with the recorded values Empty handler IOV mode IOA mode RF mode

slide-22
SLIDE 22

22

NVIDIA CONFIDENTIAL

RESULTS: USE CASE 1

SDC AVF = SDC probability from injections in occupied registers * RF occupancy 0.075 and 0.07 for CoMD and Lulesh, respectively, for single-bit flips

Register File AVF

0% 20% 40% 60% 80% 100% Single-bit flip Double-bit flip Single-bit flip Double-bit flip CoMD Lulesh % of injections

Error injection results

Masked DUEs Potential DUEs SDCs

slide-23
SLIDE 23

23

NVIDIA CONFIDENTIAL

RESULTS: USE CASE 2

Program level manifestations of errors that propagate to instruction outputs are application dependent

Injecting into a destination register of a randomly selected instruction

0% 20% 40% 60% 80% 100% CoMD Lulesh b+tree backprop bfs gaussian heartwall hotspot kmeans lavaMD lud mummergpu nn nw pathfinder srad_v1 srad_v2 streamcluster Masked DUEs Potential DUEs SDCs

slide-24
SLIDE 24

24

NVIDIA CONFIDENTIAL

RESULTS: USE CASE 3

Double instructions are less susceptible than integer instructions for CoMD Results for the remaining use cases can be found in the paper

Identifying instruction types that produce more SDCs

0% 5% 10% 15% 20% Single-bit flip Random value Single-bit flip Random value Single-bit flip Random value Single-bit flip Random value Single-bit flip Random value Single-bit flip Random value IADD-IMAD DADD-DMUL MAD DFMA LDS LD Outcome probabilities weighted by the % of dynamic instructions Masked DUEs Potential DUEs SDCs IADD-IMUL

slide-25
SLIDE 25

25

NVIDIA CONFIDENTIAL

SLOWDOWNS

1.02x to 166x slowdowns at the application-level (depends on host vs. GPU runtime) Kernel-level slowdowns were higher, ranging from 5.2x to 488x Orders of magnitude faster than lower level simulators

Modest application-level slowdowns

CoMD Lulesh Rodinia (Geomean) RF Mode 94 85 57 IOV Mode 81 81 57 IOA Mode 55 46 64 GPGPU Sim <0.1 Performance in Million Warp-Instructions Per Second (MWIPS) SASSIFI

slide-26
SLIDE 26

26

NVIDIA CONFIDENTIAL

CONCLUSIONS

Developed an error injection based tool for GPU application resilience evaluation Fast in-silicon error injections Flexible to inject many types of errors Demonstrated by conducting various types of resilience studies Released the code for public usage GitHub: https://github.com/NVlabs/sassifi

SASSIFI tool for GPU application resilience evaluations