FUNCTIONAL SAFETY AND THE GPU Richard Bramley, 5/11/2017 How good - - PowerPoint PPT Presentation

functional safety and the
SMART_READER_LITE
LIVE PREVIEW

FUNCTIONAL SAFETY AND THE GPU Richard Bramley, 5/11/2017 How good - - PowerPoint PPT Presentation

FUNCTIONAL SAFETY AND THE GPU Richard Bramley, 5/11/2017 How good is good enough What is functional safety Functional safety and the GPU AGENDA Safety support in Nvidia GPU Conclusions 2 HOW GOOD IS GOOD ENOUGH ? 3 N. Saxena 4 ACCIDENT


slide-1
SLIDE 1

Richard Bramley, 5/11/2017

FUNCTIONAL SAFETY AND THE GPU

slide-2
SLIDE 2

2

AGENDA

How good is good enough What is functional safety Functional safety and the GPU Safety support in Nvidia GPU Conclusions

slide-3
SLIDE 3

3

HOW GOOD IS GOOD ENOUGH ?

slide-4
SLIDE 4

4

  • N. Saxena

4

ACCIDENT STATISTICS– US1

Description 2013 Statistics 2015 Statistics Fatal Crashes 30,057 35,092 Non-Fatal Crashes 5,657,000 6,263,834 Number of Registered Vehicles 269,294,000 281,312,446 Licensed Drivers 212,160,000 218,084,465 Vehicle Miles Travelled 2,988,000,000,000 3,095,373,000,000 Fatal Crash Rate in FITs 2,3 250 – 500 283 - 566 Non-Fatal Crash Rate in FITs 2,3 46K – 92K 51K – 102K What is an appropriate target ?

Google Non-Fatal Crash FIT Rate = 150K

1Source: Traffic Safety Facts 2013/2015, NHTSA document reference DOT HS 812384 2 Derived from NHTSA data on driver related fatal crashes 3Assumes an average speed of 50MPH

slide-5
SLIDE 5

5

TARGET FAILURE RATES

Description Statistics

Acceptable risk (no further improvement required) 1:1,000,0001 US population (2015) >321,000,000 Traffic deaths 35,092 “Acceptable” deaths as per guidelines 321 Required improvement x100

1 Derived following data from UK health and safety executive publications

Wide variety of targets in industry Target risk reduction of 2x to 100x compared to human driver

slide-6
SLIDE 6

6

SAFETY AND AUTONOMOUS VEHICLES

Safety during intended operation Safety of the intended function (SOTIF ISO/PAS 21448 in development) Safety in presence of a fault Functional Safety ISO-26262 Algorithms Software Hardware

slide-7
SLIDE 7

7

FUNCTIONAL SAFETY BASICS

slide-8
SLIDE 8

8

DEFINITION PER STANDARDS

“Absence of unreasonable risk due to hazards caused by malfunctioning behavior of electrical/electronic systems” – ISO 26262-1:2011; 1.51 “Part of overall safety relating to the equipment under control and the equipment under control, control system that depends on the correct functioning of the electrical/electronic/programmable electronic safety-related systems and other risk reduction measures” – IEC 61508-4:2010; 3.1.12

slide-9
SLIDE 9

9

CLASSIC EXAMPLE

  • Consider a motor winding which

may overheat and cause a hazard.

  • Reliability engineering approach

might design the winding to be more resilient to over-temperature conditions

  • Functional safety engineering

approach might add a temp sensor to detect the over-temperature condition and switch off the motor

IEC 61508-0:2005; 3.1

https://upload.wikimedia.org/wikipedia/commons/0/0f/Stator_Winding_of_a_BLDC_Motor.jpg

slide-10
SLIDE 10

10

ACHIEVING FUNCTIONAL SAFETY

Systematic and random faults must be considered Systematic faults mitigated by: Following compliant process at all stages of development Monitoring of the complete product lifecycle Random faults are mitigated by: Failure mode analysis to understand the fault behavior of the system Application of diagnostic measures to detect the failure modes Transition to the safe state on failure mode detection

slide-11
SLIDE 11

11

FAIL SAFE

Undetected Failures Good State Safe State Detected Failures

m – mission , b- backup, (x), m or b is in repair mode.

Failed State

slide-12
SLIDE 12

12

FAIL OPERATIONAL

Undetected Failures Good State Failed State

m – mission , b- backup, (x), m or b is in repair mode.

For full autonomy the initial “safe state” can be a transition to the backup system

Backup Detected Failures Repair Final safe state Detected Failures Undetected Failures

slide-13
SLIDE 13

13

FAULT CLASSIFICATIONS

ISO 26262-10; B.1

All Faults λ Non-Safety Related Element λNSR Safe λS Safety Related Element λSR Safe λS Single Point λSPF Residual λRF Multi-Point Latent λMPF

, L

Multi-Point Detected λMPF

, D

Multi-Point Perceived λMPF

, P

slide-14
SLIDE 14

14

SINGLE POINT FAULT METRIC (SPFM)

Shows the percentage of overall single point faults which are: Safety related AND Safe OR dangerous but detected λs - safe fault failure rate, can also be expressed as a % (Fsafe) the ration of overall possible faults which are safe.

slide-15
SLIDE 15

15

LATENT FAULT METRIC (LFM)

Shows the percentage of overall multiple point faults which are: Safety related AND Safe OR dangerous but detected OR dangerous but perceived Customarily limited to scenarios considering 2 point independent faults Primary consideration is fault in mission logic combined with fault in safety mechanism

slide-16
SLIDE 16

16

ARCHITECTURAL METRIC TARGETS

ASIL A ASIL B ASIL C ASIL D SPFM

N/A >=90% >=97% >=99%

LFM

N/A >=60% >=80% >=90%

All targets are recommendations. Developers can set their own targets based on appropriate argumentation.

slide-17
SLIDE 17

17

PROBABILISTIC METRICS

Probabilistic Metric for (Random) Hardware Failure (PMHF)

Examines the residual probability of violation of safety goal after application

  • f diagnostics, in a given time of
  • peration.

Some pushback in market due to inconsistency between methods used by different vendors.

ISO 26262-10:2011; 8.3.3

NOTE: Multiple versions of equation possible depending on conditional probability of failures. Simplest form shown

slide-18
SLIDE 18

18

PMHF TARGETS

ASIL A ASIL B ASIL C ASIL D PMHF

N/A 100 FIT 100 FIT 10 FIT

All targets are recommendations. Developers can set their own targets based on appropriate argumentation.

slide-19
SLIDE 19

19

RELEVANCE TO GPU

slide-20
SLIDE 20

20

TRADITIONAL CV

EXAMPLES OF SAFETY CRITICAL OPERATION ON GPU

Normalize gamma and color Compute gradients Weighted voting Contrast and normalize Collect HOGS Traditional Classification: (pattern and template matching)

MACHINE LEARNING*

CNN (Convolutional Neural network) MLP (Multi-layer perceptron) SVM (Support vector machine)

*Focus is inferencing, training handled analogously to validation and calibration of a traditional safety related algorithm.

slide-21
SLIDE 21

21

GPU MEASUREMENT METHODOLOGIES

Silicon-based fault Injection

Design Simulation Fault Injection

Beamtesting

C-models/ RAM Liveness

Much of the measurement is done on representative kernels as the final applications are not available at design time

Architectural Safeness, Diagnostic Coverage (SPFM,LFM), SRAM “Liveness” Representative Workloads

Further Safety Analysis

slide-22
SLIDE 22

22

MEASURING SAFE FAULTS IN RAMS “LIVENESS”

RAMs are sensitive to particle radiation (4x larger failure rate per bit than flops) RAM contents may not be sensitive to faults (pixels) RAM contents may be very sensitive to faults (instructions) An important indicator is RAM Liveness

20 40 60 80 100 120 k27376_rfdp k27422_rfdp sparse_rfdp cudnn3_rfdp k27382_L2 cudnn1_L2 HOG_L2 winograd_L2 k27398_L1 harris_L1 cudnn2_L1 k27376_icc k27422_icc sparse_icc cudnn3_icc k27382_ifb cudnn1_ifb HOG_ifb winograd_i… k27398_gcc harris_gcc cudnn2_gcc k27376_tail k27422_tail sparse_tail cudnn3_tail time W1 W2 W4 W3 tr1 tr2 tr4 tr4’ Texe write read tr3 = 0

The occupancy can be computed: (tr1 + tr2 + tr3 + tr4 + tr4’) / 4 x Texe.

Occupancy %

Majority of RAMS in this GPU less than 10% occupancy Fsafe > 90%

slide-23
SLIDE 23

23

TESTING REPRESENTATIVE KERNELS

Parameter measurement is very sensitive to kernel definition Traditional CV has a wide diversity of operations Difficult to define representative kernels Machine learning has a smaller set of repeated operations Enabling a more complete definition of kernels for measurements More accurate and reliable measurements

slide-24
SLIDE 24

24

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 200 400 600 800 1000 1200 1 2 3 4 5 6 62 63 64 65 66 Ratio #Runs Kernel ID sorted by Launch order

Output 1000 Class, GoogLeNet, total_run=5000

FAIL Counts Residual fault ratio Diagnosed fault ratio

DEEP LEARNING APPLICATION SAFENESS

GIE GoogLeNet 67 kernels in GoogLeNet inference Faults in latter kernels have a higher possibility to cause errors #FAIL Counts represents the proportion of faults for which the application predicted the wrong final class Weighted average safeness is >99 %

slide-25
SLIDE 25

26

SAFETY SUPPORT IN NVIDIA GPUS

slide-26
SLIDE 26

27

SYSTEMATIC DEVELOPMENT OF GPU HARDWARE

Selected GPU cores targeted for automotive usage are developed with a process for ISO 26262 compliance

slide-27
SLIDE 27

28 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

LAYERED SAFETY MECHANISMS

HW plausibility checks enabling multiple execution checks throughout the GPU, Protection of large safety related memories, Dependent failure mitigation; mainly caches and shared structures,

Parity/ECC protection of key structures HW machine checks Redundant execution

slide-28
SLIDE 28

29

FLEXIBLE REDUNDANCY MODEL

GPU Context

Channel 0 Channel 1 Channel N

Work Distribution

SM0 SM1 SM2 SM3

Memory Access

GPU Flexible Execution model Built-in HW and SW diagnostics

Machine Checks Parity /ECC Common cause failure mitigation

slide-29
SLIDE 29

30

FLEXIBLE REDUNDANCY MODEL

GPU Context

Channel 0 Channel 1 Channel N

Work Distribution

SM0 SM1 SM2 SM3

Memory Access

GPU Flexible Execution model Built-in HW and SW diagnostics

Machine Checks Parity /ECC Common cause failure mitigation

slide-30
SLIDE 30

31

SYSTEMATIC CONSIDERATIONS

Software in the runtime is under development for ISO 26262 compliance Software used in development (training) considered as off-line tools per ISO 26262

Software and tools

TensorRT

slide-31
SLIDE 31

32

GPU FAULT MITIGATION

slide-32
SLIDE 32

33

CONCLUSIONS

Nvidia is developing selected GPUs for compliance to ISO 26262 Nvidia has multiple unique capabilities to analyze safety-related performance of GPUs Analysis to date indicates DNNs have a high degree of internal redundancy that results in high ratio of safe faults Selected GPUs are being built with additional hardware and software diagnostic mechanisms Nvidia is developing software and tools needed to support safety related development

slide-33
SLIDE 33