Transistor reliability trends Shekhar Borkar, Intel Corp: As - - PDF document

transistor reliability trends
SMART_READER_LITE
LIVE PREVIEW

Transistor reliability trends Shekhar Borkar, Intel Corp: As - - PDF document

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Fault Injection-based Assessment of Softw are Techniques for Hardw are Fault Tolerance Johan Karlsson (work with Ruben Alexandersson, Daniel Skarin, Raul Barbosa, Peter


slide-1
SLIDE 1

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

1

Fault Injection-based Assessment of Softw are Techniques for Hardw are Fault Tolerance

Johan Karlsson (work with Ruben Alexandersson, Daniel Skarin, Raul Barbosa, Peter Öhman, Domenico Di Leo, Behrooz Sangchoolie, Fatemeh Ayat) g , y )

Department of Computer Science and Engineering Chalmers University of Technology Göteborg, Sweden

Transistor reliability trends

Shekhar Borkar, Intel Corp: “As technology scales, variability in transistor performance will continue to increase, making transistors less and less reliable. …. Finding solutions to these challenges will require a concerted effort on the part of all the players in a system design.”

Borkar, S.; "Designing reliable systems from unreliable components: the challenges of transistor variability and degradation," IEEE Micro, December 2005.

2 Johan Karlsson NODES Winter Seminar, February 3, 2012

slide-2
SLIDE 2

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

2

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 3 NODES Winter Seminar, February 3, 2012

Main sources of transistor faults

  • Process variations

– Random variations related to lithography, etching, dopant count – Voltage and temperature variations

  • Wear out effects (degradation)

– NBTI - negative bias temperature instability – HCI - hot carrier injection – Gate oxide breakdown – Electromigration – …

  • Soft errors

– Bit-flips in latches, flip-flops and memory cells – Mainly caused by cosmic-induced high energy neutrons (cosmic neutrons) – Soft errors  no permanent damage to hardware

Johan Karlsson 4 NODES Winter Seminar, February 3, 2012 Electromigration

slide-3
SLIDE 3

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

3

Trends in the bathtube curve

Infant mortality Constant failure rate Wear out

Failure rate

  • Infant mortality: Increasing manufacturing defects
  • Constant failure rate: Increasing rate of transient, intermittent and permanent faults
  • Wearout: Acceleration of aging phenomena

Johan Karlsson 5 NODES Winter Seminar, February 3, 2012

Time

Source: Vikas Chandra, ARM R&D, Dependable Design in Nanoscale CMOS Technologies: Challenges and Solutions Keynote address, WDSN, Estoril, Portugal, June 29, 2009

1 – 20 weeks 3 – 10 years

Soft error rate trend for SRAM

(Radiation test data from Sun Microsystems)

Johan Karlsson 6 NODES Winter Seminar, February 3, 2012

Source: A. Dixit, R. Heald, and A. Wood, “Trends from Ten Years of Soft Error Experimentation, SELSE´09, Stanford, CA, USA.

1 FIT = 10-9 faults per hour

slide-4
SLIDE 4

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

4

Raw soft error rate trend for microprocessors

(Data from Sun Microsystems)

Technology node (nm) Year introduced Relative SEU rate in FITs/kbit Mbits/processor Relative uncorrected SEU rate / FITs/kbit SEU rate / microproces sor

250 1998 3.2 1.52 5.0 180 1999 3.0 1.52 4.3 130 2000 2.4 3.28 7.9 90 2002 1.0 33.6 33.6 65 2006 0.7 44.3 30.5 40 2008 0.94 71 67

Johan Karlsson 7 NODES Winter Seminar, February 3, 2012

Source: A. Dixit, R. Heald, and A. Wood, “The Impact of New Technology on Soft Error Rates, SELSE-6, Stanford, CA, USA, 2010

1 FIT = 10-9 faults per hour

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Future work

Johan Karlsson 8 NODES Winter Seminar, February 3, 2012

slide-5
SLIDE 5

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

5

Layered fault tolerance

Critical f il Benign f il Safe Sh td

System failure modes

Detected Error Undetected Error Error corrected Timing failure Bounded failure Value failure Fail silent Fail signal Error corrected failure failure Shutdow n

  • st balancing

System failure modes

Software mechanisms

2 nd line of defense

System mechanisms

3 rdline of defense

Processor failure modes

Johan Karlsson 9 NODES Winter Seminar, February 3, 2012

SW Design Faults HW Design Faults Physical Faults Error Corrected Error Error

C Hardware mechanisms

1 st line of defense

Focus of my talk

Error handling in hardw are Some examples

  • Duplication and comparison

– E.g., lock-stepped processors – High cost, high energy consumption and high failure rate

  • Error correction code (ECC) and Parity bits

– Commonly used to protect caches and other memory arrays

  • Instruction retry

– Re-execution of machine instruction after ECC or parity error

  • Reloading of untouched data from main memory when

g y uncorrectable errors occurs in the cache

  • Etc …

Johan Karlsson 10 NODES Winter Seminar, February 3, 2012

slide-6
SLIDE 6

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

6

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 11 NODES Winter Seminar, February 3, 2012

Fault Injection

  • Fault injection is a technique for verification and

validation of fault and error handling mechanisms validation of fault and error handling mechanisms

  • Exposes a system, subsystem or component to

artificial faults

  • Sometimes called FMET – Failure Mode Effects

Testing (cf. FMEA)

  • Main benefit: improves our understanding of how a

Main benefit: improves our understanding of how a system behaves in the presence of faults and errors

Johan Karlsson 12 NODES Winter Seminar, February 3, 2012

slide-7
SLIDE 7

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

7

Uses of fault injection

  • Fault forecasting

– E.g., estimation of error detection coverage

E.g., estimation of error detection coverage

  • Fault removal

– To find bugs in fault and error handling mechanisms

  • Benchmarking

– Comparison of alternative design solutions – Identify weaknesses

  • Evaluation-driven design

– Iterative process of design, evaluation and improvement

Johan Karlsson 13 NODES Winter Seminar, February 3, 2012

Error model

  • We use single bit-flip errors to benchmark the error sensitivity of

executable programs with respect to transistor faults in i microprocessors

  • The single bit-flip model is an engineering approximation
  • Bit-flips injected in CPU registers and the data segment of main

memory

  • Injection is done just before the register or memory word is read by a

machine instruction. This ensures injection of errors in live data

  • We use pre injection analysis of a fault free execution trace to avoid
  • We use pre-injection analysis of a fault-free execution trace to avoid

injection in registers or memory that hold dead data

  • No guarantee for not injecting errors in data items that are transitively

dead

Johan Karlsson 14 NODES Winter Seminar, February 3, 2012

slide-8
SLIDE 8

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

8

Failure mode distributions for three programs from the MiBench suite

Failure modes D d b Program # Injected errors No Effect Detected by hardware exception Program hang Value failure

(Non‐detected erroneous output)

CRC‐32

32‐bit cyclic redundancy check

224999 56.3% 31.5% 5.6% 6.6% SHA

Secure hash algorithm

225000 14.6% 39.7% 1.5% 44.2%

Johan Karlsson 15 NODES Winter Seminar, February 3, 2012

Quicksort

Recursive sorting algorithm

175000 30.7% 46.7% 3.7% 18.9%

Injected errors: Single bit-flips in CPU registers and volatile main memory The failure mode distribution varies for different programs!

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 16 NODES Winter Seminar, February 3, 2012

slide-9
SLIDE 9

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

9

Brake system emulator

Release request

Johan Karlsson 17 NODES Winter Seminar, February 3, 2012

Workload: Brake-by-w ire control loop

Johan Karlsson 18 NODES Winter Seminar, February 3, 2012

Parts of the program subjected to error injection encircled

slide-10
SLIDE 10

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

10

Bounded failure vs unbounded failure for the brake-by-w ire controller

(Wheel speed vs. time)

Wheel locks, speed = 0

Benign failure

(Bounded failure)

Critical failure

(Unbounded failure)

Johan Karlsson 19 NODES Winter Seminar, February 3, 2012

Blue curves: correct behavior Red curves: erroneous behavior due to single bit-flip errors in CPU registers

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 20 NODES Winter Seminar, February 3, 2012

slide-11
SLIDE 11

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

11

Low -cost softw are techniques

  • Goal: To reduce the likelihood of critical failures

Goal: To reduce the likelihood of critical failures

  • Two simple integrity checks with recovery

– Integrity check for integrator state – Integrity check for stackpointer

  • Run-time overhead < 4 %
  • Integrity checks proposed based on results from fault injection

experiments with a controller without software based error experiments with a controller without software-based error handling mechanisms

Johan Karlsson 21 NODES Winter Seminar, February 3, 2012

Simple integrity checks

  • Check of integrator state variable (floating point variable)

– Difference in the value between two samples must be within a given bound. – Value is not “Not a Number” – Value is not “Infinity” – Best effort recovery: Rollback to value of integrator state from previous sampling point

  • Stackpointer check

– Main program stores copy of the stackpointer before a function call – Check that the stackpointer is equal to the copy when execution returns from function – Recovery: soft reset of brake-controller program

  • Hardware exceptions

– Recovery: soft reset of brake-controller program – No recovery for exceptions used by debugger

Johan Karlsson 22 NODES Winter Seminar, February 3, 2012

slide-12
SLIDE 12

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

12

Evaluation of Simple Integrity Checks

  • Two versions of the brake control program

– Basic version: Without integrity checks

g y

– Fault tolerant version: With integrity checks

  • Workload

– Panic braking from a speed of 30 m/s

  • Injections of all possible single bit errors during three iterations
  • f the ABS control loop

– Approximately 90.000 single bit errors injected for each program

version

  • Behavior of the control program was recorded for 1900 control

cycles after an error was injected

Johan Karlsson 23 NODES Winter Seminar, February 3, 2012

Low -cost softw are techniques

Summary of results

  • Two orders of magnitude reduction of critical failure

Two orders of magnitude reduction of critical failure

– From 1.2% (1063 of 88265 bit flips) to 0.4% (36 of 93171 bit flips)

  • Exhaustive testing of three control loop executions
  • Low execution time overhead: < 4%
  • Drawback: In-exact recovery
  • Limitations of the evaluation

– Only three control loop executions in ABS mode investigated – Not all parts of the brake controller progran subjected to error injection

Johan Karlsson 24 NODES Winter Seminar, February 3, 2012

slide-13
SLIDE 13

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

13

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 25 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

Injected errors: Single bit-flips in CPU registers and volatile main memory Exhaustive testing of all possible bit-flips in three control loops

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 26 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

Fault tolerant version executes more machine instructions than the basic (non- fault tolerant) version More target bits in the FT version

slide-14
SLIDE 14

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

14

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 27 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

A fairly high proportion of the injected errors had no impact. Many bits are locally live but transitively dead.

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 28 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

Significantly lower proportion of detected failures for FT version. Reason: The FT version attempts to recover from errors, which leads to benign failure.

slide-15
SLIDE 15

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

15

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 29 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

Significantly lower proportion of program hangs for FT version. We assume that program hangs are detected by a watchdog timer.

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 30 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

Significantly higher proportion of benign failures for FT version. Recovery is not perfect leads to benign failure (acceptable system behavior)

slide-16
SLIDE 16

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

16

Basic version vs. Fault tolerant version

Coverage Total No Effect Detected, No Program hang Benign failure Critical failure recovery hang failure failure Without integrity checks (basic version) Percentage

100% 33.2% 45.6% 1.7% 18.2% 1.2%

  • No. of errors

88265 29312 40263 1539 16088 1063

Integrity checks with recovery (FT version)

Johan Karlsson 31 NODES Winter Seminar, February 3, 2012

Percentage

100% 36.1% 7.3% 0.7% 55.9% 0.04%

  • No. of errors

93171 33627 6776 664 52065 39

Almost two orders of magnitude reduction of critical failures:1.2% 0.04% 36 of the 39 critical failures of the FT version were caused by errors in the program counter.

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 32 NODES Winter Seminar, February 3, 2012

slide-17
SLIDE 17

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

17

High-cost softw are techniques

  • Triple time-redundant execution with forward recovery (TTR-FR)

297% < run-time overhead < 440%

  • Double time redundant execution + 5 other SW-mechanisms

(TRAM)

181% < run-time overhead < 204%

  • Two implementation techniques

– Aspect-Oriented Programming vs. Manual programming in C

  • Low versus high compiler optimization

Johan Karlsson 33 NODES Winter Seminar, February 3, 2012

Aspect oriented programming

Aspect Error handling source code Weaving directives Target program source code Source code with error handling

Johan Karlsson 34 NODES Winter Seminar, February 3, 2012

slide-18
SLIDE 18

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

18

Triple time redundant execution w ith forw ard recovery (TTR-FR)

Purpose: Error masking and error detection Purpose: Error masking and error detection

  • Executes each control loop three times
  • Errors masked by majority voting
  • Three copies of program state
  • Forward recovery: program state of erroneous copy replaced

with program state of correct copy

  • Error signaled if no majority result is found

Johan Karlsson 35 NODES Winter Seminar, February 3, 2012

Time redundancy and more (TRAM)

Purpose: error detection

  • Six checking mechanisms

Six checking mechanisms

– Double time redundant execution and result comparison – Stack pointer and stack frame pointer integrity checks – Check that writes are made to correct data set – Two mechanisms for counter-based control flow checking – Check for fake resets

  • Developed through evaluation-driven design

p g g

Johan Karlsson 36 NODES Winter Seminar, February 3, 2012

slide-19
SLIDE 19

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

19

High-cost techniques

Summary

  • The results for TTR-FR was disappointing – it achieved only

96% coverage

  • Lack of coverage mainly due to “tricky” control flow errors
  • TRAM achieved 100% error detection coverage in non-

exhaustive tests based on 10.000 bit flips

  • Coverage was similar for aspect-oriented and manually

programmed implementations

  • Compiler optimization had little impact on the coverage despite

Compiler optimization had little impact on the coverage despite large differences in the machine programs

Johan Karlsson 37 NODES Winter Seminar, February 3, 2012

Compiler optimization levels

  • Low compiler optimization
  • GCC

finline

  • GCC … -finline
  • High compiler optimization
  • GCC … -O3 -fno-strict-aliasing

Johan Karlsson 38 NODES Winter Seminar, February 3, 2012

slide-20
SLIDE 20

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

20

Comparison of overhead for TTR-FR

L il Hi h il Low compiler

  • ptimization

High compiler

  • ptimization
  • No. of

instructions %

  • verhead
  • No. of

instructions %

  • verhead

Without TTR-FR 635 0% 245 0% Manual C 2647 317% 943 285% AspectC++Opt 3428 440% 973 297%

  • No. of instructions = Number of machine instructions executed in one control loop

Target program: Brake-by-wire controller Fault tolerance technique: Triple time redundant execution and voting with forward recovery (TTR-FR) Implementation techniques: Manual programming in C, Aspect-oriented programming using the optimized weaver Johan Karlsson 39 NODES Winter Seminar, February 3, 2012

Comparison of overhead for TRAM

L il Hi h il Low compiler

  • ptimization

High compiler

  • ptimization
  • No. of

instructions %

  • verhead
  • No. of

instructions %

  • verhead

Without TRAM 635 0% 245 0% Manual C 1824 187% 689 181% AspectC++Opt 2358 271% 746 204%

  • No. of instructions = Number of machine instructions executed in one control loop

Target program: Brake-by-wire controller Fault tolerance technique: Double time redundant execution + 5 other error detection mechanisms Implementation techniques: Manual programming in C, Aspect-oriented programming using an optimized weaver Johan Karlsson 40 NODES Winter Seminar, February 3, 2012

slide-21
SLIDE 21

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

21

Error coverage – TTR-FR

(Triple Time Redundant execution w ith Voting and Forw ard Recovery)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

317% 34.5% 15.2% 0.9% 45.6% 0.2% 96.4%

AspectC++Opt

440% 33.2% 17.1% 0.5% 45.3% 0.3% 96.5%

High compiler

  • ptimization

Manual C

285% 34.2% 18.4% 1.3% 41.9% 0.1% 95.9%

AspectC++Opt

297% 32.6% 20.7% 1.7% 40.4% 0.2% 95.6%

Injected errors: Single bit-flips in CPU registers and volatile main memory

  • No. of injected errors for each program: 10.000 – random sampling of error space

Johan Karlsson 41 NODES Winter Seminar, February 3, 2012

Error coverage – TTR-FR

(Triple Time Redundant execution w ith Voting and Forw ard Recovery)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

317% 34.5% 15.2% 0.9% 45.6% 0.2% 96.4%

AspectC++Opt

440% 33.2% 17.1% 0.5% 45.3% 0.3% 96.5%

High compiler

  • ptimization

Manual C

285% 34.2% 18.4% 1.3% 41.9% 0.1% 95.9%

AspectC++Opt

297% 32.6% 20.7% 1.7% 40.4% 0.2% 95.6%

TTR-FR does not provide perfect coverage! Small variations in coverage among implementations

Johan Karlsson 42 NODES Winter Seminar, February 3, 2012

slide-22
SLIDE 22

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

22

Error coverage – TTR-FR

(Triple Time Redundant execution w ith Voting and Forw ard Recovery)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

317% 34.5% 15.2% 0.9% 45.6% 0.2% 96.4%

AspectC++Opt

440% 33.2% 17.1% 0.5% 45.3% 0.3% 96.5%

High compiler

  • ptimization

Manual C

285% 34.2% 18.4% 1.3% 41.9% 0.1% 95.9%

AspectC++Opt

297% 32.6% 20.7% 1.7% 40.4% 0.2% 95.6%

Overwriting effect (OE) is similar among programs. OE similar to OE for experiments with simple integrity checks (36.1 %)

Johan Karlsson 43 NODES Winter Seminar, February 3, 2012

Error coverage – TTR-FR

(Triple Time Redundant execution w ith Voting and Forw ard Recovery)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

317% 34.5% 15.2% 0.9% 45.6% 0.2% 96.4%

AspectC++Opt

440% 33.2% 17.1% 0.5% 45.3% 0.3% 96.5%

High compiler

  • ptimization

Manual C

285% 34.2% 18.4% 1.3% 41.9% 0.1% 95.9%

AspectC++Opt

297% 32.6% 20.7% 1.7% 40.4% 0.2% 95.6%

Slightly higher coverage of SW-based mechanisms for programs with high level of compiler optimization

Johan Karlsson 44 NODES Winter Seminar, February 3, 2012

slide-23
SLIDE 23

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

23

Error coverage – TTR-FR

(Triple Time Redundant execution w ith Voting and Forw ard Recovery)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

317% 34.5% 15.2% 0.9% 45.6% 0.2% 96.4%

AspectC++Opt

440% 33.2% 17.1% 0.5% 45.3% 0.3% 96.5%

High compiler

  • ptimization

Manual C

285% 34.2% 18.4% 1.3% 41.9% 0.1% 95.9%

AspectC++Opt

297% 32.6% 20.7% 1.7% 40.4% 0.2% 95.6%

Slightly lower coverage of HW-exception for programs with high level of compiler

  • ptimization

Johan Karlsson 45 NODES Winter Seminar, February 3, 2012

Error coverage – TTR-FR

(Triple Time Redundant execution w ith Voting and Forw ard Recovery)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

317% 34.5% 15.2% 0.9% 45.6% 0.2% 96.4%

AspectC++Opt

440% 33.2% 17.1% 0.5% 45.3% 0.3% 96.5%

High compiler

  • ptimization

Manual C

285% 34.2% 18.4% 1.3% 41.9% 0.1% 95.9%

AspectC++Opt

297% 32.6% 20.7% 1.7% 40.4% 0.2% 95.6%

Differences in coverage between HW and SW mechanisms have little impact on total coverage.

Johan Karlsson 46 NODES Winter Seminar, February 3, 2012

slide-24
SLIDE 24

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

24

Error coverage – TRAM

( Double time Redundant execution + 5 other mechanisms)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

187% 33.3% 0% 21.5% 44.8% 0.3% 100%

AspectC++Opt

271% 29.6% 0% 22.9% 47.4% 0.1% 100%

High compiler

  • ptimization

Manual C

181% 34.2% 0% 24.2% 40.4% 0.1% 100%

AspectC++Opt

204% 30.6% 0% 30.9% 38.4% 0.2% 100%

Injected errors: Single bit-flips in CPU registers and volatile main memory

  • No. of injected errors for each program: 10.000 (Non-exhaustive experiments.)

Johan Karlsson 47 NODES Winter Seminar, February 3, 2012

Error coverage – TRAM

( Double time Redundant execution + 5 other mechanisms)

Coverage Over‐ h d No Effect Corrected by Detected by Detected by HW Program H Total C g head y Software y Software y Exception Hang Coverage Low compiler

  • ptimization

Manual C

187% 33.3% 0% 21.5% 44.8% 0.3% 100%

AspectC++Opt

271% 29.6% 0% 22.9% 47.4% 0.1% 100%

High compiler

  • ptimization

Manual C

181% 34.2% 0% 24.2% 40.4% 0.1% 100%

AspectC++Opt

204% 30.6% 0% 30.9% 38.4% 0.2% 100%

Johan Karlsson 48 NODES Winter Seminar, February 3, 2012

Very high error coverage! Non-exhaustive evaluation Error coverage can be less than100%!

slide-25
SLIDE 25

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

25

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 49 NODES Winter Seminar, February 3, 2012

Experimental set-up

  • Programs executed on a Freescale MPC565 PowerPC

microcontroller

  • Nexus-based fault injection

– Injection via debug port – No need to change target program

No need to change target program

  • Experiments conducted with the GOOFI-2 tool

Johan Karlsson 50 NODES Winter Seminar, February 3, 2012

slide-26
SLIDE 26

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

26

Overview of Experimental setup

Johan Karlsson 51 NODES Winter Seminar, February 3, 2012

Overview of GOOFI-2

Generic Object-Oriented Fault Injection tool

Nexus used for experiments reported in this talk

Johan Karlsson 52 NODES Winter Seminar, February 3, 2012

slide-27
SLIDE 27

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

27

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 53 NODES Winter Seminar, February 3, 2012

Summary

  • Sensitivity to single bit-flip errors in the ISA registers varies for

different programs

  • Control applications seem to be rather robust to such errors
  • Software implemented time-redundant execution cannot alone

achieve 100% error coverage. Needs to be complemented with various other mechanisms

  • Bit errors in the stack pointer, stack frame pointer and the

program counter are likely to cause critical failures

  • We propose evaluation-driven design of software-based error

handling mechanisms

  • The TRAM mechanism appears to achieve very high error

detection coverage

Johan Karlsson 54 NODES Winter Seminar, February 3, 2012

slide-28
SLIDE 28

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

28

Outline

  • Hardware reliability trends
  • Layered fault tolerance

Layered fault tolerance

  • Fault injection
  • Target application: Brake-by-wire controller
  • Low-cost software techniques
  • High-cost software techniques
  • Tool and experimental set-up

Tool and experimental set-up

  • Summary
  • Future work

Johan Karlsson 55 NODES Winter Seminar, February 3, 2012

Future research

  • Assess the validity of the single-bit flip approximation
  • I

ti t t di t b

  • Investigate ways to predict error coverage by

symbolic program execution and static program analysis – Fault injection is expensive!

  • Experiments with other target programs
  • Investigate impact of multiple-bit errors
  • D

l i j ti l i t h i f

  • Develop pre-injection analysis techniques for

identifying transitively dead registers/memory words

  • And many more …

Johan Karlsson 56 NODES Winter Seminar, February 3, 2012

slide-29
SLIDE 29

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

29

Questions?

57 Johan Karlsson NODES Winter Seminar, February 3, 2012

List of papers

  • D. Skarin, R. Barbosa and J. Karlsson, “GOOFI-2: A Tool for Experimental Dependability

Assessment”, in Proceedings of the 40th International Conference on Dependable Systems and Networks (DSN 2010), Chicago, Illinois, USA, June/July 2010

  • D. Skarin and J. Karlsson, “Software Mechanisms for Tolerating Soft Errors in an Automotive

Brake-Controller,” in Supplemental Volume of the 39th International Conference on Dependable Systems and Networks (DSN 2009), Estoril, Lisbon, Portugal, June/July 2009

  • D. Skarin and J.Karlsson, “Evaluation of Low-Cost Detection and Recovery of Soft Errors in an

ABS controller,” in Proceedings of the 2009 IEEE Workshop on Silicon Errors in Logic– System Effects (SELSE 5), Palo Alto, California, USA, March 2009

  • R. Alexandersson, P. Öhman, J.Karlsson. “Aspect‐oriented implementation of fault tolerance: an

assessment of overhead”. In proceedings of the 24th International Conference on Computer Safety Reliability and Security (SAFECOMP 2010) Vienna Austria 2010 Safety, Reliability and Security (SAFECOMP 2010), Vienna, Austria, 2010.

  • R. Alexandersson, J. Karlsson. “Fault injection-based assessment of aspect-oriented

implementation of fault tolerance” in Proceedings of the 41th International Conference on Dependable Systems and Networks (DSN 2011), Hong Kong, China, June 2011

Johan Karlsson 58 NODES Winter Seminar, February 3, 2012

slide-30
SLIDE 30

Invited talk, NODES Winter seminar, Turku, Finland February 3, 2012 Johan Karlsson Chalmers University of Technology, Göteborg, Sweden

30

Acknow ledgements

  • Daniel Skarin (now with the SP Technical Research Institute of Sweden)

Designed the GOOFI 2 plug in based software architecture

Designed the GOOFI-2 plug-in based software architecture

Developed the experimental set-up for the brake-by-wire application

Designed and evaluated the simple integrity checks

  • Ruben Alexandersson (now with Volvo Cars)

Designed and implemented the aspect weavers for AOP

Inventor/implementor of the TTR-FR, DS-CFC and TRAM mechanisms

  • Raul Barbosa (now ass. professor, University of Coimbra, Portugal)

Implemented the pre-injection analysis module for GOOFI-2

Develop support for instrumentation-based and exception-based fault injection

  • Martin Sanfridson (Volvo Technology)

Developed the brake-by-wire application and the associated environment simulator

  • Peter Öhman (now Head of Test Site Sweden AB)

Co-advisor to Ruben Alexandersson Johan Karlsson 59 NODES Winter Seminar, February 3, 2012