Software-based Fault Tolerance Mission (Im)possible? Peter Ulbrich - - PowerPoint PPT Presentation

software based fault tolerance mission im possible
SMART_READER_LITE
LIVE PREVIEW

Software-based Fault Tolerance Mission (Im)possible? Peter Ulbrich - - PowerPoint PPT Presentation

Software-based Fault Tolerance Mission (Im)possible? Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013 System Software Group


slide-1
SLIDE 1
  • System Software Group

Software-based Fault Tolerance –
 Mission (Im)possible?

http://www4.cs.fau.de

Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013

slide-2
SLIDE 2

Soft Errors – A Growing Problem

Peter Ulbrich – ulbrich@cs.fau.de 2

  • ■ Soft-Errors (Transient hardware faults)!

Induced by e.g., radiation, glitches, insufficient signal integrity

Affecting microcontroller logic

!

!

slide-3
SLIDE 3

Soft Errors – A Growing Problem

Peter Ulbrich – ulbrich@cs.fau.de 2

  • ■ Soft-Errors (Transient hardware faults)!

Induced by e.g., radiation, glitches, insufficient signal integrity

Affecting microcontroller logic

!

!

slide-4
SLIDE 4

Soft Errors – A Growing Problem

Peter Ulbrich – ulbrich@cs.fau.de 2

  • ■ Soft-Errors (Transient hardware faults)!

Induced by e.g., radiation, glitches, insufficient signal integrity

Affecting microcontroller logic

■ Future hardware designs: more performance

performance and parallelism parallelism!

→ On the price of being less and less r On the price of being less and less reliable eliable !

[3]

slide-5
SLIDE 5

Soft Errors – A Growing Problem

Peter Ulbrich – ulbrich@cs.fau.de 2

  • ■ Soft-Errors (Transient hardware faults)!

Induced by e.g., radiation, glitches, insufficient signal integrity

Affecting microcontroller logic

■ Future hardware designs: more performance

performance and parallelism parallelism!

→ On the price of being less and less r On the price of being less and less reliable eliable !

Toyota Acceleration Case ■ Electronic throttle control system (2005 Camry)

“Toyota claimed the 2005 Camry's main CPU had error detecting and correcting RAM. It didn't.” 2

■ Unintended acceleration potentially involving 261 deaths1 ■ Experts identified soft errors as possible cause1

1 US News, Mar 17, 2010 2 Investigation Report, EDN Network, Oct 28, 2013

[3]

slide-6
SLIDE 6

Software-Based Fault Tolerance

Peter Ulbrich – ulbrich@cs.fau.de 3

■ Software-based redundancy!

Triple Modular Redundancy riple Modular Redundancy (e.g., recommended by ISO 26262)

! Selective

Selective and adaptive adaptive

! Resour

Resource efficient ce efficient

!

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface(

Replica(1(

slide-7
SLIDE 7

Software-Based Fault Tolerance

Peter Ulbrich – ulbrich@cs.fau.de 3

■ Software-based redundancy!

Triple Modular Redundancy riple Modular Redundancy (e.g., recommended by ISO 26262)

! Selective

Selective and adaptive adaptive

! Resour

Resource efficient ce efficient

■ Single points of failur

Single points of failure!

Interface Interface and Majority V Majority Voter

  • ter

Allowing for Silent Data Corruptions Silent Data Corruptions (SDC) (SDC) → Replication is impossible! Replication is impossible!

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface( Majority( Voter( Interface(

↯ ↯

slide-8
SLIDE 8

Threats to Applicability – Mission failed?

Peter Ulbrich – ulbrich@cs.fau.de 4

■ Triple modular redundancy reliability!

!

!

  • RTMR = RVoter ⋅ R2−of −3
slide-9
SLIDE 9

Threats to Applicability – Mission failed?

Peter Ulbrich – ulbrich@cs.fau.de 4

■ Triple modular redundancy reliability! ■ Voting on unreliable hardware?!

Very small residual err esidual error pr

  • r probability?
  • bability?

Risk analysis inherently complex complex (no random error distribution! [4])

!

  • RTMR = RVoter ⋅ R2−of −3
slide-10
SLIDE 10

Threats to Applicability – Mission failed?

Peter Ulbrich – ulbrich@cs.fau.de 4

■ Triple modular redundancy reliability! ■ Voting on unreliable hardware?!

Very small residual err esidual error pr

  • r probability?
  • bability?

Risk analysis inherently complex complex (no random error distribution! [4])

→ Dealbr Dealbreaker for softwar eaker for software-based TMR e-based TMR!

  • RTMR = RVoter ⋅ R2−of −3
slide-11
SLIDE 11

Research Aims

Peter Ulbrich – ulbrich@cs.fau.de 5

" " "

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface( Majority( Voter( Interface(

slide-12
SLIDE 12

Research Aims

Peter Ulbrich – ulbrich@cs.fau.de 5

# Eliminate single points of failure " "

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface(

slide-13
SLIDE 13

Research Aims

Peter Ulbrich – ulbrich@cs.fau.de 5

# Eliminate single points of failure # Constrain residual error probability "

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface(

RV =1 RI =1

slide-14
SLIDE 14

Research Aims

Peter Ulbrich – ulbrich@cs.fau.de 5

# Eliminate single points of failure # Constrain residual error probability # Dependability as a resource efficient option

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface(

RV =1 RI =1

slide-15
SLIDE 15

Agenda

■ Introduction! ■ The Combined Redundancy approach (CoRed )!

■ Holistic protection – eliminating single points of failure ■ Arithmetic coding ■ Dependable voting

■ Constraining residual error probability!

■ From coding theory to application – lessons learned ■ Finding appropriate parameters ■ Circumvent implementation pitfalls

■ Evaluation!

■ Use case ■ Experimental setup ■ Fault-injection results

■ Conclusion!

Peter Ulbrich – ulbrich@cs.fau.de 6

slide-16
SLIDE 16

CoRed Overview – Holistic Protection Approach

■ The Combined Redundancy Approach (CoRed )! !

"

Peter Ulbrich – ulbrich@cs.fau.de 7

  • Encoded(opera&on(

( (

Sphere(of(redundancy((SOR)( Isola&on(domain(

( (

{

TMR +

slide-17
SLIDE 17

CoRed Overview – Holistic Protection Approach

■ The Combined Redundancy Approach (CoRed )!

Data-flow encoding Data-flow encoding

!

"

Peter Ulbrich – ulbrich@cs.fau.de 7

  • Encoded(opera&on(

( (

Sphere(of(redundancy((SOR)( Isola&on(domain(

( (

{

TMR +

slide-18
SLIDE 18

CoRed Overview – Holistic Protection Approach

■ The Combined Redundancy Approach (CoRed )!

Data-flow encoding Data-flow encoding Dependable voters Dependable voters

!

"

Peter Ulbrich – ulbrich@cs.fau.de 7

  • Encoded(opera&on(

( (

Sphere(of(redundancy((SOR)( Isola&on(domain(

( (

{

TMR +

slide-19
SLIDE 19

CoRed Overview – Holistic Protection Approach

■ The Combined Redundancy Approach (CoRed )!

Data-flow encoding Data-flow encoding Dependable voters Dependable voters

■ Holistic protection approach for control applications!

■ Input to output pr

Input to output protection

  • tection"

1 Reading inputs 2 Processing 3 Distributing outputs

Peter Ulbrich – ulbrich@cs.fau.de 7

  • Encoded(opera&on(

( (

Sphere(of(redundancy((SOR)( Isola&on(domain(

( (

1 2 3

{

TMR +

slide-20
SLIDE 20

Eliminating Input and Output Vulnerabilities

■ Arithmetic Codes ANBD Code !

■ Based on VCP [5] ■ Data integrity:

Key

■ Address integrity:

Per variable signature

■ Outdated data:

Timestamp

!

Peter Ulbrich – ulbrich@cs.fau.de 8

SOR( Encode( Encode( X (Value)( Y (Value)( Decode( Decode( X X’ (Encoded(Value)( Y’ (Encoded(Value)( Y

} v' = A⋅v+ B+ D

slide-21
SLIDE 21

Eliminating Input and Output Vulnerabilities

■ Arithmetic Codes ANBD Code !

■ Based on VCP [5] ■ Data integrity:

Key

■ Address integrity:

Per variable signature

■ Outdated data:

Timestamp

■ Set of arithmetic operators

arithmetic operators (+, -, *, =, …)! ■ Checksum

Checksum vs. Arithmetic code Arithmetic code (AN code)

■ AN Code Encoded data operations

Encoded data operations

■ Enabler for dependable voter

Enabler for dependable voter

Peter Ulbrich – ulbrich@cs.fau.de 8

SOR( Encode( Encode( X (Value)( Y (Value)(

}

Decode( Z = X Y Z’

(

v' = A⋅v+ B+ D

slide-22
SLIDE 22

CoRed Dependable Voter – Basics

■ CoRed Dependable V

Dependable Voter

  • ter!

■ Input

Input: variants ( X’, Y’, Z’ )

■ Output

Output: Equality set (E) and encoded winner (W)

■ No decoding necessary

No decoding necessary

■ Control-flow signatures!

■ Static signatur

Static signature (expected value): Compile-time " Used as return value E

■ Dynamic signatur

Dynamic signature (actual value): Runtime, computed from variants " Applied to winner W

■ Validation

alidation: Subsequent check (decode)

Peter Ulbrich – ulbrich@cs.fau.de 9

Encode( Encoded(Voter( Replica(2( Encode( Replica(1 Encode( Replica(3( X’ X Y Z Y’ Z’ {E, W} Check((Decode)( X’ Provider((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((Encoded(Voter((((((((((((((((((((((((((((((((Consumer( e.g.,(X’(is(the(winner(

slide-23
SLIDE 23

Agenda

■ Introduction! ■ The Combined Redundancy approach (CoRed )!

■ Holistic protection – eliminating single points of failure ■ Arithmetic coding ■ Dependable voting

■ Constraining residual error probability!

■ From coding theory to application – lessons learned ■ Finding appropriate parameters ■ Circumvent implementation pitfalls

■ Evaluation!

■ Use case ■ Experimental setup ■ Fault-injection results

■ Conclusion!

Peter Ulbrich – ulbrich@cs.fau.de 10

slide-24
SLIDE 24

From Coding Theory to Application

Peter Ulbrich – ulbrich@cs.fau.de 11

Safety-Critical System!

Isola&on(domain(

( (

Sphere(of(redundancy((SOR)(

Sensors( Actuators( Replica(2( Replica(3( Replica(1( CoRed( Voter( CoRed( Interface(

RV =1 RI =1?

?

Decoded_Static() { TAssert(_B > 0); assert(check()); return (vc-_B-D)/_A; }; 101010101001010 001010100001011 111010101011010 000010101001110 001011111001011

Arithmetic coding operations! Mathematics C / C++ Assembler

Know your compiler & architecture Think binary

slide-25
SLIDE 25
  • Constraining residual error probability

■ Coding theory!

■ Data word + redundant information = code word ■ Fault detection distance between code wor

distance between code words ds

!

Peter Ulbrich – ulbrich@cs.fau.de 12

v' = A⋅v+ B+ D

slide-26
SLIDE 26
  • Constraining residual error probability

■ Coding theory!

■ Data word + redundant information = code word ■ Fault detection distance between code wor

distance between code words ds

■ Residual error probability!

■ Chance for code-to-code word mutation ■ Fundamental property for fault tolerance mathematics

Peter Ulbrich – ulbrich@cs.fau.de 12

v' = A⋅v+ B+ D

psdc = valid code words possible code words ≈ 1 A

slide-27
SLIDE 27

ppred ✓ 1 A ◆ 2 8 192 16 384 32 768 61 440 10−6 10−5 10−4 10−3 values of A (16-bit constant key) psdc (residual error probability)

Constraining residual error probability

■ Coding theory!

■ Data word + redundant information = code word ■ Fault detection distance between code wor

distance between code words ds

■ Residual error probability!

■ Chance for code-to-code word mutation ■ Fundamental property for fault tolerance mathematics

Peter Ulbrich – ulbrich@cs.fau.de 12

v' = A⋅v+ B+ D

psdc = valid code words possible code words ≈ 1 A

slide-28
SLIDE 28

Choosing Keys and Signatures

Peter Ulbrich – ulbrich@cs.fau.de 13

■ Mathematics: prime numbers

prime numbers! ■ Intuitively plausible ■ Literature: little help to find suitable As

■ Practitioner’s approach: min. Hamming distance

  • min. Hamming distance!

■ Distance (d) between code words (# unequal bits)

■ d-1 bit err

error detection capabilities

  • r detection capabilities

■ Brute force!

■ 1.4

1.4×10 1014

14 experiments

experiments for all 16 bit As A = 58,368 dmin = 2 #errors detectable = 1 58,831 3 2 58,659 " " " "6 " " " " " "5

→ The bigger the better is misleading! The bigger the better is misleading! "!

1! 0! 1! 0! 1! 1! 0! 0!

slide-29
SLIDE 29

pbrd (borderline bit errors) ppred ✓ 1 A ◆ 2 8 192 16 384 32 768 61 440 10−6 10−5 10−4 10−3 values of A (16-bit constant key) psdc (residual error probability)

Consistence with Coding Theory – Mission Failed?

■ Fault-simulation entir

entire fault-space e fault-space! ■ Each and every

Each and every A, v and fault pattern

■ 6.5

6.5×10 1016

16 experiments

experiments for 16 bit As and 1-8 bit soft errors

→ Excess of pr Excess of predicted r edicted residual err esidual error pr

  • r probability
  • bability

"! → Violation of pr iolation of predicted fault-detection capabilities edicted fault-detection capabilities!

Peter Ulbrich – ulbrich@cs.fau.de 14

slide-30
SLIDE 30

Think Binary

Peter Ulbrich – ulbrich@cs.fau.de 15

■ Binary representation of code words!

■ Coding theory is unaware of machine word sizes

→ Danger Dangerous over

  • us over- and underflow conditions
  • and underflow conditions

■ Extended AN code (EAN) implementation → Compliance with coding theory! Compliance with coding theory!!

■ Improved code reliability (A = 251)!

■ Predicted

3 3×10 10-3

  • 3

■ Common implementation [4]

≈ 1.3 1.3×10 10-2

  • 2

■ EAN implementation

≈ 1.5 1.5×10 10-5

→ Impr Improvement by or

  • vement by orders of magnitude!

ders of magnitude!!

slide-31
SLIDE 31

Know your Compiler and Architecture

Peter Ulbrich – ulbrich@cs.fau.de 16

■ On target fault-injection entir

entire fault space e fault space! ■ Each and every

Each and every register, flag, instruction and execution path

■ FAIL* fault injection framework [6] → Violation of pr iolation of predicted fault-detection capabilities edicted fault-detection capabilities!

■ Architecture specifics!

■ Absence of compound test-and-branch

test-and-branch (e.g., IA32 architecture)

■ Control-flow information is stor

information is stored in single bit ed in single bit → Redundancy is lost Redundancy is lost → Additional range checks Additional range checks

■ Undefined Execution Environment !

■ Zombie values

Zombie values leaking from caller to voter function

■ Compiler laziness

Compiler laziness leaves encoded values in registers → Isolation assumptions violated Isolation assumptions violated → Cleaning local storage r Cleaning local storage restor estores isolation es isolation

→ Tight feedback loop with fault-injection experiments Tight feedback loop with fault-injection experiments!

slide-32
SLIDE 32

Agenda

■ Introduction! ■ The Combined Redundancy approach (CoRed )!

■ Holistic protection – eliminating single points of failure ■ Arithmetic coding ■ Dependable voting

■ Constraining residual error probability!

■ From coding theory to application – lessons learned ■ Finding appropriate parameters ■ Circumvent implementation pitfalls

■ Evaluation!

■ Use case ■ Experimental setup ■ Fault-injection results

■ Conclusion!

Peter Ulbrich – ulbrich@cs.fau.de 17

slide-33
SLIDE 33

Evaluation – Experimental Setup

Peter Ulbrich – ulbrich@cs.fau.de 18

System Under Test

Replica 2 EAN Decode EAN Encode Replica 3 EAN Decode Replica 1 EAN Decode EAN Encode EAN Encode CoRed Encoded Tolerance Voter Sensor 1 Sensor 2 Sensor 3 EAN Encode EAN Encode EAN Encode Sensor System Network Interface EAN Decode CoRed Encoded (Exact) Voter Actuator Remote Node

FlightGControl(Applica&on(

Host Computer Hardware Debugger FAIL*( (Campaign(Manager( Fault(DB( Results(DB(

Outcome: 401,592 401,592 experiments Effective: 67,617 67,617 errors Categories: Fail Silent Fail Silent, , Masked Masked, , " Har Hardwar dware Detected e Detected, , EAN-Code EAN-Code, , Contr Control-Flow

  • l-Flow,"

Silent Data Corruption Silent Data Corruption

slide-34
SLIDE 34

Evaluation – Experimental Results (1)

■ Redundant execution campaign (Interface) !

■ Total: ~45,000 Errors

Peter Ulbrich – ulbrich@cs.fau.de 19

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % Distribution of Effective Faults Mask HW EAN SDC Mask HW EAN SDC Mask HW EAN SDC Unprotected Plain TMR CoRed TMR

Replica(2( Replica(3( Replica(1( Interface(

Silent Data Corruptions Hardware Detected EAN-Code Detected Masked

slide-35
SLIDE 35

Evaluation – Experimental Results (1)

■ Redundant execution campaign (Interface) !

■ Total: ~45,000 Errors Unpr

Unprotected

  • tected: Suffers from 3,622 corruptions

3,622 corruptions!

Peter Ulbrich – ulbrich@cs.fau.de 19

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % Distribution of Effective Faults Mask HW EAN SDC Mask HW EAN SDC Mask HW EAN SDC Unprotected Plain TMR CoRed TMR

Replica(2( Replica(3( Replica(1( Interface(

Silent Data Corruptions Hardware Detected EAN-Code Detected Masked

slide-36
SLIDE 36

Evaluation – Experimental Results (1)

■ Redundant execution campaign (Interface) !

■ Total: ~45,000 Errors Unpr

Unprotected

  • tected: Suffers from 3,622 corruptions

3,622 corruptions!

■ TMR

TMR: Suffers from 71 corruptions 71 corruptions!

Peter Ulbrich – ulbrich@cs.fau.de 19

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % Distribution of Effective Faults Mask HW EAN SDC Mask HW EAN SDC Mask HW EAN SDC Unprotected Plain TMR CoRed TMR

Replica(2( Replica(3( Replica(1( Interface(

Silent Data Corruptions Hardware Detected EAN-Code Detected Masked

slide-37
SLIDE 37

Evaluation – Experimental Results (1)

■ Redundant execution campaign (Interface) !

■ Total: ~45,000 Errors Unpr

Unprotected

  • tected: Suffers from 3,622 corruptions

3,622 corruptions!

■ TMR

TMR: Suffers from 71 corruptions 71 corruptions!

■ CoRed

CoRed: Remaining corruptions are covered 0 corruptions 0 corruptions

Peter Ulbrich – ulbrich@cs.fau.de 19

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % Distribution of Effective Faults Mask HW EAN SDC Mask HW EAN SDC Mask HW EAN SDC Unprotected Plain TMR CoRed TMR

Replica(2( Replica(3( Replica(1( Interface(

Silent Data Corruptions Hardware Detected EAN-Code Detected Masked

slide-38
SLIDE 38

Evaluation – Experimental Results (2)

■ Voter campaign!

"

Peter Ulbrich – ulbrich@cs.fau.de 20

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % CFM HW EAN SDC Plain Voter CoRed Encoded Voter Mask CFM HW EAN SDC Mask

Replica(2( Replica(3( Replica(1( Voter(

Silent Data Corruptions Hardware Detected EAN-Code Detected Control-flow Monitoring Masked

slide-39
SLIDE 39

Evaluation – Experimental Results (2)

■ Voter campaign!

■ Plain voter

Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions 1,223 corruptions "

Peter Ulbrich – ulbrich@cs.fau.de 20

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % CFM HW EAN SDC Plain Voter CoRed Encoded Voter Mask CFM HW EAN SDC Mask

Replica(2( Replica(3( Replica(1( Voter(

Silent Data Corruptions Hardware Detected EAN-Code Detected Control-flow Monitoring Masked

slide-40
SLIDE 40

Evaluation – Experimental Results (2)

■ Voter campaign!

■ Plain voter

Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions 1,223 corruptions

■ CoRed Dependable V

CoRed Dependable Voter

  • ter: "

Total ~26,000 1,228 masked 24,682 retry 0 corruptions 0 corruptions

Peter Ulbrich – ulbrich@cs.fau.de 20

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % CFM HW EAN SDC Plain Voter CoRed Encoded Voter Mask CFM HW EAN SDC Mask

Replica(2( Replica(3( Replica(1( Voter(

Silent Data Corruptions Hardware Detected EAN-Code Detected Control-flow Monitoring Masked

slide-41
SLIDE 41

Evaluation – Experimental Results (2)

■ Voter campaign!

■ Plain voter

Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions 1,223 corruptions

■ CoRed V

CoRed Voter

  • ter:

" Total ~26,000 1,228 masked 24,682 retry 0 corruptions 0 corruptions

Peter Ulbrich – ulbrich@cs.fau.de 21

Data Address

0 % 10 % 20 % 30 % 40 % 50 % 60 % 70 % 80 % 90 % CFM HW EAN SDC Plain Voter CoRed Encoded Voter Mask CFM HW EAN SDC Mask

Replica(2( Replica(3( Replica(1( Voter(

Silent Data Corruptions Hardware Detected EAN-Code Detected Control-flow Monitoring Masked

Evaluation – Overhead

■ Overhead Analysis!

■ I4Copter Flight-Control: 7.1% overhead "

(compared to plain TMR)

■ Selectivity!

■ I4Copter system CPU utilisation: 41% "

Full replication impossible, CPU: 120%

■ Mission-critical replication of flight control"

possible with CoRed, CPU: 60%

slide-42
SLIDE 42

Conclusion

Eliminate single points of failure [1]!

! !

!

Peter Ulbrich – ulbrich@cs.fau.de 22

Safety-Critical System!

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface(

slide-43
SLIDE 43

Conclusion

Eliminate single points of failure [1]!

■ TMR + Encoding: Combined Redundancy appr

Combined Redundancy approach

  • ach

■ Key feature: CoRed Dependable V

CoRed Dependable Voter

  • ter

! !

!

Peter Ulbrich – ulbrich@cs.fau.de 22

Safety-Critical System!

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface( Replica(2( Replica(3( Replica(1( CoRed( Voter( EAN( Coding(

Decode( Decode( Decode( Encode( Encode( Encode(

slide-44
SLIDE 44

Conclusion

Eliminate single points of failure [1]!

■ TMR + Encoding: Combined Redundancy appr

Combined Redundancy approach

  • ach

■ Key feature: CoRed Dependable V

CoRed Dependable Voter

  • ter

Constrain residual error probability [2]!

■ Parameterisation guidelines: choosing the right A

choosing the right A

■ Binary aware implementation: complying with coding theory

complying with coding theory

■ Factor 1000 impr

Factor 1000 improvement

  • vement

Dependability as a resource efficient option!

■ Only 7.1% overhead

7.1% overhead (flight control example)

!

Peter Ulbrich – ulbrich@cs.fau.de 22

Safety-Critical System!

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface( Replica(2( Replica(3( Replica(1( CoRed( Voter( EAN( Coding(

Decode( Decode( Decode( Encode( Encode( Encode(

slide-45
SLIDE 45

Conclusion

Eliminate single points of failure [1]!

■ TMR + Encoding: Combined Redundancy appr

Combined Redundancy approach

  • ach

■ Key feature: CoRed Dependable V

CoRed Dependable Voter

  • ter

Constrain residual error probability [2]!

■ Parameterisation guidelines: choosing the right A

choosing the right A

■ Binary aware implementation: complying with coding theory

complying with coding theory

■ Factor 1000 impr

Factor 1000 improvement

  • vement

Dependability as a resource efficient option!

■ Only 7.1% overhead

7.1% overhead (flight control example)

→ Bullet-pr Bullet-proof softwar

  • of software-based fault tolerance is possible

e-based fault tolerance is possible!

Peter Ulbrich – ulbrich@cs.fau.de 22

Safety-Critical System!

Sensors( Actuators( Replica(2( Replica(3( Replica(1( Majority( Voter( Interface( Replica(2( Replica(3( Replica(1( CoRed( Voter( EAN( Coding(

Decode( Decode( Decode( Encode( Encode( Encode(

slide-46
SLIDE 46

Thank you!

(1) Ulbrich, Peter; Hoffmann, Martin; Kapitza, Rüdiger; Lohmann, Daniel; Schmid, Reiner; Schröder-Preikschat,

Wolfgang: “Eliminating Single Points of Failure in Software-Based Redundancy”, Proceedings of the 9th European Dependable Computing Conference (EDCC '12), 2012.

(2) Hoffmann, Martin; Ulbrich, Peter; Dietrich, Christian; Schirmeier, Horst; Lohmann, Daniel; Schröder-Preikschat,

Wolfgang: “A Practitioner's Guide to Software-based Soft-Error Mitigation Using AN-Codes“, Proceedings of the 15th IEEE International Symposium on High Assurance Systems Engineering (HASE '14), 2014.

http://www4.cs.fau.de/Research/CoRed!

slide-47
SLIDE 47

References

(3)

P . Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, “Modelling the effect of technology trends on the soft error rate of combinational logic,” in DSN ’02: Proceedings of the 2002 International Conference on Dependable Systems and Networks

(4)

Edmund B. Nightingale, John R Douceur, and Vince Orgovan, Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs, in Proceedings of EuroSys 2011

(5)

Forin, “Vital coded microprocessor principles and application for various transit systems”, 1989

(6)

Schirmeier, Horst ; Hoffmann, Martin ; Kapitza, Rüdiger ; Lohmann, Daniel ; Spinczyk, Olaf :" “FAIL: Towards a Versatile Fault-Injection Experiment Framework”, 25th International Conference on Architecture of Computing Systems, 2012

Peter Ulbrich – ulbrich@cs.fau.de 24