Design and Analysis of Safety Critical Systems Peter Seiler and Bin - - PowerPoint PPT Presentation

design and analysis of safety critical systems
SMART_READER_LITE
LIVE PREVIEW

Design and Analysis of Safety Critical Systems Peter Seiler and Bin - - PowerPoint PPT Presentation

Design and Analysis of Safety Critical Systems Peter Seiler and Bin Hu Department of Aerospace Engineering & Mechanics University of


slide-1
SLIDE 1

Design and Analysis of Safety Critical Systems

Peter Seiler and Bin Hu

Department of Aerospace Engineering & Mechanics University of Minnesota September 30, 2013

slide-2
SLIDE 2

Uninhabited Aerial Systems (UAS)

2

Agricultural Monitoring Emergency Response (NASA/JPL) Public Safety (AeroVironment) Flight Research (UMN UAV Lab) http://www.uav.aem.umn.edu/

slide-3
SLIDE 3

Design Challenges for Low-Cost UAS

3

Modeling/System Identification Guidance and Controls Human Factors Safety Critical Software Navigation

slide-4
SLIDE 4

Design Challenges for Low-Cost UAS

4

Systems Design and Reliability

slide-5
SLIDE 5

Recent Policy Changes

5

Increased reliability needed to integrate UAS into the national airspace

slide-6
SLIDE 6

6

Outline

  • Existing design techniques in commercial aviation
  • Analytical redundancy is rarely used
  • Certification issues
  • Tools for Systems Design and Certification
  • Motivation for model-based fault detection and isolation (FDI)
  • Extended fault trees
  • Stochastic false alarm and missed detection analysis
  • Conclusions and future work
slide-7
SLIDE 7

7

Outline

  • Existing design techniques in commercial aviation
  • Analytical redundancy is rarely used
  • Certification issues
  • Tools for Systems Design and Certification
  • Motivation for model-based fault detection and isolation (FDI)
  • Extended fault trees
  • Stochastic false alarm and missed detection analysis
  • Conclusions and future work
slide-8
SLIDE 8

8

Commercial Fly-by-Wire

Boeing 787-8 Dreamliner

  • 210-250 seats
  • Length=56.7m, Wingspan=60.0m
  • Range < 15200km, Speed< M0.89
  • First Composite Airliner
  • Honeywell Flight Control Electronics

Boeing 777-200

  • 301-440 seats
  • Length=63.7m, Wingspan=60.9m
  • Range < 17370km, Speed< M0.89
  • Boeing’s 1st Fly-by-Wire Aircraft
  • Ref: Y.C. Yeh, “Triple-triple redundant

777 primary flight computer,” 1996.

slide-9
SLIDE 9

9

777 Primary Flight Control Surfaces [Yeh, 96]

  • Advantages of fly-by-wire:
  • Increased performance (e.g. reduced drag with smaller rudder), increased

functionality (e.g. “soft” envelope protection), reduced weight, lower recurring costs, and possibility of sidesticks.

  • Issues: Strict reliability requirements
  • <10-9 catastrophic failures/hr
  • No single point of failure
slide-10
SLIDE 10

10

Classical Feedback Diagram

Sensors Primary Flight Computer Pilot Inputs Actuators

Reliable implementation of this classical feedback loop adds many layers of complexity.

slide-11
SLIDE 11

11

Triplex Control System Architecture

Sensors Primary Flight Computer Column

Actuator Control Electronics

Pilot Inputs Each PFC votes on redundant sensor/pilot inputs Each ACE votes on redundant actuator commands All data communicated

  • n redundant data buses

Actuators

slide-12
SLIDE 12

12

777 Triple-Triple Architecture [Yeh, 96]

Sensors x3 Databus x3 Triple-Triple Primary Flight Computers Actuator Electronics x4

slide-13
SLIDE 13

13

777 Triple-Triple Architecture [Yeh, 96]

Sensors x3 Databus x3 Actuator Electronics x4 Left PFC INTEL AMD MOTOROLA Triple-Triple Primary Flight Computers

slide-14
SLIDE 14

14

Redundancy Management

  • Main Design Requirements:
  • < 10-9 catastrophic failures per hour
  • No single point of failure
  • Must protect against random and common-mode failures
  • Basic Design Techniques
  • Hardware redundancy to protect against random failures
  • Dissimilar hardware / software to protect against common-mode failures
  • Voting: To choose between redundant sensor/actuator signals
  • Encryption: To prevent data corruption by failed components
  • Monitoring: Software/Hardware monitoring testing to detect latent faults
  • Operating Modes: Degraded modes to deal with failures
  • Equalization to handle unstable / marginally unstable control laws
  • Model-based design and implementation for software
slide-15
SLIDE 15

15

Redundancy Management

  • Main Design Requirements:
  • < 10-9 catastrophic failures per hour
  • No single point of failure
  • Must protect against random and common-mode failures
  • Basic Design Techniques
  • Hardware redundancy to protect against random failures
  • Dissimilar hardware / software to protect against common-mode failures
  • Voting: To choose between redundant sensor/actuator signals
  • Encryption: To prevent data corruption by failed components
  • Monitoring: Software/Hardware monitoring testing to detect latent faults
  • Operating Modes: Degraded modes to deal with failures
  • Equalization to handle unstable / marginally unstable control laws
  • Model-based design and implementation for software
slide-16
SLIDE 16

16

Outline

  • Existing design techniques in commercial aviation
  • Analytical redundancy is rarely used
  • Certification issues
  • Tools for Systems Design and Certification
  • Motivation for model-based fault detection and isolation (FDI)
  • Extended fault trees
  • Stochastic false alarm and missed detection analysis
  • Conclusions and future work
slide-17
SLIDE 17

Analytical Redundancy

17

Small UASs cannot support the weight associated with physical redundancy. Approach: Use model-based or data- driven techniques to detect faults.

Parity-equation architecture (Wilsky)

slide-18
SLIDE 18

Analytical Redundancy

18

Small UASs cannot support the weight associated with physical redundancy. Approach: Use model-based or data- driven techniques to detect faults. Research Objectives:

  • Hardware, models, data

(Freeman, Balas)

  • Advanced filter design
  • Tools for systems design,

analysis and certification

Parity-equation architecture (Wilsky)

slide-19
SLIDE 19

Analytical Redundancy

19

Small UASs cannot support the weight associated with physical redundancy. Approach: Use model-based or data- driven techniques to detect faults. Research Objectives:

  • Hardware, models, data

(Freeman, Balas)

  • Advanced filter design
  • Tools for systems design,

analysis and certification

Parity-equation architecture (Wilsky)

slide-20
SLIDE 20

Tools for Systems Design and Certification

20

Diagram Reference: R. Isermann. Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance. Springer-Verlag, 2006.

slide-21
SLIDE 21

Tools for Systems Design and Certification

21

Why are new tools required? Example: Fault Tree Analysis

Diagram Reference: R. Isermann. Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance. Springer-Verlag, 2006.

slide-22
SLIDE 22

Fault Tree Analysis

22

slide-23
SLIDE 23

Fault Tree Analysis

23

Probability of hardware component failure can be estimated from field data.

slide-24
SLIDE 24

Fault Tree Analysis

24

Probability of hardware component failure can be estimated from field data. Model-based fault detection introduces new failure models (false alarms, missed detections, etc.)

slide-25
SLIDE 25

Extended Fault Tree Analysis

25

References

  • 1. Aslund, Biteus, Frisk, Krysander,

and Nielsen. Safety analysis of autonomous systems by extended fault tree analysis. IJACSP, 2007.

  • 2. Hu and Seiler, A Probabilistic

Method for Certification of Analytically Redundant Systems, SysTol Conference, 2013.

Incorporate failure modes due to false alarms and missed detections (per hour)

(Enumerate time-correlated failures and apply total law of probability)

slide-26
SLIDE 26

Example: Dual-Redundant Architecture

Objective: Compute reliability of system assuming sensors have a mean-time between failure of 1000Hrs.

26

) (k s

Switch

Fault Detection Logic (FDI)

Primary Sensor Back-up Sensor

) (

1 k

m ) (

2 k

m ) (k d ) ( ˆ k m

slide-27
SLIDE 27

Failure Modes

27

Missed Detection, MN False Alarm, FN Proper Detection, DN Early False Alarm, EN Time T1

Primary Fails

T1+N0

Missed Detection

N Time TS

False Alarm

T2+N0

System Failure

N T2

Backup Fails

Time T1

Primary Fails

T2+N0

System Failure

N TS

Failure Detected

T2

Backup Fails

Time T1

Primary Fails

T2+N0

System Failure

N TS

Failure Detected

T2

Backup Fails

slide-28
SLIDE 28

System Failure Rate

  • Notation:
  • Approximate system failure probability:

28

Sensor failure per hour False alarm per hour Detection per failure

slide-29
SLIDE 29

System Failure Rate

  • Notation:
  • Approximate system failure probability:

29

Primary sensor fails + missed detection False alarm + Backup sensor fails Failure detected + Backup sensor fails Sensor failure per hour False alarm per hour Detection per failure

slide-30
SLIDE 30

System Failure Rate

  • Notation:
  • Approximate system failure probability:

30

Primary sensor fails + missed detection False alarm + Backup sensor fails Failure detected + Backup sensor fails Sensor failure per hour False alarm per hour Detection per failure

Question: How can we compute these probabilities?

slide-31
SLIDE 31

False Alarm Analysis

31

What is the conditional probability of an alarm given that no fault has occurred? Abstraction: Discrete- time uncertain linear system driven by noise.

slide-32
SLIDE 32

Problem Formulation

32

(Healthy) Dynamics for residual Simple Thresholding Objective: Assume nk is a stationary Gaussian process and assume known dynamic model for residuals. Compute the probability PN that |rk| > T for some k in {1,…,N}.

slide-33
SLIDE 33

Problem Formulation

33

(Healthy) Dynamics for residual Simple Thresholding

References

  • 1. Glaz and Johnson. Probability inequalities for

multivariate distributions with dependence

  • structures. JASA, 1984
  • 2. Hu and Seiler, Probability Bounds for False Alarm

Analysis of Fault Detection Systems, Allerton, 2013.

Theorem: There exist bounds γk (k=1,…,N) such that

  • 1. γk ≥ PN
  • 2. γk are monotonically non-increasing in k
  • 3. γk requires evaluation of k-dim. Gaussian integrals
slide-34
SLIDE 34

Results: Effects of Correlation

34

False Alarm Probabilities and Bounds for N=360,000 For each (a,T), P1 = 10-11 which gives NP1=3.6 x 10-6 Neglecting correlations is accurate for small a …but not for a near 1.

k k k k

f n ar r + + =

+1

   ≤ = else 1 if T r d

k k

Residual Generation Decision Logic

slide-35
SLIDE 35

Worst-case False Alarm Probability

35

Issue: Model depends on unknown (uncertain) parameters, ∆ ϵ ∆ ∆ ∆ ∆. Objective: Compute the worst-case false alarm probability Main Result: Robust H2 analysis results can be used to compute worst- case residual variance. This yields bounds on PN*.

Reference Hu and Seiler, Worst-Case False Alarm Analysis of Aerospace Fault Detection Systems, Submitted to ACC, 2014.

slide-36
SLIDE 36

36

Conclusions

  • Commercial aircraft achieve high levels of reliability.
  • Analytical redundancy is rarely used (Certification Issues)
  • Model-based fault detection methods are an alternative that

enables size, weight, power, and cost to be reduced.

  • Tools for Systems Design and Certification
  • Extended fault trees
  • Stochastic false alarm and missed detection analysis
  • Methods to validate analysis using flight test data (Hu and

Seiler, 2014 AIAA)

slide-37
SLIDE 37

Acknowledgments

  • NASA Langley NRA NNX12AM55A: “Analytical Validation Tools for

Safety Critical Systems Under Loss-of-Control Conditions,” Technical Monitor: Dr. Christine Belcastro

  • Air Force Office of Scientific Research: Grant No. FA9550-12-

0339, "A Merged IQC/SOS Theory for Analysis of Nonlinear Control Systems,” Technical Monitor: Dr. Fariba Fahroo.

  • NSF Cyber-Physical Systems: Grant No. 0931931, “Embedded

Fault Detection for Low-Cost, Safety-Critical Systems,” Program Manager: Theodore Baker.

37

slide-38
SLIDE 38

Backup Slides

38

slide-39
SLIDE 39

Dual-Redundant Architecture

Objective: Efficiently compute the probability PS,N that the system generates “bad” data for N0 consecutive steps in an N-step window.

39

) (k s

Switch

Fault Detection Logic (FDI)

Primary Sensor Back-up Sensor

) (

1 k

m ) (

2 k

m ) (k d ) ( ˆ k m

slide-40
SLIDE 40

Assumptions

  • 1. Knowledge of probabilistic performance

a. Sensor failures: P[ Ti=k ] where Ti := failure time of sensor i b. FDI False Alarm: P[ TS≤N | T1=N+1 ] c. FDI Missed Detection: P[ TS≥k+N0 | T1=k ]

  • 2. Neglect intermittent failures
  • 3. Neglect intermittent switching logic
  • 4. Sensor failures and FDI logic decision are independent
  • Sensors have no common failure modes.

40

slide-41
SLIDE 41

Failure Modes

41

Missed Detection, MN False Alarm, FN Proper Detection, DN Early False Alarm, EN Time T1

Primary Fails

T1+N0

Missed Detection

N Time TS

False Alarm

T2+N0

System Failure

N T2

Backup Fails

Time T1

Primary Fails

T2+N0

System Failure

N TS

Failure Detected

T2

Backup Fails

Time T1

Primary Fails

T2+N0

System Failure

N TS

Failure Detected

T2

Backup Fails

slide-42
SLIDE 42

System Failure Probability

  • Apply basic probability theory:

42

slide-43
SLIDE 43

System Failure Probability

  • Apply basic probability theory:
  • Knowledge of probabilistic performance

a. Sensor failures: P[ Ti=k ] where Ti := failure time of sensor i

43

slide-44
SLIDE 44

System Failure Probability

  • Apply basic probability theory:
  • Knowledge of probabilistic performance

a. Sensor failures: P[ Ti=k ] where Ti := failure time of sensor i b. FDI False Alarm: P[ TS≤N | T1=N+1 ]

44

slide-45
SLIDE 45

System Failure Probability

  • Apply basic probability theory:
  • Knowledge of probabilistic performance

a. Sensor failures: P[ Ti=k ] where Ti := failure time of sensor i b. FDI False Alarm: P[ TS≤N | T1=N+1 ] c. FDI Missed Detection: P[ TS≥k+N0 | T1=k ]

45

slide-46
SLIDE 46
  • Sensor Failures: Geometric distribution with parameter q
  • Residual-based threshold logic

Example

46

) ( ) ( ) 1 ( k f k n k r + = +    ≤ = else 1 ) ( if ) ( T k r k d

Residual Decision Logic f is an additive fault n is IID Gaussian noise, variance=σ

Threshold, T

) (k d ) (k r ) (

1 k

m ) (k y

Fault Detection Filter

slide-47
SLIDE 47

Example

  • Per-frame false alarm probability can be easily computed
  • Approximate per-hour false

alarm probability

47

[ ]

− = = =

T T F

dr r p d P ) ( 1 Fault No 1 (k) Pr

( )

2

2

1

σ T F

erf P − = For each k, r(k) is N(0,σ2) :

5 10 15 20 25 30 0.5 1 1.5 2 x 10

  • 3

Time Window, N PFA(N) PFA(30) = 0.0019 for σ = 0.25 F N F S

NP P N T N T P ≈ − − = + = ≤ ) 1 ( 1 ] 1 | [

1

Per-frame detection probability PD can be similarly computed.

slide-48
SLIDE 48

System Failure Rate

  • Notation:
  • Approximate system failure probability:

48

Sensor failure per hour False alarm per hour Detection per failure

slide-49
SLIDE 49

System Failure Rate

  • Notation:
  • Approximate system failure probability:

49

Primary sensor fails + missed detection False alarm + Backup sensor fails Failure detected + Backup sensor fails Sensor failure per hour False alarm per hour Detection per failure

slide-50
SLIDE 50

System Failure Rate

50

5 10 15 20 10

  • 6

10

  • 5

10

  • 4

10

  • 3

T/σ PS,N f/σ=1 f/σ=6 f/σ=10

Sensor mean time between failure = 1000hr and N=360000 ( = 1 hour at 100Hz rate)

slide-51
SLIDE 51
  • Example analysis assumed IID fault detection logic.
  • Many fault-detection algorithms use dynamical models

and filters that introduce correlations in the residuals.

  • Question: How can we compute the FDI performance

metrics when the residuals are correlated in time?

  • FDI False Alarm: P[ TS≤N | T1=N+1 ]
  • FDI Missed Detection: P[ TS≥k+N0 | T1=k ]

Correlated Residuals

51

slide-52
SLIDE 52

False Alarm Analysis with Correlated Residuals

  • Problem: Analyze the per-hour false alarm probability for a simple

first-order fault detection system:

  • The N-step false alarm probability PN is the conditional probability

that dk=1 for some 1≤k≤N given the absence of a fault.

52

k k k k

f n ar r + + =

+1

   ≤ = else 1 if T r d

k k

Residual Generation (0<a<1) Decision Logic

Residuals are correlated in time due to filtering

f is an additive fault n is IID Gaussian noise, variance=1

There are N=360000 samples per hour for a 100Hz system

∫ ∫

− −

− =

T T N T T N R N

dr dr r r p P

  • 1

1

) ,..., ( 1

slide-53
SLIDE 53

False Alarm Analysis

  • Residuals satisfy the Markov property:
  • PN can be expressed as an N-step iteration of 1-

dimensional integrals:

53

∫ ∫ ∫

− − − − − −

− = = = =

T T N T T T T N N N N N N N N N

dr r p r f P dr r r p r f r f dr r r p r f r f r f

1 1 1 1 1 2 1 2 2 2 1 1 1 1 1

) ( ) ( 1 ) ( ) ( ) ( ) ( ) ( ) ( 1 ) (

  • k

k k k

f n ar r + + =

+1

( ) ( )

k k k k

r r p r r r p

1 1 1

, ,

+ +

=

  • (

)

( ) ( )

( )

1 1 1 2 1 1

, , r p r r p r r p r r p

k k k R

⋅ =

− −

− =

T T N T T N R N

dr dr r r p P

  • 1

1

) ,..., ( 1 This has the appearance of a power iteration ANx

slide-54
SLIDE 54

False Alarm Probability

  • Theorem: Let λ1 be the maximum eigenvalue and ψ1

the corresponding eigenfunction of Then where

  • Proof
  • This is a generalization of the matrix power iteration
  • The convergence proof relies on the Krein-Rutman theorem

which is a generalization of the Perron-Frobenius theorem.

  • For a=0.999 and N=360000, the approximation error is 10-156

54

1 1 −

N N

c P λ

=

T T

dy x y p y x ) | ( ) ( ) (

1 1 1

ψ ψ λ

Ref: B. Hu and P. Seiler. False Alarm Analysis of Fault Detection Systems with Correlated Residuals, Submitted to IEEE TAC, 2012.

1

, 1ψ = c