Building Reliable and Safe Systems - Lessons Learned Scott Torborg - - PowerPoint PPT Presentation

building reliable and safe systems lessons learned
SMART_READER_LITE
LIVE PREVIEW

Building Reliable and Safe Systems - Lessons Learned Scott Torborg - - PowerPoint PPT Presentation

Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009 The Right Way Failure Modes and Effects Analysis (FMEA) Root Cause Analysis (RCA) MTBF, FIT, etc. ...yeah, yeah Learn that at


slide-1
SLIDE 1

Building Reliable and Safe Systems - Lessons Learned

Scott Torborg storborg@mit.edu April 2009

slide-2
SLIDE 2

The “Right” Way

  • Failure Modes and Effects Analysis (FMEA)
  • Root Cause Analysis (RCA)
  • MTBF, FIT, etc.

...yeah, yeah

Learn that at engineering school.

slide-3
SLIDE 3

Design

slide-4
SLIDE 4

Standards

http://www.flickr.com/photos/lickyoats/2290383219/

  • Might matter
  • EN 61508 = 109 hours MTBF for safety

critical systems

slide-5
SLIDE 5

Redundancy

http://www.flickr.com/photos/lickyoats/2290383219/

slide-6
SLIDE 6

Failure Isolation

http://www.flickr.com/photos/metrix_feet/357018809/

Redundancy is no good if failures afgect everything at once.

slide-7
SLIDE 7

Heterogeneous Redundancy

(this is just adding redundancy in design)

Architecture of the space shuttle primary avionics software system - http://portal.acm.org/citation.cfm?id=358258

...this is impractical, because design is expensive

Space shuttle has 5x redundant computers, with difgerent configurations. Akamai championed the “keep the systems heterogeneous” mentality, but that’s easier with

  • pen source/other platforms.
slide-8
SLIDE 8

Graceful Degradation

slide-9
SLIDE 9

Failures aren’t uniform or random

...don’t treat them like they are

http://en.wikipedia.org/wiki/Bathtub_curve

Don’t apply MTBF without considering product lifetime.

slide-10
SLIDE 10

Manufacturers Mislead You

  • “Typical”? Yeah, right.
  • Sometimes they just lie, or don’t know
slide-11
SLIDE 11

Humans

  • Least reliable part of most systems
  • Political challenges v. Technical challenges
  • Interfaces and feedback
  • ...get stupid when in immediate danger
slide-12
SLIDE 12

Look at the whole picture

  • Reliability doesn’t stop at the product
  • Training
  • Maintenance
  • Support

“Despite these efforts, the F-22A continues to operate below its expected reliability rates. A key reliability requirement for the F-22A is a 3-hour mean time between maintenance intervals... Currently, the mean time between maintenance is less than 1 hour.” March 2008 GAO Congressional Report

slide-13
SLIDE 13

Testing

slide-14
SLIDE 14

Test it!

  • Do it yourself : putting your life on the line

makes you very focused

  • Seeing field failures yourself helps
  • Have an answer for every “what if?”

There’s a limit to this, because testing is often destructive... focus on common use scenarios and the most risky situations.

slide-15
SLIDE 15

Know what happens

  • Reliable == Deterministic
  • Test everything
  • Talk to users as much as possible
  • If there’s an incident (death or injury)

everyone stop everything

slide-16
SLIDE 16

Some environments to test

  • Temp / humidity extremes
  • Rapid changes in temp / humidity
  • High vibration
  • EMI / ESD
  • Oxidation risk (high-O2 or corrosive env.)
slide-17
SLIDE 17

Build Awesome Fixtures

(3000 feet!)

slide-18
SLIDE 18

Burn-in

Automated burn-in testing reduces infant mortality

slide-19
SLIDE 19

Maintenance

  • Record everything
  • Infrastructure helps

make it easy, identify trends

  • You wouldn’t try to write software without a bug tracker.
  • The better the tools you have for this, the more data you’ll get.
  • Keep the feedback loop between maintenance and design engineers tight.
slide-20
SLIDE 20

Tricks

slide-21
SLIDE 21

Generally

  • Use simpler, more reliable devices to

supplement more complex devices

  • Voltage supervisors
  • Watchdog timers
  • Diagnostic sensors
slide-22
SLIDE 22

Logic is your friend

A B

FAULT FAULT FAULT

Check Faults

Logic gates are small and very reliable, e.g. TI “Little Logic”

slide-23
SLIDE 23

Logic is your friend

A B

OK OK OK

Check Status

slide-24
SLIDE 24

Logic is your friend

A B

CTL CTL CTL

Share Outputs

Each independent system can override the other. Needs careful control algorithms!

slide-25
SLIDE 25

Do it with power too

A

POWER

B

Great chips for this, e.g. Linear PowerPath controllers

  • Can also be done with FETs, so don’t fret about power consumption.
slide-26
SLIDE 26

Voting Logic

Controller

Flaky Sensor A

OUTPUT

Flaky Sensor B Flaky Sensor C

3, 5, 7... inputs

  • Don’t use just voting logic, because it can make a bad problem really bad.
slide-27
SLIDE 27

Detect Failures with Internal Models

  • Sensor value can’t change faster than

10mV/sec

  • Limit switch A can’t be tripped at the same

time as limit switch B

Pick the simplest constraints (least amount of state required) and go up from there.

slide-28
SLIDE 28

I/O is like sex

Use protection!

microcontroller

OUTSIDE (bad ESD, EMI) ESD/TVS diodes clamp

  • ver/under voltage

small low-ESR caps (ceramic) absorb power spikes, ESD ferrite beads damp HF noise resistors isolate short conditions / component failures, reduce max currents

Don’t use all of these! Just some.

  • Be mindful of slew rates, extra capacitance, etc.
  • Excess capacitance or resistance can increase power consumption, exacerbate loads, and

make things worse.

  • Ceramic caps work best for absorbing pulses, and can be a cheap substitute for an ESD

diode.

  • Especially protect things like reset, fault, shutdown lines.
slide-29
SLIDE 29

Mechanical

  • Don’t overconstrain or stress the board
  • Vibration is bad
  • Potting helps
  • Piezoceramic effects
  • Beware of pressure efgects with soft potting compounds at altitude and pressure
slide-30
SLIDE 30

Board Mounting

Loosen up, it’s not going anywhere

slide-31
SLIDE 31

Components

Large components are more vulnerable

slide-32
SLIDE 32

Piezoceramics

(e.g. ceramic capacitor) = power supply noise

slide-33
SLIDE 33

Piezoceramics

sometimes intentional

slide-34
SLIDE 34

Some things that suck

slide-35
SLIDE 35

Most capacitors

  • Tantalum especially
  • Electrolytic bad long-term because of

leaking

slide-36
SLIDE 36

Electromechanical Devices

  • Mechanical Relays ➔ Solid-state Relays
  • Tilt Switches ➔ Accelerometers
  • Mechanical Switches ➔ Piezo, FETs
  • Connectors

The least bad connectors are optical, or process control/instrumentation connectors (e.g. M8, M12).

slide-37
SLIDE 37

Tin Whiskers

Until they’re dealt with, get an RoHS exemption

http://nepp.nasa.gov/WHISKER/index.html

Can also occur with other metals, e.g. zinc. The solution is to use leaded solder.

slide-38
SLIDE 38

Flux Residue

http://glacier.lbl.gov/%7Egtp/DOM/MB/V5.0/206_ps4.JPG

Clean boards after assembly!

Often overlooked reliability issue, particularly for low-voltage analog circuits.

slide-39
SLIDE 39

ESD

  • Take it seriously!
  • Especially while potting and testing
slide-40
SLIDE 40

Some things that don’t suck

Notice a trend? These things that don’t suck apply the same principles discussed earlier.

slide-41
SLIDE 41

PPTCs

  • Polymer Positive Temp Coefficient
  • Like a fuse, but resets
slide-42
SLIDE 42

TDK Capacitors

  • Afgects MLCC (multilayer ceramic caps)
  • “Open Mode” - Fail open instead of fail short
  • Much more conservative ratings
slide-43
SLIDE 43

Hi-Rel

  • Only part of solution
  • Flight-grade, mil-spec, etc.
  • $$$ Expensive
  • Don’t go overboard
slide-44
SLIDE 44

Envirogel

  • Makes potting

practical

  • Watch out for

rapid pressure changes, behaves like lipid tissue

http://www.kellerstudio.de/repairfaq/sam/ya234p1.jpg

slide-45
SLIDE 45

CAN Bus

  • Deterministic
  • Robust
  • Fault Tolerant
slide-46
SLIDE 46

Process Control Connectors

  • E.g. M8, M12
  • Affordable and

easy

  • Turck, Phoenix,

Woodhead, Binder, Tyco

slide-47
SLIDE 47

In short...

  • Be paranoid
  • Test thoroughly
  • Analyze everything

...thanks!