Building Reliable and Safe Systems - Lessons Learned
Scott Torborg storborg@mit.edu April 2009
Building Reliable and Safe Systems - Lessons Learned Scott Torborg - - PowerPoint PPT Presentation
Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009 The Right Way Failure Modes and Effects Analysis (FMEA) Root Cause Analysis (RCA) MTBF, FIT, etc. ...yeah, yeah Learn that at
Scott Torborg storborg@mit.edu April 2009
Learn that at engineering school.
http://www.flickr.com/photos/lickyoats/2290383219/
http://www.flickr.com/photos/lickyoats/2290383219/
http://www.flickr.com/photos/metrix_feet/357018809/
Redundancy is no good if failures afgect everything at once.
(this is just adding redundancy in design)
Architecture of the space shuttle primary avionics software system - http://portal.acm.org/citation.cfm?id=358258
Space shuttle has 5x redundant computers, with difgerent configurations. Akamai championed the “keep the systems heterogeneous” mentality, but that’s easier with
http://en.wikipedia.org/wiki/Bathtub_curve
Don’t apply MTBF without considering product lifetime.
“Despite these efforts, the F-22A continues to operate below its expected reliability rates. A key reliability requirement for the F-22A is a 3-hour mean time between maintenance intervals... Currently, the mean time between maintenance is less than 1 hour.” March 2008 GAO Congressional Report
There’s a limit to this, because testing is often destructive... focus on common use scenarios and the most risky situations.
A B
FAULT FAULT FAULT
Logic gates are small and very reliable, e.g. TI “Little Logic”
A B
OK OK OK
A B
CTL CTL CTL
Each independent system can override the other. Needs careful control algorithms!
A
POWER
B
Great chips for this, e.g. Linear PowerPath controllers
Controller
Flaky Sensor A
OUTPUT
Flaky Sensor B Flaky Sensor C
3, 5, 7... inputs
Pick the simplest constraints (least amount of state required) and go up from there.
microcontroller
OUTSIDE (bad ESD, EMI) ESD/TVS diodes clamp
small low-ESR caps (ceramic) absorb power spikes, ESD ferrite beads damp HF noise resistors isolate short conditions / component failures, reduce max currents
Don’t use all of these! Just some.
make things worse.
diode.
The least bad connectors are optical, or process control/instrumentation connectors (e.g. M8, M12).
http://nepp.nasa.gov/WHISKER/index.html
Can also occur with other metals, e.g. zinc. The solution is to use leaded solder.
http://glacier.lbl.gov/%7Egtp/DOM/MB/V5.0/206_ps4.JPG
Often overlooked reliability issue, particularly for low-voltage analog circuits.
Notice a trend? These things that don’t suck apply the same principles discussed earlier.
http://www.kellerstudio.de/repairfaq/sam/ya234p1.jpg