building reliable and safe systems lessons learned
play

Building Reliable and Safe Systems - Lessons Learned Scott Torborg - PowerPoint PPT Presentation

Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009 The Right Way Failure Modes and Effects Analysis (FMEA) Root Cause Analysis (RCA) MTBF, FIT, etc. ...yeah, yeah Learn that at


  1. Building Reliable and Safe Systems - Lessons Learned Scott Torborg storborg@mit.edu April 2009

  2. The “Right” Way • Failure Modes and Effects Analysis (FMEA) • Root Cause Analysis (RCA) • MTBF, FIT, etc. ...yeah, yeah Learn that at engineering school.

  3. Design

  4. Standards • Might matter • EN 61508 = 10 9 hours MTBF for safety critical systems http://www.flickr.com/photos/lickyoats/2290383219/

  5. Redundancy http://www.flickr.com/photos/lickyoats/2290383219/

  6. Failure Isolation http://www.flickr.com/photos/metrix_feet/357018809/ Redundancy is no good if failures a fg ect everything at once.

  7. Heterogeneous Redundancy (this is just adding redundancy in design) ...this is impractical, because design is expensive Architecture of the space shuttle primary avionics software system - http://portal.acm.org/citation.cfm?id=358258 Space shuttle has 5x redundant computers, with di fg erent configurations. Akamai championed the “keep the systems heterogeneous” mentality, but that’s easier with open source/other platforms.

  8. Graceful Degradation

  9. Failures aren’t uniform or random ...don’t treat them like they are http://en.wikipedia.org/wiki/Bathtub_curve Don’t apply MTBF without considering product lifetime.

  10. Manufacturers Mislead You • “Typical”? Yeah, right. • Sometimes they just lie, or don’t know

  11. Humans • Least reliable part of most systems • Political challenges v. Technical challenges • Interfaces and feedback • ...get stupid when in immediate danger

  12. Look at the whole picture • Reliability doesn’t stop at the product • Training • Maintenance “Despite these efforts, the F-22A continues to operate below its expected reliability rates. A key reliability requirement for the F-22A is a 3-hour mean time • Support between maintenance intervals... Currently, the mean time between maintenance is less than 1 hour.” March 2008 GAO Congressional Report

  13. Testing

  14. Test it! • Do it yourself : putting your life on the line makes you very focused • Seeing field failures yourself helps • Have an answer for every “what if?” There’s a limit to this, because testing is often destructive... focus on common use scenarios and the most risky situations.

  15. Know what happens • Reliable == Deterministic • Test everything • Talk to users as much as possible • If there’s an incident (death or injury) everyone stop everything

  16. Some environments to test • Temp / humidity extremes • Rapid changes in temp / humidity • High vibration • EMI / ESD • Oxidation risk (high-O2 or corrosive env.)

  17. Build Awesome Fixtures (3000 feet!)

  18. Burn-in Automated burn-in testing reduces infant mortality

  19. Maintenance • Record everything • Infrastructure helps make it easy, identify trends - You wouldn’t try to write software without a bug tracker. - The better the tools you have for this, the more data you’ll get. - Keep the feedback loop between maintenance and design engineers tight.

  20. Tricks

  21. Generally • Use simpler, more reliable devices to supplement more complex devices • Voltage supervisors • Watchdog timers • Diagnostic sensors

  22. Logic is your friend Check Faults FAULT A FAULT FAULT B Logic gates are small and very reliable, e.g. TI “Little Logic”

  23. Logic is your friend Check Status OK A OK OK B

  24. Logic is your friend Share Outputs A CTL CTL CTL B Each independent system can override the other. Needs careful control algorithms!

  25. Do it with power too A POWER B Great chips for this, e.g. Linear PowerPath controllers - Can also be done with FETs, so don’t fret about power consumption.

  26. Voting Logic Flaky Sensor A OUTPUT Controller Flaky Sensor B Flaky Sensor C 3, 5, 7... inputs - Don’t use just voting logic, because it can make a bad problem really bad.

  27. Detect Failures with Internal Models • Sensor value can’t change faster than 10mV/sec • Limit switch A can’t be tripped at the same time as limit switch B Pick the simplest constraints (least amount of state required) and go up from there.

  28. I/O is like sex Use protection! resistors isolate short conditions / component ferrite beads failures, reduce max currents damp HF noise OUTSIDE (bad ESD, EMI) microcontroller small low-ESR caps (ceramic) ESD/TVS diodes clamp absorb power spikes, ESD over/under voltage Don’t use all of these! Just some. - Be mindful of slew rates, extra capacitance, etc. - Excess capacitance or resistance can increase power consumption, exacerbate loads, and make things worse. - Ceramic caps work best for absorbing pulses, and can be a cheap substitute for an ESD diode. - Especially protect things like reset, fault, shutdown lines.

  29. Mechanical • Don’t overconstrain or stress the board • Vibration is bad • Potting helps • Piezoceramic effects - Beware of pressure e fg ects with soft potting compounds at altitude and pressure

  30. Board Mounting Loosen up, it’s not going anywhere

  31. Components Large components are more vulnerable

  32. Piezoceramics (e.g. ceramic capacitor) = power supply noise

  33. Piezoceramics sometimes intentional

  34. Some things that suck

  35. Most capacitors • Tantalum especially • Electrolytic bad long-term because of leaking

  36. Electromechanical Devices • Mechanical Relays ➔ Solid-state Relays • Tilt Switches ➔ Accelerometers • Mechanical Switches ➔ Piezo, FETs • Connectors The least bad connectors are optical, or process control/instrumentation connectors (e.g. M8, M12).

  37. Tin Whiskers Until they’re dealt with, get an RoHS exemption http://nepp.nasa.gov/WHISKER/index.html Can also occur with other metals, e.g. zinc. The solution is to use leaded solder.

  38. Flux Residue Clean boards after assembly! http://glacier.lbl.gov/%7Egtp/DOM/MB/V5.0/206_ps4.JPG Often overlooked reliability issue, particularly for low-voltage analog circuits.

  39. ESD • Take it seriously! • Especially while potting and testing

  40. Some things that don’t suck Notice a trend? These things that don’t suck apply the same principles discussed earlier.

  41. PPTCs • Polymer Positive Temp Coefficient • Like a fuse, but resets

  42. TDK Capacitors - A fg ects MLCC (multilayer ceramic caps) - “Open Mode” - Fail open instead of fail short - Much more conservative ratings

  43. Hi-Rel • Only part of solution • Flight-grade, mil-spec, etc. • $$$ Expensive • Don’t go overboard

  44. Envirogel • Makes potting practical • Watch out for rapid pressure changes, behaves like lipid tissue http://www.kellerstudio.de/repairfaq/sam/ya234p1.jpg

  45. CAN Bus • Deterministic • Robust • Fault Tolerant

  46. Process Control Connectors • E.g. M8, M12 • Affordable and easy • Turck, Phoenix, Woodhead, Binder, Tyco

  47. In short... • Be paranoid • Test thoroughly • Analyze everything ...thanks!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend