What can nuclear engineering teach us about software? Todd Lewis - - PowerPoint PPT Presentation
What can nuclear engineering teach us about software? Todd Lewis - - PowerPoint PPT Presentation
What can nuclear engineering teach us about software? Todd Lewis & Eduardo Bellani tlewis@brickabode.com emb@brickabode.com 24 April 2017 Read every word of Lamport Leslie Lamport (1977): "Proving the Correctness of Multiprocess
Read every word of Lamport
- Leslie Lamport (1977): "Proving
the Correctness of Multiprocess Programs"
- This paper is amazing
- Leslie Lamport is amazing
- He did "Time, clocks, and the
- rdering of events in a
distributed system" only a year later
- (Has there ever been a
computer science decade as great as the 1970s?)
System properties come in two kinds!
- Computing is great at
liveness: lots of features!
- Benefit of features often
- utweighs cost of failure,
so “Move Fast & Break Things”
- However, we often do
safety so badly that there is opportunity there; lots
- f low-hanging fruit
Liveness Safety When Sometimes Always Where Somewhere Everywhere Nature Good thing Bad thing Action Happens Does not happen Means Feature Control
Let’s talk about saving lives
- Starting in the 1970s, human
factors analysis started happening in aviation
- What used to be called “pilot
error” is now recognized as “bad interface design”
- Hundreds of thousands of
people are alive today who would otherwise be dead because of this advance
Compare and contrast
Let’s design a nuclear plant!
- We are putting a
nuclear plant next to the ocean
- Your mother lives next
door
- What failures would
you want the designers to care about?
Multi-system failures (Oceanic edition)
Bad outcome Cause Control
Multi-system failure Tsunami Put critical infrastructure up high Multi-system failure Corrosion Annual inspections Multi-system failure Flooding Sea wall and drainage Multi-system failure Loss of coolant (biomass clogs pipes) Inspect & clean pipes Multi-system failure Sea-borne attack Sea walls Multi-system failure Erosion kills plant Sea walls Multi-system failure Sedimentation blocks coolant Inspect & dredge
We can do this systematically
1) What failures matter? (“Bad business outcome” is a useful criterion) 2) For each failure, what can cause it? 3) How do you address each cause?
- Gives you a finite list of hazards
handled
- Gives you a clear model to give
to your operators: here are the risks we manage, and how
Pro tip: Create a Red Team
- It is psychologically difficult
to look at your own designs critically
- You need distance in order
to tease out assumptions and blindspots
- Bring an outsider into your
analysis, and encourage them to ask “dumb questions”
How to do this
1) Get a few hours of whiteboard time: your team, plus a smart outsider 2) Failures → Causes → Controls 3) Write it up 4) Start sharing it with others: here are some new options to improve our system
Where to find more
- Engineering a Better
World, by Nancy Leveson
- Resilience
Engineering, by Hollnagel, Woods and Leveson
- Drift into Failure, by
Sidney Dekker
“Correct, On-Time, On-Budget”
- Do you like building
systems that work?
- Are you a Haskell,
ML, or Lisp programmer?
- Meet us after the talk!
- jobs@brickabode.com