Avoiding Accidents - A Misson Impossible? Michael Dorner Chair for - - PowerPoint PPT Presentation

avoiding accidents a misson impossible
SMART_READER_LITE
LIVE PREVIEW

Avoiding Accidents - A Misson Impossible? Michael Dorner Chair for - - PowerPoint PPT Presentation

Avoiding Accidents - A Misson Impossible? Michael Dorner Chair for Network Architectures and Services Department for Computer Science Technical University of Munich 11.10.2013 Michael Dorner: Avoiding Accidents - A Misson Impossible? 1


slide-1
SLIDE 1

Avoiding Accidents - A Misson Impossible?

Michael Dorner

Chair for Network Architectures and Services Department for Computer Science Technical University of Munich

11.10.2013

Michael Dorner: Avoiding Accidents - A Misson Impossible? 1

slide-2
SLIDE 2

Outline

1

Motivation for a science based on accidents

2

Normal Accident Theory and High Reliability Theory Normal Accident Theory High Reliability Organization NAT and/vs. HRO?

3

Accidents and Computer Systems

4

Conclusion

Michael Dorner: Avoiding Accidents - A Misson Impossible? 2

slide-3
SLIDE 3

Motivation

Accidents happen to all of us The reasons for accidents are not as simple as we sometimes think Most accident investigations stop after assigning blame - most often to the operator What should we do if accidents must not happen, because the risk is e.g. a nuclear catastrophe?

Michael Dorner: Avoiding Accidents - A Misson Impossible? 3

slide-4
SLIDE 4

Accident at Three Mile Island

TMI was and still is a nuclear power plant In 1979 there was a partial meltdown in reactor 2 It was the most serious accident in nuclear energy inside the US

Michael Dorner: Avoiding Accidents - A Misson Impossible? 4

slide-5
SLIDE 5

Basic Layout of a NPP

Michael Dorner: Avoiding Accidents - A Misson Impossible? 5

slide-6
SLIDE 6

What happened?

The plants feedwater pumps failed, so the reactor was not cooled properly Increased pressure forced an emergency relief The relief valve got stuck open, but its indicator showed it was closed The operators failed to realize what was happening More coolant escaped and a part of the fuel rods was no longer covered The nuclear fuel rods began to melt down

Michael Dorner: Avoiding Accidents - A Misson Impossible? 6

slide-7
SLIDE 7

What’s next

1

Motivation for a science based on accidents

2

Normal Accident Theory and High Reliability Theory Normal Accident Theory High Reliability Organization NAT and/vs. HRO?

3

Accidents and Computer Systems

4

Conclusion

Michael Dorner: Avoiding Accidents - A Misson Impossible? 7

slide-8
SLIDE 8

Normal Accident Theory - NAT

Michael Dorner: Avoiding Accidents - A Misson Impossible? 8

slide-9
SLIDE 9

What is NAT?

For the next 20 minutes it is not Network Address Translation! NAT is an accident theory, which tries to explain accidents It pays special attention to systems, which are tightly coupled with highly complex interactions (HC2-systems), such as nuclear power plants According to NAT there are ”normal accidents” in these systems due to unanticipated interaction of component failures, which cannot be avoided

Michael Dorner: Avoiding Accidents - A Misson Impossible? 9

slide-10
SLIDE 10

Interactive Complexity Criteria

Isolation of errors and easy fixes are not possible (e.g. replace A with B) Common-mode failures: failure in one system causes failure

  • f multiple systems

Uncertainty about exact processes, internal feedback and exact system state Local proximity or interconnection of subsystems

Michael Dorner: Avoiding Accidents - A Misson Impossible? 10

slide-11
SLIDE 11

Coupling Criteria

Processes follow an immutable order and cannot be delayed

  • r interrupted

Only one path to success Buffers and redundancy must be designed into the system from the very start System has little slack

Michael Dorner: Avoiding Accidents - A Misson Impossible? 11

slide-12
SLIDE 12

What NAT does not do

It does not say anything about how accidents can be avoided It does not offer a metric for interactive complexity or coupling It does not cover intentions, group interests and other human factors as relevant

Michael Dorner: Avoiding Accidents - A Misson Impossible? 12

slide-13
SLIDE 13

High Reliability Organization - HRO

Michael Dorner: Avoiding Accidents - A Misson Impossible? 13

slide-14
SLIDE 14

What is HRO?

HRO is an organizational strategy to prevent accidents in HC2-systems, i.e. provide high reliability It encourages redundancy as central technical mean to prevent accidents It puts a focus on an organizational ”culture of reliability” to prevent accidents, which is put in place by centralized control and executed in a decentralized manner A HRO values reliability over everything else, even performance/cost

Michael Dorner: Avoiding Accidents - A Misson Impossible? 14

slide-15
SLIDE 15

Culture of Reliability

Preoccupation with failure - could this minor fault result in a big catastrophe next time? Don’t simplify - stay aware of the complex processes Sensibility to operations Commitment to resilience - improve crisis management Value experience over hierarchies

Michael Dorner: Avoiding Accidents - A Misson Impossible? 15

slide-16
SLIDE 16

What does HRO not do

HRO does not offer an alternative accident-model, in fact it claims to be based on NAT It does not consider system design (except redundancy), but

  • peration and organization

It does also not offer a metric for interactive complexity and coupling

Michael Dorner: Avoiding Accidents - A Misson Impossible? 16

slide-17
SLIDE 17

NAT vs. HRO? NAT and HRO?

Michael Dorner: Avoiding Accidents - A Misson Impossible? 17

slide-18
SLIDE 18

Why they conflict

Both claim to be applicable to HC2-systems NAT claims ”normal accidents” cannot be avoided, while HRO claims to provide an accident avoidance strategy NAT-advocates claim that HRO is a theory of its own (HRT) → NAT vs HRT HRO considers itself an organizational strategy for those systems → NAT and HRO Both theories assume different decision making models (sensemaking vs. garbage can)

Michael Dorner: Avoiding Accidents - A Misson Impossible? 18

slide-19
SLIDE 19

Why they don’t conflict

Both say they apply to HC2-systems, but none of them actually has a reproducible metric to categorize systems Both sides categorize systems solely based on their own subjective impression Their definition of HC2-systems does not seem to match and thus they seem to talk about two different things Practical consequences from both theories usually don’t conflict

Michael Dorner: Avoiding Accidents - A Misson Impossible? 19

slide-20
SLIDE 20

HRO and NAT

NAT explains the impact of certain design factors on complex/unpredictable accidents There are likely some rare ”normal” accidents and they do depend on coupling and complexity of the interactions HRO offers a promising culture to operate risky systems, but not a way to prevent these ”normal” accidents (which is not contrary to its original claim) Forcing multidimensional properties into a four-fold table with a non-reproducible metric is causing categorization problem, which makes both sides think they talk about the same thing The more properties of tight coupling and complex interaction a system has the more likely normal accidents get

Michael Dorner: Avoiding Accidents - A Misson Impossible? 20

slide-21
SLIDE 21

What’s next

1

Motivation for a science based on accidents

2

Normal Accident Theory and High Reliability Theory Normal Accident Theory High Reliability Organization NAT and/vs. HRO?

3

Accidents and Computer Systems

4

Conclusion

Michael Dorner: Avoiding Accidents - A Misson Impossible? 21

slide-22
SLIDE 22

Computers in Classical Systems

Using computers in a system will add interactive complexity, but will likely not increase coupling Computers are often black boxes to system operators, thus making sensitive operation harder in the overall system It is to be expected that computers will make ”normal” accidents more likely and HRO harder to apply

Michael Dorner: Avoiding Accidents - A Misson Impossible? 22

slide-23
SLIDE 23

NAT in Computer Systems

NAT should be useful during system design, because it explains which criteria increase the likelihood of unpredictable failure Computers as they are now are not tightly coupled, e.g. execution order is not guaranteed at any level unless explicitly synchronized Computers do have a certain level of complexity, but are not

  • n par with nuclear power plants

Many trends in computer science may increase the risk of ”normal accidents”, e.g. managed services, cloud computing (→common mode!)

Michael Dorner: Avoiding Accidents - A Misson Impossible? 23

slide-24
SLIDE 24

HRO in Computer Systems

HRO culture is not applicable to software, but it is to administration Systems with distributed organization are a weak spot of HRO, because HRO culture is imposed by the centralized leadership Thus HRO is not useful for the Internet as a whole, but may be for providers of centralized services When more critical processes embrace computer technology there may very well be an even bigger market for highly reliable services

Michael Dorner: Avoiding Accidents - A Misson Impossible? 24

slide-25
SLIDE 25

We’re done! - almost

1

Motivation for a science based on accidents

2

Normal Accident Theory and High Reliability Theory Normal Accident Theory High Reliability Organization NAT and/vs. HRO?

3

Accidents and Computer Systems

4

Conclusion

Michael Dorner: Avoiding Accidents - A Misson Impossible? 25

slide-26
SLIDE 26

Conclusion

Some accidents may be impossible to prevent - NAT explains why and we can conclude how to reduce their probability A lot of accidents can be avoided - HRO offers an

  • rganizational model to achieve that

Both theories apply to computer systems with some limitations, e.g. decentralization Both theories can contribute to a more complete approach to accidents

Michael Dorner: Avoiding Accidents - A Misson Impossible? 26

slide-27
SLIDE 27

Questions?

Michael Dorner: Avoiding Accidents - A Misson Impossible? 27

slide-28
SLIDE 28

Full List of Interactive Complexity Criteria

Local proximity Common-mode connections Interconnected subsystems Limited substitution of materials Unknown/unfamiliar feedback-loops Multiple and interacting controls Indirect information sources Limited understanding of processes

Michael Dorner: Avoiding Accidents - A Misson Impossible? 28

slide-29
SLIDE 29

Full List of Coupling Criteria

Processing cannot be delayed Fixed Order of Sequences Only a single method leads to success Little slack in resources Buffers and redundancy have to be present by design Substitution of resources has to be designed in

Michael Dorner: Avoiding Accidents - A Misson Impossible? 29

slide-30
SLIDE 30

Decision Making in NAT and HRO

NAT uses a Garbage Can Model, i.e. it assume the decision to be based on stochastic processes involving changing factors (streams), such as policy, politics and problems HRO on the other hand assume the involved actors to consciously make decisions based personal and

  • rganizational values (Sensemaking)

Neither side can fully convince in the justification for their decision If NAT was right an a GCM would apply that might weaken the impact of HRO, but certainly not erase it, since it would probably still be reflected heavily in the streams

Michael Dorner: Avoiding Accidents - A Misson Impossible? 30