An Overview of Human Error Drawn f rom J . Reason, Human Error , - - PowerPoint PPT Presentation
An Overview of Human Error Drawn f rom J . Reason, Human Error , - - PowerPoint PPT Presentation
An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS 294- 4 ROC Seminar Outline Human error and computer system f ailures A theory of human error Human error and accident theory
Slide 2
Outline
- Human error and computer system f ailures
- A theory of human error
- Human error and accident theory
- Addressing human error
Slide 3
Dependability and human error
- I ndustry data shows that human error is the
largest contributor to reduced dependability
– HP HA labs: human error is # 1 cause of f ailures (2001) – Oracle: half of DB f ailures due t o human error (1999) – Gray/ Tandem: 42% of f ailures f rom human administ rat or errors (1986) – Murphy/ Gent st udy of VAX syst ems (1993):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1985 1993
Causes of system crashes
Time (1985-1993) % of Syst em Crashes
Syst em management Sof t ware f ailure Hardware f ailure Ot her
53% 18% 18% 10%
Slide 4
Learning f rom other f ields: PSTN
- FCC- collected data on outages in the US
public- switched telephone network
– met ric: breakdown of cust omer calls blocked by syst em out ages (excluding nat ural disast ers). J an-J une 2001
9% 47% 17% 5% 22% Human-co. Human-ext. Hardware Failure Software Failure Overload Vandalism
Human error account s f or 56% 56% of all blocked calls – comparison wit h 1992-4 dat a shows t hat human error is t he only f act or t hat is not improving over t ime
Slide 5
Learning f rom other f ields: PSTN
- PSTN trends: 1992- 1994 vs. 2001
60 314 Overload 3 5 Vandalism 12 15 Sof t ware 49 49 Hardware 75 100 Human error: ext ernal 176 98 Human error: company
2001 1992- 94 Trend Cause
Minutes Minutes (millions of customer minutes/month)
Slide 6
Learning f rom experiments
- Human error rates during maintenance of
sof tware RAI D system
– part icipant s at t empt t o repair RAI D disk f ailures
» by replacing broken disk and reconst r uct ing dat a
– each part icipant repeat ed t ask several t imes – dat a aggregat ed across 5 part icipant s
31 33 35 Total number of trials
- User Error – User Recovered
- User Error – I nt ervent ion Required
- Syst em ignored f at al input
- Unsuccessf ul Repair
- Fat al Dat a Loss
Linux Solaris Windows Error type
Slide 7
Learning f rom experiments
- Errors occur despite experience:
Iteration
1 2 3 4 5 6 7 8 9
Number of errors
1 2 3 Windows Solaris Linux
- Training and f amiliarity don’t eliminate errors
– t ypes of errors change: mist akes vs. slips/ lapses
- System design af f ects error- susceptibilit y
Slide 8
Outline
- Human error and computer system f ailures
- A theory of human error
- Human error and accident theory
- Addressing human error
Slide 9
A theory of human error
(dist illed f rom J . Reason, Human Error, 1990)
- Preliminaries: the three stages of cognitive
processing f or tasks
1) planning
» a goal is ident if ied and a sequence of act ions is select ed t o reach t he goal
2) st orage
» t he select ed plan is st ored in memor y unt il it is appropriat e t o carr y it out
3) execut ion
» t he plan is implement ed by t he pr ocess of car rying out t he act ions specif ied by t he plan
Slide 10
A theory of human error (2)
- Each cognit ive st age has an associated f orm
- f error
– slips: execut ion st age
» incorrect execut ion of a planned act ion » example: miskeyed command
– lapses: st orage st age
» incor rect omission of a st ored, planned act ion » examples: skipping a st ep on a checklist , f orget t ing t o rest ore nor mal valve set t ings af t er maint enance
– mistakes: planning st age
» t he plan is not suit able f or achieving t he desired goal » example: TMI operat ors premat urely disabling HPI pumps
Slide 11
Origins of error: the GEMS model
- GEMS: Generic Error- Modeling System
– an at t empt t o underst and t he origins of human error
- GEMS identif ies three levels of cognitive task
processing
– skill- based: f amiliar , aut omat ic procedural t asks
» usually low-level, like knowing t o t ype “ls” t o list f iles
– rule- based: t asks approached by pat t ern-mat ching f rom a set of int ernal problem-solving rules
» “obser ved sympt oms X mean syst em is in st at e Y” » “if syst em st at e is Y, I should pr obably do Z t o f ix it ”
– knowledge- based: t asks approached by reasoning f rom f irst principles
» when rules and experience don’t apply
Slide 12
GEMS and errors
- Errors can occur at each level
– skill- based: slips and lapses
» usually errors of inat t ent ion or misplaced at t ent ion
– rule- based: mist akes
» usually a result of picking an inappropriat e rule » caused by misconst r ued view of st at e, over-zealous pat t ern mat ching, f requency gambling, def icient r ules
– knowledge- based: mist akes
» due t o incomplet e/ inaccurat e underst anding of syst em, conf irmat ion bias, over conf idence, cognit ive st rain, ...
- Errors can result f rom operating at wrong level
– humans are reluct ant t o move f rom RB t o KB level even if rules aren’t working
Slide 13
Error f requencies
- I n raw f requencies, SB >> RB > KB
– 61% of errors are at skill-based level – 27% of errors are at rule-based level – 11% of errors are at knowledge-based level
- But if we look at opportunit ies f or error, the
- rder reverses
– humans perf orm vast ly more SB t asks t han RB, and vast ly more RB t han KB
» so a given KB t ask is more likely t o result in err or t han a given RB or SB t ask
Slide 14
Error detection and correction
- Basic detection mechanism is self - monitoring
– periodic at t ent ional checks, measurement of progress t oward goal, discovery of surprise inconsist encies, ...
- Ef f ectiveness of self - detection of errors
– SB errors: 75-95% det ect ed, avg 86%
» but some lapse-t ype er ror s were r esist ant t o det ect ion
– RB errors: 50-90% det ect ed, avg 73% – KB errors: 50-80% det ect ed, avg 70%
- I ncluding correction tells a dif f erent story:
– SB: ~70% of all errors det ect ed and correct ed – RB: ~50% det ect ed and correct ed – KB: ~25% det ect ed and correct ed
Slide 15
Outline
- Human error and computer system f ailures
- A theory of human error
- Human error and accident theory
- Addressing human error
Slide 16
Human error and accident theory
- Major systems accidents (“normal accidents”)
start with an accumulation of latent errors
– most of t hose lat ent errors are human errors
» lat ent slips/ lapses, par t icularly in maint enance
- example: misconf igured valves in TMI
» lat ent mist akes in syst em design, organizat ion, and planning, part icularly of emergency pr ocedures
- example: f lowchart s t hat omit unf oreseen pat hs
– invisible lat ent errors change syst em realit y wit hout alt ering operat or’s models
» seemingly-cor rect act ions can t hen t rigger accident s
Slide 17
Accident theory (2)
- Accident s are exacerbated by human errors
made during operator response
– RB errors made due t o lack of experience wit h syst em in f ailure st at es
» t raining is r arely suf f icient t o develop a r ule base t hat capt ures syst em response out side of nor mal bounds
– KB reasoning is hindered by syst em complexit y and cognit ive st rain
» syst em complexit y prohibit s ment al modeling » st ress of an emergency encourages RB appr oaches and diminishes KB ef f ect iveness
– syst em visibilit y limit ed by aut omat ion and “def ense in dept h”
» result s in improper rule choices and KB reasoning
Slide 18
Outline
- Human error and computer system f ailures
- A theory of human error
- Human error and accident theory
- Addressing human error
– general guidelines – t he ROC approach: syst em-level undo
Slide 19
Addressing human error
- Challenges
– humans are inherent ly f allible and errors are inevit able – hard-t o-det ect lat ent errors can be more t roublesome t han f ront -line errors – human psychology must not be ignored
» especially t he SB/ RB/ KB dist inct ion and human behavior at each level
- General approach: error- tolerance rather than
error- avoidance
“I t is now widely held among human reliabilit y specialist s t hat t he most pr oduct ive st rat egy f or dealing wit h act ive err or s is t o f ocus upon cont r olling t heir consequences rat her t han upon st riving f or t heir eliminat ion.” (Reason, p. 246)
Slide 20
The Automation I rony
- Automation is not the cure f or human error
– aut omat ion addresses t he easy SB/ RB t asks, leaving t he complex KB t asks f or t he human
» humans are ill-suit ed t o KB t asks, especially under st ress
– aut omat ion hinders underst anding and ment al modeling
» decreases syst em visibilit y and incr eases complexit y » operat or s don’t get hands-on cont r ol experience » rule-set f or RB t asks and models f or KB t asks are weak
– aut omat ion shif t s t he error source f rom operat or errors t o design errors
» harder t o det ect / t olerat e/ f ix design errors
Slide 21
Building robustness to human error
- Discover and correct latent errors
– must overcome human nat ure t o wait unt il emergency t o respond
- I ncrease system visibility
– don’t hide complexit y behind aut omat ed mechanisms
- Take errors into account in operator training
– include error scenarios – promot e explorat ory t rial & error approaches – emphasize posit ive side of errors: learning f rom mist akes
Slide 22
Building robustness to human error
- Reduce opportunities f or error (Don Norman):
– get good concept ual model t o user by consist ent design – design t asks t o mat ch human limit s: working memor y, problem solving abilit ies – make visible what t he opt ions are, and what are t he consequences of act ions – exploit nat ural mappings: bet ween int ent ions and possible act ions, act ual st at e and what is perceived, … – use const raint s t o guide user t o next act ion/ decision – design f or err or s. Assume t heir occurrence. Plan f or er r or recover y. Make it easy t o rever se act ion and make hard t o perf orm ir r ever sible ones. – when all else f ails, st andardize: ease of use more import ant , only st andardize as last resort
Slide 23
Building robustness to human error
- Acknowledge human behavior in system design:
– int erf aces should allow user t o explore via experiment at ion – t o help at KB level, pr ovide t ools do experiment s/ t est hypot heses wit hout having t o do t hem on high-risk irrever sible plant . Or make syst em st at e always rever sible. – provide f eedback t o increase er r or obser vabilit y (RB level) – at RB level, provide symbolic cues and conf idence measures – f or RB, t r y t o give mor e elaborat e, int egrat ed cues t o avoid “st r ong-but -wr ong” RB err or – provide overview displays at edge of peripher y t o avoid at t ent ional capt ure at SB level – simult aneously present dat a in f or ms usef ul f or SB/ RB/ KB – provide ext ernal memory aids t o help at KB level, including ext ernalized represent at ion of dif f erent opt ions/ schemas
Slide 24
Human error: the ROC approach
- ROC is f ocusing on system- level techniques
f or human error tolerance
– compliment ary t o UI innovat ions
- Goal: provide f orgiving operator environment
– expect human error and t olerat e it – allow operat or t o experiment saf ely, t est hypot heses – make it possible t o det ect and f ix lat ent errors
- Approach: undo f or system administration
Slide 25
Repairing the Past with Undo
- The Three R’s: undo meets t ime travel
– Rewind: roll syst em st at e backwards in t ime – Repair: f ix lat ent or act ive error
» aut omat ically or via human int ervent ion
– Redo: roll syst em st at e f orward, replaying user int eract ions lost during rewind
- This is not your ordinary word- processor undo!
– allows sysadmin t o go back in t ime t o f ix lat ent errors af t er t hey’re manif est ed
Slide 26
Undo details
- Examples where Undo would help:
– reverse t he ef f ect s of a mist yped command (rm –rf *) – roll back a sof t ware upgrade wit hout losing user dat a – ret roact ively inst all virus f ilt er on email server; ef f ect s of virus are squashed on redo
- The 3 R’s vs. checkpointing, reboot, logging
– checkpoint ing gives Rewind only – reboot may give Repair, but only f or “Heisenbugs” – logging can give all 3 R’s
» but need more t han RDBMS logging, since syst em st at e changes are int erdependent and non-t ransact ional » 3R-logging requires caref ul dependency t racking, and at t ent ion t o st at e granularit y and ext ernalized event s
Slide 27
Summary
- Humans are critical to system dependability
– human error is t he single largest cause of f ailures
- Human error is inescapable: “to err is human”
– yet we blame t he operat or inst ead of f ixing syst ems
- Human error comes in many f orms
– mist akes, slips, lapses at KB/ RB/ SB levels of operat ion – but is nearly always det ect able
- Best way to address human error is tolerance