Event Analysis and Trends James Merlo, Associate Director of Human - - PowerPoint PPT Presentation

event analysis and trends
SMART_READER_LITE
LIVE PREVIEW

Event Analysis and Trends James Merlo, Associate Director of Human - - PowerPoint PPT Presentation

Event Analysis and Trends James Merlo, Associate Director of Human Performance James Merlo, Associate Director of Human Performance Reliability Issues Steering Committee January 24, 2013 Topics Events Analysis (EA) Process Update on


slide-1
SLIDE 1

Event Analysis and Trends

James Merlo, Associate Director of Human Performance James Merlo, Associate Director of Human Performance Reliability Issues Steering Committee January 24, 2013

slide-2
SLIDE 2

Topics

  • Events Analysis (EA) Process

U d t R t d E t t D t

  • Update on Reported Events to Date
  • Cause Coding Process
  • Initial Trends and Clusters
  • Initial Trends and Clusters
  • Preliminary Analysis of Energy Management System (EMS)

failures

  • Questions

RELI ABI LI TY | ACCOUNTABI LI TY 2

slide-3
SLIDE 3

Event Analysis Program

  • Events in 2012
  • (Cat 1 Cat 5) Events = 111
  • (Cat 1‐ Cat 5) Events = 111
  • Cat 1 = 73 events
  • Cat 2 = 33 events
  • Cat 3 = 3 events
  • Cat 3 = 3 events
  • Cat 4 and 5 = 1 each
  • Total Events (October 2010 – December 2012)
  • (Cat 1‐ Cat 5) Events = 255
  • Events closed = 236 (92.5 percent)
  • Closed events Cause Coded = 202 (85.6 percent)
  • 2013 cause coding collaboratively with Regions
  • EA report quality improving
  • Providing analysis to NERC Committees

RELI ABI LI TY | ACCOUNTABI LI TY 3

  • Providing analysis to NERC Committees
slide-4
SLIDE 4

Event Characteristics

  • There are 80 different Characteristics of Events that NERC

tracks in 9 Major Categories: tracks in 9 Major Categories:

  • Natural Events (lightning, hurricanes, etc.)
  • Entity Operations (Switching, Maintenance, etc.)
  • Controls/Communication (EMS, SCADA, ICCP Data, etc.)
  • Industry Alerts (pubic appeals, EEA, etc.)
  • System Tools (Load Management Tools, etc.)
  • Infrastructure Security (Vandalism, Sabotage, Theft, etc.)
  • Failed Equipment (Relays, Splice, Transformer, etc.)
  • System Conditions (Transmission Generation Loss of load etc )
  • System Conditions (Transmission, Generation, Loss of load, etc.)
  • Miscellaneous (Software, Vendors, Mis‐operations, etc.)

RELI ABI LI TY | ACCOUNTABI LI TY 4

slide-5
SLIDE 5

Generation and Transmission Events

Transmission only

165 Events 221 Events

Generation only Both T & G

133 Events involve Generation and Transmission 88 Events just 32 Events just Generation 88 Events just Transmission

RELI ABI LI TY | ACCOUNTABI LI TY 5

slide-6
SLIDE 6

Mis-Operations and Transmission Events

88

Transmission

88 40% 133 60%

Loss with Mis‐ Ops Transmission

60%

Transmission Loss without Mis‐Op

RELI ABI LI TY | ACCOUNTABI LI TY 6

slide-7
SLIDE 7

Root Cause Determinations

A2 ‐ Equipment/ M t i l P bl

RELI ABI LI TY | ACCOUNTABI LI TY 7

Material Problem

slide-8
SLIDE 8

Deeper Dive into Organizational I ssues (Based on Root Cause) A4 – Management Challenges

B3C08 ‐ job scoping did not identify special circumstances

  • r conditions
  • r conditions

B5C04 ‐ risks/consequences associated with change not adequately reviewed B1C03 ‐ direction created insufficient awareness of impact C03 d ect o c eated su c e t a a e ess o pact

  • f actions on safety/reliability

B1C04 ‐ follow‐up did not identify problems B1C05 ‐ assessment did not determine cause of previously event or known problem

RELI ABI LI TY | ACCOUNTABI LI TY 8

slide-9
SLIDE 9

All Causes for Management I ssues/ Challenges

16 14

A4 – Management Challenges

B1C05 ‐ assessment did not determine cause of previously event or known problem

10 12

B3C08 ‐ job scoping did not identify special circumstances or conditions B5C03 ‐ inadequate vendor support of change B5C04 ‐ risks/consequences associated with

6 8

B5C04 ‐ risks/consequences associated with change not adequately reviewed B1C08 ‐ corrective action responses to a known or repetitive problem was untimely

4 6

B5C05 ‐ system interactions not considered B1C04 ‐ follow‐up did not identify problems

2

RELI ABI LI TY | ACCOUNTABI LI TY 9

slide-10
SLIDE 10

All Causes for Management I ssues/ Challenges A4 – Management Challenges

B1C05 ‐ assessment did not determine

14 16

cause of previous event or known problem B3C08 ‐ job scoping did not identify special i di i

10 12

circumstances or conditions B5C03 ‐ inadequate vendor support of change B5C04 ‐ risks/consequences associated with

8

B5C04 risks/consequences associated with change not adequately reviewed B1C08 ‐ corrective action responses to a known or repetitive problem was

4 6

untimely B5C05 ‐ system interactions not considered B1C04 ‐ follow‐up did not identify problems

2 A4B1C05 A4B3C08 A4B5C03 A4B5C04 A4B1C08 A4B5C05 A4B1C04

RELI ABI LI TY | ACCOUNTABI LI TY 10

A4B1C05 A4B3C08 A4B5C03 A4B5C04 A4B1C08 A4B5C05 A4B1C04

slide-11
SLIDE 11

Summary

  • Less than adequate Job Scoping is a threat to reliability
  • Less than adequate EA can result in real threats not being
  • Less than adequate EA can result in real threats not being

remediated

  • Event Analysis process has potential to provide high quality

y p p p g q y reliability information

  • QA/QC of submitted EA reports

RELI ABI LI TY | ACCOUNTABI LI TY 11

slide-12
SLIDE 12

Preliminary Analysis of EMS Outages General:

C t 2b E t

  • Category 2b Event
  • 46 events (October 26, 2010 – June 27, 2012)

( )

  • 35 entities reporting with nine (9) entities

experiencing multiple outages C l t t 32 t 253 i t

  • Complete outages range: 32 to 253 minutes
  • Partial outages range: 23 to 242 minutes

RELI ABI LI TY | ACCOUNTABI LI TY 12

slide-13
SLIDE 13

Analysis of Outage Times

270 300

M F ll EMS O t 59 Mi t

180 210 240

Minutes

Mean Full EMS Outage: 59 Minutes Mean Partial EMS Outage: 36 Minutes Mean Total EMS Outage: 95 Minutes

120 150 180

ge Times in

Mean Total EMS Outage: 95 Minutes

30 60 90

Outag

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

Full EMS Outage Partial EMS Outage Mean Full Outage Time

RELI ABI LI TY | ACCOUNTABI LI TY 13

Mean Partial Outage Time Mean Total Outage Time

slide-14
SLIDE 14

Outage Times by Date

270 300

October 2010 – June 2012

210 240

utes

2010 2011 2012

150 180

e Time in Min

60 90 120

Outage

30 60

RELI ABI LI TY | ACCOUNTABI LI TY 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

Full EMS Outage Partial EMS Outage

slide-15
SLIDE 15

Root Cause

7 8

Top Root Causes A2B6C07 – Software Failure

5 6

A2B6C07 – Software Failure A1B4C02 – Testing of Design/Installation LTA AZ – Information to determine cause LTA A4B5C04 – Inadequate risk assessment of change

3 4

A4B5C04 Inadequate risk assessment of change A2B3C03 – Post modification testing LTA A4B3C08 – Insufficient Job scoping

2 3 1

RELI ABI LI TY | ACCOUNTABI LI TY 15

slide-16
SLIDE 16

Contributing Causes

25

Top Contributing Causes A2B6C07 – Software Failure A1B2C01 – Design output scope LTA

15 20

A4B5C03 – Inadequate vendor support of change A1B4C02 – Testing of Design/Installation LTA A2B6C01 – Defective or failed part A4B5C05 – System Interactions not considered

10 15

A4B5C05 System Interactions not considered A4B5C04 – Inadequate risk assessment of change A2B3C02 – Inspection/Testing LTA A2B3C03 – Post Modification Testing LTA

5 10

A3B3C01 – Attention given to wrong issues A4B1C08 – Untimely corrective actions to known issue

07 01 03 02 01 05 04 02 03 01 08 08 01 13 02 AX 05 01 01 01 05 04 B2 08 B3 08 09 09 08 01 B1 B2 A1 B1 01 09 01 02 02 05 04 02 04 B1 04 06 09 07 11 05 B5 01 02 A5 01 03 05 01 06 B3

RELI ABI LI TY | ACCOUNTABI LI TY 16

A2B6C0 A1B2C0 A4B5C0 A1B4C0 A2B6C0 A4B5C0 A4B5C0 A2B3C0 A2B3C0 A3B3C0 A4B1C0 A1B2C0 A3B2C0 A4B5C1 A7B1C0 A A1B2C0 A2B2C0 A2B3C0 A3B1C0 A3B2C0 A3B3C0 A4B A4B2C0 A4B A4B3C0 A4B3C0 A4B5C0 A5B2C0 A5B4C0 AXB AXB A A1B A1B1C0 A1B2C0 A1B3C0 A1B3C0 A2B1C0 A2B6C0 A3B1C0 A3B2C0 A3B2C0 A4B A4B1C0 A4B1C0 A4B1C0 A4B2C0 A4B3C1 A4B4C0 A4B A4B5C0 A4B5C0 A A5B1C0 A5B1C0 A5B1C0 A5B3C0 A5B4C0 A6B

slide-17
SLIDE 17

Common Themes Common themes: 1 S ft F il 1. Software Failures 2. Software Configuration/Installation/Maintenance 3 Hardware Failures 3. Hardware Failures 4. Hardware Configuration/Installation/Maintenance 5. Failover Testing Weaknesses 5. Failover Testing Weaknesses 6. Testing Inadequacies 7. Less than Adequate Situational Awareness

RELI ABI LI TY | ACCOUNTABI LI TY 17

slide-18
SLIDE 18

Going Forward Actions to date: 1 Preventable EMS and SCADA Events Alert April 10 2012

  • 1. Preventable EMS and SCADA Events Alert – April 10, 2012
  • 2. Brief Event Analysis Subcommittee Leadership ‐ December 3,

2012

  • 3. Brief EAS at Operating Committee (OC) Meeting‐ December

11, 2012

  • 4. Collaboration with EAS EMS Task Force – January 10, 2013

Next steps: 1 M f d l i

  • 1. More focused analysis
  • 2. Quantifying Risk of EMS outages

3 Update to EAS OC and Reliability Issues Steering Committee

RELI ABI LI TY | ACCOUNTABI LI TY 18

  • 3. Update to EAS, OC and Reliability Issues Steering Committee
  • 4. Develop interventions or remediation strategies
slide-19
SLIDE 19

RELI ABI LI TY | ACCOUNTABI LI TY 19