Incident Management Incident Management Making sure things go right - - PowerPoint PPT Presentation

▶

Sep 02, 2023 242 likes •591 views

Incident Management Incident Management Making sure things go right when they inevitably go wrong. Gareth Eason, HEAnet for TF-NOC, Zrich, 2011-06-29 Agenda HEAnet background: What do we do? Why manage incidents? How does

SLIDE 1

Incident Management Incident Management

“Making sure things go right when they inevitably go wrong.”

Gareth Eason, HEAnet

for TF-NOC, Zürich, 2011-06-29

SLIDE 2

Agenda

HEAnet background: What do we do?
Why manage incidents?
How does HEAnet manage incidents?
Implementation of a new incident

management system

Lessons learned

SLIDE 3

Who are HEAnet?

HEAnet is Ireland's research and

education network (NREN)

Set up in 1983 as a collaborative

body by the seven Irish universities and the Higher Education Authority

Became a non-profit, limited

company in 1997

Approximately 50 staff today

SLIDE 4

schools

SLIDE 5

Affiliations & Representations

IBEC – TIF/Telecoms Internet Federation
INEX/Internet Neutral Exchange
ISPAI / Internet Service Provider Association of Ireland

National

EU funded Framework Projects
RIPE Network Co-ordination Centre (NCC)
DANTE/TERENA (37 countries)
GÉANT/NREN Consortium Policy Committee
JANET (UK) and JANET-CERT
MoU with Internet 2/ NGI

International

SLIDE 6

What do we do?

Provide high quality Internet services

to our members

Enable research and learning through

leading edge shared services

Act as a representative body for the

ICT education & research community

Facilitate innovation and collaboration
Ensure value for money

SLIDE 7

Network Trends 1991-

SLIDE 8

Milestones

2008 First 10Gbps Client Connections 2009 Resilience, Wireless Strategy 2010 Schools 100 Mbit/s Connections 2011 - 2013 Data Storage National Data Centre Next Generation Network Cloud Computing Wireless

SLIDE 9

What is an incident?

An unplanned interruption to an IT

Service or a reduction in the Quality

f an IT Service.
Typically, something has gone wrong
Sources:

– Automated alerts – Customers – NOC observations – Suppliers

SLIDE 10

Why manage incidents? Top 3 reasons to manage incidents:

1. Keep customers happy
2. Keep customers happy
3. Keep customers happy

Distant 4th reason:

4. Continuous Service Improvement

SLIDE 11

Why manage incidents? “You can't manage what you don't measure”

SLIDE 12

Why manage incidents? “You can't manage what you don't measure” Measure, manage and continually improve service

SLIDE 13

How does HEAnet manage?

Fundamentally

process driven

Supported by

tools

Managed by

NOC staff

People are the most critical

Tools Process NOC personnel

SLIDE 14

Implementation

Good people

– Experienced and know what they are doing

Good processes

– Tried, tested and continually improved

Poor tool support

– Custom; built for a need 7 years ago – No support – Inflexible; not practical to extend – Not all incidents captured

SLIDE 15

A new tool

Evaluate available tools

– Remedy, OTRS, RT, ...

Propose replacement tool
Map existing processes to new tool
Amend tool / processes to match
Plan migration to new tool
Decommission old tool

SLIDE 16

Requirements

No external facing change
Federated auth, with bypass
Integration with existing datasets
Integration with monitoring systems
Standalone capable
Resilient
DR plan (#2 item for reinstatement)
Scalable, supportable, maintainable

SLIDE 17

Requirements

Automation & Aggregation

– Automate what we can – Facilitate everything else

Ensure clear, well understood, robust

procedures are

– in place and – will be followed / enabled

Leverage Upgrades in Core RT

SLIDE 18

Design

Two separate data centres
API for integration

DB Middleware API UI DB Middleware API UI Failover Management

SLIDE 19

Design

DB Middleware API UI Failover API Client info RT Ticketing

SLIDE 20

Design

DB Middleware API UI Failover API Client info API Supplier info RT Ticketing API Service & Circuit info

SLIDE 21

Design

DB Middleware API UI Failover API e-mail Client info API Supplier info RT Ticketing API Service & Circuit info alerting

SLIDE 22

Design

DB Middleware API UI Failover API e-mail Client info API Supplier info RT Ticketing API Service & Circuit info alerting RT Cache

SLIDE 23

Buy in

Management buy-in

– Reporting – Better customer service

NOC buy-in

– Easier to track incidents – Better integration makes life easier

Client buy-in

– Looks the same, but better service

SLIDE 24

Buy in

NOC involved from day #1
Suggestions tracked

– Fogbugz

3-month migration from old to new

– 5th April 2011 (go-live) – 1st July 2011 (turn off mousetrap)

SLIDE 25

Continuous improvement

E-mail filters
RT interface

– Agile methodology – Multiple releases since 5th April

AssetDB launched 28th June 2011

– Plan for integration

SLIDE 26

Platform

Development Staging Production s/w dev team Sysadmin & NOC Development Staging Production Primary Secondary / Failover

SLIDE 27

Outcomes

2006 2007 2008 2009 2010 2011 500 1000 1500 2000 2500 3000 3500 4000 4500

Network Operation Centre Tickets

Q4 Q3 Q2 Q1

tickets opened

Much better issue tracking
More

tickets

SLIDE 28

Outcomes

Much better reporting

SLIDE 29

Outcomes

Much better reporting

SLIDE 30

Outcomes

Much better reporting

SLIDE 31

Lessons learned

Good incident management =>

Good customer service

Good process is key
Tool must support the process
Integration is key
Automation is great
Reporting is vital

SLIDE 32

Lessons learned

Have a DR plan (Disaster Recovery)
Test it
Break stuff, and test it again
Test it some more
Test it again

How do you manage incidents if they break the tool?

SLIDE 33

Lessons Learned

Support the process
Integrate
Automate
Report
Leverage community development
Have a DR plan
Test, test, test some more!

Incident Management Incident Management

“Making sure things go right when they inevitably go wrong.”

Gareth Eason, HEAnet

Agenda

management system

Who are HEAnet?

education network (NREN)

body by the seven Irish universities and the Higher Education Authority

company in 1997

Network members

schools

Affiliations & Representations

National

International

What do we do?

to our members

leading edge shared services

ICT education & research community

Network Trends 1991-

Milestones

What is an incident?

Service or a reduction in the Quality

– Automated alerts – Customers – NOC observations – Suppliers

Why manage incidents? Top 3 reasons to manage incidents:

Distant 4th reason:

Why manage incidents? “You can't manage what you don't measure”

Why manage incidents? “You can't manage what you don't measure” Measure, manage and continually improve service

How does HEAnet manage?

process driven

tools

NOC staff

Implementation

– Experienced and know what they are doing

– Tried, tested and continually improved

– Custom; built for a need 7 years ago – No support – Inflexible; not practical to extend – Not all incidents captured

A new tool

– Remedy, OTRS, RT, ...

Requirements

Requirements

– Automate what we can – Facilitate everything else

procedures are

– in place and – will be followed / enabled

Design

Design

Design

Design

Design

Buy in

– Reporting – Better customer service

– Easier to track incidents – Better integration makes life easier

– Looks the same, but better service

Buy in

– Fogbugz

– 5th April 2011 (go-live) – 1st July 2011 (turn off mousetrap)

Continuous improvement

– Agile methodology – Multiple releases since 5th April

– Plan for integration

Platform

Outcomes

tickets

Outcomes

Outcomes

Outcomes

Lessons learned

Good customer service

Lessons learned

How do you manage incidents if they break the tool?

Lessons Learned