Incident Management Incident Management Making sure things go right - - PowerPoint PPT Presentation

incident management incident management
SMART_READER_LITE
LIVE PREVIEW

Incident Management Incident Management Making sure things go right - - PowerPoint PPT Presentation

Incident Management Incident Management Making sure things go right when they inevitably go wrong. Gareth Eason, HEAnet for TF-NOC, Zrich, 2011-06-29 Agenda HEAnet background: What do we do? Why manage incidents? How does


slide-1
SLIDE 1

Incident Management Incident Management

“Making sure things go right when they inevitably go wrong.”

Gareth Eason, HEAnet

for TF-NOC, Zürich, 2011-06-29

slide-2
SLIDE 2

Agenda

  • HEAnet background: What do we do?
  • Why manage incidents?
  • How does HEAnet manage incidents?
  • Implementation of a new incident

management system

  • Lessons learned
slide-3
SLIDE 3

Who are HEAnet?

  • HEAnet is Ireland's research and

education network (NREN)

  • Set up in 1983 as a collaborative

body by the seven Irish universities and the Higher Education Authority

  • Became a non-profit, limited

company in 1997

  • Approximately 50 staff today
slide-4
SLIDE 4

Network members

  • 7 Universities & DIT
  • 13 Institutes of Technology
  • 16 3rd level colleges and VECs
  • 24 non-profit and research
  • rganisations
  • Government & Administrative bodies
  • In excess of 180,000 end users
  • 4,000 primary and post-primary

schools

slide-5
SLIDE 5

Affiliations & Representations

  • IBEC – TIF/Telecoms Internet Federation
  • INEX/Internet Neutral Exchange
  • ISPAI / Internet Service Provider Association of Ireland

National

  • EU funded Framework Projects
  • RIPE Network Co-ordination Centre (NCC)
  • DANTE/TERENA (37 countries)
  • GÉANT/NREN Consortium Policy Committee
  • JANET (UK) and JANET-CERT
  • MoU with Internet 2/ NGI

International

slide-6
SLIDE 6

What do we do?

  • Provide high quality Internet services

to our members

  • Enable research and learning through

leading edge shared services

  • Act as a representative body for the

ICT education & research community

  • Facilitate innovation and collaboration
  • Ensure value for money
slide-7
SLIDE 7

Network Trends 1991-

slide-8
SLIDE 8

Milestones

2008 First 10Gbps Client Connections 2009 Resilience, Wireless Strategy 2010 Schools 100 Mbit/s Connections 2011 - 2013 Data Storage National Data Centre Next Generation Network Cloud Computing Wireless

slide-9
SLIDE 9

What is an incident?

  • An unplanned interruption to an IT

Service or a reduction in the Quality

  • f an IT Service.
  • Typically, something has gone wrong
  • Sources:

– Automated alerts – Customers – NOC observations – Suppliers

slide-10
SLIDE 10

Why manage incidents? Top 3 reasons to manage incidents:

  • 1. Keep customers happy
  • 2. Keep customers happy
  • 3. Keep customers happy

Distant 4th reason:

  • 4. Continuous Service Improvement
slide-11
SLIDE 11

Why manage incidents? “You can't manage what you don't measure”

slide-12
SLIDE 12

Why manage incidents? “You can't manage what you don't measure” Measure, manage and continually improve service

slide-13
SLIDE 13

How does HEAnet manage?

  • Fundamentally

process driven

  • Supported by

tools

  • Managed by

NOC staff

  • People are the most critical

Tools Process NOC personnel

slide-14
SLIDE 14

Implementation

  • Good people

– Experienced and know what they are doing

  • Good processes

– Tried, tested and continually improved

  • Poor tool support

– Custom; built for a need 7 years ago – No support – Inflexible; not practical to extend – Not all incidents captured

slide-15
SLIDE 15

A new tool

  • Evaluate available tools

– Remedy, OTRS, RT, ...

  • Propose replacement tool
  • Map existing processes to new tool
  • Amend tool / processes to match
  • Plan migration to new tool
  • Decommission old tool
slide-16
SLIDE 16

Requirements

  • No external facing change
  • Federated auth, with bypass
  • Integration with existing datasets
  • Integration with monitoring systems
  • Standalone capable
  • Resilient
  • DR plan (#2 item for reinstatement)
  • Scalable, supportable, maintainable
slide-17
SLIDE 17

Requirements

  • Automation & Aggregation

– Automate what we can – Facilitate everything else

  • Ensure clear, well understood, robust

procedures are

– in place and – will be followed / enabled

  • Leverage Upgrades in Core RT
slide-18
SLIDE 18

Design

  • Two separate data centres
  • API for integration

DB Middleware API UI DB Middleware API UI Failover Management

slide-19
SLIDE 19

Design

DB Middleware API UI Failover API Client info RT Ticketing

slide-20
SLIDE 20

Design

DB Middleware API UI Failover API Client info API Supplier info RT Ticketing API Service & Circuit info

slide-21
SLIDE 21

Design

DB Middleware API UI Failover API e-mail Client info API Supplier info RT Ticketing API Service & Circuit info alerting

slide-22
SLIDE 22

Design

DB Middleware API UI Failover API e-mail Client info API Supplier info RT Ticketing API Service & Circuit info alerting RT Cache

slide-23
SLIDE 23

Buy in

  • Management buy-in

– Reporting – Better customer service

  • NOC buy-in

– Easier to track incidents – Better integration makes life easier

  • Client buy-in

– Looks the same, but better service

slide-24
SLIDE 24

Buy in

  • NOC involved from day #1
  • Suggestions tracked

– Fogbugz

  • 3-month migration from old to new

– 5th April 2011 (go-live) – 1st July 2011 (turn off mousetrap)

slide-25
SLIDE 25

Continuous improvement

  • E-mail filters
  • RT interface

– Agile methodology – Multiple releases since 5th April

  • AssetDB launched 28th June 2011

– Plan for integration

slide-26
SLIDE 26

Platform

Development Staging Production s/w dev team Sysadmin & NOC Development Staging Production Primary Secondary / Failover

slide-27
SLIDE 27

Outcomes

2006 2007 2008 2009 2010 2011 500 1000 1500 2000 2500 3000 3500 4000 4500

Network Operation Centre Tickets

Q4 Q3 Q2 Q1

tickets opened

  • Much better issue tracking
  • More

tickets

slide-28
SLIDE 28

Outcomes

  • Much better reporting
slide-29
SLIDE 29

Outcomes

  • Much better reporting
slide-30
SLIDE 30

Outcomes

  • Much better reporting
slide-31
SLIDE 31

Lessons learned

  • Good incident management =>

Good customer service

  • Good process is key
  • Tool must support the process
  • Integration is key
  • Automation is great
  • Reporting is vital
slide-32
SLIDE 32

Lessons learned

  • Have a DR plan (Disaster Recovery)
  • Test it
  • Break stuff, and test it again
  • Test it some more
  • Test it again

How do you manage incidents if they break the tool?

slide-33
SLIDE 33

Lessons Learned

  • Support the process
  • Integrate
  • Automate
  • Report
  • Leverage community development
  • Have a DR plan
  • Test, test, test some more!