incident management incident management
play

Incident Management Incident Management Making sure things go right - PowerPoint PPT Presentation

Incident Management Incident Management Making sure things go right when they inevitably go wrong. Gareth Eason, HEAnet for TF-NOC, Zrich, 2011-06-29 Agenda HEAnet background: What do we do? Why manage incidents? How does


  1. Incident Management Incident Management “Making sure things go right when they inevitably go wrong.” Gareth Eason, HEAnet for TF-NOC, Zürich, 2011-06-29

  2. Agenda • HEAnet background: What do we do? • Why manage incidents? • How does HEAnet manage incidents? • Implementation of a new incident management system • Lessons learned

  3. Who are HEAnet? • HEAnet is Ireland's research and education network (NREN) • Set up in 1983 as a collaborative body by the seven Irish universities and the Higher Education Authority • Became a non-profit, limited company in 1997 • Approximately 50 staff today

  4. Network members • 7 Universities & DIT • 13 Institutes of Technology • 16 3 rd level colleges and VECs • 24 non-profit and research organisations • Government & Administrative bodies • In excess of 180,000 end users • 4,000 primary and post-primary schools

  5. Affiliations & Representations National • IBEC – TIF/Telecoms Internet Federation • INEX/Internet Neutral Exchange • ISPAI / Internet Service Provider Association of Ireland International • EU funded Framework Projects • RIPE Network Co-ordination Centre (NCC) • DANTE/TERENA (37 countries) • GÉANT/NREN Consortium Policy Committee • JANET (UK) and JANET-CERT • MoU with Internet 2/ NGI

  6. What do we do? • Provide high quality Internet services to our members • Enable research and learning through leading edge shared services • Act as a representative body for the ICT education & research community • Facilitate innovation and collaboration • Ensure value for money

  7. Network Trends 1991-

  8. Milestones 2010 2008 2009 Schools 100 Mbit/s First 10Gbps Client Resilience, Wireless Connections Connections Strategy 2011 - 2013 Data Storage National Data Centre Next Generation Network Cloud Computing Wireless

  9. What is an incident? • An unplanned interruption to an IT Service or a reduction in the Quality of an IT Service. • Typically, something has gone wrong • Sources: – Automated alerts – Customers – NOC observations – Suppliers

  10. Why manage incidents? Top 3 reasons to manage incidents: 1. Keep customers happy 2. Keep customers happy 3. Keep customers happy Distant 4 th reason: 4. Continuous Service Improvement

  11. Why manage incidents? “You can't manage what you don't measure”

  12. Why manage incidents? “You can't manage what you don't measure” Measure, manage and continually improve service

  13. How does HEAnet manage? • Fundamentally process driven Process • Supported by NOC tools personnel • Managed by Tools NOC staff • People are the most critical

  14. Implementation • Good people – Experienced and know what they are doing • Good processes – Tried, tested and continually improved • Poor tool support – Custom; built for a need 7 years ago – No support – Inflexible; not practical to extend – Not all incidents captured

  15. A new tool • Evaluate available tools – Remedy, OTRS, RT, ... • Propose replacement tool • Map existing processes to new tool • Amend tool / processes to match • Plan migration to new tool • Decommission old tool

  16. Requirements • No external facing change • Federated auth, with bypass • Integration with existing datasets • Integration with monitoring systems • Standalone capable • Resilient • DR plan (#2 item for reinstatement) • Scalable, supportable, maintainable

  17. Requirements • Automation & Aggregation – Automate what we can – Facilitate everything else • Ensure clear, well understood, robust procedures are – in place and – will be followed / enabled • Leverage Upgrades in Core RT

  18. Design • Two separate data centres • API for integration Failover Management UI UI API API Middleware Middleware DB DB

  19. Design RT Ticketing Failover UI Client info API API Middleware DB

  20. Design RT Ticketing Failover UI Client info API API Middleware Service & Circuit info API Supplier info DB API

  21. Design alerting e-mail RT Ticketing Failover UI Client info API API Middleware Service & Circuit info API Supplier info DB API

  22. Design alerting e-mail RT Ticketing Failover UI Client info RT Cache API API Middleware Service & Circuit info API Supplier info DB API

  23. Buy in • Management buy-in – Reporting – Better customer service • NOC buy-in – Easier to track incidents – Better integration makes life easier • Client buy-in – Looks the same, but better service

  24. Buy in • NOC involved from day #1 • Suggestions tracked – Fogbugz • 3-month migration from old to new – 5 th April 2011 (go-live) – 1 st July 2011 (turn off mousetrap)

  25. Continuous improvement • E-mail filters • RT interface – Agile methodology – Multiple releases since 5 th April • AssetDB launched 28 th June 2011 – Plan for integration

  26. Platform Primary Secondary / Failover Sysadmin & Production Production NOC Staging Staging s/w dev team Development Development

  27. Outcomes • Much better issue tracking • More Network Operation Centre Tickets tickets 4500 4000 3500 3000 Q4 tickets opened Q3 2500 Q2 Q1 2000 1500 1000 500 0 2006 2007 2008 2009 2010 2011

  28. Outcomes • Much better reporting

  29. Outcomes • Much better reporting

  30. Outcomes • Much better reporting

  31. Lessons learned • Good incident management => Good customer service • Good process is key • Tool must support the process • Integration is key • Automation is great • Reporting is vital

  32. Lessons learned • Have a DR plan (Disaster Recovery) • Test it • Break stuff, and test it again • Test it some more • Test it again How do you manage incidents if they break the tool?

  33. Lessons Learned • Support the process • Integrate • Automate • Report • Leverage community development • Have a DR plan • Test, test, test some more!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend