Predicting intermittent network device failures based on network metrics from multiple data sources
Authors: H.P.M. van Doorn C.H.J. Kuipers
University of Amsterdam
Supervisors:
- P. Boers
- M. Kaat
SURFnet
1
Tuesday 3 juli RP91
Predicting intermittent network device failures based on network - - PowerPoint PPT Presentation
Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1 I
Predicting intermittent network device failures based on network metrics from multiple data sources
Authors: H.P.M. van Doorn C.H.J. Kuipers
University of Amsterdam
Supervisors:
SURFnet
1
Tuesday 3 juli RP91
2
Collected Data over 2 years
3
Relevance
Research question
To what extent is it possible to predict intermittent network device failures based on network metrics from multiple data sources?
4
Sub questions
5
6
Fault vs Failure
Source: Salfner et al. “A Survey of Online Failure Prediction Methods”.
Identifying outages
Startingpoint: Big outages in the past 2 years: Big: multiple customers losing connectivity Based on:
7
Categorizing outages
(Spontaneous reboots)
(Line-card malfunctioning)
8
Switch chassis metrics
9
Metrics per interface:
Metrics at hand
Device Metrics: Device Events:
10
Overview Device Data
11
Line-card failure
12
13
Some Charts
14
Packet CRC Error at core router [BOR] # Interface Input Errors
15
Interface Input errors 11-09-2017 [TRUUS] # Interface Input Errors
16
Spontaneous throughput loss (1)
17
Spontaneous throughput loss (2)
18
2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2
Spontaneous throughput loss (3)
19
Roughly 21.000 events for this switch alone
Events M a y h
i d a y s ?
Spontaneous throughput loss (2)
20
2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2
Validating our hypothesis
21
PDU/sec Time (s) 15 45 Transmitted PDUs Received PDUs Opposite link
Identified:
Challenges:
22
23
24
25
26
Spontaneous throughput loss
27
28
Spontaneous throughput loss
29
Spontaneous throughput loss
30
Spontaneous reboots