Predicting intermittent network device failures based on network - - PowerPoint PPT Presentation

predicting intermittent network device failures based on
SMART_READER_LITE
LIVE PREVIEW

Predicting intermittent network device failures based on network - - PowerPoint PPT Presentation

Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1 I


slide-1
SLIDE 1

Predicting intermittent network device failures based on network metrics from multiple data sources

Authors: H.P.M. van Doorn C.H.J. Kuipers

University of Amsterdam

Supervisors:

  • P. Boers
  • M. Kaat

SURFnet

1

Tuesday 3 juli RP91

slide-2
SLIDE 2

~690 Million Device Events ~163 Billion Device Metrics

2

Introduction

Collected Data over 2 years

slide-3
SLIDE 3

Failures impacting connectivity

3

Introduction

Relevance

slide-4
SLIDE 4

Introduction

Research question

To what extent is it possible to predict intermittent network device failures based on network metrics from multiple data sources?

4

slide-5
SLIDE 5

Introduction

Sub questions

  • Which metrics are relevant?
  • Patterns between failures?
  • Correlation between data sources?

5

slide-6
SLIDE 6

6

Introduction

Fault vs Failure

Source: Salfner et al. “A Survey of Online Failure Prediction Methods”.

slide-7
SLIDE 7

Methodology

Identifying outages

Startingpoint: Big outages in the past 2 years: Big: multiple customers losing connectivity Based on:

  • Ticketing System
  • Network operators

7

slide-8
SLIDE 8

Methodology

Categorizing outages

  • Intermittent failure

(Spontaneous reboots)

  • Permanent failure

(Line-card malfunctioning)

8

slide-9
SLIDE 9

Switch chassis metrics

  • CPU and Memory utilization
  • Temperature
  • Uptime

9

Metrics per interface:

  • Throughput
  • Unicast packets
  • Multicast packets
  • Broadcast packets
  • In/Out Errors

Methodology

Metrics at hand

slide-10
SLIDE 10

Device Metrics: Device Events:

10

Data Sources

Overview Device Data

slide-11
SLIDE 11

11

slide-12
SLIDE 12

Methodology

Line-card failure

  • Line-card Bor malfunctioning

12

slide-13
SLIDE 13

Findings

Line Card fault

13

slide-14
SLIDE 14

Some Charts

14

Results

Packet CRC Error at core router [BOR] # Interface Input Errors

slide-15
SLIDE 15

15

Findings

Interface Input errors 11-09-2017 [TRUUS] # Interface Input Errors

slide-16
SLIDE 16

Findings

Loss of throughput

16

slide-17
SLIDE 17

Findings

Spontaneous throughput loss (1)

17

slide-18
SLIDE 18

Findings

Spontaneous throughput loss (2)

  • Syslog event

18

2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2

  • n l2-ucast EXCEEDED
slide-19
SLIDE 19

Findings

Spontaneous throughput loss (3)

  • So is this a real problem?

19

Roughly 21.000 events for this switch alone

Events M a y h

  • l

i d a y s ?

slide-20
SLIDE 20

Findings

Spontaneous throughput loss (2)

  • Syslog event

20

2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2

  • n l2-ucast EXCEEDED
slide-21
SLIDE 21

Findings

Validating our hypothesis

21

PDU/sec Time (s) 15 45 Transmitted PDUs Received PDUs Opposite link

slide-22
SLIDE 22

Identified:

  • 2 cases of permanent line-card faults
  • Thousands of flood containment events

Challenges:

  • Data inconsistencies
  • Measurement errors
  • No labeled dataset

22

Discussion

slide-23
SLIDE 23
  • Dataset not (yet) suitable for automated predictions
  • No data that could indicate failure beforehand
  • Proved link between two datasets
  • Validated hypothesis

23

Conclusion

slide-24
SLIDE 24
  • Normalizing datasets
  • Create labeled dataset
  • Other areas:
  • Capacity Management
  • Service Level Specification

24

Future Work

slide-25
SLIDE 25

Questions?

25

slide-26
SLIDE 26

Backup slides

26

slide-27
SLIDE 27

Bonus

Spontaneous throughput loss

27

slide-28
SLIDE 28
  • So is this a real problem?

28

Bonus

Spontaneous throughput loss

slide-29
SLIDE 29

29

Bonus

Spontaneous throughput loss

slide-30
SLIDE 30

30

Bonus

Spontaneous reboots