predicting intermittent network device failures based on
play

Predicting intermittent network device failures based on network - PowerPoint PPT Presentation

Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1 I


  1. Predicting intermittent network device failures based on network metrics from multiple data sources Supervisors: Authors: P. Boers H.P.M. van Doorn M. Kaat C.H.J. Kuipers SURFnet University of Amsterdam Tuesday 3 juli RP91 1

  2. I ntroduction Collected Data over 2 years ~690 Million Device Events ~163 Billion Device Metrics 2

  3. I ntroduction Relevance Failures impacting connectivity 3

  4. I ntroduction Research question To what extent is it possible to predict intermittent network device failures based on network metrics from multiple data sources ? 4

  5. I ntroduction Sub questions - Which metrics are relevant? - Patterns between failures? - Correlation between data sources? 5

  6. I ntroduction Fault vs Failure 6 Source: Salfner et al. “ A Survey of Online Failure Prediction Methods ”.

  7. M ethodology Identifying outages Startingpoint: Big outages in the past 2 years: Big: multiple customers losing connectivity Based on: - Ticketing System - Network operators 7

  8. M ethodology Categorizing outages - Intermittent failure (Spontaneous reboots) - Permanent failure (Line-card malfunctioning) 8

  9. M ethodology Metrics at hand Switch chassis metrics Metrics per interface: - CPU and Memory utilization - Throughput - Temperature - Unicast packets - Uptime - Multicast packets - Broadcast packets - In/Out Errors 9

  10. D ata Sources Overview Device Data Device Metrics: Device Events: 10

  11. 11

  12. M ethodology Line-card failure - Line-card Bor malfunctioning 12

  13. Findings Line Card fault 13

  14. R esults Packet CRC Error at core router [BOR] Some Charts # Interface Input Errors 14

  15. F indings Interface Input errors 11-09-2017 [TRUUS] # Interface Input Errors 15

  16. Findings Loss of throughput 16

  17. F indings Spontaneous throughput loss (1) 17

  18. F indings Spontaneous throughput loss (2) - Syslog event 2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED 18

  19. F indings Spontaneous throughput loss (3) - So is this a real problem? ? s y a d i l o Events h y a M Roughly 21.000 events for this switch alone 19

  20. F indings Spontaneous throughput loss (2) - Syslog event 2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE-4-FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED 20

  21. F indings Validating our hypothesis Transmitted PDUs Received PDUs PDU/sec Opposite link 21 0 15 45 Time (s)

  22. D iscussion Identified: - 2 cases of permanent line-card faults - Thousands of flood containment events Challenges: - Data inconsistencies - Measurement errors - No labeled dataset 22

  23. C onclusion - Dataset not (yet) suitable for automated predictions - No data that could indicate failure beforehand - Proved link between two datasets - Validated hypothesis 23

  24. F uture W ork - Normalizing datasets - Create labeled dataset - Other areas: - Capacity Management - Service Level Specification 24

  25. Questions? 25

  26. Backup slides 26

  27. B onus Spontaneous throughput loss 27

  28. B onus Spontaneous throughput loss - So is this a real problem? 28

  29. B onus Spontaneous throughput loss 29

  30. B onus Spontaneous reboots 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend