Detecting network outages using different sources of data TMA - - PowerPoint PPT Presentation

detecting network outages using different sources of data
SMART_READER_LITE
LIVE PREVIEW

Detecting network outages using different sources of data TMA - - PowerPoint PPT Presentation

Detecting network outages using different sources of data TMA Experts Summit, Paris, France Cristel Pelsser University of Strasbourg / ICube June, 2019 1 / 44 Some perspective on: From unsollicited traffic Detecting Outages using


slide-1
SLIDE 1

Detecting network outages using different sources of data

TMA Experts Summit, Paris, France

Cristel Pelsser University of Strasbourg / ICube June, 2019

1 / 44

slide-2
SLIDE 2

Some perspective on:

  • From unsollicited traffic

Detecting Outages using Internet Background Radiation. Andr´ eas Guillot (U. Strasbourg), Romain Fontugne (IIJ), Philipp Winter (CAIDA), Pascal M´ erindol (U. Strasbourg), Alistair King (CAIDA), Alberto Dainotti (CAIDA), Cristel Pelsser (U. Strasbourg). TMA 2019.

  • From highly distributed permanent TCP connections

Disco: Fast, Good, and Cheap Outage Detection. Anant Shah (Colorado State U.), Romain Fontugne (IIJ), Emile Aben (RIPE NCC), Cristel Pelsser (University of Strasbourg), Randy Bush (IIJ, Arrcus). TMA 2017.

  • From large-scale traceroute measurements

Pinpointing Anomalies in Large-Scale Traceroute

  • Measurements. Romain Fontugne (IIJ), Emile Aben (RIPE

NCC), Cristel Pelsser (University of Strasbourg), Randy Bush (IIJ, Arrcus). IMC 2017.

2 / 44

slide-3
SLIDE 3

Understanding Internet health? (Motivation)

  • To speedup failure identification and thus recovery
  • To identify weak areas and thus guide network design

3 / 44

slide-4
SLIDE 4

Understanding Internet health? (Problem 1)

Manual observations and operations

  • Traceroute / Ping / Operators’ group mailing lists
  • Time consuming
  • Slow process
  • Small visibility

→ Our goal: Automaticaly pinpoint network disruptions (i.e. congestion and network disconnections)

4 / 44

slide-5
SLIDE 5

Understanding Internet health? (Problem 2)

A single viewpoint is not enough → Our goal: mine results from deployed platforms → Cooperative and distributed approach → Using existing data, no added burden to the network

5 / 44

slide-6
SLIDE 6

Outage detection from unsollicited traffic

slide-7
SLIDE 7

Dataset: Internet Background Radiation

Internet P1 P1 is advertised to the Internet

7 / 44

slide-8
SLIDE 8

Dataset: Internet Background Radiation

Internet P1 P1 is advertised to the Internet Scans, responses to spoofed traffic

7 / 44

slide-9
SLIDE 9

Dataset: Internet Background Radiation

Spoofed traffic

Internet P1 P1 is advertised to the Internet Scans, responses to spoofed traffic Sends traffic with source in P1

7 / 44

slide-10
SLIDE 10

Dataset: Internet Background Radiation

Spoofed traffic

Internet P1 P1 is advertised to the Internet Scans, responses to spoofed traffic Sends traffic with source in P1 Responds to spoofed traffic

7 / 44

slide-11
SLIDE 11

Dataset: IP count time-series (per country or AS)

Use cases: Attacks, Censorship, Local outages detection

2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07 Time 200 400 600 800 Number of unique source IP Original time series

Figure 1: Egyptian revolution

⇒ More than 60 000 time series in the CAIDA telescope data. We use drops in the time series are indicators of an outage.

8 / 44

slide-12
SLIDE 12

Current methodology used by IODA

Detecting outages using fixed thresholds

9 / 44

slide-13
SLIDE 13

Our goal

Detecting outages using dynamic thresholds

10 / 44

slide-14
SLIDE 14

Outage detection process

2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07 Time 200 400 600 800 Number of unique source IP Training Validation Test Original time series

11 / 44

slide-15
SLIDE 15

Outage detection process

2 1 1

  • 1
  • 1

4 2 1 1

  • 1
  • 1

7 2 1 1

  • 1
  • 2

2 1 1

  • 1
  • 2

3 2 1 1

  • 1
  • 2

6 2 1 1

  • 1
  • 2

9 2 1 1

  • 2
  • 1

2 1 1

  • 2
  • 4

2 1 1

  • 2
  • 7

Time 200 400 600 800 Number of unique source IP Training Calibration Test Original time series Predicted time series

Prediction and confidence interval

11 / 44

slide-16
SLIDE 16

Outage detection process

2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07 Time 200 400 600 800 Number of unique source IP Training Validation Test Original time series Predicted time series

  • When the real data is outside the prediction interval, we raise

an alarm.

  • We want a prediction model that is robust to the seasonality

and noise in the data → We use the SARIMA model1.

1More details on the methodology on wednesday.

11 / 44

slide-17
SLIDE 17

Validation: ground truth

Characteristics

  • 130 known outages
  • Multiple spatial scales
  • Countries
  • Regions
  • Autonomous Systems
  • Multiple durations (from an hour to a week)
  • Multiple causes (intentional or non intentional)

12 / 44

slide-18
SLIDE 18

Evaluating our solution

Objectives

  • Identifying the

minimal number of IP addresses

  • Identifying a good

threshold Threshold

  • TPR of 90% and

FPR of 2%

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate All time series < 20 IPs > 20 IPs 2 sigma - 95% 3 sigma - 99.5% 5 sigma - 99.99%

Figure 2: ROC curve

13 / 44

slide-19
SLIDE 19

Comparing our proposal (Chocolatine) to CAIDA’s tools

  • More events detected than the simplistic thresholding

technique (DN)

  • Higher overlap with other detection techniques
  • Not a complete overlap

→ difference in dataset coverage → different sensitivities to outages

DN 17 BGP 644 AP 633 36 985 3 15 Chocolatine 251 BGP 445 AP 440 235 489 196 511

14 / 44

slide-20
SLIDE 20

Outage detection from highly dis- tributed permanent TCP connec- tions

slide-21
SLIDE 21

Proposed Approach

Disco:

  • Monitor long-running TCP

connections and synchronous disconnections from related network/area

  • We apply Disco on RIPE Atlas

data, where probes are widely distributed at the edge and behind NATs/CGNs providing visibility Trinocular may not have → Outage = synchronous disconnections from the same topological/geographical area

16 / 44

slide-22
SLIDE 22

Assumptions / Design Choices

Rely on TCP disconnects

  • Hence the granularity of detection is dependent on TCP

timeouts Bursts of disconnections are indicators of interesting outage

  • While there might be non bursty outages that are interesting,

Disco is designed to detect large synchronous disconnections

17 / 44

slide-23
SLIDE 23

Proposed System: Disco & Atlas

RIPE Atlas platform

  • 10k probes worldwide
  • Persistent connections with

RIPE controllers

  • Continuous traceroute

measurements (see outages from inside) → Dataset: Stream of probe connection/disconnections (from 2011 to 2016)

18 / 44

slide-24
SLIDE 24

Disco Overview

  • 1. Split disconnection

stream in sub-streams (AS, country, geo-proximate 50km radius)

  • 2. Burst modeling and
  • utage detection
  • 3. Aggregation and
  • utage reporting

19 / 44

slide-25
SLIDE 25

Why Burst Modeling?

Goal: How to find synchronous disconnections?

  • Time series conceal

temporal characteristics

  • Burst model estimates

disconnections arrival rate at any time Implementation: Kleinberg burst model2

  • 2J. Kleinberg. “Bursty and hierarchical structure in streams”, Data Mining

and Knowledge Discovery, 2003.

20 / 44

slide-26
SLIDE 26

Burst modeling: Example

  • Monkey causes blackout in

Kenya at 8:30 UTC June, 7th 2016

  • Same day RIPE rebooted

controllers

21 / 44

slide-27
SLIDE 27

Results

Outage detection:

  • Atlas probes disconnections from 2011 to 2016
  • Disco found 443 significant outages

Outage characterization and validation:

  • Traceroute results from probes (buffered if no connectivity)
  • Outage detection results from Trinocular

22 / 44

slide-28
SLIDE 28

Validation (Traceroute)

Comparison to traceroutes:

  • Probes in detected outages can reach traceroutes destination?

→ Velocity ratio: proportion of completed traceroutes in given time

0.0 0.5 1.0 1.5 2.0 R (Average Velocity Ratio) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Probability Mass Function

Normal Outage

→ Velocity ratio ≤ 0.5 for 95% of detected outages

23 / 44

slide-29
SLIDE 29

Validation (Trinocular)

Comparison to Trinocular (2015):

  • Disco found 53 outages in 2015
  • Corresponding to 851 /24s (only 43% is responsive to ICMP)

Results for /24s reported by Disco and pinged by Trinocular:

  • 33/53 are also found by Trinocular
  • 9/53 are missed by Trinocular (avg time of outages < 1hr)
  • Other outages are partially detected by Trinocular

23 outages found by Trinocular are missed by Disco

  • Disconnections are not very bursty in these cases

→ Disco’s precision: 95%, recall: 67%

24 / 44

slide-30
SLIDE 30

Outage detection from large-scale traceroute measurements

slide-31
SLIDE 31

Dataset: RIPE Atlas traceroutes

Two repetitive large-scale measurements

  • Builtin: traceroute every 30 minutes to all DNS root servers

(≈ 500 server instances)

  • Anchoring: traceroute every 15 minutes to 189 collaborative

servers Analyzed dataset

  • May to December 2015
  • 2.8 billion IPv4 traceroutes
  • 1.2 billion IPv6 traceroutes

26 / 44

slide-32
SLIDE 32

Monitor delays with traceroute?

Traceroute to “www.target.com” Round Trip Time (RTT) between B and C? Report abnormal RTT between B and C?

27 / 44

slide-33
SLIDE 33

Monitor delays with traceroute?

Challenges:

  • Noisy data

2 4 6 8 10 12 Number of hops 50 100 150 200 250 300 RTT (ms)

Traceroutes from CZ to BD

28 / 44

slide-34
SLIDE 34

Monitor delays with traceroute?

Challenges:

  • Noisy data
  • Traffic

asymmetry

2 4 6 8 10 12 Number of hops 50 100 150 200 250 300 RTT (ms)

Traceroutes from CZ to BD

28 / 44

slide-35
SLIDE 35

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

29 / 44

slide-36
SLIDE 36

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

  • No!
  • Traffic is asymmetric
  • RTTB and RTTC take different return paths!

30 / 44

slide-37
SLIDE 37

What is the RTT between B and C?

RTTC - RTTB = RTTCB?

  • No!
  • Traffic is asymmetric
  • RTTB and RTTC take different return paths!
  • Differential RTT: ∆CB = RTTC − RTTB = dBC + ep

30 / 44

slide-38
SLIDE 38

Problem with differential RTT

Monitoring ∆CB over time:

Time 10 20 30 ∆RTT

→ Delay change on BC? CD? DA? BA???

31 / 44

slide-39
SLIDE 39

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = x0

32 / 44

slide-40
SLIDE 40

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = {x0, x1}

32 / 44

slide-41
SLIDE 41

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = {x0, x1, x2, x3, x4}

32 / 44

slide-42
SLIDE 42

Proposed Approach: Use probes with different return paths

Differential RTT: ∆CB = {x0, x1, x2, x3, x4} Median ∆CB:

  • Stable if a few return paths delay change
  • Fluctuate if delay on BC changes

32 / 44

slide-43
SLIDE 43

Median Diff. RTT: Tier1 link, 2 weeks of data, 95 probes

−400 −300 −200 −100 100 200 300 400 Differential RTT (ms)

130.117.0.250 (Cogent, Zurich) - 154.54.38.50 (Cogent, Munich) Raw values

J u n 2 2 1 5 J u n 4 2 1 5 J u n 6 2 1 5 J u n 8 2 1 5 J u n 1 2 1 5 J u n 1 2 2 1 5 J u n 1 4 2 1 5 4.8 5.0 5.2 5.4 5.6 Differential RTT (ms)

Median Diff. RTT Normal Reference

  • Stable despite noisy RTTs

(not true for average)

  • Normally distributed

33 / 44

slide-44
SLIDE 44

Detecting congestion

N

  • v

2 6 2 1 5 N

  • v

2 7 2 1 5 N

  • v

2 8 2 1 5 N

  • v

2 9 2 1 5 N

  • v

3 2 1 5 D e c 1 2 1 5 −10 −5 5 10 15 20 25 30 Differential RTT (ms)

72.52.92.14 (HE, Frankfurt) - 80.81.192.154 (DE-CIX (RIPE)) Median Diff. RTT Normal Reference Detected Anomalies

Significant RTT changes: Confidence interval not overlapping with the normal reference

34 / 44

slide-45
SLIDE 45

Results

Analyzed dataset

  • Atlas builtin/anchoring measurements
  • From May to Dec. 2015
  • 2.8 billion IPv4 traceroutes
  • 1.2 billion IPv6 traceroutes
  • Observed 262k IPv4 and 42k IPv6 links (core links)

We found a lot of congested links! Let’s look at one example

35 / 44

slide-46
SLIDE 46

Study case: Telekom Malaysia BGP leak

36 / 44

slide-47
SLIDE 47

Study case: Telekom Malaysia BGP leak

37 / 44

slide-48
SLIDE 48

Study case: Telekom Malaysia BGP leak

37 / 44

slide-49
SLIDE 49

Study case: Telekom Malaysia BGP leak

37 / 44

slide-50
SLIDE 50

Study case: Telekom Malaysia BGP leak

Not only with Google... but about 170k prefixes!

37 / 44

slide-51
SLIDE 51

Congestion in Level3

Rerouted traffic has congested Level3 (120 reported links)

  • Example: 229ms increase between two routers in London!

J u n 8 2 1 5 J u n 9 2 1 5 J u n 1 2 1 5 J u n 1 1 2 1 5 J u n 1 2 2 1 5 J u n 1 3 2 1 5 −50 50 100 150 200 250 300 350 Differential RTT (ms)

67.16.133.130 - 67.17.106.150 Median Diff. RTT Normal Reference Detected Anomalies

38 / 44

slide-52
SLIDE 52

Congestion in Level3

Reported links in London:

Delay increase Delay & packet loss

→ Traffic staying within UK/Europe may also be altered

39 / 44

slide-53
SLIDE 53

But why did we look at that?

Per-AS alarm for delay

40 / 44

slide-54
SLIDE 54

Conclusions and perspectives (1)

We proposed 3 different techniques to detect outages for 3 different sources of data

  • Each source of data has its own coverage
  • Core links (congestion and failures)
  • Prefix, country, region, AS disconnections

41 / 44

slide-55
SLIDE 55

Conclusions and perspectives (1)

We proposed 3 different techniques to detect outages for 3 different sources of data

  • Each source of data has its own coverage
  • Core links (congestion and failures)
  • Prefix, country, region, AS disconnections
  • Each source of data has its own noise, properties
  • Identifying the suitable model is a challenge

41 / 44

slide-56
SLIDE 56

Conclusions and perspectives (2)

There is no substancial, state of the art ground truth to validate the results. We resort to

  • the comparison of different techniques with different coverages
  • evaluations on the basis of partial ground truth
  • characterizations of the detected outages based on the

detection algorithm used

42 / 44

slide-57
SLIDE 57

Turn this

43 / 44

slide-58
SLIDE 58

Into this

http://ihr.iijlab.net

44 / 44