Detecting network outages using different sources of data TMA - - PowerPoint PPT Presentation
Detecting network outages using different sources of data TMA - - PowerPoint PPT Presentation
Detecting network outages using different sources of data TMA Experts Summit, Paris, France Cristel Pelsser University of Strasbourg / ICube June, 2019 1 / 44 Some perspective on: From unsollicited traffic Detecting Outages using
Some perspective on:
- From unsollicited traffic
Detecting Outages using Internet Background Radiation. Andr´ eas Guillot (U. Strasbourg), Romain Fontugne (IIJ), Philipp Winter (CAIDA), Pascal M´ erindol (U. Strasbourg), Alistair King (CAIDA), Alberto Dainotti (CAIDA), Cristel Pelsser (U. Strasbourg). TMA 2019.
- From highly distributed permanent TCP connections
Disco: Fast, Good, and Cheap Outage Detection. Anant Shah (Colorado State U.), Romain Fontugne (IIJ), Emile Aben (RIPE NCC), Cristel Pelsser (University of Strasbourg), Randy Bush (IIJ, Arrcus). TMA 2017.
- From large-scale traceroute measurements
Pinpointing Anomalies in Large-Scale Traceroute
- Measurements. Romain Fontugne (IIJ), Emile Aben (RIPE
NCC), Cristel Pelsser (University of Strasbourg), Randy Bush (IIJ, Arrcus). IMC 2017.
2 / 44
Understanding Internet health? (Motivation)
- To speedup failure identification and thus recovery
- To identify weak areas and thus guide network design
3 / 44
Understanding Internet health? (Problem 1)
Manual observations and operations
- Traceroute / Ping / Operators’ group mailing lists
- Time consuming
- Slow process
- Small visibility
→ Our goal: Automaticaly pinpoint network disruptions (i.e. congestion and network disconnections)
4 / 44
Understanding Internet health? (Problem 2)
A single viewpoint is not enough → Our goal: mine results from deployed platforms → Cooperative and distributed approach → Using existing data, no added burden to the network
5 / 44
Outage detection from unsollicited traffic
Dataset: Internet Background Radiation
Internet P1 P1 is advertised to the Internet
7 / 44
Dataset: Internet Background Radiation
Internet P1 P1 is advertised to the Internet Scans, responses to spoofed traffic
7 / 44
Dataset: Internet Background Radiation
Spoofed traffic
Internet P1 P1 is advertised to the Internet Scans, responses to spoofed traffic Sends traffic with source in P1
7 / 44
Dataset: Internet Background Radiation
Spoofed traffic
Internet P1 P1 is advertised to the Internet Scans, responses to spoofed traffic Sends traffic with source in P1 Responds to spoofed traffic
7 / 44
Dataset: IP count time-series (per country or AS)
Use cases: Attacks, Censorship, Local outages detection
2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07 Time 200 400 600 800 Number of unique source IP Original time series
Figure 1: Egyptian revolution
⇒ More than 60 000 time series in the CAIDA telescope data. We use drops in the time series are indicators of an outage.
8 / 44
Current methodology used by IODA
Detecting outages using fixed thresholds
9 / 44
Our goal
Detecting outages using dynamic thresholds
10 / 44
Outage detection process
2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07 Time 200 400 600 800 Number of unique source IP Training Validation Test Original time series
11 / 44
Outage detection process
2 1 1
- 1
- 1
4 2 1 1
- 1
- 1
7 2 1 1
- 1
- 2
2 1 1
- 1
- 2
3 2 1 1
- 1
- 2
6 2 1 1
- 1
- 2
9 2 1 1
- 2
- 1
2 1 1
- 2
- 4
2 1 1
- 2
- 7
Time 200 400 600 800 Number of unique source IP Training Calibration Test Original time series Predicted time series
Prediction and confidence interval
11 / 44
Outage detection process
2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07 Time 200 400 600 800 Number of unique source IP Training Validation Test Original time series Predicted time series
- When the real data is outside the prediction interval, we raise
an alarm.
- We want a prediction model that is robust to the seasonality
and noise in the data → We use the SARIMA model1.
1More details on the methodology on wednesday.
11 / 44
Validation: ground truth
Characteristics
- 130 known outages
- Multiple spatial scales
- Countries
- Regions
- Autonomous Systems
- Multiple durations (from an hour to a week)
- Multiple causes (intentional or non intentional)
12 / 44
Evaluating our solution
Objectives
- Identifying the
minimal number of IP addresses
- Identifying a good
threshold Threshold
- TPR of 90% and
FPR of 2%
0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate All time series < 20 IPs > 20 IPs 2 sigma - 95% 3 sigma - 99.5% 5 sigma - 99.99%
Figure 2: ROC curve
13 / 44
Comparing our proposal (Chocolatine) to CAIDA’s tools
- More events detected than the simplistic thresholding
technique (DN)
- Higher overlap with other detection techniques
- Not a complete overlap
→ difference in dataset coverage → different sensitivities to outages
DN 17 BGP 644 AP 633 36 985 3 15 Chocolatine 251 BGP 445 AP 440 235 489 196 511
14 / 44
Outage detection from highly dis- tributed permanent TCP connec- tions
Proposed Approach
Disco:
- Monitor long-running TCP
connections and synchronous disconnections from related network/area
- We apply Disco on RIPE Atlas
data, where probes are widely distributed at the edge and behind NATs/CGNs providing visibility Trinocular may not have → Outage = synchronous disconnections from the same topological/geographical area
16 / 44
Assumptions / Design Choices
Rely on TCP disconnects
- Hence the granularity of detection is dependent on TCP
timeouts Bursts of disconnections are indicators of interesting outage
- While there might be non bursty outages that are interesting,
Disco is designed to detect large synchronous disconnections
17 / 44
Proposed System: Disco & Atlas
RIPE Atlas platform
- 10k probes worldwide
- Persistent connections with
RIPE controllers
- Continuous traceroute
measurements (see outages from inside) → Dataset: Stream of probe connection/disconnections (from 2011 to 2016)
18 / 44
Disco Overview
- 1. Split disconnection
stream in sub-streams (AS, country, geo-proximate 50km radius)
- 2. Burst modeling and
- utage detection
- 3. Aggregation and
- utage reporting
19 / 44
Why Burst Modeling?
Goal: How to find synchronous disconnections?
- Time series conceal
temporal characteristics
- Burst model estimates
disconnections arrival rate at any time Implementation: Kleinberg burst model2
- 2J. Kleinberg. “Bursty and hierarchical structure in streams”, Data Mining
and Knowledge Discovery, 2003.
20 / 44
Burst modeling: Example
- Monkey causes blackout in
Kenya at 8:30 UTC June, 7th 2016
- Same day RIPE rebooted
controllers
21 / 44
Results
Outage detection:
- Atlas probes disconnections from 2011 to 2016
- Disco found 443 significant outages
Outage characterization and validation:
- Traceroute results from probes (buffered if no connectivity)
- Outage detection results from Trinocular
22 / 44
Validation (Traceroute)
Comparison to traceroutes:
- Probes in detected outages can reach traceroutes destination?
→ Velocity ratio: proportion of completed traceroutes in given time
0.0 0.5 1.0 1.5 2.0 R (Average Velocity Ratio) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Probability Mass Function
Normal Outage
→ Velocity ratio ≤ 0.5 for 95% of detected outages
23 / 44
Validation (Trinocular)
Comparison to Trinocular (2015):
- Disco found 53 outages in 2015
- Corresponding to 851 /24s (only 43% is responsive to ICMP)
Results for /24s reported by Disco and pinged by Trinocular:
- 33/53 are also found by Trinocular
- 9/53 are missed by Trinocular (avg time of outages < 1hr)
- Other outages are partially detected by Trinocular
23 outages found by Trinocular are missed by Disco
- Disconnections are not very bursty in these cases
→ Disco’s precision: 95%, recall: 67%
24 / 44
Outage detection from large-scale traceroute measurements
Dataset: RIPE Atlas traceroutes
Two repetitive large-scale measurements
- Builtin: traceroute every 30 minutes to all DNS root servers
(≈ 500 server instances)
- Anchoring: traceroute every 15 minutes to 189 collaborative
servers Analyzed dataset
- May to December 2015
- 2.8 billion IPv4 traceroutes
- 1.2 billion IPv6 traceroutes
26 / 44
Monitor delays with traceroute?
Traceroute to “www.target.com” Round Trip Time (RTT) between B and C? Report abnormal RTT between B and C?
27 / 44
Monitor delays with traceroute?
Challenges:
- Noisy data
2 4 6 8 10 12 Number of hops 50 100 150 200 250 300 RTT (ms)
Traceroutes from CZ to BD
28 / 44
Monitor delays with traceroute?
Challenges:
- Noisy data
- Traffic
asymmetry
2 4 6 8 10 12 Number of hops 50 100 150 200 250 300 RTT (ms)
Traceroutes from CZ to BD
28 / 44
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
29 / 44
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
- No!
- Traffic is asymmetric
- RTTB and RTTC take different return paths!
30 / 44
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
- No!
- Traffic is asymmetric
- RTTB and RTTC take different return paths!
- Differential RTT: ∆CB = RTTC − RTTB = dBC + ep
30 / 44
Problem with differential RTT
Monitoring ∆CB over time:
Time 10 20 30 ∆RTT
→ Delay change on BC? CD? DA? BA???
31 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = x0
32 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1}
32 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1, x2, x3, x4}
32 / 44
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1, x2, x3, x4} Median ∆CB:
- Stable if a few return paths delay change
- Fluctuate if delay on BC changes
32 / 44
Median Diff. RTT: Tier1 link, 2 weeks of data, 95 probes
−400 −300 −200 −100 100 200 300 400 Differential RTT (ms)
130.117.0.250 (Cogent, Zurich) - 154.54.38.50 (Cogent, Munich) Raw values
J u n 2 2 1 5 J u n 4 2 1 5 J u n 6 2 1 5 J u n 8 2 1 5 J u n 1 2 1 5 J u n 1 2 2 1 5 J u n 1 4 2 1 5 4.8 5.0 5.2 5.4 5.6 Differential RTT (ms)
Median Diff. RTT Normal Reference
- Stable despite noisy RTTs
(not true for average)
- Normally distributed
33 / 44
Detecting congestion
N
- v
2 6 2 1 5 N
- v
2 7 2 1 5 N
- v
2 8 2 1 5 N
- v
2 9 2 1 5 N
- v
3 2 1 5 D e c 1 2 1 5 −10 −5 5 10 15 20 25 30 Differential RTT (ms)
72.52.92.14 (HE, Frankfurt) - 80.81.192.154 (DE-CIX (RIPE)) Median Diff. RTT Normal Reference Detected Anomalies
Significant RTT changes: Confidence interval not overlapping with the normal reference
34 / 44
Results
Analyzed dataset
- Atlas builtin/anchoring measurements
- From May to Dec. 2015
- 2.8 billion IPv4 traceroutes
- 1.2 billion IPv6 traceroutes
- Observed 262k IPv4 and 42k IPv6 links (core links)
We found a lot of congested links! Let’s look at one example
35 / 44
Study case: Telekom Malaysia BGP leak
36 / 44
Study case: Telekom Malaysia BGP leak
37 / 44
Study case: Telekom Malaysia BGP leak
37 / 44
Study case: Telekom Malaysia BGP leak
37 / 44
Study case: Telekom Malaysia BGP leak
Not only with Google... but about 170k prefixes!
37 / 44
Congestion in Level3
Rerouted traffic has congested Level3 (120 reported links)
- Example: 229ms increase between two routers in London!
J u n 8 2 1 5 J u n 9 2 1 5 J u n 1 2 1 5 J u n 1 1 2 1 5 J u n 1 2 2 1 5 J u n 1 3 2 1 5 −50 50 100 150 200 250 300 350 Differential RTT (ms)
67.16.133.130 - 67.17.106.150 Median Diff. RTT Normal Reference Detected Anomalies
38 / 44
Congestion in Level3
Reported links in London:
Delay increase Delay & packet loss
→ Traffic staying within UK/Europe may also be altered
39 / 44
But why did we look at that?
Per-AS alarm for delay
40 / 44
Conclusions and perspectives (1)
We proposed 3 different techniques to detect outages for 3 different sources of data
- Each source of data has its own coverage
- Core links (congestion and failures)
- Prefix, country, region, AS disconnections
41 / 44
Conclusions and perspectives (1)
We proposed 3 different techniques to detect outages for 3 different sources of data
- Each source of data has its own coverage
- Core links (congestion and failures)
- Prefix, country, region, AS disconnections
- Each source of data has its own noise, properties
- Identifying the suitable model is a challenge
41 / 44
Conclusions and perspectives (2)
There is no substancial, state of the art ground truth to validate the results. We resort to
- the comparison of different techniques with different coverages
- evaluations on the basis of partial ground truth
- characterizations of the detected outages based on the