Detecting Peering Infrastructure Outages in the Wild
Vasileios Giotsas †∗, Christoph Dietzel † §, Georgios Smaragdakis ‡ †, Anja Feldmann †, Arthur Berger ¶ ‡, Emile Aben #
†TU Berlin ∗CAIDA §DE-CIX ‡MIT ¶Akamai #RIPE NCC
Detecting Peering Infrastructure Outages in the Wild Vasileios - - PowerPoint PPT Presentation
Detecting Peering Infrastructure Outages in the Wild Vasileios Giotsas , Christoph Dietzel , Georgios Smaragdakis , Anja Feldmann , Arthur Berger , Emile Aben # TU Berlin CAIDA DE-CIX MIT Akamai
Vasileios Giotsas †∗, Christoph Dietzel † §, Georgios Smaragdakis ‡ †, Anja Feldmann †, Arthur Berger ¶ ‡, Emile Aben #
†TU Berlin ∗CAIDA §DE-CIX ‡MIT ¶Akamai #RIPE NCC
○ Largest IXPs support > 100 K of peerings, > 5 Tbps peak traffic ○ Typical SLA 99.99% (~52 min. downtime/year)1
○ Largest facilities support > 170 K of interconnections ○ Typical SLA 99.999% (~5 min. downtime/year)2
1 https://ams-ix.net/services-pricing/service-level-agreement 2http://www.telehouse.net/london-colocation/
2
3
4
5
6
https://www.franceix.net/en/technical/infrastructure/
Global footprint of AMS-IX
https://ams-ix.net/connect-to-ams-ix/peering-around-the-globe
7
○ Timely, at the finest granularity possible
○ Distinguish cascading effects from outage source
○ Determine duration, shifts in routing paths, geographic spread
8
9
Actual incident
10
Before
VP Actual incident Observed paths
11
Before
VP Actual incident Observed paths
12
Before
During
VP Actual incident Observed paths
13
AS path does not change!
Before
During
VP Actual incident Observed paths
14
Before
During
IXP or Facility 2 failed
VP Actual incident Observed paths
15
IXP is still active
Before
During
IXP or Facility 2 failed
During
VP VP Actual incident Observed paths
16
Before
During
During
VP VP
No hop changes The initial hops changed
Actual incident Observed paths
17
France-IX topology
Djibouti Telecom Telkom Indonesi a
18
BGP measurement BGP BGP BGP
Djibouti Telecom Telkom Indonesi a
19
BGP BGP BGP Traceroute measurement
149.6.154.14 2 37.49.237.126
Telkom Indonesi a
20
BGP BGP BGP Traceroute measurement Traceroute Traceroute Traceroute
149.6.154.14 2 37.49.237.126
3 Giotsas, Vasileios, et al. "Mapping peering interconnections to a facility", CoNEXT 2015 4 Motamedi, Reza, et al. “On the Geography of X-Connects”, Technical Report CIS-TR-2014-02. University of Oregon, 2014 5 Nomikos, George, et al. "traIXroute: Detecting IXPs in traceroute paths.". PAM 2016
Telkom Indonesi a
IP-to-Facility3,4 and IP-to-IXP5 mapping possible but expensive!
Djibouti Telecom
21
BGP BGP BGP Traceroute Traceroute Traceroute
22
BGP BGP BGP Traceroute Traceroute Traceroute
PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:200
23
PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:200
24
BGP Communities:
metadata
PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:200
Top 16 bits: ASN that sets the community. Bottom 16 bits: Numerical value that encodes the actual meaning.
25
PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:200
The BGP Community 2:200 is used to tag routes received at Facility 2
26
PREFIX: 3.3.3.3/24 ASPATH: 4 3 COMMUNITY: 4:8714 4:400 PREFIX: 2.2.2.2/24 ASPATH: 4 2 COMMUNITY: 4:8714 4:400 PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:200
27
PREFIX: 3.3.3.3/24 ASPATH: 4 3 COMMUNITY: 4:8714 4:400 PREFIX: 2.2.2.2/24 ASPATH: 4 2 COMMUNITY: 4:8714 4:400 PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:200
Multiple communities can tag different types
28
PREFIX: 3.3.3.3/24 ASPATH: 4 3 COMMUNITY: 4:400 PREFIX: 2.2.2.2/24 ASPATH: 4 2 COMMUNITY: 4:8714 4:400 PREFIX: 1.0.0.0/24 ASPATH: 2 1 0 COMMUNITY: 2:100
When a route changes ingress point, the community values will be update to reflect the change.
29
○ WHOIS, NOCs websites
30
31
paths annotated with at least one Community in our dictionary.
98% of the facilities with at least 20 members.
32
Time
33
AS_PATH: 1 x COMM: 1:FAC2 AS_PATH: 2 1 0 COMM: 2:FAC2 AS_PATH: 4 x COMM: 4:FAC2 Time
34
Time
35
AS_PATH: 2 1 0 COMM: 2:FAC1
Time
36
AS_PATH: 2 1 0 COMM: 2:FAC1 AS_PATH: 1 x COMM: 1:FAC1 AS_PATH: 4 x COMM: 4:FAC4 4:IXP
Time
37
AS_PATH: 2 1 0 COMM: 2:FAC1 AS_PATH: 1 x COMM: 1:FAC1 AS_PATH: 4 x COMM: 4:FAC4 4:IXP
Partial outage Time
38
AS_PATH: 2 1 0 COMM: 2:FAC1 AS_PATH: 1 x COMM: 1:FAC1 AS_PATH: 4 x COMM: 4:FAC4 4:IXP
Partial outage? De-peering of large ASes? Major routing policy change? Time
39
AS_PATH: 2 1 0 COMM: 2:FAC1 AS_PATH: 1 x COMM: 1:FAC1 AS_PATH: 4 x COMM: 4:FAC4 4:IXP
Partial outage? De-peering of large ASes? Major routing policy change? Time
40
AS_PATH: 1 x COMM: 1:FAC2 AS_PATH: 2 1 0 COMM: 2:FAC2
End of outage inferred when the majority
Time
41
The aggregated activity of BGP messages (updates, withdrawals, states) provides no outage indication.
Time Number of BGP messages (log)
105 103 101
42
The aggregated activity of BGP messages (updates, withdrawals, states) provides no outage indication. The BGP activity filtered using communities provides strong
Time Number of BGP messages (log)
105 103 101
Time Number of BGP messages (log)
105 103 101 1.0 0.4 0.2 0.6 0.8
Fraction of infrastructure paths
43
○ ASes may be interconnected over multiple intermediate infrastructures ○ Failures in intermediate infrastructures may affect the near-end infrastructure paths
44
Time
45
Time
46
Outage in Facility 2 causes drop in the paths of Facility 4!
Time
47
Time
48
Outage in Facility 3 causes drop in the paths of Facility 4!
Time
49
○ AS to Facilities, AS to IXPs, IXPs to Facilities ○ Sources: PeeringDB, DataCenterMap, operator websites
50
Far-end ASes colocated in Facility 2
Time
51
Far-end ASes colocated in Facility 3
Time
52
London Telecity HE8/9 outage London Telehouse North outage
Time
53
London Telecity HE8/9 outage London Telehouse North outage London Telecity HE8/9 outage London Telehouse North outage
Time
54
○ 76% of the outages not reported in popular mailing lists/websites
○ 90% accuracy, 93% precision (for trackable PoPs)
55
56
> 56 % of the affected links in different country, > 20% in different continent! Median RTT rises by > 100 ms for rerouted paths during AMS-IX
Number of affected links (log)
105 103 101
CDF
1.0 0.4 0.2 0.6 0.8 0.44
Distance from outage source (km)
12K 8K 10K 6K 4K 2K
Fraction of paths RTT (ms)
passive BGP monitoring
impact of local incidents
transparency and resilience strategies
57
58