The Impact of Router Outages on the AS-Level Internet Matthew - - PowerPoint PPT Presentation

the impact of router outages on the as level internet
SMART_READER_LITE
LIVE PREVIEW

The Impact of Router Outages on the AS-Level Internet Matthew - - PowerPoint PPT Presentation

The Impact of Router Outages on the AS-Level Internet Matthew Luckie* - University of Waikato Robert Beverly - Naval Postgraduate School *work started while at CAIDA, UC San Diego SIGCOMM 2017, August 24th 2017 1 w w w . cai da. or


slide-1
SLIDE 1

w w w . cai da.

  • r

The Impact of Router Outages on the AS-Level Internet

1

Matthew Luckie* - University of Waikato
 Robert Beverly - Naval Postgraduate School *work started while at CAIDA, UC San Diego

SIGCOMM 2017, August 24th 2017

slide-2
SLIDE 2

Internet Resilience

2

CE PE PE CE PE CE CE: Customer Edge
 PE: Provider Edge Where are the Single Points of Failure? Example #A Example #B

slide-3
SLIDE 3

Internet Resilience

3

CE PE PE Where are the Single Points of Failure? If the CE router fails,
 the network is disconnected,
 so the CE router is a
 Single Point of Failure (SPoF) CE: Customer Edge
 PE: Provider Edge Example #A

slide-4
SLIDE 4

Internet Resilience

4

CE PE CE Where are the Single Points of Failure? If the CE router fails,
 the network has an
 alternate path available,
 so the CE router is NOT a
 Single Point of Failure (SPoF) CE: Customer Edge
 PE: Provider Edge Example #B

slide-5
SLIDE 5

Internet Resilience

5

CE PE CE Where are the Single Points of Failure? If the PE router fails,
 the customer network is
 disconnected, so the PE router is
 a Single Point of Failure (SPoF) CE: Customer Edge
 PE: Provider Edge Example #B

slide-6
SLIDE 6

Challenges in topology analysis

  • Prior approaches analyzed static AS-level and router-level

topology graphs,

  • e.g.: Nature 2000
  • Important AS-level and router-level topology might be

invisible to measurement, such as backup paths,

  • e.g: INFOCOM 2002
  • A router that appears to be central to a network’s

connectivity might not be

  • e.g.: AMS 2009

6

slide-7
SLIDE 7

What we did

Large-scale (Internet-wide) longitudinal (2.5 years) measurement study to characterize prevalence of Single Points

  • f Failure (SPoF):

1.Efficiently inferred IPv6 router outage time windows 2.Associated routers with IPv6 BGP prefixes 3.Correlated router outages with BGP control plane 4.Correlated router outages with data plane 5.Validated inferences of SPoF with network operators

7

slide-8
SLIDE 8

What we did

8

Identified IPv6 router interfaces from traceroute 83K to 2.4M interfaces from CAIDA’s
 Archipelago traceroute measurements

slide-9
SLIDE 9

What we did

9

probed router interfaces to infer outage windows We used a single vantage point located at CAIDA,
 UC San Diego for the duration of this study

slide-10
SLIDE 10

What we did

10

Central counter: 9290

slide-11
SLIDE 11

What we did

10

Central counter: 9290 Central counter: 9291 9290 T1: 9290

slide-12
SLIDE 12

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 9291 Central counter: 9292 T1: 9290 T2: 9291

slide-13
SLIDE 13

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 9292

slide-14
SLIDE 14

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 9293 Central counter: 9294 T1: 9290 T2: 9291 T3: 9292 T4: 9293

slide-15
SLIDE 15

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 Central counter: 9294 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 9294 Central counter: 9295

slide-16
SLIDE 16

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 Central counter: 9294 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 Central counter: 9295 Reboot! Central counter: 1

slide-17
SLIDE 17

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 Central counter: 9294 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 Central counter: 9295 Central counter: 1 1 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 Central counter: 2

slide-18
SLIDE 18

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 Central counter: 9294 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 Central counter: 9295 Central counter: 1 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 Central counter: 2 Central counter: 3 2 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2

slide-19
SLIDE 19

What we did

10

Central counter: 9290 Central counter: 9291 T1: 9290 Central counter: 9292 T1: 9290 T2: 9291 T1: 9290 T2: 9291 T3: 9292 Central counter: 9293 Central counter: 9294 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 Central counter: 9295 Central counter: 1 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 Central counter: 2 Central counter: 3 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 3 Central counter: 4 T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 T8: 3

slide-20
SLIDE 20

What we did

11

probed router interfaces to infer outage windows using IPID T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 T8: 3 Infer a reboot when time series of values returned from a router is discontinuous, indicating router was restarted Outage
 Window

slide-21
SLIDE 21

Why IPv6 fragment IDs?

  • IPv4 Fragment IDs:
  • 16 bits, bursty velocity: every packet requires unique ID
  • At 100Mbps and 1500 byte packets, Nyquist rate dictates


4 second probing interval

  • IPv6 Fragment IDs:
  • 32 bits, low velocity: IPv6 routers rarely send fragments
  • We average 15 minute probing interval

12

slide-22
SLIDE 22

What we did

13

correlated routers with prefixes
 using traceroute paths

slide-23
SLIDE 23

What we did

14

2001:db8:1::/48 2001:db8:2::/48 correlated routers with prefixes
 using traceroute paths Ark VP Ark VP 50-60 Ark VPs traceroute every
 routed IPv6
 prefix every day

slide-24
SLIDE 24

What we did

14

2001:db8:1::/48 2001:db8:2::/48 correlated routers with prefixes
 using traceroute paths Ark VP Ark VP 50-60 Ark VPs traceroute every
 routed IPv6
 prefix every day

slide-25
SLIDE 25

What we did

15

2001:db8:2::/48 computed distance of
 router from AS announcing
 network Ark VP 2001:db8:1::/48

(CE) 1
 (PE) 2

CE: Customer Edge
 PE: Provider Edge

slide-26
SLIDE 26

What we did

16

2001:db8:1::/48 2001:db8:2::/48 correlated router outage windows
 with BGP control plane

(CE)

slide-27
SLIDE 27

What we did

17

2001:db8:1::/48 2001:db8:2::/48 correlated router outage windows
 with BGP control plane T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 T8: 3 Outage
 Window

slide-28
SLIDE 28

What we did

18

2001:db8:1::/48 2001:db8:2::/48 correlated router outage windows
 with BGP control plane T1: 9290 T2: 9291 T3: 9292 T4: 9293 T5: 9294 T6: 1 T7: 2 T8: 3 Outage
 Window 2001:db8:2::/48 T5.2: Peer-1 W T5.2: Peer-2 W T5.3: Peer-3 W T5.3: Peer-4 W T5.8: Peer-3 A T5.8: Peer-2 A T5.8: Peer-1 A T5.8: Peer-4 A RouteViews

slide-29
SLIDE 29

classified impact on BGP according to observed activity


  • verlapping with inferred outage

What we did

  • Complete Withdrawal: all peers simultaneously

withdrew route for at least 70 seconds

  • Single Point of Failure (SPoF)
  • Partial Withdrawal: at least one peer withdrew route for

at least 70 seconds, but not all did

  • Churn: BGP activity for the prefix
  • No Impact: No observed BGP activity for the prefix

19

slide-30
SLIDE 30

Data Collection Summary

What we did

  • Probed IPv6 routers at ~15 minute intervals from


18 Jan 2015 to 30 May 2017 (approx. 2.5 years)

  • 149,560 routers allowed reboots to be detected
  • We inferred 59,175 (40%) rebooted at least once,750K

reboots in total

20

CDF 0.2 0.4 0.6 0.8 1 1 10 100 Number of Outages

slide-31
SLIDE 31

What we found

  • 2,385 (4%) of routers that rebooted (59K) we inferred

to be SPoF for at least one IPv6 prefix in BGP

  • Of SPoF routers, we inferred 59% to be customer edge

router; 8% provider edge; 29% within destination AS

  • No covering prefix for 70% of withdrawn prefixes
  • During one-week sample, covering prefix presence during

withdrawal did not imply data plane reachability

  • IPv6 Router reboots correlated with IPv4 BGP control

plane activity

21

slide-32
SLIDE 32

Limitations

  • Applicability to IPv4 depends on router being dual-stack
  • Requires IPID assigned from a counter
  • Cisco, Huawei, Vyatta, Microtik, HP assign from counter
  • 27.1% responsive for 14 days assigned from counter
  • Router outage might end before all peers withdraw route
  • Path exploration + Minimum Route Advertisement Interval

(MRAI) + Route Flap Dampening (RFD)

  • Complex events: multiple router outages but one detected
  • We observed some complex events and filtered them out

22

slide-33
SLIDE 33

Validation

23

Reboots SPoF Network ✔ ✘ ? ✔ ✘ ? US University 7 8 7 8 US R&E backbone #1 2 3 3 2 US R&E backbone #2 3 1 4 NZ R&E backbone 11 22 4 2 27 Total: 23 34 14 4 39 ✔ = Validated Inference
 ✘ = Incorrect Inference ? = Not Validated

slide-34
SLIDE 34

Validation

24

Reboots SPoF Network ✔ ✘ ? ✔ ✘ ? US University 7 8 7 8 US R&E backbone #1 2 3 3 2 US R&E backbone #2 3 1 4 NZ R&E backbone 11 22 4 2 27 Total: 23 34 14 4 39 Challenging to get validation data: operators often
 could only tell us about the last reboot

slide-35
SLIDE 35

Validation

25

Reboots SPoF Network ✔ ✘ ? ✔ ✘ ? US University 7 8 7 8 US R&E backbone #1 2 3 3 2 US R&E backbone #2 3 1 4 NZ R&E backbone 11 22 4 2 27 Total: 23 34 14 4 39 No falsely inferred reboots: we correctly observed
 the last known reboot of each router

slide-36
SLIDE 36

Validation

26

Reboots SPoF Network ✔ ✘ ? ✔ ✘ ? US University 7 8 7 8 US R&E backbone #1 2 3 3 2 US R&E backbone #2 3 1 4 NZ R&E backbone 11 22 4 2 27 Total: 23 34 14 4 39 We did not detect some SPoFs

slide-37
SLIDE 37

Data Collection Summary

27

83K 41.8K 15.2K 46.5K (b) ~1.1M 79.8K (a) (c) 23.5K 10K Jan ’17 Jul ’16 All Incrementing Jan ’16 Jul ’15 Jan ’15 3M 1M 100K 30K Number of Interfaces

PPS List Unresponsive (a) 100 Static 83K 12-24 hours (b) 225 Static 1.1M 12-24 hours (c) 200 Dynamic, ~2.4M 7-14 days

slide-38
SLIDE 38

Control: six hours prior to inferred outages, Feb 2015

Correlating BGP/router outages

28

2 1 0 −1 −2 −3 −4 −5 Distance of Router from Destination AS (IP hops) Fraction of Reboot/Prefix Pairs 0.1 Churn 0.2 0.3 0.4 0.5 12 11 10 Partial Withdrawal 9 8 7 6 5 4 3 Complete Withdrawal

Inside Dest.
 AS Outside

  • Dest. AS
slide-39
SLIDE 39

During the inferred outages, Feb 2015

Correlating BGP/router outages

29

2 1 0 −1 −2 −3 −4 −5 Distance of Router from Destination AS (IP hops) Fraction of Reboot/Prefix Pairs 0.1 Churn 0.2 0.3 0.4 0.5 12 11 10 Partial Withdrawal 9 8 7 6 5 4 3 Complete Withdrawal

Inside Dest.
 AS Outside

  • Dest. AS
slide-40
SLIDE 40

BGP Prefix Withdrawals: SPoF

30

max 0.2 0.4 0.6 0.8 1 1 min 5 min 15 min 30 min 1 hr 2 hr 4 hr 8 hr 16 hr CDF Complete Withdrawal Duration min

44% less than 5 minutes, suggestive of
 router maintenance or router crash

slide-41
SLIDE 41

SPoF prefixes mostly single homed

31

Router hop distance PE CE 1 −2 −1 1 Prefix announced through a single upstream Prefix announced through multiple upstreams 0.2 2 3 Fraction of Population 0.4 0.6 0.8 −3

Especially
 SPoFs outside
 destination AS,
 as expected

slide-42
SLIDE 42

Impact on IPv4 prefixes in BGP

32 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Cumulative Fraction Router Outages Withdrawn Peers/Advertising Peers

Before Outage During Outage

We examined IPv4 prefixes for 5% sample of reboots.
 19% of correlated IPv4 prefixes withdrawn
 by at least 90% of peers during router outage window. Control Outage

slide-43
SLIDE 43

Summary

  • Step towards root-cause analysis of


inter-domain routing outages and events

  • Explore applicability of method to measurement of other

critical Internet infrastructure: DNS, Web, Email

  • In our 2.5 year sample of 59K routers that rebooted
  • 4% (2.3K) were SPoF
  • SPoF were mostly confined to the edge: 59% customer edge
  • We released our code as part of scamper

33

https://www.caida.org/tools/measurement/scamper/

2 1 0 −1 −2 −3 −4 −5 Distance of Router from Destination AS (IP hops) Fraction of Reboot/Prefix Pairs 0.1 Churn 0.2 0.3 0.4 0.5 12 11 10 Partial Withdrawal 9 8 7 6 5 4 3 Complete Withdrawal

slide-44
SLIDE 44

Backup Slides

34

slide-45
SLIDE 45

Impact on IPv4 Services

35

We examined IPv4 prefixes for 5% sample of reboots
 where at least 90% of peers during router outage window. Active Hosts 39,107 HTTP 25,592 HTTPS 16,321 SSH 11,277 DNS 7,922 SMTP 7,383 IMAP 5,127 censys.io April 2017 Web

}

Email

}

slide-46
SLIDE 46

Partial Withdrawals

36

50% of pairs had 1−2 peers withdraw nearly all peers withdraw 10% of pairs had 1 0.2 0.4 0.6 0.8 1 Fraction of Peers Withdrawing Route CDF 0.6 0.4 0.2 0.8

50% of pairs had 1-2 peers withdraw prefix
 10% of pairs had nearly all peers withdraw prefix

slide-47
SLIDE 47

Degrees of ASes monitored

37

0.2 0.4 0.6 0.8 1 1 10 100 1000 Cumulative Fraction of Rebooting ASes AS Degree Single Points of Failure Monitored Population

ASes that were inferred to have a SPoF
 were disproportionately low-degree ASes

slide-48
SLIDE 48

Activity for IPv4 prefixes in BGP

38

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Cumulative Fraction Router Outages Peers Sending Updates/Total Peers

Before Outage During Outage

At least 70% of peers reported BGP activity on IPv4
 prefixes for 50% of the inferred router outages

slide-49
SLIDE 49

Reboot Window Durations

39

max 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 min 5 min 15 min 30 min 1 hr 2 hr 4 hr 8 hr 16 hr Reboot Window Duration CDF min

Half the maximum reboot lengths were less than
 30 minutes (~two probing rounds)

slide-50
SLIDE 50

Router + BGP outage correlation

40

Outage Window Withdraw-Contained 10, 11, 12 1, 2, 3 W A Router IP-ID Sequence: Outage-Contained W A Withdraw-Before W A Announce-After W A BGP Sequence:

slide-51
SLIDE 51

Data processing pipeline

41

Uptime Prober

Cassandra CAIDA IPv6 T

  • pology

rtr targets <ip,time,ipid>

Route Views

<peer,time,prefix>

Inferred Reboots BGP Correlation single points

  • f failure

AS border distance

slide-52
SLIDE 52

Inferring router position

42

x1 R1 R3 x2 x3 R2 R5 y1 y2 R4

  • 1
  • 2

1 2 (a) interface addresses routed by Y appear in traceroute x1 R1 R3 x2 x3 R2 R5 R4 1 2 (b) no interface addresses routed by Y appear in traceroute Customer Edge (CE) Router Provider Edge (PE) Router ? ? AS X AS Y AS X AS Y

slide-53
SLIDE 53

Data Collection Summary

43

18 Jan ’15 18 Oct ’16 (a) 18 Oct ’16 24 Feb ’17 (b) 24 Feb ’17 30 May ’17 (c) Probing rate 100 pps 225 pps 200 pps Interfaces 83K seen
 Dec ‘14 1.1M seen
 Jun to Oct ’16

  • Dynamic. 2.4M


in May ‘17 Responsive every round ~15 mins every round ~15 mins every round ~15 mins Unresponsive 12-24 hours 12-24 hours 7-14 days

slide-54
SLIDE 54

Why IPv6 fragment IDs?

44

At 100Mbps and 1500 byte packets. Nyquist rate dictates a 4 second probing interval IPv4 ID values are 16 bits with bursty velocity
 as every packet requires a unique value. source address destination address TTL protocol checksum identification Ver DSCP length HL

  • ffset
slide-55
SLIDE 55

Why IPv6 fragment IDs?

45

IPv6 ID values are 32 bits with low velocity
 as systems rarely send fragmented packets. source address destination address protocol TTL identification Ver DSCP flow id payload length protocol

  • ffset

reserved

slide-56
SLIDE 56

Soliciting IPv6 Fragment IDs

46

echo request, 1300 bytes packet too big, MTU 1280 echo reply, 1300 bytes echo request, 1300 bytes echo reply, 1280 bytes Fragment ID: 12345