A Measurement Study of BGP Misconfiguration Ratul Mahajan, David - - PowerPoint PPT Presentation

a measurement study of bgp misconfiguration
SMART_READER_LITE
LIVE PREVIEW

A Measurement Study of BGP Misconfiguration Ratul Mahajan, David - - PowerPoint PPT Presentation

A Measurement Study of BGP Misconfiguration Ratul Mahajan, David Wetherall, and Tom Anderson University of Washington Motivation Routing protocols are robust against failures Meaning fail-stop link and node


slide-1
SLIDE 1

A Measurement Study of BGP Misconfiguration

Ratul Mahajan, David Wetherall, and Tom Anderson University of Washington

† ‡ † ‡

slide-2
SLIDE 2

djw // UW-CSE

2

Motivation

  • Routing protocols are robust against failures

– Meaning “fail-stop” link and node failures

  • But what about when nodes just don’t behave?

– Misconfigurations, implementation bugs, malicious attacks

  • We need to understand this to make availability guarantees

– Many colorful anecdotes, few systematic studies

  • BGP is rich ground for a study of misconfigurations

– Thousands of ISPs, many implementations, complex to configure

slide-3
SLIDE 3

djw // UW-CSE

3

This talk

  • Peek at an in-progress BGP measurement study based
  • n the RouteViews server

– Public 2 hourly routing table snapshots from ~50 different ISPs

  • Our goals:

– Identify the common types of misconfigurations – Determine how frequently they occur – Assess their impact on the Internet as a whole

  • Current focus is the analysis of origin changes (hijacks)

and partial connectivity

slide-4
SLIDE 4

djw // UW-CSE

4

Methodology

  • Define a model of acceptable BGP usage

– Deviations from the model are “misconfigurations”

  • Measure the occurrence of misconfigurations

– Use heuristics to attribute to the likely causes

  • Measure the impact of misconfigurations

– On other, well-defined, quantities of interest

  • Validate against actual ISP experiences

– Via an email survey

slide-5
SLIDE 5

djw // UW-CSE

5

BGP in a nutshell

  • BGP is the routing protocol used in the Internet core,

which is a graph of Autonomous Systems (ASes) or ISPs

  • Each AS announces paths to other ASes that it can use to

reach given prefixes (block of IP addresses)

  • Announcements are aggregated where possible, e.g, one

for many customers, rather than one per customer

  • Imagine paths growing from origins subject to policies

(transit versus peering); packets follow reverse direction

slide-6
SLIDE 6

djw // UW-CSE

6

BGP in a nutshell (2)

  • 2 provides transit for 7; 7 reaches and is reached via 2
  • 4 and 5 peer; they exchange their customer traffic

3 4 6 5 7 1 8 2

2 7 2 7 3 2 7 6 2 7 2 7 7 7 6 2 3 4 4 4 3 4 2 3 4 7 2 3 4 2 3 4 2 3 4 5 3 2 6 5 2 6 5 2 6 5 2 6 5 7 2 6 5 6 5 5

slide-7
SLIDE 7

djw // UW-CSE

7

Why we need a usage model

  • BGP is defined by local operational practices, not global

standards

  • A contrived example: botched pre-pending
  • Pre-pending by an AS is a hack used to make paths less

attractive to others. Not considered to be a loop.

– e.g., AS1 AS77 AS4 AS1 AS77 AS77 AS77 AS4

  • What if AS77 announces AS1 AS77 AS66 AS77 AS4?
  • Is this a mistake, or a hack for enforcing policy?
slide-8
SLIDE 8

djw // UW-CSE

8

A model of BGP usage

  • Private identifiers are not be leaked in public
  • The origin AS owns the address space it announces
  • The advertised AS path matches the forwarding path
  • Announcements are aggregated where possible
  • AS paths obey policy constraints
  • Providers are connected to the entire Internet
  • Deviations are defined to be “misconfigurations”
slide-9
SLIDE 9

djw // UW-CSE

9

Impacts of misconfiguration

  • Alteration of selected paths

– Not what you preferred

  • Increased routing load

– More routing announcements to process

  • Loss of connectivity

– No paths at some/all locations that reach a prefix

  • The last is most serious and visible to users
  • The two deviations we focus on can affect connectivity
slide-10
SLIDE 10

djw // UW-CSE

10

Measuring routes with incorrect origins

  • Are there easy ways to detect misconfigured origins?

– Multiple origins for a prefix; increasingly common practice – Internet Routing Registries (IRRs); found to be inaccurate

  • We observe that origins tend to change on human

timescales, except for failures and misconfigurations

– We analyze changes in the RouteViews BGP snapshots – We divide them by duration (short vs. long-lived) – Then we attribute probable causes to changes – Finally we assess their impact on reachability

slide-11
SLIDE 11

djw // UW-CSE

11

IRRs: do they detect incorrect origins?

BGP Table Snapshot: Sep 28, 2001

Total Prefixes Registered Origins Consistent Origin(s) Inconsistent Origin (s) Single Origin AS

115228 101952 70458 (69%) 31494 (31%)

Multiple Origin AS’s

1720 1523 293 (19%) 1230 (81%)

slide-12
SLIDE 12

djw // UW-CSE

12

Causes of origin changes

  • Long-lived changes last more than one day

Long-lived Fluctuating Conflicting

More Specific Added Self Deaggregation AS-Path Stripping More Specific Deleted Failures (unreachable) Strip Deaggregation Origin Added Backups Extra Last Hop Origin Deleted Foreign Deaggregation Origin Changed Other New Address Space Address Space Deleted

slide-13
SLIDE 13

djw // UW-CSE

13

Definitions of short-lived changes

Stable Announcements Short-lived Announcements Self Deaggregation a.b.0.0/16 X-Y-Z a.b.c1.0/24 X‘-Y‘-Z a.b.c2.0/24 X‘-Y‘-Z AS-Path Stripping a.b.c.d/s X-Y-Z a.b.c.d/s X‘-Y Strip Deaggregation a.b.0.0/16 X-Y-Z a.b.c1.0/24 X‘-Y a.b.c2.0/24 X‘-Y Extra Last Hop a.b.0.0/16 X-Y-Z a.b.c1.0/24 X‘-Y‘-Z-O a.b.c2.0/24 X‘-Y‘-Z-O Foreign Deaggregation a.b.0.0/16 X-Y-Z a.b.c1.0/24 X‘-Y‘-O a.b.c2.0/24 X‘-Y‘-O

slide-14
SLIDE 14

djw // UW-CSE

14

  • 1. More than 2% of the prefixes experience a change
  • 2. Less than a third of changes are long-lived
  • 3. Weekly pattern in the number of changes seen

Distribution of Origin Changes

1000 2000 3000 4000 5000 6000 7000 8000 8/1/01 8/8/01 8/15/01 8/22/01 8/29/01 9/5/01 9/12/01 9/19/01 9/26/01

Number of Prefixes

Conflicting (403) Fluctuating (1455) Long-lived (745)

Weekend

slide-15
SLIDE 15

djw // UW-CSE

15

Breakdown of Long-Lived Changes

500 1000 1500 2000 8/1/01 8/8/01 8/15/01 8/22/01 8/29/01 9/5/01 9/12/01 9/19/01 9/26/01

Number of Prefixes

More Specific Added (313) More Specific Deleted (260) Origin Added (35) Origin Deleted (32) Origin Change (31) Address Space Added (42) Address Space Deleted (29)

slide-16
SLIDE 16

djw // UW-CSE

16

Breakdown of Fluctuating Changes

500 1000 1500 2000 2500 3000 3500 4000 8/1/01 8/8/01 8/15/01 8/22/01 8/29/01 9/5/01 9/12/01 9/19/01 9/26/01

Number of Prefixes

Backups (4) Unreachable Failures (523) Self Deaggregation (928)

slide-17
SLIDE 17

djw // UW-CSE

17

Breakdown of Conflicting Changes

200 400 600 800 1000 1200 8/1/01 8/8/01 8/15/01 8/22/01 8/29/01 9/5/01 9/12/01 9/19/01 9/26/01

Number of Prefixes

Other (52) Strip Deaggregation (20) AS-Path Stripping (18) Foreign Deaggregation (81) Extra Last Hop (233)

slide-18
SLIDE 18

djw // UW-CSE

18

Consulting the IRR when you see conflicts does not help

IRR suggests Conficting cases contain misconfigs

200 400 600 800 1000 1200 8/1/01 8/8/01 8/15/01 8/22/01 8/29/01 9/5/01 9/12/01 9/19/01 9/26/01

Number of Prefixes

Conflicting IRR

slide-19
SLIDE 19

djw // UW-CSE

19

Validation via an email survey

  • 30% of emails bounce outright
  • More find their way to /dev/null

–“Your support request has been accepted by our team, a case has been opened with reference 12345 …”

  • Surprise and lack of a clue

–“Thanks for alerting us … I am a bit surprised …” –“Ratul, … can you help us?”, “No idea really …” –“I believe research has shown routes appear and disappear every day”

  • Defensiveness

–“Yes, we leaked … but took pre-emptive action right away …” –“The information you are requesting is covered by NDA …’

  • Hard information and encouragement

–“You caught us. This is what happened …” –“I enjoyed your NANOG talk …”

  • Interesting exercise in its own right …
slide-20
SLIDE 20

djw // UW-CSE

20

Validation results

  • Caveat: these stats are for prefixes, not incidents.

96 (8%) 86 (7%) 1081 (92%) 1177 2522 all 4 (10%) 18 (40%) 41 (91%) 45 188 foreign-deagg 3 (4%) 5 (6%) 82 (96%) 85 150 strip-deagg 12 (33%) 12 (33%) 24 (67%) 36 91

  • ther

63 (26%) 42 (17%) 180 (73%) 243 1222 self-deagg 7 (1%) 2 (0%) 723 (99%) 730 760 as-path-strip 7 (18%) 7 (18%) 31 (82%) 38 111 extra-last-hop False +ve Connect? Misconfig Replies Total Cause

slide-21
SLIDE 21

djw // UW-CSE

21

Causes of origin changes

Real misconfigurations:

  • Buggy ACLs/route-maps
  • Relying on upstream
  • Forgot auto-summary
  • Redistribution
  • Over-aggregating
  • Hijacking
  • Old routers …

False positives:

  • Just testing
  • Failures
  • Temp. load balancing
  • Migration
  • Re-numbering
slide-22
SLIDE 22

djw // UW-CSE

22

Speculation

  • Complexity of configuration is a root cause of error

– Scope for greater “type-checking”

  • Operational practices are diverse

– Makes systematic identification of errors difficult

  • Authoritative databases will be inaccurate

– Use for automatic blocks is problematic

  • ISPs depend on one another to a significant degree

– “I thought you’d handle that”

  • Connectivity can persist despite many misconfigs

– Route leaks, redistribution, de-aggregation, …

slide-23
SLIDE 23

djw // UW-CSE

23

Also: Measuring partial connectivity

  • Advertised address space is not reachable from all

places in the Internet!

  • Causes:

– Convergence delays – route flap damping – policy (filtering on prefix length, or commercial relationships)

  • Failures do not lead to partial connectivity
  • We can distinguish the above causes by timescale
slide-24
SLIDE 24

djw // UW-CSE

24

Partial connectivity analysis

  • Identify partially connected address space (!= prefix)

from the BGP table

  • Consult BGP snapshots 15 minutes before and after to

identify partial connectivity due to convergence delays

  • Correlate against partial connectivity across days to

differentiate between route flap damping and filtering based partial connectivity

  • Verify using public looking glasses to guard against

restrictive export policies and default pointing

slide-25
SLIDE 25

djw // UW-CSE

25

Partial connectivity: results

  • Express as percentage of advertised address space.
  • Convergence: 0.005-0.02%
  • Route flap damping: 0.1-0.8%
  • Filtering: 0.7%
slide-26
SLIDE 26

djw // UW-CSE

26

Most partially connected prefixes are /24’s Most partially connected address space is due to /16’s

Prefix Length Distribution of Partially Connected Address Space

0.1 0.2 0.3 0.4 0.5 0.6 0.7 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Prefix Length Fraction

Address Space Prefixes

slide-27
SLIDE 27

djw // UW-CSE

27

Tentative conclusions

  • There is considerable churn in prefix origins

– More than 2% of the prefixes are affected every day – 1/3 to 1/2 of this churn is due to misconfigurations

  • The causes of misconfigurations are diverse
  • Connectivity is surprisingly robust

– ~ 3 in 4 incidents do not cause reachability to be lost

  • The address space is not fully connected

– ~1% persistently partially connected at any time

  • Many thanks to the ISP community for its support
  • Feedback: http://www.cs.washington.edu/homes/ratul/bgp/