L IFE G UARD: Practical Repair of Persistent Route Failures Ethan - - PowerPoint PPT Presentation

l ife g uard practical repair of persistent route failures
SMART_READER_LITE
LIVE PREVIEW

L IFE G UARD: Practical Repair of Persistent Route Failures Ethan - - PowerPoint PPT Presentation

L IFE G UARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC) Colin Scott, David Choffnes, Italo Cunha, Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy This work is generously funded


slide-1
SLIDE 1

LIFEGUARD: Practical Repair of Persistent Route Failures

Ethan Katz-Bassett (USC)

Colin Scott, David Choffnes, Italo Cunha, Valas Valancius, Nick Feamster, Harsha Madhyastha, Tom Anderson, Arvind Krishnamurthy

This work is generously funded in part by Google, Cisco and the NSF.

slide-2
SLIDE 2

LIFEGUARD: Practical Repair of Persistent Route Failures 3

slide-3
SLIDE 3

LIFEGUARD: Practical Repair of Persistent Route Failures 4

slide-4
SLIDE 4

LIFEGUARD: Practical Repair of Persistent Route Failures 4

slide-5
SLIDE 5

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

slide-6
SLIDE 6

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

slide-7
SLIDE 7

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes

slide-8
SLIDE 8

LIFEGUARD: Practical Repair of Persistent Route Failures

! Monitor outages from Amazon’s EC2 ! Fraction of outages of duration ! X? ! Fraction of unavailability due to outages of duration ! X?

5

Long Outages Cause Most Unavailability

86% of outages last less than 5 minutes But longer outages account for 90% of the unavailability

slide-9
SLIDE 9

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

“Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

slide-10
SLIDE 10

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

Mailing List User 1 1 Home router 2 Verizon in Baltimore 3 Verizon in Philly 4 Alter.net in DC 5 Level3 in DC 6 * * * 7 * * * “Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

slide-11
SLIDE 11

LIFEGUARD: Practical Repair of Persistent Route Failures

Operators Struggle to Locate Failures

Mailing List User 1 1 Home router 2 Verizon in Baltimore 3 Verizon in Philly 4 Alter.net in DC 5 Level3 in DC 6 * * * 7 * * * Mailing List User 2 1 Home router 2 Verizon in DC 3 Alter.net in DC 4 Level3 in DC 5 Level3 in Chicago 6 Level3 in Denver 7 * * * 8 * * * “Traffic attempting to pass through Level3’s network in the Washington, DC area is getting lost in the abyss. Here's a trace from Verizon residential to Level3.” Outages mailing list, Dec. 2010

6

slide-12
SLIDE 12

LIFEGUARD: Practical Repair of Persistent Route Failures

Reasons for Long-Lasting Outages

Long-term outages are:

! Repaired over slow, human timescales ! Not well understood ! Caused by routers advertising paths that do not work

! E.g., corrupted memory on line card causes black hole ! E.g., bad cross-layer interactions cause failed MPLS tunnel

! Complicated by lack of visibility into or control over

routes in other ISPs

7

slide-13
SLIDE 13

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

8

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically

! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem

slide-14
SLIDE 14

! Building blocks ! Example ! Description of technique

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

8

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically

! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem

slide-15
SLIDE 15

LIFEGUARD: Practical Repair of Persistent Route Failures

Building blocks for failure isolation

LIFEGUARD can use:

! Ping to test reachability ! Traceroute to measure forward path ! Distributed vantage points (VPs)

! PlanetLab for our experiments ! Some can source spoof

! Reverse traceroute to measure reverse path (NSDI ’10) ! Atlas of historical forward/reverse paths between

VPs and targets

9

slide-16
SLIDE 16

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes ! Traceroute yields only path from GMU to target ! Reverse traceroute reveals path asymmetry

10

Source: GMU Target: Smartkom

How does LIFEGUARD locate a failure?

Before outage:

slide-17
SLIDE 17

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes ! Traceroute yields only path from GMU to target ! Reverse traceroute reveals path asymmetry

10

Source: GMU Target: Smartkom Level3 Telia TransTelecom ZSTTK

How does LIFEGUARD locate a failure?

Before outage:

slide-18
SLIDE 18

LIFEGUARD: Practical Repair of Persistent Route Failures

! Historical atlas enables reasoning about changes ! Traceroute yields only path from GMU to target ! Reverse traceroute reveals path asymmetry

10

Source: GMU Target: Smartkom Level3 Telia TransTelecom ZSTTK Rostelecom NTT

How does LIFEGUARD locate a failure?

Before outage:

slide-19
SLIDE 19

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

How does LIFEGUARD locate a failure?

During outage:

slide-20
SLIDE 20

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

How does LIFEGUARD locate a failure?

During outage:

slide-21
SLIDE 21

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

How does LIFEGUARD locate a failure?

During outage:

slide-22
SLIDE 22

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

VP

How does LIFEGUARD locate a failure?

During outage:

slide-23
SLIDE 23

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

VP

How does LIFEGUARD locate a failure?

During outage:

slide-24
SLIDE 24

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?

During outage:

slide-25
SLIDE 25

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?

During outage:

slide-26
SLIDE 26

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?

Ping! To:VP

During outage:

slide-27
SLIDE 27

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

?

Problem with ZSTTK?

VP

Ping? Fr:VP

How does LIFEGUARD locate a failure?

Ping! To:VP

During outage:

slide-28
SLIDE 28

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

! Forward path works

Problem with ZSTTK?

VP

How does LIFEGUARD locate a failure?

Ping! To:VP

During outage:

slide-29
SLIDE 29

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 11

! Forward path works

Problem with ZSTTK?

VP

How does LIFEGUARD locate a failure?

Ping! To:VP

During outage:

slide-30
SLIDE 30

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

During outage:

slide-31
SLIDE 31

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

During outage:

slide-32
SLIDE 32

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

During outage:

slide-33
SLIDE 33

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

NTT:Ping? Fr:GMU

During outage:

slide-34
SLIDE 34

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

GMU:Ping! Fr:NTT

During outage:

slide-35
SLIDE 35

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

During outage:

slide-36
SLIDE 36

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 12

! Forward path works

How does LIFEGUARD locate a failure?

During outage:

slide-37
SLIDE 37

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 13

! Forward path works ! Rostelcom is not forwarding traffic towards GMU

Rostele: Ping? Fr:GMU

How does LIFEGUARD locate a failure?

During outage:

slide-38
SLIDE 38

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 13

! Forward path works ! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?

During outage:

slide-39
SLIDE 39

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 13

! Forward path works ! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?

During outage:

slide-40
SLIDE 40

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 13

! Forward path works ! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?

During outage:

slide-41
SLIDE 41

Source: GMU Target: Smartkom Source: GMU Level3 Telia ZSTTK Rostelecom NTT TransTelecom Target: Smartkom Source: GMU

LIFEGUARD: Practical Repair of Persistent Route Failures 13

! Forward path works ! Rostelcom is not forwarding traffic towards GMU

How does LIFEGUARD locate a failure?

During outage:

slide-42
SLIDE 42

LIFEGUARD: Practical Repair of Persistent Route Failures

How LIFEGUARD Locates Failures

LIFEGUARD:

  • 1. Maintains background historical atlas
  • 2. Isolates direction of failure, measures working direction
  • 3. Tests historical paths in failing direction in order to

prune candidate failure locations

  • 4. Locates failure as being at the horizon of reachability

14

slide-43
SLIDE 43

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

15

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically

! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem

slide-44
SLIDE 44

! What would we like to add to BGP to enable this? ! What can we deploy today, using only available protocols

and router support?

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Approach and Outline

15

LIFEGUARD: Locating Internet Failures Effectively and Generating Usable Alternate Routes Dynamically

! Locate the ISP / link causing the problem ! Suggest that other ISPs reroute around the problem

slide-45
SLIDE 45

LIFEGUARD: Practical Repair of Persistent Route Failures

Our Goal for Failure Avoidance

! Enable content / service providers to repair

persistent routing problems affecting them, regardless of which ISP is causing them Setting

! Assume we can locate problem ! Assume we are multi-homed / have multiple data centers ! Assume we speak BGP ! We use BGP-Mux to speak BGP to the real Internet:

5 US universities as providers

16

slide-46
SLIDE 46

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

slide-47
SLIDE 47

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

slide-48
SLIDE 48

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

slide-49
SLIDE 49

LIFEGUARD: Practical Repair of Persistent Route Failures

Straightforward: Choose a path that avoids the problem.

17

Self-Repair of Forward Paths

slide-50
SLIDE 50

LIFEGUARD: Practical Repair of Persistent Route Failures

A Mechanism for Failure Avoidance

Forward path: Choose route that avoids ISP or ISP-ISP link Reverse path: Want others to choose paths to my prefix P that avoid ISP or ISP-ISP link X

! Want a BGP announcement AVOID(X,P):

! Any ISP with a route to P that avoids X uses such a route ! Any ISP not using X need only pass on the announcement

18

slide-51
SLIDE 51

LIFEGUARD: Practical Repair of Persistent Route Failures 19

Ideal Self-Repair of Reverse Paths

slide-52
SLIDE 52

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

slide-53
SLIDE 53

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS) AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

slide-54
SLIDE 54

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS) AVOID(L3,WS) AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

slide-55
SLIDE 55

LIFEGUARD: Practical Repair of Persistent Route Failures

AVOID(L3,WS) AVOID(L3,WS) AVOID(L3,WS)

19

Ideal Self-Repair of Reverse Paths

slide-56
SLIDE 56

LIFEGUARD: Practical Repair of Persistent Route Failures

Do paths exist that AVOID problem?

LIFEGUARD repairs outages by instructing others to avoid particular routes. Q: Do alternative routes exist? A: Alternate policy-compliant paths exist in 90% of simulated

AVOID(X,P) announcements.

! Simulated 10 million AVOIDs on actual measured routes.

20

slide-57
SLIDE 57

LIFEGUARD: Practical Repair of Persistent Route Failures 21

Practical Self-Repair of Reverse Paths

slide-58
SLIDE 58

LIFEGUARD: Practical Repair of Persistent Route Failures

WS

21

Practical Self-Repair of Reverse Paths

slide-59
SLIDE 59

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS Qwest ! WS

21

Practical Self-Repair of Reverse Paths

slide-60
SLIDE 60

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS

21

Practical Self-Repair of Reverse Paths

slide-61
SLIDE 61

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS

21

Practical Self-Repair of Reverse Paths

slide-62
SLIDE 62

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS

21

Practical Self-Repair of Reverse Paths

slide-63
SLIDE 63

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS L3 ! ATT ! WS Qwest ! WS

21

Practical Self-Repair of Reverse Paths

slide-64
SLIDE 64

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS Qwest ! WS AVOID(L3,WS)

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

slide-65
SLIDE 65

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS Qwest ! WS WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

slide-66
SLIDE 66

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS WS ! L3! WS Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

slide-67
SLIDE 67

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS AISP ! Qwest ! WS ! L3! WS WS ! L3! WS Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

slide-68
SLIDE 68

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS Sprint ! Qwest ! WS ! L3! WS WS ! L3! WS Qwest ! WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

slide-69
SLIDE 69

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

L3 ! ATT ! WS

BGP loop prevention encourages switch to working path.

slide-70
SLIDE 70

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS ? Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

slide-71
SLIDE 71

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS ? UW ! Sprint ! Qwest ! WS ! L3! WS Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

slide-72
SLIDE 72

LIFEGUARD: Practical Repair of Persistent Route Failures

WS ATT ! WS UW ! L3 ! ATT ! WS Sprint ! Qwest ! WS ? UW ! Sprint ! Qwest ! WS ! L3! WS Sprint ! Qwest ! WS ! L3! WS ATT ! WS ! L3! WS WS ! L3! WS

22

Practical Self-Repair of Reverse Paths

BGP loop prevention encourages switch to working path.

slide-73
SLIDE 73

LIFEGUARD: Practical Repair of Persistent Route Failures

Stuff I Don’t Have Time to Talk About

23

Results from real poisonings

! Poisoning in the wild / poisoning anomalies ! Case study of restoring connectivity

Making poisoning flexible

! Monitoring broken path while it is disabled ! Allowing ISPs w/o alternatives to use disabled route

LIFEGUARD’s scalability

! Overhead and speed of failure location ! Router update load if many ISPs deploy our approach

Alternatives to poisoning

! Compatibility with secure routing (BGPSEC, etc.) ! Comparing to other route control mechanisms

slide-74
SLIDE 74

LIFEGUARD: Practical Repair of Persistent Route Failures

Can poisoning approximate AVOID effects?

24

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration. Q: Does poisoning disrupt working routes? A: No. As I will describe: (a) Under certain circumstances, we can disable a link without disabling the full ISP . (b) We can speed BGP convergence by carefully crafting announcements.

slide-75
SLIDE 75

O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

25

! We only want C3 to change its route, to avoid A-B2

slide-76
SLIDE 76

O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

25

! We only want C3 to change its route, to avoid A-B2

slide-77
SLIDE 77

O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

26

! We only want C3 to change its route, to avoid A-B2 ! Forward direction is easy: choose a different route

slide-78
SLIDE 78

O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

26

! We only want C3 to change its route, to avoid A-B2 ! Forward direction is easy: choose a different route

slide-79
SLIDE 79

O B1 B2 A C1 C2 C3 C4 D1 D2 Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

27

! We only want C3 to change its route, to avoid A-B2 ! Forward direction is easy: choose a different route

slide-80
SLIDE 80

O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

28

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP

slide-81
SLIDE 81

O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

28

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP

slide-82
SLIDE 82

O B1 B2 A C1 C2 C3 C4 D1 D2 O-O-O O-A-O O-A-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

29

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP

slide-83
SLIDE 83

O B1 B2 A C1 C2 C3 C4 D1 D2

? ?

O-O-O O-A-O O-A-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

30

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP

slide-84
SLIDE 84

O B1 B2 A C1 C2 C3 C4 D1 D2 O Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

31

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! Selective advertising via just D1 is also blunt

slide-85
SLIDE 85

O B1 B2 A C1 C2 C3 C4 D1 D2 O Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

31

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! Selective advertising via just D1 is also blunt

slide-86
SLIDE 86

O B1 B2 A C1 C2 C3 C4 D1 D2

? ? ?

O Network link Transitive link Original path New path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

32

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! Selective advertising via just D1 is also blunt

slide-87
SLIDE 87

O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

33

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

slide-88
SLIDE 88

O B1 B2 A C1 C2 C3 C4 D1 D2 O O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

33

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

slide-89
SLIDE 89

LIFEGUARD: Practical Repair of Persistent Route Failures

What if some routes in an ISP still work?

34

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

O B1 B2 A C1 C2 C3 C4 D1 D2 O-O-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path

slide-90
SLIDE 90

O B1 B2 A C1 C2 C3 C4 D1 D2 O-O-O O-A-O Network link Transitive link Pre-poisoning path Post-poisoning path

LIFEGUARD: Practical Repair of Persistent Route Failures 35

What if some routes in an ISP still work?

! We only want C3 to change its route, to avoid A-B2 ! Poisoning seems blunt, disabling an entire ISP ! If D1 and D2 (transitively) connect to different PoPs of A,

selectively poison via D2 and not D1

slide-91
SLIDE 91

LIFEGUARD: Practical Repair of Persistent Route Failures

Can poisoning approximate AVOID effects?

36

LIFEGUARD’s poisoning repairs outages by disabling routes to induce route exploration. Q: Does poisoning disrupt working routes? A: No. As I will describe: (a) “Selective poisoning” can avoid 73% of links without disabling entire AS.

  • Real-world results from 5 provider BGP-Mux testbed

(b) We can speed BGP convergence by carefully crafting announcements.

slide-92
SLIDE 92

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

O A B C F D E O A-O D-A-O F-B-A-O B-A-O E-D-A-O A-O B-A-O

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

37

AVOID(X,P)

slide-93
SLIDE 93

O A B C F D E O-X-O A-O D-A-O F-B-A-O B-A-O E-D-A-O A-O B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

38

AVOID(X,P)

slide-94
SLIDE 94

O A B C F D E O-X-O A-O-X-O D-A-O F-B-A-O B-A-O E-D-A-O A-O-X-O B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

39

AVOID(X,P)

slide-95
SLIDE 95

O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

40

AVOID(X,P)

slide-96
SLIDE 96

O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O F-B-A-O D-A-O-X-O E-D-A-O B-A-O-X-O E-D-A-O F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

41

AVOID(X,P)

slide-97
SLIDE 97

O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O F-B-A-O D-A-O-X-O E-D-A-O B-A-O-X-O E-D-A-O F-B-A-O E-D-A-O F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

42

AVOID(X,P)

slide-98
SLIDE 98

O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O B-A-O-X-O E-D-A-O B-A-O-X-O F-B-A-O E-D-A-O F-B-A-O D-A-O-X-O E-D-A-O B-A-O-X-O E-D-A-O F-B-A-O E-D-A-O F-B-A-O B-A-O-X-O E-D-A-O D-A-O-X-O F-B-A-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

43

AVOID(X,P)

slide-99
SLIDE 99

O A B C F D E O-X-O A-O-X-O D-A-O-X-O F-B-A-O-X-O B-A-O-X-O E-D-A-O-X-O A-O-X-O B-A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Naive Poisoning Causes Transient Loss

! Some ISPs may have

working paths that avoid problem ISP X

! Naively, poisoning

causes path exploration even for these ISPs

! Path exploration causes

transient loss

44

AVOID(X,P)

slide-100
SLIDE 100

O A B C F D E O-O-O A-O-O-O D-A-O-O-O F-B-A-O-O-O B-A-O-O-O E-D-A-O-O-O A-O-O-O B-A-O-O-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration

! Most routing decisions

based on: (1) next hop ISP (2) path length

! Keep these fixed to

speed convergence

! Prepending prepares

ISPs for later poison

45

AVOID(X,P)

slide-101
SLIDE 101

O A B C F D E O-O-O A-O-O-O D-A-O-O-O F-B-A-O-O-O B-A-O-O-O E-D-A-O-O-O A-O-O-O B-A-O-O-O O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration

! Most routing decisions

based on: (1) next hop ISP (2) path length

! Keep these fixed to

speed convergence

! Prepending prepares

ISPs for later poison

46

AVOID(X,P)

slide-102
SLIDE 102

O A B C F D E O-O-O A-O-O-O D-A-O-O-O F-B-A-O-O-O B-A-O-O-O E-D-A-O-O-O A-O-O-O B-A-O-O-O O-X-O A-O-X-O A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration

! Most routing decisions

based on: (1) next hop ISP (2) path length

! Keep these fixed to

speed convergence

! Prepending prepares

ISPs for later poison

47

AVOID(X,P)

slide-103
SLIDE 103

O A B C F D E O-X-O A-O-X-O A-O-X-O D-A-O-X-O F-B-A-O-O-O B-A-O-X-O E-D-A-O-O-O B-A-O-X-O E-D-A-O-O-O F-B-A-O-O-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration

! Most routing decisions

based on: (1) next hop ISP (2) path length

! Keep these fixed to

speed convergence

! Prepending prepares

ISPs for later poison

48

AVOID(X,P)

slide-104
SLIDE 104

O A B C F D E O-X-O A-O-X-O D-A-O-X-O F-B-A-O-X-O B-A-O-X-O E-D-A-O-X-O A-O-X-O B-A-O-X-O

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepend to Reduce Path Exploration

! Most routing decisions

based on: (1) next hop ISP (2) path length

! Keep these fixed to

speed convergence

! Prepending prepares

ISPs for later poison

49

AVOID(X,P)

slide-105
SLIDE 105

0.9999 0.999 0.99 0.95 0.65 1 2 3 4 5 6 7 8 Cumulative Fraction of Convergences (CDF) Peer Convergence Time (minutes) Prepend, no change No prepend, no change

LIFEGUARD: Practical Repair of Persistent Route Failures

Prepending Speeds Convergence

! With no prepend, only 65% of unaffected ISPs converge instantly ! With prepending, 95% of unaffected ISPs re-converge instantly, 98%<1/2 min. ! Also speeds convergence to new paths for affected peers

50

slide-106
SLIDE 106

LIFEGUARD: Practical Repair of Persistent Route Failures

Conclusion

! We increasingly depend on the Internet, but availability lags ! Much of Internet unavailability due to long-lasting outages ! LIFEGUARD: Let edge networks reroute around failures ! Location challenge: Find problem, given unidirectional

failures and tools that depend on connectivity

! Use reverse traceroute, isolate directions, use historical view

! Avoidance challenge: Reroute without participation of

transit networks

! BGP poisoning gives control to the destination ! Well-crafted announcements ease concerns

51