CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing - - PowerPoint PPT Presentation

cs 557 bgp convergence
SMART_READER_LITE
LIVE PREVIEW

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing - - PowerPoint PPT Presentation

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz, 2003 BGP-RCN: Improving Convergence Through Root Cause Notification Pei, Azuma, Massey, Zhang, 2005 Spring 2013 BGP Path Exploration dest. ( ) Z


slide-1
SLIDE 1

CS 557 BGP Convergence

Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz, 2003 BGP-RCN: Improving Convergence Through Root Cause Notification Pei, Azuma, Massey, Zhang, 2005

Spring 2013

slide-2
SLIDE 2

(B A) (C B A) (E D B A) (H G F E A)

BGP Path Exploration

H B Z D A E C dest. I G

n Obsolete paths (C B A) and (E D B A) explored

before converging on valid path (I H G F A)

Z’s Candidate paths:

() (C B A) (E D B A) (I H G F A) () () (E D B A) (I H G F A) () () () (I H G F A) F

( ) ( ) ( ) ( ) ( ) ( )

slide-3
SLIDE 3

Potential to Explore N! Paths

A B C D S Paths Explored by A A,C,S Link C,S fails A,B,C,S A,D,C,S A,B,D,C,S A,D,B,C,S …. No route Theoretically can explore N! paths before no route

slide-4
SLIDE 4

Some Routing Terminology

  • Tup = route to previously unreachable prefix is

announced.

  • Tdown = route to current reachable prefix is

withdrawn and no replacement exists

  • Tshort = route to current reachable prefix switches to

shorter path

  • Tlong = route to current reachable prefix switches to

a longer path

  • Other terminology

– Tdown = fail-down – Tlong = fail-over

slide-5
SLIDE 5

BGP MRAI Time and Convergence

n Minimum Route Advertisement Interval (MRAI) timer:

n Within M=30 seconds, at most one announcement from A to B n not for the first announcement, not for the withdrawal

P1 P2 P3 P4 P5 w

A’s path changes: Msgs from A to B: P1 time=0 time=30 time=60

P4 P5 w

  • b. delay convergence
  • a. suppress transient changes

n Impact:

slide-6
SLIDE 6

[BAS03] Improving BGP Convergence

  • Objective:

– Improve convergence time after a legitimate route change.

  • Approach:

– Flush out ghost information that is blocked by the MRAI timer

  • P4 in previous slide is ghost information
  • Contributions:

– Simple, easily deployed, and clever approach to improve convergence – Theoretical understanding of convergence behavior – Improves on 2002 result from Pei et al.

slide-7
SLIDE 7

Basic Model

  • Each AS is treated as one node

– Though not strictly required in ghost flushing

  • Routers use shortest path routing policy

– Helps with analysis, but not strictly required

  • SPVP (simple path vector protocol) approximates

BGP

  • MRAI timer between updates

– Minimum Route Advertisement Interval – Two consecutive updates must be at least MRAI time apart.

slide-8
SLIDE 8

Ghost Information

  • Obsolete path information stored at node

– Could be preferred route or backup route stored at a node.

  • MRAI timer can block removal of ghost

information

– Router cannot announce its current choice of paths because it recently announced a different path. – Typical MRAI value is 30 seconds – Can lead to increased convergence time and increased chance of selecting ghost paths.

slide-9
SLIDE 9

Ghost Flushing

  • Very Simple Rule for BGP Routers

When route to P is updated to a worse path and MRAI timer is delaying path announcement send withdraw(P) (no route to P)

slide-10
SLIDE 10

Path Length and Time

  • Assume Tdown Event
  • Let H = message passing time
  • Claim at time K*H , every message or node has

ASPath length > K

– By induction. True at time H since neighboring routers received withdraw – Assume true at time KH, all paths longer than KH. – Suppose K or less path exists at time (K+1)H

  • Must have come from some peer P with path length KH.
  • Path must have been removed prior to time KH

– Withdraw or longer path announced prior to time KH – Must be received prior to time (K+1)H (contradiction)

slide-11
SLIDE 11

Implications of Time/Length

  • Shown that at the K*H, every message or node

has ASPath length > K

  • Implications:

– Longest possible path has length N – At time N*H, all paths are longer than longest possible path – By time N*H, all routers know that path is withdrawn

  • Convergence time is (N*H)

– Reduced from N*MRAI

slide-12
SLIDE 12

Message Complexity

  • Claim at most 2 messages sent during each

MRAI timer interval

  • Resulting complexity

– Number of MRAI rounds is NH/(MRAI) – Updates per round is 2E

Complexity is O(2ENH/MRAI) (BGP complexity is EN)

slide-13
SLIDE 13

Tlong (fail-over) Complexity

  • Expect good results, but no theoretical results

presented here

– Simulations show solid improvement – Other simulations (not shown here) show some surprises…

  • Theoretical results later determined by Pei et al.

– Covered next week….

slide-14
SLIDE 14

[PA+05] Improving BGP Convergence

  • Objective:

– Improve convergence time after a legitimate route change.

  • Approach:

– Signal the cause of the path failure

  • Contributions:

– Dramatic reduction in convergence time plus ability to improve other parts of BGP – Theoretical understanding of convergence behavior

slide-15
SLIDE 15

(B A) (C B A) (E D B A) (H G F E A)

BGP Path Exploration Revisited

H B Z D A E C dest. I G

n Observation: if Z know [B A] failed, it could’ve

avoided the obsolete paths

Z’s Candidate paths:

() (C B A) (E D B A) (I H G F A) () () (E D B A) (I H G F A) () () () (I H G F A) F

( ) ( ) ( ) ( ) ( ) ( )

slide-16
SLIDE 16

Root Cause Notification

  • The node who detects the failure attaches root cause to msg
  • Other nodes copy the root cause to outgoing messages

(B A) (C B A) (E D B A) (H G F E A) H B Z D A E C I G

Z’s Candidate paths:

F () (C B A) (E D B A) (I H G F A)

( ), [B A] failure ( ), [B A] failure ( ), [B A] failure

n the first msg is enough for Z to remove all

the obsolete paths

slide-17
SLIDE 17

Overlapping Events

A

[B A] failure [B A] failure

B Z A D E

  • Another topology change happens before the previous

change’s convergence finishes.

  • Propagation along lower path is slower than upper path

dest.

slide-18
SLIDE 18

B Z A D

[B A] failure [B A] recovery

Path: (B A)

[B A] recovery

E

Overlapping Events

dest.

slide-19
SLIDE 19

B Z A D

[B A] failure

Path: (B A)

E

Wrong!

Overlapping Events

[B A] recovery

  • Observation: need to order the relative timing
  • f the root causes

dest.

slide-20
SLIDE 20

B Z A D

[B A] failure, seqnum=1

E

Solution: adding sequence number

[B A] failure, seqnum=1

  • Node B maintains a sequence number for link [B A]
  • Incremented each time the link status changes

dest.

slide-21
SLIDE 21

B Z A D

[B A] failure, seqnum=1

E

Solution: adding sequence number

(B A), [B A] recovery, seqnum=2

Path: (C B A), seqnum of [B A]=2

dest.

slide-22
SLIDE 22

B Z A D

[B A] failure, seqnum=1

E

Solution: adding sequence number

Path: (B A), seqnum of [B A]=2

dest.

  • Sequence number orders the relative

timing of the root causes

slide-23
SLIDE 23

Evaluation: analysis and simulation

n Two types of topology changes:

H B Z F A dest. I G

n Fail-over: nodes

switch to worse paths

n Fail-down: destination becomes unreachable

A dest.

slide-24
SLIDE 24

RCN d * h BGP (N-1) * (h+M)

Fail-down convergence delay (worst case) bound

h seconds h seconds

A dest. B C Z

Length of the longest possible path(N)

MRAI value

d: network diameter Along shortest path: it takes at most d*h seconds nodal processing delay Withdrawals are not delayed by MRAI !

d << N-1 and h <<M w w w

slide-25
SLIDE 25

Fail-down simulation results

Convergence Time

1 10 100 1000 14 28 56 112 Number of nodes Seconds BGP RCN

n 2-3 orders of magnitudes reduction

slide-26
SLIDE 26

Border nodes in fail-over convergence

Z A H

unaffected nodes

D

Affected nodes

I J B Border node Z:

  • connected to an unaffected node H
  • its eventual path is through H

Z’s eventual path has always been available

C dest.

slide-27
SLIDE 27

RCN’s fail-over delay bound

B Z H

unaffected nodes

D

RCN (M + 2*h)*daffected

Affected nodes

diameter of the sub-graph

  • f affected nodes

Phase 1: h*daffected Phase 2: (M+h)* daffected

Node D’s convergence: Phase 1: Z receives the root cause Phase 2: Z’s path is propagated to D (MRAI delay applies in this phase)

First message is not delayed by MRAI !

A C dest.

slide-28
SLIDE 28

B Z H

unaffected nodes

D

BGP (M+h) * min{d’ - J, |Vaffected|+ daffected -1}

Affected nodes

Phase 2: (M+h)* daffected

Node D’s convergence:

Phase 1: Z explores paths shorter than Z’s eventual path Phase 2: the same as in RCN A C

BGP’s fail-over delay bound

dest.

slide-29
SLIDE 29

Fail-over simulation results

5 10 15 20 25 14 28 56 112 Number of nodes Seconds BGP RCN

n BGP does fine : <(M+h) * d’ n d’ : 2~6

d’ : length of the longest path from any affected node to the destination Constructed topologies with large d’: RCN has much more pronounced improvement

slide-30
SLIDE 30

n Transmission & storage of a path: doubled

path:seqnum (Z C B A):(3 2 2 1)

n Storage overhead in the routing table:

n doubled

n Transmission overhead reduced

n 1~2 orders of magnitudes

reduction in msg counts

RCN Overhead

slide-31
SLIDE 31

n Reducing negative impact of MRAI:

n [Griffin:ICNP01], Ghost-Flushing [Bremler-Barr:Infocom03]

n don’t deal with path exploration

n Reducing path exploration

n Consistency Assertion [Pei:Infocom02] n path exploration still exists

n Explicitly signaling failure

n RCO [Luo:Globecom02], BGP-CT [Wattenhofer:talkslides03]:

may result in wrong routing decision

n EPIC [Chandrasheka:Infocom05]: encoding difference

Related Work