OUTLINE California Fault Lines: Understanding the Causes and Impact - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

OUTLINE California Fault Lines: Understanding the Causes and Impact - - PowerPoint PPT Presentation

Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study


slide-1
SLIDE 1

Internet Measurement

Huaiyu Zhu, Rim Kaddah CS538 Fall 2011

slide-2
SLIDE 2

OUTLINE

  • California Fault Lines: Understanding the

Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush

  • A Measurement Study on the Impact of

Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.

slide-3
SLIDE 3

OUTLINE

  • California Fault Lines: Understanding the

Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush

  • A Measurement Study on the Impact of

Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.

slide-4
SLIDE 4

Why Study Failure

  • Failure is a reality for large network
  • Achieving high availability requires engineering

the network to be robust to failure

  • Designing mechanisms to effectively mitigate

failures requires deep understanding of real failures

slide-5
SLIDE 5

CENIC Network

  • Serving California educational institutions
  • Over 200 routers
  • 5 years of data
  • Three Types of Components:
  • The Digital California (DC) network
  • The High-Performance Research

(HPR) network

  • Customer-premises equipment

(CPE)

slide-6
SLIDE 6

Contribution

  • Methodology to reconstruct historical failure events
  • f CENIC network
  • Using only commonly available data, No need for

additional instrumentation

  • Analyze the network based on failure measurement
slide-7
SLIDE 7

Reconstruction

What data are available to reconstruct a failure 4 years later?

  • Syslog
  • Describes interface state changes
  • Router Configuration Files
  • Maps interfaces to Links
  • Operation announcements on mailing list

Data are not intended for failure reconstruction!

slide-8
SLIDE 8

Validation

  • Internal consistency
  • Using the administrator announcements to validate the

event history reconstructed.

  • External consistency
  • CAIDA Skitter project (now Ark) validating UP.
  • Route Views project validating DOWN.
slide-9
SLIDE 9

Overview of Link Failures

slide-10
SLIDE 10

Overview of Link Failures

slide-11
SLIDE 11

Overview of Link Failures

  • Vertical banding
  • V1: a network-wide IS-IS configuration change requiring

a router restart

  • V2: a network-wide software upgrade
  • V3: a network-wide configuration change in preparation

for IPv6

  • Horizontal banding
  • H1: a series of failures on a link between a core

router and a County of Education office (hardware)

  • H2: this link experienced over 33,000 short-duration

failures (fiber cut)

slide-12
SLIDE 12

CDFs of Individual Failure Events

slide-13
SLIDE 13

Various Link Hardware Types

slide-14
SLIDE 14

Cause of Failure

slide-15
SLIDE 15

Failure Events

slide-16
SLIDE 16

Summary

  • Engineering for failure requires real data
  • Data has historically been difficult to obtain
  • Methodology to perform historical failure analysis with

low-quality data sources

  • Shared our findings in the CENIC network
  • Reliability of individual components
  • Causes of failures
  • Impact of failure
slide-17
SLIDE 17

OUTLINE

  • California Fault Lines: Understanding the

Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush

  • A Measurement Study on the Impact of

Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.

slide-18
SLIDE 18

Key Questions

  • How could routing events cause degraded

end-to-end path performance?

  • How topological properties and routing

policies affect performance degradation?

slide-19
SLIDE 19

Approach

  • Study end-to-end performance under realistic topologies.
  • Investigate several metrics to characterize the end-to-end

loss, delay, and out-of-order packets.

  • Characterize the kinds of routing changes that impact

end-to-end path performance.

  • Analyze the impact of topology, routing policies, MRAI

timer and iBGP configurations on end-to-end path performance.

slide-20
SLIDE 20

Experiment Methodology

  • A multi-homed prefix
  • BGP Beacon prefix: 192.83.230.0/24
  • Controlled Routing Changes
  • Failover events: Beacon changes from the state of being connected to

both providers to the state of being connected to a single provider.

  • Recovery events: Beacon changes from the state of being connected

to a single provider to the state of being connected to both providers.

ISP1

Beacon

ISP 2 ISP 1 ISP 2 ISP 1 ISP 2

Beacon Beacon

Failover event Recovery event

slide-21
SLIDE 21

Controlled Routing Changes

Time schedule (GMT) for BGP Beacon routing transitions

  • 12 routing events every

day

  • 8 for beacon events:
  • Failover events
  • Recovery events
  • 4 for resetting the

Beacon Connectivity.

slide-22
SLIDE 22

Active Probing

Internet

ISP 2

Beacon host

ISP 1

host B host A host C

  • Goal: capture the impact of

routing changes on the end-to- end performance.

  • From 37 PlanetLab hosts to

the Beacon host (a host within the Beacon prefix).

  • Three probing methods:
  • Back-to-back traceroutes
  • Back-to-back pings
  • UDP probing (50msec

interval)

Data Plane Performance metrics Active probing traceroute ping UDP probing Pack loss √ Delay √ Out-of-order √

slide-23
SLIDE 23

Packet Loss

Loss burst: consecutive UDP probing packets lost during a routing change event. Failover Recovery

slide-24
SLIDE 24

Packet Delay

Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery

slide-25
SLIDE 25

Out-of-order Packets

  • Number of reordering (number of packets out of order)
  • Reordering offset

Failover Recovery

slide-26
SLIDE 26

How Routing Failures Occur (Failover)?

Prefer-customer routing policy: routes received from a provider’s customers are always preferred over those received from its peers.

R1 Beacon R4 R5 R6 R2 R3 Provider 1 Provider 2 Peer link 2 0 1 0 AS 0 Customer link

slide-27
SLIDE 27

How Routing Failures Occur (Failover)? (contd.)

R1 Beacon R4 R5 R6 R2 R3 Provider 1 Provider 2 Peer link 2 0 1 0 R7 R9 Provider 3 2 0 1 0

1 0 1 0 2 0

AS 0 Peer link R8

No-valley routing policy: peers do not transit traffic from one peer to another.

slide-28
SLIDE 28

How Routing Failures Occur? (Recovery)

R1 R2 R4 R3 Beacon path (0) Path (0) Withdraw (2 0)

  • 5. R1 regains its connection to the Beacon
  • 1. Path 0 ⇒R3 recovery.
  • 2. R3 sends the path to R2
  • 3. R2 sends a withdrawal

to R1

  • 4. R3 sends the recovery path to R1

iBGP constraint: a route received from an iBGP router cannot be transited to another iBGP router

Provider 1 Provider 2 AS 0

slide-29
SLIDE 29

Summary

  • During failover and recovery events
  • Routing events impact packet loss significantly.
  • Routing failures contribute to end-to-end packet loss significantly.
  • Routing events can lead to long packet round-trip delays and

reordering

  • Routing policies and iBGP configuration play a major role

in causing packet loss during routing events.

slide-30
SLIDE 30

Discussion

  • How could we prevent packet loss during path

exploration? Would storing an alternative path in each router be a good idea? What are the downsides?

  • How could we exploit the previous results to improve end-

to-end performance?

  • How realistic could we consider the topology in the

second paper?

slide-31
SLIDE 31

References

  • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao

and Randy Bush. A Measurement Study on the Impact

  • f Routing Events on End-to-End Internet Path
  • Performance. SIGCOMM 2006.
  • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao

and Randy Bush. Presentation on SIGCOMM 2006.

  • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and

Stefan Savage. California Fault Lines. SIGCOMM 2010.

  • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and

Stefan Savage. Presentation on SIGCOMM 2010.