OUTLINE California Fault Lines: Understanding the Causes and Impact - - PowerPoint PPT Presentation
OUTLINE California Fault Lines: Understanding the Causes and Impact - - PowerPoint PPT Presentation
Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study
OUTLINE
- California Fault Lines: Understanding the
Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush
- A Measurement Study on the Impact of
Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
OUTLINE
- California Fault Lines: Understanding the
Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush
- A Measurement Study on the Impact of
Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
Why Study Failure
- Failure is a reality for large network
- Achieving high availability requires engineering
the network to be robust to failure
- Designing mechanisms to effectively mitigate
failures requires deep understanding of real failures
CENIC Network
- Serving California educational institutions
- Over 200 routers
- 5 years of data
- Three Types of Components:
- The Digital California (DC) network
- The High-Performance Research
(HPR) network
- Customer-premises equipment
(CPE)
Contribution
- Methodology to reconstruct historical failure events
- f CENIC network
- Using only commonly available data, No need for
additional instrumentation
- Analyze the network based on failure measurement
Reconstruction
What data are available to reconstruct a failure 4 years later?
- Syslog
- Describes interface state changes
- Router Configuration Files
- Maps interfaces to Links
- Operation announcements on mailing list
Data are not intended for failure reconstruction!
Validation
- Internal consistency
- Using the administrator announcements to validate the
event history reconstructed.
- External consistency
- CAIDA Skitter project (now Ark) validating UP.
- Route Views project validating DOWN.
Overview of Link Failures
Overview of Link Failures
Overview of Link Failures
- Vertical banding
- V1: a network-wide IS-IS configuration change requiring
a router restart
- V2: a network-wide software upgrade
- V3: a network-wide configuration change in preparation
for IPv6
- Horizontal banding
- H1: a series of failures on a link between a core
router and a County of Education office (hardware)
- H2: this link experienced over 33,000 short-duration
failures (fiber cut)
CDFs of Individual Failure Events
Various Link Hardware Types
Cause of Failure
Failure Events
Summary
- Engineering for failure requires real data
- Data has historically been difficult to obtain
- Methodology to perform historical failure analysis with
low-quality data sources
- Shared our findings in the CENIC network
- Reliability of individual components
- Causes of failures
- Impact of failure
OUTLINE
- California Fault Lines: Understanding the
Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush
- A Measurement Study on the Impact of
Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage.
Key Questions
- How could routing events cause degraded
end-to-end path performance?
- How topological properties and routing
policies affect performance degradation?
Approach
- Study end-to-end performance under realistic topologies.
- Investigate several metrics to characterize the end-to-end
loss, delay, and out-of-order packets.
- Characterize the kinds of routing changes that impact
end-to-end path performance.
- Analyze the impact of topology, routing policies, MRAI
timer and iBGP configurations on end-to-end path performance.
Experiment Methodology
- A multi-homed prefix
- BGP Beacon prefix: 192.83.230.0/24
- Controlled Routing Changes
- Failover events: Beacon changes from the state of being connected to
both providers to the state of being connected to a single provider.
- Recovery events: Beacon changes from the state of being connected
to a single provider to the state of being connected to both providers.
ISP1
Beacon
ISP 2 ISP 1 ISP 2 ISP 1 ISP 2
Beacon Beacon
Failover event Recovery event
Controlled Routing Changes
Time schedule (GMT) for BGP Beacon routing transitions
- 12 routing events every
day
- 8 for beacon events:
- Failover events
- Recovery events
- 4 for resetting the
Beacon Connectivity.
Active Probing
Internet
ISP 2
Beacon host
ISP 1
host B host A host C
- Goal: capture the impact of
routing changes on the end-to- end performance.
- From 37 PlanetLab hosts to
the Beacon host (a host within the Beacon prefix).
- Three probing methods:
- Back-to-back traceroutes
- Back-to-back pings
- UDP probing (50msec
interval)
Data Plane Performance metrics Active probing traceroute ping UDP probing Pack loss √ Delay √ Out-of-order √
Packet Loss
Loss burst: consecutive UDP probing packets lost during a routing change event. Failover Recovery
Packet Delay
Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery
Out-of-order Packets
- Number of reordering (number of packets out of order)
- Reordering offset
Failover Recovery
How Routing Failures Occur (Failover)?
Prefer-customer routing policy: routes received from a provider’s customers are always preferred over those received from its peers.
R1 Beacon R4 R5 R6 R2 R3 Provider 1 Provider 2 Peer link 2 0 1 0 AS 0 Customer link
How Routing Failures Occur (Failover)? (contd.)
R1 Beacon R4 R5 R6 R2 R3 Provider 1 Provider 2 Peer link 2 0 1 0 R7 R9 Provider 3 2 0 1 0
1 0 1 0 2 0
AS 0 Peer link R8
No-valley routing policy: peers do not transit traffic from one peer to another.
How Routing Failures Occur? (Recovery)
R1 R2 R4 R3 Beacon path (0) Path (0) Withdraw (2 0)
- 5. R1 regains its connection to the Beacon
- 1. Path 0 ⇒R3 recovery.
- 2. R3 sends the path to R2
- 3. R2 sends a withdrawal
to R1
- 4. R3 sends the recovery path to R1
iBGP constraint: a route received from an iBGP router cannot be transited to another iBGP router
Provider 1 Provider 2 AS 0
Summary
- During failover and recovery events
- Routing events impact packet loss significantly.
- Routing failures contribute to end-to-end packet loss significantly.
- Routing events can lead to long packet round-trip delays and
reordering
- Routing policies and iBGP configuration play a major role
in causing packet loss during routing events.
Discussion
- How could we prevent packet loss during path
exploration? Would storing an alternative path in each router be a good idea? What are the downsides?
- How could we exploit the previous results to improve end-
to-end performance?
- How realistic could we consider the topology in the
second paper?
References
- Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao
and Randy Bush. A Measurement Study on the Impact
- f Routing Events on End-to-End Internet Path
- Performance. SIGCOMM 2006.
- Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao
and Randy Bush. Presentation on SIGCOMM 2006.
- Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and
Stefan Savage. California Fault Lines. SIGCOMM 2010.
- Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and
Stefan Savage. Presentation on SIGCOMM 2010.