outline
play

OUTLINE California Fault Lines: Understanding the Causes and Impact - PowerPoint PPT Presentation

Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study


  1. Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011

  2. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .

  3. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .

  4. Why Study Failure • Failure is a reality for large network • Achieving high availability requires engineering the network to be robust to failure • Designing mechanisms to effectively mitigate failures requires deep understanding of real failures

  5. CENIC Network • Serving California educational institutions • Over 200 routers • 5 years of data • Three Types of Components: ◦ The Digital California (DC) network ◦ The High-Performance Research (HPR) network ◦ Customer-premises equipment (CPE)

  6. Contribution • Methodology to reconstruct historical failure events of CENIC network • Using only commonly available data, No need for additional instrumentation • Analyze the network based on failure measurement

  7. Reconstruction What data are available to reconstruct a failure 4 years later? ◦ Syslog • Describes interface state changes ◦ Router Configuration Files • Maps interfaces to Links ◦ Operation announcements on mailing list Data are not intended for failure reconstruction!

  8. Validation • Internal consistency  Using the administrator announcements to validate the event history reconstructed. • External consistency  CAIDA Skitter project (now Ark) validating UP.  Route Views project validating DOWN.

  9. Overview of Link Failures

  10. Overview of Link Failures

  11. Overview of Link Failures • Vertical banding  V1: a network-wide IS-IS configuration change requiring a router restart  V2: a network-wide software upgrade  V3: a network-wide configuration change in preparation for IPv6 • Horizontal banding  H1: a series of failures on a link between a core router and a County of Education office (hardware)  H2: this link experienced over 33,000 short-duration failures (fiber cut)

  12. CDFs of Individual Failure Events

  13. Various Link Hardware Types

  14. Cause of Failure

  15. Failure Events

  16. Summary • Engineering for failure requires real data - Data has historically been difficult to obtain • Methodology to perform historical failure analysis with low-quality data sources • Shared our findings in the CENIC network - Reliability of individual components - Causes of failures - Impact of failure

  17. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .

  18. Key Questions • How could routing events cause degraded end-to-end path performance? • How topological properties and routing policies affect performance degradation?

  19. Approach • Study end-to-end performance under realistic topologies. • Investigate several metrics to characterize the end-to-end loss, delay, and out-of-order packets. • Characterize the kinds of routing changes that impact end-to-end path performance. • Analyze the impact of topology, routing policies, MRAI timer and iBGP configurations on end-to-end path performance.

  20. Experiment Methodology • A multi-homed prefix • BGP Beacon prefix: 192.83.230.0/24 • Controlled Routing Changes • Failover events: Beacon changes from the state of being connected to both providers to the state of being connected to a single provider. • Recovery events: Beacon changes from the state of being connected to a single provider to the state of being connected to both providers. ISP1 ISP 2 ISP 1 ISP 2 ISP 1 ISP 2 Failover event Recovery event Beacon Beacon Beacon

  21. Controlled Routing Changes • 12 routing events every day  8 for beacon events: o Failover events o Recovery events  4 for resetting the Beacon Connectivity. Time schedule (GMT) for BGP Beacon routing transitions

  22. host B host A Active Probing host C Internet • Goal : capture the impact of routing changes on the end-to- end performance. ISP 1 ISP 2 • From 37 PlanetLab hosts to the Beacon host (a host within Beacon host the Beacon prefix). • Three probing methods: Data Plane Active probing - Back-to-back traceroutes Performance traceroute ping UDP probing metrics - Back-to-back pings Pack loss √ - UDP probing (50msec Delay √ interval) Out-of-order √

  23. Packet Loss Loss burst : consecutive UDP probing packets lost during a routing change event . Failover Recovery

  24. Packet Delay Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery

  25. Out-of-order Packets • Number of reordering (number of packets out of order) • Reordering offset Recovery Failover

  26. How Routing Failures Occur (Failover)? Prefer-customer routing policy: routes received from a provider’s customers are always preferred over those received from its peers. Provider 1 Provider 2 Peer link 0 0 R2 R3 R4 R5 0 0 2 0 1 0 R1 R6 0 0 Customer link Beacon AS 0

  27. How Routing Failures Occur (Failover)? (contd.) No-valley routing policy: peers do not transit traffic from one peer to another. 1 0 1 0 2 0 R8 2 0 R7 R9 Provider 3 1 0 Peer link R2 R3 R4 R5 Peer link 0 0 0 0 2 0 1 0 R1 R6 0 0 Provider 2 Provider 1 Beacon AS 0

  28. How Routing Failures Occur? (Recovery) iBGP constraint: a route received from an iBGP router cannot be transited to another iBGP router Provider 2 Withdraw (2 0) R1 R2 R4 Provider 1 1. Path 0 ⇒ R3 recovery. 2. R3 sends the path to R2 path (0) Path (0) 3. R2 sends a withdrawal R3 to R1 4. R3 sends the recovery path to R1 0 5. R1 regains its connection to the Beacon Beacon AS 0

  29. Summary • During failover and recovery events • Routing events impact packet loss significantly. • Routing failures contribute to end-to-end packet loss significantly. • Routing events can lead to long packet round-trip delays and reordering • Routing policies and iBGP configuration play a major role in causing packet loss during routing events.

  30. Discussion • How could we prevent packet loss during path exploration? Would storing an alternative path in each router be a good idea? What are the downsides? • How could we exploit the previous results to improve end- to-end performance? • How realistic could we consider the topology in the second paper?

  31. References • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance . SIGCOMM 2006. • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. Presentation on SIGCOMM 2006. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines . SIGCOMM 2010. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Presentation on SIGCOMM 2010.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend