 
              CompSci 514: Computer Networks Lecture 10: BGP problems Xiaowei Yang 1
Today • Known problems of BGP – Multi-homing – Instability – Delayed convergence • Slow failover • Discussing fixes – Root cause, ghost flushing etc. 2
Background on the paper IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 6, NO. 5, OCTOBER 1998 515 Internet Routing Instability Craig Labovitz, Student Member, IEEE, G. Robert Malan, Student Member, IEEE, and Farnam Jahanian, Member, IEEE Abstract— This paper examines the network interdomain rout- flaps have led to the transient loss of connectivity for large ing information exchanged between backbone service providers portions of the Internet. Overall, instability has three primary at the major U.S. public Internet exchange points. Internet rout- effects: increased packet loss, delays in the time for network ing instability, or the rapid fluctuation of network reachability convergence, and additional resource overheard (memory, information, is an important problem currently facing the In- CPU, etc.) within the Internet infrastructure. ternet engineering community. High levels of network instability can lead to packet loss, increased network latency and time to The Internet is comprised of a large number of intercon- convergence. At the extreme, high levels of routing instability nected regional and national backbones. The large public have led to the loss of internal connectivity in wide-area, national exchange points are often considered the “core” of the Internet, networks. In this paper, we describe several unexpected trends where backbone service providers peer , or exchange traffic in routing instability, and examine a number of anomalies and and routing information with one another. Backbone service pathologies observed in the exchange of inter-domain routing information. The analysis in this paper is based on data collected providers participating in the Internet core must maintain a from BGP routing messages generated by border routers at five complete map, or default-free routing table, of all globally vis- of the Internet core’s public exchange points during a nine month ible network-layer addresses reachable throughout the Internet. period. We show that the volume of these routing updates is The Internet is divided into a large number of different re- several orders of magnitude more than expected and that the majority of this routing information is redundant, or pathological. gions of administrative control commonly called autonomous Furthermore, our analysis reveals several unexpected trends and systems . These autonomous systems (AS’s) usually have dis- ill-behaved systematic properties in Internet routing. We finally tinct routing policies and connect to one or more remote AS’s posit a number of explanations for these anomalies and evaluate at private or public exchange points . AS’s are traditionally their potential impact on the Internet infrastructure. composed of network service providers or large organizational • Another ACM SIGCOMM test of time award paper • A first large scale study of BGP traffic • Motivated much improvement to BGP 3
Failover • BGP is designed for scaling more than fast failover – Many mechanisms favor this balance – Route flap damping, for example. • If excess routing changes ( � flapping � ), ignore for some time. • Has unexpected effects on convergence times. – Route advertisement/withdrawal timers in the 30 second range – Effect: tens of seconds to many minutes to recover from � simple � failures. – 15-30 minute outages not uncommon. 4
Multi-homing • Connect to multiple providers – Goal: Higher availability, more capacity • Problems: – Provider-based addressing breaks – Everyone needs their own address space 5
Multi-homing increases routing table size 204.0.0.0/8 ISP2 You can reach 128.0.0.0/8 204.1.0.0/16 ISP2 And 204.1.0.0/16 via ISP1 ISP3 ISP1 ISP2 128.0.0.0/8 204.0.0.0/8 128.0.0.0/8 ISP1 204.1.0.0/16 204.1.0.0/16 ISP1 Mutil-home.com 204.1.0.0/16 6
Global routing tables continue to grow Source: http://bgp.potaroo.net/as6447/ 7
Other BGP problems • Convergence: BGP may explore many routes before finding the right new one. – Labovitz et al., SIGCOMM 2000 • Correctness: routes may not be valid, visible, or loop-free. • Security: There is none! – Some providers filter what announcements their customers can make. Not all do. – See paper discussion site for pointers 8
Measurement studies • Two papers (measurement) – End-to-end traffic – Routing messages • Experimental techniques • Results 9
Internet Routing Instability • Goals: how often BGP sends updates to change routes • Methodology: – Analyzing BGP logs for a long time 10
Terms • WADiff: withdrawal à announcement • AADiff: announcement à announcement • WADup: same route withdrawal à announcement • AADup: same route announcement à announcement • WWDup: same route withdrawal à withdrawal 10/2/18 CPS 214 11
Observed pathologies • Repeated WWDup, WADup, AADup • Why are they pathologies? 10/2/18 CPS 214 12
• Majority of BGP updates are WWDup • WWDup belong to ASes that never announce them • Why? – Stateless BGP, does not remember what have sent to peers – Send withdrawals to all peers 10/2/18 13
Possible origins of instability • Stateless BGP • Physical link errors • Unjittered timers • IGP, BGP interactions • Conflicting routing policies 10/2/18 14
Data analysis techniques • Time series analysis • Frequency analysis – Fast Fourier transform – Maximum entropy spectral estimation • Different estimation methods, but both find significant frequencies at seven days, and 24 hours 10/2/18 15
Main results • Much more updates than expected – 99% is pathological. Impressive! – A taxonomy to analyze pathologies • Speculation of causes – Configuration errors, router bugs – Correlate with traffic load, perhaps due to router architectures – Open research questions: root cause of updates • Motivated much follow-up work 10/2/18 16
IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 9, NO. 3, JUNE 2001 293 Delayed Internet Routing Convergence Craig Labovitz , Member, IEEE , Abha Ahuja , Member, IEEE , Abhijit Bose, and Farnam Jahanian tions. For example, transient disruptions in backbone networks Abstract— This paper examines the latency in Internet path failure, failover, and repair due to the convergence properties of that previously impacted a handful of scientists may now cause interdomain routing. Unlike circuit-switched paths which exhibit enormous financial loss and disrupt hundreds of thousands of failover on the order of milliseconds, our experimental mea- end users. surements show that interdomain routers in the packet-switched Since its commercial inception in 1995, the Internet has Internet may take tens of minutes to reach a consistent view of the lagged behind the public switched telephone network (PSTN) network topology after a fault. These delays stem from temporary routing table fluctuations formed during the operation of the in availability, reliability, and quality of service (QoS). Factors Border Gateway Protocol (BGP) path selection process on Internet contributing to these differences between the commercial backbone routers. During these periods of delayed convergence , we Internet infrastructure and the PSTN have been discussed show that end-to-end Internet paths will experience intermittent in various literature [26], [18]. Although recent advances in loss of connectivity, as well as increased packet loss and latency. We present a two-year study of Internet routing convergence the IETF’s Differentiated Services working group promise to through the experimental instrumentation of key portions of the improve the performance of application-level services within Internet infrastructure, including both passive data collection some networks, across the wide-area Internet these QoS and fault-injection machines at major Internet exchange points. algorithms are usually predicated on the existence of a stable Based on data from the injection and measurement of several 17
Delayed Internet Convergence Problem Modeling Measurement Improvement discovery & analysis • Methodologies 10/2/18 18
Experiments setup • Actively inject BGP faults – How is fault injected? • Passively listen at peering sessions, and use NTP synchronized machines to calculate the convergence time • Actively send probe packets to observe end-to-end packet loss and latency • Much BGP work later uses similar measurement techniques. 10/2/18 19
Results show delayed convergence • Bad news travels slow. 10/2/18 20
Slow routing convergence results in poor end-to-end performance 10/2/18 21
What causes the delayed routing convergence? ( ∞ , *1R, 2R) 0 ( ∞ , ∞ , *2R) ( ∞ , ∞ , ∞ ) 01R 01R 10R R 20R (*0R, ∞ ,, 2R) 1 2 10R (*0R, 1R, ∞ ,) ( ∞ , ∞ , *2R) (01R, *1R, ∞ ) ( ∞ , ∞ , *20R) 20R (*01R, 10R, ∞ ,) • A simple BGP convergence model reveals that in the worse case, all possible paths are explored before a prefix is withdrawn. • No minimum advertisement timer: synchronized network, global message queue 10/2/18 22
Recommend
More recommend