SLIDE 1 IETF 104, March 2019
Edge Fabric:
Delivering Oceans of Content to the World
Brandon Schlinker
Hyojeong Kim1, Timothy Cui1, Ethan Katz-Bassett2,3, Harsha V. Madhyastha4, Italo Cunha5
1
1 Facebook, 2 University of Southern California, 3 Columbia University, 4 University of Michigan, 5 Universidade Federal de Minas Gerais
James Quinn1, Saif Hasan1, Petr Lapukhov1, James Hongyi Zeng1
1,2
SLIDE 2
Facebook's Global Network
points of presence around the world interconnect with thousands of networks
SLIDE 3 Benefits of Rich Interconnection
short, direct path
Tier 1
short, direct paths bypass transit providers
SLIDE 4 Benefits of Rich Interconnection
short, direct path
short, direct paths bypass transit providers
Tier 1
multiple, diverse paths
substantial path diversity
SLIDE 5
Basics of Interconnection
SLIDE 6
Basics of Interconnection
SLIDE 7
Basics of Interconnection
Edge Router
SLIDE 8
Basics of Interconnection
Edge Router
Network A Network B
Establish physical circuits 1
SLIDE 9 Basics of Interconnection
Establish physical circuits 1 Edge Router
BGP
Network A Network B
SLIDE 10 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
SLIDE 11 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
SLIDE 12 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
SLIDE 13 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
SLIDE 14 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
203.0.113.0/24
SLIDE 15 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
Route 1
203.0.113.0/24 203.0.113.0/24
SLIDE 16 Basics of Interconnection
Establish physical circuits 1 Exchange reachability information via BGP 2 Edge Router
BGP
Network A Network B
Route 1
203.0.113.0/24
Route 2
203.0.113.0/24 203.0.113.0/24
SLIDE 17 Exchange reachability information via BGP
Basics of Interconnection
Establish physical circuits 1 2 Edge Router
BGP
Network A Network B
Route 1
203.0.113.0/24
Route 2
203.0.113.0/24 203.0.113.0/24
SLIDE 18 Exchange reachability information via BGP
Basics of Interconnection
Establish physical circuits 1 2 Edge Router
BGP
Network A Network B
Route 1
203.0.113.0/24
Route 2
203.0.113.0/24 203.0.113.0/24
BGP at router selects which route to use 3
Route 1
203.0.113.0/24
SELECTED ROUTE
SLIDE 19 Challenges to Using Our Connectivity
deliver traffic with the best performance possible
SLIDE 20 Challenges to Using Our Connectivity
deliver traffic with the best performance possible
challenge
BGP does not consider demand, capacity or performance
SLIDE 21 100 Gbps capacity 10 Gbps capacity
Route A
BGP Does Not Consider Demand and Capacity
Router
Route B
Tier 1
ISP
5 Gbps demand
SLIDE 22 100 Gbps capacity 10 Gbps capacity 5 Gbps load
Route A
BGP Does Not Consider Demand and Capacity
Router
Route B
Tier 1
ISP
| selected by BGP
5 Gbps demand
SLIDE 23 100 Gbps capacity 10 Gbps capacity 5 Gbps load
Route A
BGP Does Not Consider Demand and Capacity
5 Gbps demand
Router
Route B
12
Tier 1
ISP
12 Gbps load (overloaded) | selected by BGP
Cannot configure BGP to adapt to demand/capacity in real time
Not possible to express with BGP policy terms
SLIDE 24 Poor performance
2% loss
Best performance Tier 1
BGP Does Not Consider Performance
Route A
Router
Route B
ISP
+50 ms
5 Gbps demand
SLIDE 25 Poor performance
+50 ms 2% loss
Best performance Tier 1
BGP Does Not Consider Performance
Route A
Router
Route B
ISP
| selected by BGP
Cannot configure BGP to adapt to performance in real time
Not possible to express with BGP policy terms 5 Gbps demand
SLIDE 26
BGP is fundamental to interconnection and it's not going away
SLIDE 27 Sidestepping BGP's Limitations
deliver traffic with the best performance possible
challenge
BGP does not consider demand, capacity or performance
approach
shift control from BGP at routers to a software controller
SLIDE 28
Outline
Overview 1
SLIDE 29
Outline
Facebook's Connectivity and Challenges 2 Overview 1
SLIDE 30
Outline
Facebook's Connectivity and Challenges 2 Sidestepping BGP's Limitations with Edge Fabric 3 Overview 1
SLIDE 31
Outline
Facebook's Connectivity and Challenges 2 Sidestepping BGP's Limitations with Edge Fabric 3 4 Results from Edge Fabric's Behavior in Production Overview 1
SLIDE 32
Evolution and Related Work
Outline
Facebook's Connectivity and Challenges 2 Sidestepping BGP's Limitations with Edge Fabric 3 4 Results from Edge Fabric's Behavior in Production 5 Overview 1
SLIDE 33 Connectivity at a Point of Presence (PoP)
Transit Providers
deliver traffic to entire Internet
Two or more # per PoP Interconnection Private circuit
SLIDE 34 Peers
end-user ISPs, mobile providers
Connectivity at a Point of Presence (PoP)
Transit Providers
deliver traffic to entire Internet
Two or more # per PoP
Private Peers
Interconnection Private circuit Tens Private circuit
SLIDE 35 Peers
end-user ISPs, mobile providers
Connectivity at a Point of Presence (PoP)
Transit Providers
deliver traffic to entire Internet
Two or more # per PoP
via Internet Exchange Point
Tens Hundreds
Private Peers IXP Peers
Interconnection Private circuit Private circuit Shared fabric
SLIDE 36
We prefer routes from private peers > IXP peers > transits
SLIDE 37
We prefer routes from private peers > IXP peers > transits
peers > transits
peers provide short, direct paths to end users
SLIDE 38
We prefer routes from private peers > IXP peers > transits
private > IXP peers
prefer circuits dedicated to Facebook and peer
peers > transits
peers provide short, direct paths to end users
SLIDE 39 Peers
end-user ISPs, mobile providers
Connectivity at a Point of Presence (PoP)
Transit Providers
deliver traffic to entire Internet
Two or more # per PoP
Private Peers IXP Peers
Interconnection Private circuit majority of traffic Tens Hundreds Private circuit Shared fabric
via Internet Exchange Point
SLIDE 40
We cannot acquire sufficient capacity with private peers to satisfy demand
SLIDE 41 We cannot acquire sufficient capacity with private peers to satisfy demand
10 Gbps capacity
Router 12 Gbps load (overloaded)
| selected by BGP
Private Peer 12 Gbps demand
SLIDE 42
Why not just acquire more peering capacity?
SLIDE 43
Why not just acquire more peering capacity?
Peers often cannot provision capacity technical constraints, business constraints
SLIDE 44
Why not just acquire more peering capacity?
Peers often cannot provision capacity technical constraints, business constraints Even when peers agree to add capacity provisioning can be slow (months) little headroom for traffic bursts or circuit failures
SLIDE 45
How bad is the problem?
Why not just acquire more peering capacity?
SLIDE 46
Capacity Constraints in Production
Over a two-day study of 20 PoPs (subset of production) identified circuits predicted to have demand > capacity
SLIDE 47
Capacity Constraints in Production
Over a two-day study of 20 PoPs (subset of production) identified circuits predicted to have demand > capacity
17 out of 20 PoPs
had at least one circuit
SLIDE 48
Capacity Constraints in Production
Over a two-day study of 20 PoPs (subset of production) identified circuits predicted to have demand > capacity
18% of all circuits 17 out of 20 PoPs
had at least one circuit
SLIDE 49 Capacity Constraints in Production
0.2 0.4 0.6 0.8 1
0.5 1 1 2 4 5
CDF of circuits where demand > capacity
Peak demand relative to capacity 3
Circuit's peak demand to capacity
For circuits predicted to have demand > capacity at least once
SLIDE 50 Capacity Constraints in Production
0.2 0.4 0.6 0.8 1
0.5 1 1 2 4 5 Peak demand relative to capacity 3
50% of circuits had peak demand ≥ 1.19x capacity
Circuit's peak demand to capacity
For circuits predicted to have demand > capacity at least once
CDF of circuits where demand > capacity
SLIDE 51 Capacity Constraints in Production
0.2 0.4 0.6 0.8 1
0.5 1 1 2 4 5 Peak demand relative to capacity 3
50% of circuits had peak demand ≥ 1.19x capacity 10% of circuits had peak demand ≥ 2x capacity
Circuit's peak demand to capacity
For circuits predicted to have demand > capacity at least once
CDF of circuits where demand > capacity
SLIDE 52
Recall: BGP Does Not Consider Demand and Capacity 10% of overloaded circuits had peak demand ≥ 2x capacity
SLIDE 53
Recall: BGP Does Not Consider Demand and Capacity 10% of overloaded circuits had peak demand ≥ 2x capacity
We need a better solution than BGP
so we built Edge Fabric
SLIDE 54
Evolution and Related Work
Outline
Facebook's Connectivity and Challenges 2 Sidestepping BGP's Limitations with Edge Fabric 3 4 Results from Edge Fabric's Behavior in Production 5 Overview 1
SLIDE 55 Sidestepping BGP's Limitations
deliver traffic with the best performance possible
challenge
BGP does not consider demand, capacity or performance
approach
shift control from BGP at routers to a software controller
SLIDE 56
Design Priorities
Operational simplicity
minimize change and system complexity
SLIDE 57
Design Priorities
Operational simplicity Ease of deployment
minimize change and system complexity interoperate with existing infrastructure and tooling
SLIDE 58 Responsibility for Routing
Traditional routers Route per destination from BGP Host-based routing Route per packet dictated by hosts
Operational simplicity Ease of deployment
design priorities
SLIDE 59 Responsibility for Routing
Traditional routers Route per destination from BGP Host-based routing Route per packet dictated by hosts
Edge Fabric's approach:
Controller overrides BGP's decisions at router Hosts provide hints on packet priority Operational simplicity Ease of deployment
design priorities
SLIDE 60 Edge Fabric's Approach to Control
Router
BGP sessions
BGP
48
1
Router selects routes using BGP
Route A
SLIDE 61 BGP routes
Edge Fabric's Approach to Control
Additional Inputs
Router
BGP sessions
1
Router selects routes using BGP Edge Fabric selects ideal routes
using BGP routes + other inputs
2
BGP Edge Fabric
49
Route A
SLIDE 62 Edge Fabric's Approach to Control
Additional Inputs
Edge Fabric
50
Prefix traffic rates Route performance measurements BGP routes (from router) Advanced policy
40 Gbps 1 Gbps
Circuit capacities Inputs to Edge Fabric
SLIDE 63 Edge Fabric's Approach to Control
Additional Inputs
Edge Fabric
51
Prefix traffic rates Route performance measurements BGP routes (from router) Advanced policy
40 Gbps 1 Gbps
Circuit capacities Inputs to Edge Fabric
Route B
SLIDE 64 BGP routes
Edge Fabric's Approach to Control
Additional Inputs
Router
BGP sessions
1
Router selects routes using BGP Edge Fabric selects ideal routes
using BGP routes + other inputs
2
BGP Edge Fabric
52
Route B Route A
SLIDE 65 BGP routes
Edge Fabric's Approach to Control
Additional Inputs
Router
BGP sessions
1
Router selects routes using BGP Edge Fabric selects ideal routes
using BGP routes + other inputs
2
BGP
3
If router and Edge Fabric choose different routes, override router
Edge Fabric Edge Fabric
Route B
Router BGP
Route A
Route B
53
Route B Route A
use Route B
SLIDE 66
Types of Edge Fabric Overrides
Edge Fabric can override BGP's decision in order to...
SLIDE 67 Types of Edge Fabric Overrides
Peering Transit Before After
203.0.113.0/24
Move traffic for set of end-users
- verride per <destination>
Edge Fabric can override BGP's decision in order to...
SLIDE 68 Types of Edge Fabric Overrides
Peering Transit Before After
203.0.113.0/24
Move traffic for set of end-users
- verride per <destination>
Peering Transit Before After Low priority traffic
Move class of end-user traffic
- verride per <destination, traffic class>
(see paper for details)
Edge Fabric can override BGP's decision in order to...
SLIDE 69 Example Override: Preventing Congestion
100 Gbps capacity 10 Gbps capacity
Route A
Router
Route B
Tier 1
ISP
0 Gbps load 12 Gbps load BGP's decision
12 Gbps demand
SLIDE 70 Example Override: Preventing Congestion
100 Gbps capacity 10 Gbps capacity
Route A
Router
Route B
Tier 1
ISP
0 Gbps load 12 Gbps load BGP's decision
12 Gbps demand
Demand composed of two prefixes:
SLIDE 71 Example Override: Preventing Congestion
100 Gbps capacity 10 Gbps capacity
Route A
Router
Route B
Tier 1
ISP
0 Gbps load 12 Gbps load BGP's decision
12 Gbps demand
Demand composed of two prefixes:
198.51.100.0/24 | 9.5 Gbps 203.0.113.0/24 | 2.5 Gbps
SLIDE 72 Example Override: Preventing Congestion
100 Gbps capacity 10 Gbps capacity
Route A
Router
Route B
Tier 1
ISP
9.5 Gbps load Edge Fabric
Edge Fabric shifts a prefix's traffic to an alternate link
12 Gbps demand
Demand composed of two prefixes:
198.51.100.0/24 | 9.5 Gbps 203.0.113.0/24 | 2.5 Gbps
SLIDE 73 Example Override: Preventing Congestion
100 Gbps capacity 10 Gbps capacity
Route A
Router
Route B
Tier 1
ISP
9.5 Gbps load +2.5 Gbps load Edge Fabric Shifts 203.0.113.0/24
(destination-based override)
12 Gbps demand
Demand composed of two prefixes:
198.51.100.0/24 | 9.5 Gbps 203.0.113.0/24 | 2.5 Gbps
Edge Fabric shifts a prefix's traffic to an alternate link
SLIDE 74 inject via BGP
Enacting Overrides at Routers
Transit Route
203.0.113.0/24
selected route
Edge Fabric injects override route via BGP
Edge Router
1
SLIDE 75 inject via BGP
Enacting Overrides at Routers
Transit Route
203.0.113.0/24
selected route Edge Router Injected Route
203.0.113.0/24
BGP's selected route BGP ENGINE Edge Fabric injection via BGP
1 2 BGP at routers prefers routes from Edge Fabric Edge Fabric injects override route via BGP
SLIDE 76
Enacting Overrides at Routers
Edge Fabric monitors BGP's decisions and overrides them as needed
We gain centralized control over the distributed BGP process without removing BGP from our routers
SLIDE 77 Edge Fabric is Flexible
Circuit capacity and traffic rates Route performance measurements BGP routes Policy Path per <destination> Path per <destination, traffic class>
inputs
Edge Fabric supports sophisticated traffic engineering policies
SLIDE 78
Edge Fabric Meets Our Design Priorities
Operational simplicity
Can fallback to BGP at routers Allows operators to continue to use existing tools Synchronization is only required between Edge Fabric and routers
SLIDE 79
Edge Fabric Meets Our Design Priorities
Ease of deployment
BGP sessions with external peers remain at routers
Uses BGP protocol for injections Uses other industry standards for route and traffic info (BMP/IPFIX/sFlow)
Operational simplicity
Can fallback to BGP at routers Allows operators to continue to use existing tools Synchronization is only required between Edge Fabric and routers
SLIDE 80
Evolution and Related Work
Outline
Facebook's Connectivity and Challenges 2 Sidestepping BGP's Limitations with Edge Fabric 3 4 Results from Edge Fabric's Behavior in Production 5 Overview 1
SLIDE 81
Edge Fabric entered production in 2013 Objective: Prevent circuit congestion
SLIDE 82 Edge Fabric in Production
Edge Routers BMP IPFIX/sFlow
Edge Fabric
Traffic rates BGP routes BGP
Runs per PoP, executes every 30 seconds Controls 100% of Facebook's egress traffic
(see paper for implementation details)
SLIDE 83
Target Circuit Utilization To Avoid Congestion
110% if all traffic was placed onto its most preferred path circuit utilization How much traffic should Edge Fabric remove?
SLIDE 84
Target Circuit Utilization To Avoid Congestion
110% if all traffic was placed onto its most preferred path 100% packet loss during bursts circuit utilization
SLIDE 85 Target Circuit Utilization To Avoid Congestion
110% if all traffic was placed onto its most preferred path 100% packet loss during bursts
poor utilization
50% circuit utilization
SLIDE 86 Target Circuit Utilization To Avoid Congestion
110% if all traffic was placed onto its most preferred path 100% packet loss during bursts
poor utilization
50% ~95% high utilization with tolerance for bursts in traffic circuit utilization
SLIDE 87
Does Edge Fabric prevent circuit congestion and packet drops?
Evaluating Congestion Avoidance
Key questions:
Does Edge Fabric keep circuit utilization at prescribed threshold?
SLIDE 88
During measurement period
Evaluating Congestion Avoidance
Does Edge Fabric prevent circuit congestion and packet drops? When Edge Fabric was shifting traffic away 99.9% of the time, no packet drops
SLIDE 89
During measurement period
Evaluating Congestion Avoidance
Does Edge Fabric prevent circuit congestion and packet drops? When Edge Fabric was shifting traffic away 99.9% of the time, no packet drops When Edge Fabric was not active No packet drops
SLIDE 90
During measurement period
Evaluating Congestion Avoidance
Does Edge Fabric prevent circuit congestion and packet drops? When Edge Fabric was shifting traffic away 99.9% of the time, no packet drops When Edge Fabric was not active No packet drops
Edge Fabric intervened when needed and prevented circuit congestion
SLIDE 91
Evaluating Congestion Avoidance
[Circuit utilization - threshold]
every 30 seconds for circuits where demand > capacity
Can we keep utilization at the threshold?
SLIDE 92 Evaluating Congestion Avoidance
% of samples
10 20 30
[Circuit utilization - threshold]
every 30 seconds for circuits where demand > capacity
Circuit utilization - threshold
0%
1% 2% 4% 3%
Can we keep utilization at the threshold?
SLIDE 93 Evaluating Congestion Avoidance
% of samples
10 20 30
[Circuit utilization - threshold]
every 30 seconds for circuits where demand > capacity
Ideal value
Circuit utilization - threshold
0%
1% 2% 4% 3%
Can we keep utilization at the threshold?
SLIDE 94 Evaluating Congestion Avoidance
% of samples
10 20 30
[Circuit utilization - threshold]
every 30 seconds for circuits where demand > capacity
Utilization higher than threshold Utilization lower than threshold Ideal value
Circuit utilization - threshold
0%
1% 2% 4% 3%
Can we keep utilization at the threshold?
SLIDE 95 Utilization higher than threshold Utilization lower than threshold
0%
Circuit utilization - threshold
Evaluating Congestion Avoidance
% of samples
1% 2% 4% 3% 10
Within 2%
20 30
Threshold Can we keep utilization at the threshold?
SLIDE 96
Edge Fabric prevents packet loss while keeping circuit utilization high
Yes. Yes.
Does Edge Fabric prevent circuit congestion and packet drops? Does Edge Fabric keep circuit utilization at prescribed threshold?
SLIDE 97
Evolution and Related Work
Outline
Facebook's Connectivity and Challenges 2 Sidestepping BGP's Limitations with Edge Fabric 3 4 Results from Edge Fabric's Behavior in Production 5 Overview 1
SLIDE 98 Initially: Host-based routing
Overrides enacted by hosts Hosts signal egress path per packet
Evolution: Enacting Decisions
decisions servers Edge Fabric routers
via MPLS/DSCP/GRE
"send via circuit X" Packet
X
SLIDE 99 Initially: Host-based routing
Overrides enacted by hosts Hosts signal egress path per packet
Evolution: Enacting Decisions
decisions servers Edge Fabric routers
via MPLS/DSCP/GRE
"send via circuit X" Packet
Today: Edge-based routing
Overrides enacted by routers at edge Hosts signal priority per packet X
decisions servers Edge Fabric routers
via DSCP
"video traffic" Packet
SLIDE 100
Before: Host-based routing
Evolution: Enacting Decisions
Today: Edge-based routing
Both provide the capabilities we want today
Preventing congestion, incorporating advanced policy, application-specific and performance-aware routing
SLIDE 101
Before: Host-based routing
Evolution: Enacting Decisions
Today: Edge-based routing
Both provide the capabilities we want today
Preventing congestion, incorporating advanced policy, application-specific and performance-aware routing Operational simplicity Ease of deployment
Edge-based is best aligned with our design priorities
SLIDE 102
Edge Fabric and Google's Espresso
SLIDE 103
Edge Fabric and Google's Espresso
use BGP to exchange routes with peers
Both systems
SLIDE 104
Edge Fabric and Google's Espresso
focus on centralizing control and incorporating additional inputs use BGP to exchange routes with peers
Both systems
SLIDE 105 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
Facebook's Edge Fabric
design priorities
Operational simplicity Ease of deployment
SLIDE 106 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
Facebook's Edge Fabric
design priorities
Maximum flexibility Cost savings Operational simplicity Ease of deployment
SLIDE 107 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
edge device
router MPLS switch
enacts decisions via role of hosts decision granularity
Facebook's Edge Fabric
routing options design priorities
Maximum flexibility Cost savings Operational simplicity Ease of deployment
SLIDE 108 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
edge device
router MPLS switch
enacts decisions via role of hosts decision granularity
Facebook's Edge Fabric
routing options
BGP injections to routers host-based overrides
design priorities
Maximum flexibility Cost savings Operational simplicity Ease of deployment
SLIDE 109 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
edge device
router MPLS switch
enacts decisions via role of hosts decision granularity
Facebook's Edge Fabric
routing options
BGP injections to routers host-based overrides mark packet's priority select packet's route
design priorities
Maximum flexibility Cost savings Operational simplicity Ease of deployment
SLIDE 110 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
edge device
router MPLS switch
enacts decisions via role of hosts decision granularity
Facebook's Edge Fabric
routing options
BGP injections to routers host-based overrides mark packet's priority select packet's route <destination, priority/class> packet
design priorities
Maximum flexibility Cost savings Operational simplicity Ease of deployment
SLIDE 111 centralize control and incorporate additional inputs
Google's Espresso
use BGP to exchange routes with peers
edge device
router MPLS switch
enacts decisions via role of hosts decision granularity
Facebook's Edge Fabric
routing options
BGP injections to routers host-based overrides mark packet's priority select packet's route <destination, priority/class> packet
design priorities
Maximum flexibility Cost savings Operational simplicity Ease of deployment
per-PoP global
SLIDE 112
BGP does not consider demand, capacity or performance
Problem has been around for a decade.
SLIDE 113
BGP does not consider demand, capacity or performance
Problem has been around for a decade.
Scale of connectivity, traffic, and QoS demands brings new challenges and opportunities
SLIDE 114
Conclusion
Benefits of Rich Interconnection
SLIDE 115 Conclusion
deliver traffic with the best performance possible
SLIDE 116 Conclusion
challenge
BGP does not consider demand, capacity or performance
deliver traffic with the best performance possible
SLIDE 117 Conclusion
deliver traffic with the best performance possible
challenge
BGP does not consider demand, capacity or performance
With Edge Fabric, we sidestep BGP's limitations
by shifting control from routers to software
result
more efficient network, better performance for our users