RouteBricks
Exploating Parallelism To Scale Software Routers
Paweł Bedyński 12 January 2011
RouteBricks Exploating Parallelism To Scale Software Routers Pawe - - PowerPoint PPT Presentation
RouteBricks Exploating Parallelism To Scale Software Routers Pawe Bedyski 12 January 2011 About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)
Paweł Bedyński 12 January 2011
Published:
October 2009, 14 pages
People:
Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley) Katerina Argyraki Byung-Gon Chun Kevin Fall Gianluca Iannaccone Allan Knies Maziar Manesh Sylvia Ratnasamy
Institutions
EPFL Lausanne, Switzerland Lancaster University, Lancaster, UK Intel Research Labs, Berkeley, CA
„While certainly an improvement, in practice, network processors have proven hard to program: in the best case, the programmer needs to learn a new programming paradigm; in the worst, she must be aware of (and program to avoid) low-level issues (…)”
A little bit of history:
Network equipment has focused primarily on performance
Limited forms of packets processing
New functionality and services renewed interest in programmable and extensible network equipement
Main issue:
High-end routers are diffucult to extend
„software routers” – easily programmable but have so far been suitable for low-packet rate env.
Goal:
Individual link speed: 10Gbps is already widespread
Carrier-grade routers 10Gbps up to 92Tbps
Software routers had problems with exceeding 1-5Gbps.
RouteBricks: parallelization across servers and tasks within server
Requirements:
Variables:
N ports Each port full-duplex Line rate R bps
Router functionality:
(1) packet processing, route lookup or classification (2) packet switching from input to output ports
Existing solutions (N – 10-10k, R – 1-40 Gbps):
Hardware routers:
Packet processing happens in the linecard (per one or few ports). Each linecard
must process at cR rate.
Packet switching through a switch fabric and centralized scheduler, hence rate is
NR
Software routers:
Both switching and packet processing at NR rate.
NR is unrealistic in single server solution. This is 2-3 orders of magnitude away from current server performance.
Even cR (lowest c=2) is too much for single server if we don’t use potential of multicore architecture (1-4Gbps is reachable)
Drawbacks, tradeoffs:
Packet reordering, increased latency, more „relaxed” performance guarantees
Switching guarantees:
(1) 100% throughput – all output ports can run at full line rate R bps, if
the input trafic demands it.
(2) fairness – each input port gets its fair share of the capacity of any
(3) avoids reordering packets
Constraints of commodity serves:
Limited internal link rates – internal links cannot run at a rate higher
than external line rate R
Limited per-node processing rate – single server rate not higher than
cR for small constant c>1
Limited per-node fanout – number of physical connection from each
server constant and independent of the number of servers.
Routing algorithms:
Static single-path – high „speedups” violates our constraint Adaptive single-path – needs centralized scheduling, but it should run at rate NR Load-balaced routing - VLB (Valiant Load Balancing)
Benefits of VLB
Guarantees 100% throuput and fairness without centralized
scheduling
Doesn’t require link speedups – traffic uniformly split across the
cluster’s internal links
Only +R (for intermediate traffic) to per-server traffic rate
compared to solution without VLB (R for traffic comming in from
external line, R for trafic which server should send out)
Problems:
Packets Reordering Limited fanout : prevents us from using full-mesh topology when
N exceeds server’s fanout
configurations:
Current servers: each server can handle one router port and
accommodate 5 NICs
More NICs: 1RP (router port), 20 NICs Faster servers & more NICs: 2 RP, 20NICs
Number of servers required to build an N- port, R=10Gbps/port router, for four different server configurations
Each network queue should
be accessed by a single core
Locking is expensive Separate thread for polling
and writing
Threads statically assigned to
cores
Each packet should be
handled by a single core
Pipeline aproach is
aproach
3-fold performance improvement Increased latency
Workloads:
Minimal forwarding – traffic
arriving at port i is just forwarded to port j (no routing-table lookup etc.)
IP routing – full routing with
checksum calculation, updating headers etc.
IPsec packet encryption
Specification:
4 Nehalem servers full-mesh topology Direct-VLB routing Each server assigned a single 10 Gbps external line
Minimizing packet processing:
By encoding output node in MAC address (once) Only works if each „internal” port has as many receive queues as
there are external ports
Avoiding reordering:
Standard VLB cluster allows reordering (multiple cores or load-
balancing)
Perfectly synchronized clocks solution – require custom
Sequence numbers tags – CPU is a bottleneck Solution – avoid reordering within TCP/UDP
Same flow packets are assigned to the same queue Set of same-flow packets (δ – msec) are sent through the same intermediate
node
Forwarding performance: Reordering (Abilene trace – single input, single output):
<p1,p2,p3,p4,p5> <p1,p4,p2,p3,p5> - one reordered sequence
Measure reordering as fraction of same-flow packet sequences that were reordered
RB4 - 0.15% when using reordering avoidance, 5,5% when using Direct VLB
Latency:
Per-server packet latency 24μs. DMA transfer (two back-and-forth transfers between NIC and memory: packet and descriptor) – 2,56; routing - 0.8; batching – 12.8
Traversal through RB4 includes 2-3 hops; hence estimated latency – 47.6 – 66.4 μs
Cisco 6500 Series router – 26.3 μs (packet processing latency)
Workflow RB4 Expect. Explanation 64B 12 Gbps 12.7-19.4 Extra overhead caused by reordering avoidance algorithm Abilene 35 Gbps 33 - 49 Limited number of PCIe slots on prototype server
Not only performance-driven: How to measure programmability ?
Space Power Cost
Ethernet controllers directly on the motherboard (done in laptops) – but idea was not to change hardware
results:400mm motherboard could accomodate 6 controllers to drive 2x10Gbps and 30x1Gbps interfaces. With this we could connect 30-40 servers. Thus we would result in 300-400Gbps router that occupies 30U (rack unit).
in a 21U form-factor
rating of popular mid-range router loaded for 40Gbps is 1.6KW
consumption by slowing down components not stressed by the workflow
„raw cost”
Cisco 7603 router $70.000.