routebricks
play

RouteBricks Exploating Parallelism To Scale Software Routers Pawe - PowerPoint PPT Presentation

RouteBricks Exploating Parallelism To Scale Software Routers Pawe Bedyski 12 January 2011 About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)


  1. RouteBricks Exploating Parallelism To Scale Software Routers Paweł Bedyński 12 January 2011

  2. About the paper  Published: October 2009, 14 pages   People:  Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)  Katerina Argyraki „While certainly an  Byung-Gon Chun improvement, in practice,  Kevin Fall network processors have  Gianluca Iannaccone proven hard to program: in  Allan Knies the best case, the  Maziar Manesh programmer needs to learn a  Sylvia Ratnasamy new programming paradigm;  Institutions in the worst, she must be  EPFL Lausanne, Switzerland aware of (and program to  Lancaster University, Lancaster, UK avoid) low- level issues (…)”  Intel Research Labs, Berkeley, CA

  3. What for? A little bit of history:  Network equipment has focused primarily on performance  Limited forms of packets processing  New functionality and services renewed interest in programmable and extensible  network equipement Main issue:  High-end routers are diffucult to extend  „software routers ” – easily programmable but have so far been suitable for low-packet  rate env. Goal:  Individual link speed: 10Gbps is already widespread  Carrier-grade routers 10Gbps up to 92Tbps  Software routers had problems with exceeding 1-5Gbps.  RouteBricks: parallelization across servers and tasks within server 

  4. Design Principles  Requirements:  Variables:  N ports  Each port full-duplex  Line rate R bps  Router functionality:  (1) packet processing, route lookup or classification  (2) packet switching from input to output ports  Existing solutions (N – 10-10k, R – 1-40 Gbps):  Hardware routers:  Packet processing happens in the linecard (per one or few ports). Each linecard must process at cR rate.  Packet switching through a switch fabric and centralized scheduler, hence rate is NR  Software routers:  Both switching and packet processing at NR rate.

  5. Design Principles 1. Router functionality should be parallelized across multiple servers NR is unrealistic in single server solution. This is 2-3 orders of magnitude away  from current server performance. 2. Router functionality should be parallelized across multiple processing paths within each server Even cR (lowest c =2) is too much for single server if we don’t use  potential of multicore architecture (1-4Gbps is reachable) Drawbacks, tradeoffs:  Packet reordering, increased latency, more „ relaxed ” performance  guarantees

  6. Parallelizing accross servers  Switching guarantees:  (1) 100% throughput – all output ports can run at full line rate R bps, if the input trafic demands it.  (2) fairness – each input port gets its fair share of the capacity of any output port  (3) avoids reordering packets  Constraints of commodity serves:  Limited internal link rates – internal links cannot run at a rate higher than external line rate R  Limited per-node processing rate – single server rate not higher than cR for small constant c>1  Limited per-node fanout – number of physical connection from each server constant and independent of the number of servers.

  7. Parallelizing accross servers  Routing algorithms:  Static single-path – high „ speedups ” violates our constraint  Adaptive single-path – needs centralized scheduling, but it should run at rate NR  Load-balaced routing - VLB (Valiant Load Balancing)  Benefits of VLB  Guarantees 100% throuput and fairness without centralized scheduling  Doesn’t require link speedups – traffic uniformly split across the cluster’s internal links  Only +R (for intermediate traffic) to per-server traffic rate compared to solution without VLB (R for traffic comming in from external line, R for trafic which server should send out )  Problems:  Packets Reordering  Limited fanout : prevents us from using full-mesh topology when N exceeds server’s fanout

  8. Parallelizing accross servers  configurations:  Current servers : each server can handle one router port and accommodate 5 NICs  More NICs : 1RP (router port), 20 NICs  Faster servers & more NICs : 2 RP, 20NICs Number of servers required to build an N - port, R =10Gbps/port router, for four different server configurations

  9. Parallelizing within servers

  10. Parallelizing within servers  Each network queue should be accessed by a single core  Locking is expensive  Separate thread for polling and writing  Threads statically assigned to cores  Each packet should be handled by a single core  Pipeline aproach is outperformed by parallel aproach

  11. Parallelizing within servers  „ Batch ” processing (NIC-driven, poll-driven)  3-fold performance improvement  Increased latency

  12. Evaluation: server parallelism  Workloads:  Minimal forwarding – traffic arriving at port i is just forwarded to port j (no routing-table lookup etc.)  IP routing – full routing with checksum calculation, updating headers etc.  IPsec packet encryption

  13. The RB4 Parallel Router  Specification:  4 Nehalem servers  full-mesh topology  Direct-VLB routing  Each server assigned a single 10 Gbps external line

  14. RB4 - implementation  Minimizing packet processing:  By encoding output node in MAC address (once)  Only works if each „ internal ” port has as many receive queues as there are external ports  Avoiding reordering:  Standard VLB cluster allows reordering (multiple cores or load- balancing)  Perfectly synchronized clocks solution – require custom operating systems and hardware  Sequence numbers tags – CPU is a bottleneck  Solution – avoid reordering within TCP/UDP  Same flow packets are assigned to the same queue  Set of same-flow packets ( δ – msec) are sent through the same intermediate node

  15. RB4 - performance  Forwarding performance: Workflow RB4 Expect. Explanation 64B 12 Gbps 12.7-19.4 Extra overhead caused by reordering avoidance algorithm Abilene 35 Gbps 33 - 49 Limited number of PCIe slots on prototype server  Reordering (Abilene trace – single input, single output): <p1,p2,p3,p4,p5 >  <p1,p4,p2,p3,p5> - one reordered sequence  Measure reordering as fraction of same-flow packet sequences that were reordered  RB4 - 0.15% when using reordering avoidance, 5,5% when using Direct VLB   Latency: Per-server packet latency 24 μ s. DMA transfer (two back-and-forth transfers  between NIC and memory: packet and descriptor) – 2,56; routing - 0.8; batching – 12.8 Traversal through RB4 includes 2-3 hops; hence estimated latency – 47.6 – 66.4 μ s  Cisco 6500 Series router – 26.3 μ s (packet processing latency) 

  16. Discussion & Conclusions  Not only performance-driven: Space Power Cost • Limiting space in RB4 by integrating • RB4 consumes 2.6KW • RB4 prototype $14.000 „ra w cost ” Ethernet controllers directly on the motherboard (done in laptops) – but idea was • For reference: nominal power not to change hardware rating of popular mid-range • For reference 40Gbps router loaded for 40Gbps is • Estimates made by extrapolating 1.6KW Cisco 7603 router results:400mm motherboard could $70.000. • RB4 can reduce power accomodate 6 controllers to drive 2x10Gbps and 30x1Gbps interfaces. With this we could consumption by slowing down connect 30-40 servers. Thus we would result components not stressed by the in 300-400Gbps router that occupies 30U workflow (rack unit). • For reference: Cisco 7600 Series 360Gbps in a 21U form-factor  How to measure programmability ?

  17. Q&A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend