RouteBricks Exploating Parallelism To Scale Software Routers Pawe - - PowerPoint PPT Presentation

routebricks
SMART_READER_LITE
LIVE PREVIEW

RouteBricks Exploating Parallelism To Scale Software Routers Pawe - - PowerPoint PPT Presentation

RouteBricks Exploating Parallelism To Scale Software Routers Pawe Bedyski 12 January 2011 About the paper Published: October 2009, 14 pages People: Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)


slide-1
SLIDE 1

RouteBricks

Exploating Parallelism To Scale Software Routers

Paweł Bedyński 12 January 2011

slide-2
SLIDE 2

 Published:

October 2009, 14 pages

 People:

 Mihai Dobrescu and Norbert Egi (interns at Intel Labs Berkeley)  Katerina Argyraki  Byung-Gon Chun  Kevin Fall  Gianluca Iannaccone  Allan Knies  Maziar Manesh  Sylvia Ratnasamy

 Institutions

 EPFL Lausanne, Switzerland  Lancaster University, Lancaster, UK  Intel Research Labs, Berkeley, CA

About the paper

„While certainly an improvement, in practice, network processors have proven hard to program: in the best case, the programmer needs to learn a new programming paradigm; in the worst, she must be aware of (and program to avoid) low-level issues (…)”

slide-3
SLIDE 3

A little bit of history:

Network equipment has focused primarily on performance

Limited forms of packets processing

New functionality and services renewed interest in programmable and extensible network equipement

Main issue:

High-end routers are diffucult to extend

„software routers” – easily programmable but have so far been suitable for low-packet rate env.

Goal:

Individual link speed: 10Gbps is already widespread

Carrier-grade routers 10Gbps up to 92Tbps

Software routers had problems with exceeding 1-5Gbps.

RouteBricks: parallelization across servers and tasks within server

What for?

slide-4
SLIDE 4

 Requirements:

 Variables:

 N ports  Each port full-duplex  Line rate R bps

 Router functionality:

 (1) packet processing, route lookup or classification  (2) packet switching from input to output ports

 Existing solutions (N – 10-10k, R – 1-40 Gbps):

 Hardware routers:

 Packet processing happens in the linecard (per one or few ports). Each linecard

must process at cR rate.

 Packet switching through a switch fabric and centralized scheduler, hence rate is

NR

 Software routers:

 Both switching and packet processing at NR rate.

Design Principles

slide-5
SLIDE 5
  • 1. Router functionality should be parallelized across

multiple servers

NR is unrealistic in single server solution. This is 2-3 orders of magnitude away from current server performance.

  • 2. Router functionality should be parallelized across

multiple processing paths within each server

Even cR (lowest c=2) is too much for single server if we don’t use potential of multicore architecture (1-4Gbps is reachable)

Drawbacks, tradeoffs:

Packet reordering, increased latency, more „relaxed” performance guarantees

Design Principles

slide-6
SLIDE 6

 Switching guarantees:

 (1) 100% throughput – all output ports can run at full line rate R bps, if

the input trafic demands it.

 (2) fairness – each input port gets its fair share of the capacity of any

  • utput port

 (3) avoids reordering packets

 Constraints of commodity serves:

 Limited internal link rates – internal links cannot run at a rate higher

than external line rate R

 Limited per-node processing rate – single server rate not higher than

cR for small constant c>1

 Limited per-node fanout – number of physical connection from each

server constant and independent of the number of servers.

Parallelizing accross servers

slide-7
SLIDE 7

 Routing algorithms:

 Static single-path – high „speedups” violates our constraint  Adaptive single-path – needs centralized scheduling, but it should run at rate NR  Load-balaced routing - VLB (Valiant Load Balancing)

 Benefits of VLB

 Guarantees 100% throuput and fairness without centralized

scheduling

 Doesn’t require link speedups – traffic uniformly split across the

cluster’s internal links

 Only +R (for intermediate traffic) to per-server traffic rate

compared to solution without VLB (R for traffic comming in from

external line, R for trafic which server should send out)

 Problems:

 Packets Reordering  Limited fanout : prevents us from using full-mesh topology when

N exceeds server’s fanout

Parallelizing accross servers

slide-8
SLIDE 8

 configurations:

 Current servers: each server can handle one router port and

accommodate 5 NICs

 More NICs: 1RP (router port), 20 NICs  Faster servers & more NICs: 2 RP, 20NICs

Parallelizing accross servers

Number of servers required to build an N- port, R=10Gbps/port router, for four different server configurations

slide-9
SLIDE 9

Parallelizing within servers

slide-10
SLIDE 10

Parallelizing within servers

 Each network queue should

be accessed by a single core

 Locking is expensive  Separate thread for polling

and writing

 Threads statically assigned to

cores

 Each packet should be

handled by a single core

 Pipeline aproach is

  • utperformed by parallel

aproach

slide-11
SLIDE 11

 „Batch” processing (NIC-driven, poll-driven)

 3-fold performance improvement  Increased latency

Parallelizing within servers

slide-12
SLIDE 12

 Workloads:

 Minimal forwarding – traffic

arriving at port i is just forwarded to port j (no routing-table lookup etc.)

 IP routing – full routing with

checksum calculation, updating headers etc.

 IPsec packet encryption

Evaluation: server parallelism

slide-13
SLIDE 13

 Specification:

 4 Nehalem servers  full-mesh topology  Direct-VLB routing  Each server assigned a single 10 Gbps external line

The RB4 Parallel Router

slide-14
SLIDE 14

 Minimizing packet processing:

 By encoding output node in MAC address (once)  Only works if each „internal” port has as many receive queues as

there are external ports

 Avoiding reordering:

 Standard VLB cluster allows reordering (multiple cores or load-

balancing)

 Perfectly synchronized clocks solution – require custom

  • perating systems and hardware

 Sequence numbers tags – CPU is a bottleneck  Solution – avoid reordering within TCP/UDP

 Same flow packets are assigned to the same queue  Set of same-flow packets (δ – msec) are sent through the same intermediate

node

RB4 - implementation

slide-15
SLIDE 15

 Forwarding performance:  Reordering (Abilene trace – single input, single output):

<p1,p2,p3,p4,p5>  <p1,p4,p2,p3,p5> - one reordered sequence

Measure reordering as fraction of same-flow packet sequences that were reordered

RB4 - 0.15% when using reordering avoidance, 5,5% when using Direct VLB

 Latency:

Per-server packet latency 24μs. DMA transfer (two back-and-forth transfers between NIC and memory: packet and descriptor) – 2,56; routing - 0.8; batching – 12.8

Traversal through RB4 includes 2-3 hops; hence estimated latency – 47.6 – 66.4 μs

Cisco 6500 Series router – 26.3 μs (packet processing latency)

RB4 - performance

Workflow RB4 Expect. Explanation 64B 12 Gbps 12.7-19.4 Extra overhead caused by reordering avoidance algorithm Abilene 35 Gbps 33 - 49 Limited number of PCIe slots on prototype server

slide-16
SLIDE 16

 Not only performance-driven:  How to measure programmability ?

Discussion & Conclusions

Space Power Cost

  • Limiting space in RB4 by integrating

Ethernet controllers directly on the motherboard (done in laptops) – but idea was not to change hardware

  • Estimates made by extrapolating

results:400mm motherboard could accomodate 6 controllers to drive 2x10Gbps and 30x1Gbps interfaces. With this we could connect 30-40 servers. Thus we would result in 300-400Gbps router that occupies 30U (rack unit).

  • For reference: Cisco 7600 Series 360Gbps

in a 21U form-factor

  • RB4 consumes 2.6KW
  • For reference: nominal power

rating of popular mid-range router loaded for 40Gbps is 1.6KW

  • RB4 can reduce power

consumption by slowing down components not stressed by the workflow

  • RB4 prototype $14.000

„raw cost”

  • For reference 40Gbps

Cisco 7603 router $70.000.

slide-17
SLIDE 17

Q&A