RouteBricks:Exploi2ngParallelismto ScaleSo9wareRouters - - PowerPoint PPT Presentation

routebricks exploi2ng parallelism to scale so9ware routers
SMART_READER_LITE
LIVE PREVIEW

RouteBricks:Exploi2ngParallelismto ScaleSo9wareRouters - - PowerPoint PPT Presentation

RouteBricks:Exploi2ngParallelismto ScaleSo9wareRouters MihaiDobrescuandetc. SOSP2009 PresentedbyShuyiChen Mo2va2on Routerdesign Performance Extensibility


slide-1
SLIDE 1

RouteBricks:
Exploi2ng
Parallelism
to
 Scale
So9ware
Routers


Mihai
Dobrescu
and
etc.
 SOSP
2009
 Presented
by
Shuyi
Chen


slide-2
SLIDE 2

Mo2va2on


  • Router
design


– Performance
 – Extensibility
 – They
are
compe2ng
goals


  • Hardware
approach


– Support
limited
APIs
 – Poor
programmability

 – Need
to
deal
with
low
level
issues


slide-3
SLIDE 3

Mo2va2on


  • So9ware
approach


– Low
performance
 – Easy
to
program
and
upgrade



  • Challenges
to
build
a
so9ware
router


– Performance
 – Power 

 – Space


  • RouteBricks
as
the
solu2on
to
close
the
divide

slide-4
SLIDE 4

RouteBricks


  • RouteBricks
is
a
router
architecture
that


parallelizes
router
func2onality
across
 mul2ple
servers
and
across
mul2ple
cores
 within
a
single
server


slide-5
SLIDE 5

Design
Principles


  • Goal:
a
“router”
with
N
ports
working
at
R
bps

  • Tradi2onal
Router
func2onali2es


– Packet
switching
(NR
bps
in
the
scheduler)
 – Packet
processing
(R
bps
each
linecard)


  • Principle
1:
router
func2onality
should
be


parallelized
across
mul2ple
servers


  • Principle
2:
router
func2onality
should
be


parallelized
across
mul2ple
processing
paths
 within
each
server.


slide-6
SLIDE 6

Parallelizing
across
servers


  • A
switching
solu2on



– Provide
a
physical
path
 – Determine
how
to
relay
packets


  • It
should
guarantee


– 100%
throughput
 – Fairness
 – Avoid
packet
reordering


  • Constraints
using
commodity
server


– Limited
internal
link
rate
 – Limited
per‐node
processing
rate
 – Limited
per‐node
fanout


slide-7
SLIDE 7

Parallelizing
across
servers


  • To
sa2sfy
the
requirements


– Rou2ng
algorithm
 – Topology



slide-8
SLIDE 8

Rou2ng
Algorithms


  • Op2ons


– Sta2c
single
path
rou2ng
 – Adap2ve
single
path
rou2ng


  • Valiant
Load
Balancing


– Full
mesh
 – 2
phases
 – Benefits
 – Drawbacks


slide-9
SLIDE 9

Rou2ng
Algorithms


  • Direct
VLB


– When
the
traffic
matrix
is
closed
to
uniform
 – Each
input
node
S
route
up
to
R/N
of
traffic
 addressed
to
output
node
D
and
load
balance
the
 rest
across
the
remaining
nodes
 – Reduce
3R
to
2R


  • Issues


– Packet
reordering
 – N
might
exceed
node
fanout


slide-10
SLIDE 10

Topology


  • If
N
is
less
than
node
fanout


– Use
full
mesh


  • Otherwise,


– Use
a
k‐ary
n‐fly
network(n
=
logkN)


4 8 16 32 64 128 256 512 1024 2048 1 2 4 8 16 32 64 128 256 512 1024 2048 4096

External router ports Number of servers

48-port switches

  • ne ext. port/server, 5 PCIe slots
  • ne ext. port/server, 20 PCIe slots

two ext. ports/server, 20 PCIe slots transition from mesh to n-fly because # ports exceeds server fanout

slide-11
SLIDE 11

Parallelizing
within
servers


  • A
line
rate
of
10Gbps
requires
each
server
to


be
able
to
process
packets
at
at‐least
20Gbps


  • Mee2ng
the
requirement
is
daun2ng

  • Exploi2ng
packet
processing
paralleliza2on


within
a
server


– Memory
Access
Parallelism
 – Parallelism
in
NICs
 – Batching
processing


slide-12
SLIDE 12

Memory
Access
Parallelism


Figure 5: A traditional shared-bus architecture.

Figure 4: A server architecture based on point-to-point inter-socket links and integrated memory controllers.

  • Xeon


– Shared
FSB
 – Single
memory
 controller


  • Streaming
workload
requires


high
BW
between
CPUs
and


  • ther
subsystems

  • Nehalem


– P2P
links
 – Mul2ple
memory
 controller


slide-13
SLIDE 13

Parallelism
in
NICs


  • How
to
assign
packets
to
cores


– Rule
1:
each
network
queue
be
accessed
by
a
 single
core
 – Rule
2:
each
packet
be
handled
by
a
single
core


  • However,
if
a
port
has
only
one
network


queue,
it’s
hard
to
simultaneously
enforce
 both
rules



slide-14
SLIDE 14

Parallelism
in
NICs


  • Fortunately,
modern
NICs
has
mul2ple
receive


and
transmit
queues.



  • It
can
be
used
to
enforce
both
rules


– One
core
per
packet
 – One
core
per
queue


slide-15
SLIDE 15

Batching
processing


  • Avoid
book
keeping
overhead
when


forwarding
packets


– Incurring
them
once
every
serveral
packets
 – Modify
Click
to
receive
a
batch
of
packets
per
poll


  • pera2on


– Modify
the
NIC
driver
to
relay
packet
descriptors
 in
batches
of
packets


slide-16
SLIDE 16

Resul2ng
performance


5 10 15 20 25

Mpps

Nehalem, multiple queues, with batching Nehalem, single queue, with batching Nehalem, single queue, no batching Xeon, single queue, no batching

  • “Toy
experiments”,
simply
forward
packets


determinis2cally
without
header
processing
or
 rou2ng
lookups


slide-17
SLIDE 17

Evalua2on:
Server
Parallelism


  • Workloads


– Distribu2on
of
packet
size


  • Fixed
size
packet

  • “Abilene”
packet
trace


– Applica2on


  • Minimal
forwarding
(memory,
I/O)

  • IP
rou2ng
(reference
large
data
structure)

  • Ipsec
packet
encryp2on
(CPU)

slide-18
SLIDE 18

Results
for
server
parallelism


64 128 256 512 1024 Ab. 10 20

Packet size (bytes) Mpps

64 128 256 512 1024 Ab. 10 20 30

Packet size (bytes) Gbps

Forwarding Routing IPsec 5 10 15 20

Mpps

Forwarding Routing IPsec 10 20 30

Gbps

64B Abilene

slide-19
SLIDE 19

Scaling
the
System
Performance


5 10 15 20 0.5 1 1.5 2 2.5 x 10

4

Packet rate (Mpps) CPU load (cycles/packet)

fwd rtr ipsec cycles available

2 4 6 8 10 12 14 16 18 20 10

2

10

3

10

4

10

5

Memory load (bytes/packet)

fwd rtr ipsec benchmark nom 2 4 6 8 10 12 14 16 18 20 10

2

10

3

10

4

10

5

I/O load (bytes/packet)

2 4 6 8 10 12 14 16 18 20 10

2

10

3

10

4

PCIe load (bytes/packet)

2 4 6 8 10 12 14 16 18 20 10

2

10

3

10

4

10

5

Packet rate (Mpps) intersocket (bytes/packet)

  • CPU
is
the
bofleneck

slide-20
SLIDE 20

RB4
Router


  • 4
Nehalem
servers


– 2
NICs,
each
has
2
10Gbps
ports
 – 1
port
used
for
the
external
link
and
3
ports
used
 for
internal
links
 – Direct
VLB
in
a
full
mesh


  • Implementa2on


– Minimize
packet
processing
to
one
core
 – Avoid
reordering
by
grouping
same‐flow
packets


slide-21
SLIDE 21

Performance


  • 64B
packets
workload


– 12Gbps


  • Abilene
workload


– 35Gbps


  • Reordering
avoidance


– Reduce
from
5.5%
to
0.15%


  • Latency


– 47.6‐66.4
μs
in
RB4
 – 26.3
μs
for
a
Cisco
6500
router


slide-22
SLIDE 22

Conclusion


  • A
high
performance
so9ware
router


– Parallelism
across
servers
 – Parallelism
within
servers


slide-23
SLIDE 23

Discussion


  • Similar
situa2on
in
other
field
of
computer


industry


– GPU


  • Power
consump2on/cooling

  • Space
consump2on

slide-24
SLIDE 24

K‐ary
n‐fly
network
topology


  • N=kn
sources
and
kn
des2na2ons

  • n
stages

slide-25
SLIDE 25

Adding
an
extra
stage