The Case for a Flexible Low-Level Backend for Software Data Planes - - PowerPoint PPT Presentation

the case for a flexible low level backend for software
SMART_READER_LITE
LIVE PREVIEW

The Case for a Flexible Low-Level Backend for Software Data Planes - - PowerPoint PPT Presentation

The Case for a Flexible Low-Level Backend for Software Data Planes Sean Choi 1 , Xiang Long 2 , Muhammad Shahbaz 3 , Skip Booth 4 , Andy Keep 4 , John Marshall 4 , Changhoon Kim 5 1 3 4 2 5 Why software data planes? VM hypervisors VM


slide-1
SLIDE 1

The Case for a Flexible Low-Level Backend for Software Data Planes

Sean Choi1, Xiang Long2, Muhammad Shahbaz3, Skip Booth4, Andy Keep4, John Marshall4, Changhoon Kim5

1 2 3 4 5

slide-2
SLIDE 2

Virtual Ports Physical Port Software Switch

VM VM

Why software data planes?

  • VM hypervisors
  • Cost savings with commodity general

purpose processing units – where desired throughput < ~100 Gbps

  • Prototyping protocol design
  • Prototyping hardware DP architecture
slide-3
SLIDE 3

PISCES[1]

Software Switch

[1] PISCES. ACM SIGCOMM 2016.

slide-4
SLIDE 4

Software switch DSLs

High-level, close to protocol Abstract forwarding model

slide-5
SLIDE 5

Nice for programmers…

  • Familiar and logical model in mind when programming,

e.g. match/action pipelines

  • Can specify packet data without worrying about

implementation

  • Portable code across platforms
slide-6
SLIDE 6

Not so nice for compilers

  • Abstract forwarding model not designed for

e.g. CPU-based architectures

  • Limited in expressiveness
  • Insulated from underlying low-level APIs
  • Result: Difficult to realize full performance potential of

underlying hardware

slide-7
SLIDE 7

If software switches exposed more low-level characteristics to the data plane compiler improvements are possible in performance and features

Hypothesis

slide-8
SLIDE 8

Our contribution

  • Identify a software switch that can be programmed

at low-level w.r.t to the hardware architecture

  • Create compiler targeting that switch to allow it to

support high-level data plane programs

  • Compare performance
slide-9
SLIDE 9

Target Switch: Vector Packet Processor (VPP)

  • Open sourced by Cisco
  • Can be programmed at low-level
  • Part of the FD.io project
slide-10
SLIDE 10

Vector Packet Processing (VPP) Platform

  • Modular packet

processing node graph abstraction

dpdk-input ip6-input ip4-input llc-input ip6-lookup ip6-rewrite- transmit

dpdk-output

slide-11
SLIDE 11

Vector Packet Processing (VPP) Platform

  • Each node can execute

almost arbitrary C code

  • n vectors of packets

dpdk-input ip6-input ip4-input llc-input ip6-lookup ip6-rewrite- transmit

dpdk-output

slide-12
SLIDE 12

Vector Packet Processing (VPP) Platform

  • Code is divided into

nodes to optimize for i- and d-cache locality …

dpdk-input ip6-input ip4-input llc-input ip6-lookup ip6-rewrite- transmit

dpdk-output

slide-13
SLIDE 13

Vector Packet Processing (VPP) Platform

Packet Vector dpdk-input ip6-input ip4-input llc-input ip6-lookup ip6-rewrite- transmit

dpdk-output Standard VPP Nodes

Custom-input Node 1 Node 2 Node i Node j Node k Custom Plugin

  • Extensible packet processing through first-class plugins
slide-14
SLIDE 14

Vector Packet Processing (VPP) Platform

  • Proven performance[1]

[1] https://wiki.fd.io/view/VPP/What_is_VPP%3F

  • Multiple MPPS from a single x86_64 core
  • > 100Gbps full-duplex on a single physical host
  • Outperforms Open vSwitch in various scenarios

1 core: 9 MPPS ipv4 in+out forwarding 2 cores: 13.4 MPPS ipv4 in+out forwarding 4 cores: 20.0 MPPS ipv4 in+out forwarding

slide-15
SLIDE 15

Vector Packet Processing (VPP) Platform

  • Disadvantage: large burden on the programmer
  • Requires knowledge from different fields:

protocols, operating systems, processor architecture, C compiler optimization….

  • Some Magic Required for good performance
slide-16
SLIDE 16

Some Magic Required

Manually fetch two packets Consequence of being low-level

slide-17
SLIDE 17

Ease of programmability sacrificed for performance at low-level Can a high-level DSL compiler help?

+

Programmable Vector Packet Processor (PVPP)

slide-18
SLIDE 18

Front-end Compiler BMv2 Mid-end Compiler BMv2 Back-end Compiler JSON-VPP Compiler VPP Plugin Directory P4 Program VPP Plugin Cog Templates Reference P4 Compiler (P4C) JSON C Files

PVPP structure

Standard compiler optimizations are also applied, e.g. redundant table removal

slide-19
SLIDE 19

PVPP

DPDK MoonGen

Sender/Receiver

MoonGen

Sender/Receiver

10Gx3 10Gx3

M1 M2 M3

CPU: Intel Xeon E5-2640 v3 2.6GHz Memory: 32GB RDIMM, 2133 MT/s, Dual Rank NICs: Intel X710 DP/QP DA SFP+ Cards HDD: 1TB 7.2K RPM NLSAS 6Gbps

Experimental Setup

slide-20
SLIDE 20

Benchmark Application

IPv4_match Match: ip.dstAddr Action: Set_nhop drop Parse Ethernet/ IPv4 Match: ip.dstAddr Action: Set_dmac drop Destination MAC Match: egress_port Action: Set_dmac drop Source MAC

slide-21
SLIDE 21

Baseline Performance

7.86 7.05 1 2 3 4 5 6 7 8 9 64 Throughput (Mpps)

Packet Size (Bytes)

Single Node Multiple Node

64 byte packets, single 10G port

slide-22
SLIDE 22

Vector Packet Processing (VPP) Platform

  • Each node can execute

almost arbitrary C code

  • n vectors of packets

dpdk-input ip6-input ip4-input llc-input ip6-lookup ip6-rewrite- transmit

dpdk-output

slide-23
SLIDE 23

Optimized Performance

7.86 9.25 9.51 9.51 9.58 10.01 10.21 7.05 8.38 8.50 8.80 8.89 9.02 9.20 2 4 6 8 10 12 Baseline Removing Redundant Tables Reducing Metadata Access Loop Unrolling Bypassing Redundant Nodes Reducing Pointer Dereferences Caching Logical HW Interface Throughput (Mpps) Single Node Multiple Node

64 byte packets, single 10G port

slide-24
SLIDE 24

Scalability

8.52 17.03 26.40 35.83 44.23 53.11 8.14 16.57 24.14 33.41 40.69 49.34 10 20 30 40 50 60 1 2 3 4 5 6 Throughput (Mpps)

Number of CPU cores

Single Node Multiple Node

64 byte packets across 3 x 10G ports

slide-25
SLIDE 25

Performance Comparison

59.53 49.31 34.71 26.78 63.49 47.23 34.72 26.78 30.22 30.22 30.20 26.78 10 20 30 40 50 60 70 64 128 192 256 Throughput (Mpps)

Packet Size (Bytes)

PVPP PISCES (with Microflow) PISCES (without Microflow)

slide-26
SLIDE 26

Future work

  • Microbenchmarking VPP to inform VPP-specific optimizations
  • P4 compiler annotations for low-level constructs
  • Explore when multi-node compilation is beneficial for PVPP
  • Demonstrate use cases where OVS microflow cache is

defeated – to show PVPP is just as programmable without resorting to separated fast/slow path

slide-27
SLIDE 27

Summary

  • High-level DSLs are great for programmers of software

switches, but lack expressivity for optimizations.

  • Low-level software switches such as VPP are performant but

hard to program.

  • We propose that best of both is possible with PVPP.
  • Comparable to state-of-art performance achieved but still

work in progress.