You Cant Always Spin to Win Hossein Golestani , Amirhossein - - PowerPoint PPT Presentation

you can t always spin to win
SMART_READER_LITE
LIVE PREVIEW

You Cant Always Spin to Win Hossein Golestani , Amirhossein - - PowerPoint PPT Presentation

Soft Softwar are e Da Data Planes: ta Planes: You Cant Always Spin to Win Hossein Golestani , Amirhossein Mirhosseini, Thomas F. Wenisch University of Michigan ACM Symposium on Cloud Computing (SoCC) November 22, 2019 adacenter.org


slide-1
SLIDE 1

This work is supported by the Semiconductor Research Corporation (SRC) and DARPA

@ADA_Center adacenter.org

Soft Softwar are e Da Data Planes: ta Planes: You Can’t Always Spin to Win

Hossein Golestani, Amirhossein Mirhosseini, Thomas F. Wenisch University of Michigan

ACM Symposium on Cloud Computing (SoCC) November 22, 2019

slide-2
SLIDE 2
  • Virtual μs-scale computing era
  • Service objectives
  • High throughput
  • Low average/tail latency

Software Data Planes: You Can’t Always Spin to Win

* Image credits: Mellanox, Intel

High-speed I/O* Microservices

What’s Up in the Cloud?

Network function virtualization I/O virtualization

Server #1

Address Translation Routing

Server #2

Firewall Load Balancing

VM #1 VM #n …

2

slide-3
SLIDE 3
  • Then vs. now
  • Kernel-bypass architectures (just a handful)

Software Data Planes: You Can’t Always Spin to Win

Andromeda [NSDI’18] Arrakis [OSDI’14] IX [OSDI’14] mTCP [NSDI’14] ReFlex [ASPLOS’17] Shenango [NSDI’19] Shinjuku [NSDI’19] Snap [SOSP’19] ZygOS [SOSP’17]

Kernel CPU CPU … I/O I/O … User app CPU … CPU Kernel I/O … I/O User app

Softw Softwar are e Stac Stacks: Under ks: Under Revision vision

3

slide-4
SLIDE 4
  • Key mechanisms
  • User-level shared queues
  • Spin-polling cores
  • Fast notification by cache coherence write signals
  • Widely adopted in industry

Software Data Planes: You Can’t Always Spin to Win

SPDK

STORAGE PERFORMANCE DEVELOPMENT KIT

I/O

Softw Softwar are e Da Data ta Planes Planes

4

slide-5
SLIDE 5
  • An easy-to-use and fast model for communication and signaling
  • But far from ideal, especially when scaled
  • We show that spin-based data planes:
  • Perform more work when there is less
  • Are not scalable to many cores
  • Are not scalable to many queues
  • Are not well-suited for shared queues

Software Data Planes: You Can’t Always Spin to Win

Spin Spin-polling: polling: Not a Not a Panacea anacea

5

slide-6
SLIDE 6
  • Introduction to Software Data Planes
  • Methodology
  • Characterization of Software Data Plane Challenges
  • Solution Directions
  • Conclusion

Software Data Planes: You Can’t Always Spin to Win

Outline Outline

6

slide-7
SLIDE 7
  • Setup
  • DPDK-based applications
  • Skylake cores
  • 100GbE Mellanox NIC
  • Experiments

Inefficiencies of spin-polling Lack of queue scalability Impracticality of queue sharing

Software Data Planes: You Can’t Always Spin to Win

Methodolog Methodology

7

1 2 3

slide-8
SLIDE 8
  • Polling “tax”
  • Body of poll loop
  • Useless polling on idle queues (possibly causing cache misses)
  • Affects throughput scalability with cores

Software Data Planes: You Can’t Always Spin to Win

While forever: For each RX queue: Read packets from RX queue; If there are any packets: Route packets using LPM*; Send packets to TX queue(s); * LPM: Longest Prefix Match

Inef Inefficiencies ficiencies of

  • f Spin

Spin-polling polling

Polling tax can be 20-28% of total CPU cycles even in 100% load

Core … NIC Port 1 … NIC Port 2

8 (1) (2) (3) (4) (5) (6)

1 2 3

slide-9
SLIDE 9
  • IPC (Instructions Per Cycle) of routing core at varying loads

Software Data Planes: You Can’t Always Spin to Win

IPC IPC != Useful != Useful Wor

  • rk

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 5 10 15 20 25 30

IPC of routing core Routing throughput (Mpps)

1 queue 4 queues 8 queues

9

1 2 3

IPC decreases as load increases, resulting in energy inefficiency, fast aging, and severe co-runner interference

slide-10
SLIDE 10
  • More (useless) instructions executed in lighter traffic
  • Co-running:
  • Matrix mult
  • Spin-based routing (0-100% load)
  • Executed on:
  • SMT cores of a physical CPU
  • Different physical CPUs

Software Data Planes: You Can’t Always Spin to Win

Ef Effect ect on SMT

  • n SMT Co

Co-runner unner

Useless spinning wastes execution resources of an SMT co-runner

2.24 1.56 1.54 0.0 0.5 1.0 1.5 2.0 2.5

Not collocated Collocated Routing-0% Collocated Routing-100% IPC of matrix mult 10

1 2 3

slide-11
SLIDE 11
  • Traffic flows spread among multiple queues
  • Limited size of CPU caches: a performance antagonist
  • Experiment
  • Forwarding packets by a single core
  • Scaling up the number of queues

Software Data Planes: You Can’t Always Spin to Win

Lac Lack of k of Queue Scala Queue Scalability bility

Core … NIC Port 1 … NIC Port 2

11

1 2 3

slide-12
SLIDE 12
  • Round-trip latency of packet forwarding
  • Light traffic (minimal queuing delay)

Software Data Planes: You Can’t Always Spin to Win

Latency is severely affected as queue heads fall out of L1/L2 caches

Ef Effect ect on La

  • n Latenc

tency

5 10 15 20 25 64 128 192 256 320 384 448 512

Average latency (μs) Number of queues 12

1 2 3

slide-13
SLIDE 13
  • Balanced traffic: Passing through all queues
  • Unbalanced traffic: Passing through only one queue

Software Data Planes: You Can’t Always Spin to Win

Ef Effect ect on P

  • n Peak T

eak Thr hroughput

  • ughput

13

Cache misses not interleaved with transmits severely hurt peak throughput in unbalanced traffic

5 10 15 20 25 30 35 40 64 128 192 256 320 384 448 512

Throughput (Mpps) Total number of queues

Balanced Unbalanced

1 2 3

slide-14
SLIDE 14
  • (a) Scale-out vs. (b) Scale-up queuing (shared queue)
  • Scale-up queuing
  • Strong theoretical merits
  • Synchronization disadvantage

Software Data Planes: You Can’t Always Spin to Win

Scale Scale-up Queuing up Queuing Is Is Impr Impractical actical

14

(a) (b)

Core 1 Core n … … Core 1 Core n …

1 2 3

slide-15
SLIDE 15
  • Processing hiccups cause head-of-line (HoL) blocking in scale-out
  • Round-trip latency with 10 parallel cores

(a) No hiccups (b) 1μs processing hiccup with 1% probability

Software Data Planes: You Can’t Always Spin to Win

Although effective in avoiding HoL blocking, spin-polling in scale-up queuing saturates at lower loads

Scale Scale-out

  • ut vs. Scale
  • vs. Scale-up

up

15

50 100 150 200 250 300 350 400 20 40 60

Average latency (μs) Throughput (Mpps)

Scale-out Scale-up 50 100 150 200 250 300 350 400 20 40 60

Average latency (μs) Throughput (Mpps)

Scale-out Scale-up

(a) (b)

1 2 3

slide-16
SLIDE 16

Software Data Planes: You Can’t Always Spin to Win

Futur Future Da e Data ta Planes Planes

16

slide-17
SLIDE 17
  • QWAIT, a multi-address monitoring scheme
  • Inspired by x86 MWAIT
  • Avoids polling tax, useless polling, and disruption to SMT co-runners
  • Needs hardware support
  • Programming model similar to select-case in Go

Software Data Planes: You Can’t Always Spin to Win

Solution Dir Solution Direction(s) ection(s)

QWAIT (queue_set): case queue_1: process_queue_1(); … case queue_n: process_queue_n();

17

slide-18
SLIDE 18
  • Key mechanisms of software data planes
  • User-level shared queues
  • Spin-polling cores
  • Although easy-to-use and low-latency, software data planes have

deficiencies, especially when scaled

  • Using DPDK, we quantified these deficiencies:
  • Incurring polling overhead and useless work
  • Not scalable to many cores/queues
  • Not well-suited for scale-up queuing

Software Data Planes: You Can’t Always Spin to Win

Conc Conclusion lusion

18

slide-19
SLIDE 19

Thank you!

Software Data Planes: You Can’t Always Spin to Win

Q Q & & A

19