Protocol stacks and multicore scalability The evolving - - PowerPoint PPT Presentation

protocol stacks and multicore scalability
SMART_READER_LITE
LIVE PREVIEW

Protocol stacks and multicore scalability The evolving - - PowerPoint PPT Presentation

Protocol stacks and multicore scalability The evolving hardware-software interface or Why we love and hate offload MSN 2010 Robert N. M. Watson University of Cambridge Portions of this work supported by Juniper Networks, Inc. Idealised


slide-1
SLIDE 1

Protocol stacks and multicore scalability

The evolving hardware-software interface

  • r

Why we love and hate offload MSN 2010 Robert N. M. Watson University of Cambridge

Portions of this work supported by Juniper Networks, Inc.

slide-2
SLIDE 2

2

Network stack goodness

NIC NIC

The somebody else's problem cloud Network stack goodness magic

Idealised network for an OS developer

slide-3
SLIDE 3

Things are getting a bit sticky at the end host*

3

* … and end host-like middle nodes: proxies, application firewalls, anti-spam, anti-virus, …

slide-4
SLIDE 4

Packets-per-second (PPS) scales with bandwidth, but per-core limits reached

➮ Transition to multicore

Even today’s bandwidth achieved only with protocol offload to the NIC

➮ But just specific protocols, workloads

4

slide-5
SLIDE 5

Contemporary network stack scalability themes

5

slide-6
SLIDE 6

6

  • Counting instructions ➞ cache misses
  • Lock contention ➞ cache line contention
  • Locking ➞ finding parallelism opportunities
  • Work ordering, classification, distribution
  • NIC offload of even more protocol layers
  • Vertical integrated work distribution/affinity
slide-7
SLIDE 7

Why we love offload Better performance, no protocol changes*

7

* It sounds good so it must be true!

slide-8
SLIDE 8

Full TCP , iSCSI, RDMA, ... offload MultiQ: RSS, CAMs, MIPS, … IP fragmentation/TSO/LRO Checksum offload, VLAN en/decap Interrupt moderation PIO ➞ DMA rings

8

100Mb/s 10Gb/s 1Gb/s

slide-9
SLIDE 9

Reducing effective PPS with offload

9

slide-10
SLIDE 10

TCP Segmentation Offload (TSO)

10

Userspace Kernel Hardware user thread ithread Data stream from application TCP header encapsulation IP header encapsulation Kernel copies in data to mbufs + clusters Application Socket TCP IP Link layer + driver Checksum + transmit Ethernet frame encapsulation, insert in descriptor ring Device 2k, 4k, 9k, 16k

MSS MSS

TCP segmentation Move TCP segmentation from TCP layer to hardware

Reduce effective PPS to improve OS performance

slide-11
SLIDE 11

Large Receive Offload (LRO)*

11

Hardware Kernel ithread user thread Userspace Linker layer + driver IP TCP + Socket Socket Strip IP header Interpret and strips link layer header Kernel copies

  • ut mbufs +

clusters Receive, validate ethernet, IP, TCP checksums Reassemble segments Application Data stream to application Look up and deliver to socket Strip TCP header Move TCP segment reassembly from network protocol to device driver Device

* Interestingly, LRO is often done in software

slide-12
SLIDE 12

12

Varying TSO and LRO − bandwidth

Net bandwidth in Gb/s Processes

1 2 3 4 5 6 7 8 2 4 6

  • 1 − LRO+TSO

1 2 3 4 5 6 7 8

  • 2 − LRO

1 2 3 4 5 6 7 8

  • 3 − vanilla

TSO and LRO off from now on

slide-13
SLIDE 13

13

What about the wire protocol?

  • Packet format remains the same
  • Transmit/receive code essentially identical
  • Just shifted segmentation/reassembly
  • Effective ACK behaviour has changed!
  • ACK every 6-8 segments instead of every

2 segments!

slide-14
SLIDE 14

Managing contention and the search for parallelism*

14

* Again, try not to change the protocol…

slide-15
SLIDE 15

Lock contention

15

!" #" $" %" &" '" (" )" *" +" #!" #",-./011" $",-./01101" %",-./01101" &",-./01101" 2.31." 31."

!"# $!"# %!"# &!"# '!"# (!"# )!"# *!"# +!"# ,!"# $!!"#

  • .

/ . 1 $ 2 3 . 4 5 # / . 1 $ 2 3 . 4 5 #

  • .

/ . 1 % 2 3 . 4 5 5 # / . 1 % 2 3 . 4 5 5 #

  • .

/ . 1 & 2 3 . 4 5 5 # / . 1 & 2 3 . 4 5 5 #

  • .

/ . 1 ' 2 3 . 4 5 5 # / . 1 ' 2 3 . 4 5 5 # "6785# "6-/3# "090#

slide-16
SLIDE 16

16

Varying locking strategy − bandwidth

Net bandwidth in Gb/s Processes

1 2 3 4 5 6 7 8 1 2 3 4

  • 1 − multi queue read locking

1 2 3 4 5 6 7 8

  • 2 − multi queue exclusive locking

1 2 3 4 5 6 7 8

  • 3 − single queue read locking

1 2 3 4 5 6 7 8

  • 4 − single queue exclusive locking
slide-17
SLIDE 17

TCP input path

17

Hardware Userspace Kernel ithread netisr software ithread user thread Device Application Linker layer + driver IP TCP + Socket Socket Data stream to application Validate checksum, strip IP header Validate checksum, strip TCP header Reassemble segments, deliver to socket Interpret and strips link layer header Kernel copies

  • ut mbufs +

clusters Receive, validate checksum Look up socket

Potential dispatch points

slide-18
SLIDE 18

18

Work distribution

  • Parallelism implies work distribution
  • Must keep work ordered
  • Establish flow-CPU affinity
  • Microsoft Receive-Side Steering (RSS)
  • More fine-grained solutions (CAMs, etc)

⚠ MTCP watch out! ⚠ The Toeplitz catastrophe

slide-19
SLIDE 19

19

Varying dispatch strategy − bandwidth

Net bandwidth in Gb/s Processes

1 2 3 4 5 6 7 8 1 2 3 4

  • 1 − multi

1 2 3 4 5 6 7 8

  • 2 − single_link_proto

1 2 3 4 5 6 7 8

  • 3 − single
slide-20
SLIDE 20

Why we hate offload

20

slide-21
SLIDE 21

“Layering violations” are not invisible

  • Hardware bugs harder to work around
  • Instrumentation below socket layer affected
  • BPF, firewalls, traffic management, etc.
  • Interface migration more difficult
  • All your protocols were not created equal
  • Not all TOEs equal: SYN, TIMEWAIT, etc.

21

slide-22
SLIDE 22

Protocol implications

  • Unsupported protocols and workloads see:
  • Internet-wide PMTU applied to PCI
  • Limited or no checksum offload
  • Ineffectual NIC-side load balancing
  • Another nail in “deploy a new protocol”

coffin? (e.g., SCTP , even multi-path TCP)

  • Ideas about improving protocol design?

22

slide-23
SLIDE 23

Structural problems

  • Replicated implementation and

maintenance responsibility

  • Difficult field upgrade
  • Host vs. NIC interop problems
  • Composability problem for virtualisation
  • Encodes flow affinity policies in hardware

23

slide-24
SLIDE 24

The vertical affinity problem

24

slide-25
SLIDE 25

25

Network stack goodness NIC Application

Ithread 0 Ithread Ithread 2 Ithread 3 Ithread 4 Ithread 5 Ithread 6 Ithread 7 Queue 0 Queue Queue 2 Queue 3 Queue 4 Queue 5 Queue 6 Queue 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Socket 0 Socket 1 Socket 2 Socket 3 Socket 4 Socket 5 Socket 6 Socket 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

Hardware-only RSS

Awkwardly random distribution

slide-26
SLIDE 26

26

Network stack goodness NIC Application

Ithread 0 Ithread Ithread 2 Ithread 3 Ithread 4 Ithread 5 Ithread 6 Ithread 7 Queue 0 Queue Queue 2 Queue 3 Queue 4 Queue 5 Queue 6 Queue 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Socket 0 Socket 1 Socket 2 Socket 3 Socket 4 Socket 5 Socket 6 Socket 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

OS-aligned RSS

Is this better?

slide-27
SLIDE 27

27

  • Applications can express execution affinity
  • How to align with network stack and

network interface affinity?

  • Sockets API inadequate; easy to imagine

simple extensions but are they sufficient?

  • How to deal with hardware vs. software

policy mismatches?

slide-28
SLIDE 28

28

Network stack goodness NIC NIC Network stack goodness Application Application Switches / Routers Switches / Routers

The somebody else's problem cloud

Quite a lot less magic

Reality for an OS developer

slide-29
SLIDE 29

Key research areas

  • Explore programmability, debuggability, and

traceability of heterogenous network stack

  • Security implications of intelligent devices,

diverse/new execution substrates, and single intermediate format

  • Protocol impact: “end-to-end” endpoints

shifting even further

29

slide-30
SLIDE 30

Q&A

30