Fast QUIC sockets with vector packet processing Aloys Augustin, - - PowerPoint PPT Presentation

fast quic sockets with vector packet processing
SMART_READER_LITE
LIVE PREVIEW

Fast QUIC sockets with vector packet processing Aloys Augustin, - - PowerPoint PPT Presentation

Fast QUIC sockets with vector packet processing Aloys Augustin, Nathan Skrzypczak, Mathias Raoul, Dave Wallace 1 What is QUIC ? 2 3 The stack HTTP/2 HTTP/3 TLS QUIC TCP UDP IP 4 Nice properties Encryption by default ~ TLS 1.3


slide-1
SLIDE 1

Fast QUIC sockets with vector packet processing

Aloys Augustin, Nathan Skrzypczak, Mathias Raoul, Dave Wallace

1

slide-2
SLIDE 2

What is QUIC ?

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

The stack

HTTP/2 TLS TCP IP UDP HTTP/3 QUIC

4

slide-5
SLIDE 5

Nice properties

  • Encryption by default ~ TLS 1.3 handshake
  • No ossification
  • Built-in multiplexing

○ Very common application requirement ○ Independent streams in each connection ○ Addresses head-of-line blocking ○ Stream prioritization support

  • Supports mobility

○ 5-tuple may change without breaking the connection

5

slide-6
SLIDE 6

Conns & streams

Connection #1 Stream #1 Stream #2 Connection #2 Stream #1 Stream #2 Server Client

6

slide-7
SLIDE 7

Why QUIC - pros & cons

Pros

  • Runs on UDP, can be implemented out of the kernel
  • Addresses head of line blocking
  • 5-tuple mobility
  • Encryption by default

Cons

  • Implementation complexity
  • No standard northbound API (for now)
  • Still evolving relatively fast, not an IETF standard yet

7

slide-8
SLIDE 8

A quick dive in the code

8

slide-9
SLIDE 9

Building blocks

L4 / UDP QUIC implem Socket API Cient app Wire

9

slide-10
SLIDE 10

Building blocks

vectorization fast L2-3-4 pluggable sessions few assumptions very modular https://github.com/h2o/quicly

vpp client lib vpp quicly

10

slide-11
SLIDE 11

What is VPP?

  • Fast userspace networking dataplane - https://fd.io/
  • Open-source: https://gerrit.fd.io/r/q/project:vpp
  • Extensible through plugins
  • Multi-architecture (x86, ARM, ...), runs in baremetal / VM / container
  • Highly optimized for performance (vector instructions, cache efficiency,

DPDK, native crypto, native drivers)

  • Feature-rich L2-3-4 networking (switching, routing, IPsec, …)
  • Includes a host stack with L4 (TCP, UDP) protocols

→ Great platform for a fast userspace QUIC stack

11

slide-12
SLIDE 12

VPP Host stack (1/2)

  • Generic session layer exposing L4 protocols

○ Socket-like APIs

  • Fifos used to pass data between apps and protocols
  • Internal API for VPP plugins
  • Similar external API for independent processes available through a message

queue

  • Designed for high performance

○ Saturates 40G link with 1 TCP flow or 1 UDP flow ○ Performance scales linearly with number of threads

12

slide-13
SLIDE 13

vpp

VPP Host stack (2/2)

  • ext. application

L2/L3 VCL Message queue Control events TCP/UDP Session Session rx fifo tx fifo rx tx

13

  • int. application

(vpp plugin) rx tx

slide-14
SLIDE 14

QUIC App Requirements

Three types of objects:

  • Listeners
  • Connections
  • Streams

Connection #1 Stream #1 Stream #2 Connection #2 Stream #1 Stream #2 Server Listener #1 Cient

14

slide-15
SLIDE 15

Socket-like API for QUIC

Listener :8000 Connection a1 Stream a1-1 Connection c1 Stream c1-1 listen(8000) Client connect(server, 8000) connect(c1) accept(:8000) accept(a1) Stream a1-2 Stream c1-2 connect(a1) accept(c1) Server sockets

15

Three types of sockets for listeners, connections and streams Connection sockets are only used to connect and accept streams Connection sockets cannot send or receive data

slide-16
SLIDE 16

Building a QUIC stack in VPP

application vpp TCP / UDP Session Session rx tx L2/L3 VCL rx tx

16

Message queue Control events

slide-17
SLIDE 17

Building a QUIC stack in VPP

application vpp QUIC Session Session rx tx L2/L3 UDP Session Session rx tx VCL rx tx

17

Message queue Control events

slide-18
SLIDE 18

Zooming in

quicly picotls QUIC Northbound interface in VPP session layer Southbound interface to VPP session layer Allows quicly to use VPP UDP stack Callbacks Callbacks

18

slide-19
SLIDE 19

QUIC Consumption models in VPP

  • The VPP QUIC stack offers 3 consumption models:

○ External apps: independent, use the external (socket) API ○ Internal apps: shared library loaded by VPP, use the internal API ○ Apps can use the quicly northbound API directly → As long as they use the VPP threading model

19

quicly Northbound interface Southbound interface External interface mq

Northbound interface in the host stack is optional!

slide-20
SLIDE 20

Connection matching

RX path

VPP UDP rx node UDP session rx fifo quicly initial packet decoding quicly packet decryption QUIC stream session rx fifo

Buffer copy Session event Decrypted payload copy

Stream data available to internal app

Session event

App MQ

MQ event

Stream data available to external app

App notification

Stream data available to quicly app

Callback

20

slide-21
SLIDE 21

Memory management and ACKs

  • VPP fifos are fixed size. What if a sender sends more data than fifo size ?

○ Before a packet is decrypted, we have no way to know which stream(s) it contains data for → We cannot check the available space in the receiving fifo ○ Once a packet is decrypted, Quicly does not allow us to drop it

  • therwise it will never be retransmitted

○ Fortunately, QUIC has a connection parameter called max_stream_data, which limits the in-flight (un-acked) data per stream sent by peer. ○ Setting this parameter to the fifo size solves this problem, as long as we ACK data only when it is removed from the fifo

  • QUIC has several other connection-level settings to control memory usage:

○ Max number of streams ○ Total un-acked data for the connection

21

slide-22
SLIDE 22

TX path

VPP session node UDP session tx fifo quicly packet encryption quicly packet generation QUIC stream session tx fifo

UDP payload copy Session event Session event Payload copy

Stream data pushed by internal app

Session event

App MQ

MQ event

Stream data pushed by external app Stream data pushed by quicly app

22

slide-23
SLIDE 23

Backpressure

  • UDP backpressure: we limit the amount of packets generated

by quicly so as not to overflow the UDP tx fifo

  • How does an app know it should wait before sending more

data?

○ When Quicly cannot send data as fast as the app provides it, it stops reading from the QUIC streams tx fifos ○ The app needs to check the amount of space available in the fifo before sending data ○ The app can subscribe to notifications when data is dequeued from its fifos

23

app tx UDP tx app QUIC UDP

slide-24
SLIDE 24

Threading model

  • VPP runs either with one thread, or one main thread + n worker threads
  • UDP packets assignment to threads is dependent on RSS

○ The receiving thread is unknown when the first packet is sent ○ UDP connections start on one thread and migrate when the first reply is received ○ The VPP host stack sends notifications when this happens

  • QUIC sessions are opened only when the handshake completes, and thus do

not migrate (as long as there are no mobility events - not yet supported)

  • All QUIC streams are placed in the thread where their connection exists

24

slide-25
SLIDE 25

How quick is it ?

25

slide-26
SLIDE 26

Performance: evaluation

For now : no canonical QUIC perf assessment tool Custom iperf-like client/server benchmark tool

  • Opens N connections
  • Then opens n streams in each connection
  • Client sends d bytes of data per stream
  • Server closes the streams, then the connection

Typical setup N=10 n=10 d=1GB

26

slide-27
SLIDE 27

Performance: test setup

40Gbps XL710 XL710 vpp quicly avf test server vpp quicly avf test client

  • Core pinning, VPP and test apps on same NUMA node
  • 1500 bytes MTU
  • 2x Intel Xeon Gold 6146 3.2GHz CPUs

27

slide-28
SLIDE 28

Performance: initial results

10x10 1 worker 3.5 Gbit/s 100x10 4 workers 13.7 Gbit/s

Simultaneous connections

  • Scales up to 100k streams / core
  • Handshake rate ~1500 / s / core

28

slide-29
SLIDE 29

Performance: optimisations

  • Crypto

○ Quicly uses picotls by default for the TLS handshake and the packet encryption / decryption ○ Picotls has a pluggable encryption API, which uses openssl by default for encryption ○ Using the VPP native crypto API yielded better results ○ Further improvements were obtained by batching crypto operations, using the Quicly offload API: ■ N packets are received and decoded ■ These N packets are decrypted at the same time ■ The decrypted packets are passed to quicly for protocol processing ■ The same idea is applied in the TX path as well

  • Congestion control

○ The default congestion control (Reno) of quicly doesn’t reach very high throughputs ○ Fortunately, it is pluggable as well :)

29

slide-30
SLIDE 30

Performance: new results

10x10 pre-optimization 1 worker 3.5 Gbit/s 10x10 w/ batching & native crypto 1 worker 4.5 Gbit/s (+28%)

30

For now, most of the CPU time is spent doing crypto Intel Ice Lake CPUs will accelerate AES and may move the bottleneck more towards the protocol processing

slide-31
SLIDE 31

What’s next

31

slide-32
SLIDE 32

Next steps

  • Performance optimisation
  • Mobility support
  • Continuous benchmarking - soon on

https://docs.fd.io/csit/master/trending/index.html If you want to get involved: https://gerrit.fd.io/r/q/project:vpp - code in src/plugins/quic/ If you want to try it, check out the example code in src/plugins/hs_apps/ (host stack apps)

32

slide-33
SLIDE 33

Use cases

  • Scalable HTTP/3 servers
  • Scalable gRPC over QUIC servers
  • QUIC VPN

○ Better than SSL VPN: mobility support, using one stream per flow allows to get rid of head of line blocking ○ As easy to deploy as an SSL VPN: only a certificate is needed on the server, with an authentication mechanism for clients

  • QUIC VPN with transparent proxying

○ Transparently terminating the TCP connections at the VPN gateway and sending only the TCP payloads in QUIC streams allows to get rid of the nested congestion control issues

33

slide-34
SLIDE 34

Takeaways

34

  • Great experience with quicly
  • VPP now provides an easy API to use QUIC
  • Host stack proved to be extensible for new protocols
  • VPP framework gave good performance boost to QUIC

○ Native crypto + vector processing ○ Still some effort required to reach max levels of performance

slide-35
SLIDE 35

Thanks for listening

Any questions ?

35

slide-36
SLIDE 36

Backup slides

36

slide-37
SLIDE 37

Building a QUIC stack - a QUIC dive

rx tx rx tx quicly

  • accept
  • data
  • in/out of order packets
  • close

rx tx rx tx picotls mq

  • connect
  • close
  • read

37