Network stack specialization for performance
Ilias Marinos§, Robert N.M. Watson§, Mark Handley*
§ University of Cambridge, * University College London
goo.gl/1la2u6
Network stack specialization for performance goo.gl/1la2u6 Ilias - - PowerPoint PPT Presentation
Network stack specialization for performance goo.gl/1la2u6 Ilias Marinos , Robert N.M. Watson , Mark Handley* University of Cambridge, * University College London Motivation Providers are scaling out rapidly. Key aspects: 1
Ilias Marinos§, Robert N.M. Watson§, Mark Handley*
§ University of Cambridge, * University College London
goo.gl/1la2u6
Providers are scaling out rapidly. Key aspects:
Providers are scaling out rapidly. Key aspects:
Are general-purpose stacks the right solution for that kind of role?
about short ones?
Throughput (Gbps)
2 4 6 8 10
HTTP object size (KB)
8 16 24 32 64 128
Network Throughput (Gbps)
CPU utilization (%)
40 80 120 160 200
Throughput (Gbps)
2 4 6 8 10
HTTP object size (KB)
8 16 24 32 64 128
Network Throughput (Gbps) CPU utilization (%)
CPU utilization (%)
40 80 120 160 200
Throughput (Gbps)
2 4 6 8 10
HTTP object size (KB)
8 16 24 32 64 128
Network Throughput (Gbps) CPU utilization (%)
NIC saturation, Low CPU-usage
CPU utilization (%)
40 80 120 160 200
Throughput (Gbps)
2 4 6 8 10
HTTP object size (KB)
8 16 24 32 64 128
Network Throughput (Gbps) CPU utilization (%)
NIC saturation, Low CPU-usage Throughput/CPU ratio is low
CPU utilization (%)
40 80 120 160 200
Throughput (Gbps)
2 4 6 8 10
HTTP object size (KB)
8 16 24 32 64 128
Network Throughput (Gbps) CPU utilization (%)
Short-lived HTTP flows are a problem!
NIC saturation, Low CPU-usage Throughput/CPU ratio is low
Distribution based on traces from Yahoo! CDN [Al-Fares et’al 2011]
Distribution based on traces from Yahoo! CDN [Al-Fares et’al 2011]
90% of the HTTP requested object sizes ≤ 25K 95% of the HTTP requested object sizes ≤ 50K
Design a network stack that:
application and vice versa
locality, lock- and sharing-free, CPU-affinity)
repetitive processing costs (e.g. TCP segmentation
Prototyped on top of FreeBSD’s netmap framework:
related I/O
layer
layer
server that serves static content
TX RX buffer rings
syscall device driver
zero copy netmap ioctls DMA memory mapped to userspace
kernel space user space
libnmio.so libeth.so libtcpip.so
web_recv() web_write()
webserver
netmap_output() eth_output() tcpip_output() tcpip_write() tcpip_recv() tcpip_fsm() tcpip_input() eth_input() netmap_input()
Key decisions (some of them):
address space
a-priori loaded to DRAM
RX rings, w/o memory copying/buffering
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
kernel space user space
nmio eth tcpip app
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input()
kernel space user space
nmio eth tcpip app
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input()
kernel space user space
nmio eth tcpip app
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive() ether_output()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive() ether_output()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive() ether_output() netmap_output()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive() ether_output() netmap_output()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive() ether_output() netmap_output()
kernel space user space
nmio eth tcpip app
POLLIN
content
NIC driver
ix0:TX ix0:RX
A B .. A B ..
..
netmap_input() ether_input() tcpip_input()
TCP FSM
tcpip_output() websrv_accept() websrv_receive() ether_output() netmap_output()
kernel space user space
nmio eth tcpip app
POLLOUT POLLIN
content
Throughput - 6NICs (Gbps) 10 20 30 40 50 60 HTTP Object Size (KB) 4 8 16 24 32 64 128 256 512 756 1024
nginx+FreeBSD nginx+Linux Sandstorm
Throughput - 6NICs (Gbps) 10 20 30 40 50 60 HTTP Object Size (KB) 4 8 16 24 32 64 128 256 512 756 1024
nginx+FreeBSD nginx+Linux Sandstorm
~9.8x ~3.6x ~1.8x
Throughput - 6NICs (Gbps) 10 20 30 40 50 60 HTTP Object Size (KB) 4 8 16 24 32 64 128 256 512 756 1024
nginx+FreeBSD nginx+Linux Sandstorm
~9.8x ~3.6x ~1.8x
Start converging for sizes ≥ 256K
/* Get src and destination slots */ struct netmap_slot *bf = &ppool->slot[slotindex]; struct netmap_slot *tx = &txring->slot[cur];
tx->buf_idx = bf->buf_idx; tx->len = bf->len; tx->flags = NS_BUF_CHANGED; /* Get source and destination bufs */ char *srcp = NETMAP_BUF(ppool, bf->buf_idx); char *dstp = NETMAP_BUF(txring, tx->buf_idx);
memcpy(dstp, srcp, bf->len); tx->len = bf->len;
memcpy zerocopy TX TX n n
Intel Core 2 (2006) Throughput (Gbps) 2 4 6 8 10 Sandstorm “zerocopy” Sandstorm “memcpy”
Serving a 24KB HTTP object
Intel Core 2 (2006) Throughput (Gbps) 2 4 6 8 10 Sandstorm “zerocopy” Sandstorm “memcpy”
Serving a 24KB HTTP object
Intel Sandybridge (2013) Throughput (Gbps) 2 4 6 8 10 Sandstorm “zerocopy” Sandstorm “memcpy”
Serving a 24KB HTTP object
Memory Controller Hub
FSB C C L 2 C C L 2 PCIe
DMA engine
PCIe
Memory Controller Hub
FSB C C L 2 C C L 2 PCIe
DMA engine
PCIe
Memory Controller Hub
FSB C C L 2 C C L 2 PCIe
DMA engine
PCIe
Raise interrupt
Memory Controller Hub
FSB C C L 2 C C L 2 PCIe
DMA engine
PCIe
Raise interrupt
Memory Controller Hub
FSB C C L 2 C C L 2 PCIe
DMA engine
PCIe
Raise interrupt
Bottleneck Extra detour to RAM
MC
MC
MC
Raise interrupt
MC
Raise interrupt
MC
Eventual eviction from LLC
Raise interrupt
MC
Eventual eviction from LLC
Raise interrupt
MC
Eventual eviction from LLC
Raise interrupt
“black box” for networked systems development?
Mem Read Throughput 6NICs (Gbps) 20 40 60 80 100 120 Object Size (KB) 16 24 32 64 128 512 1024
Sandstorm "zerocopy" Sandstorm "memcpy"
Lower is better
Natural fit for:
Apple HLS)
Limitations:
applications
Natural fit for:
Apple HLS)
General-purpose stacks:
small-sized objects!)
application-specific clusters)
General-purpose stacks:
small-sized objects!)
application-specific clusters)
Specialized stacks:
General-purpose stacks:
small-sized objects!)
application-specific clusters)
Specialized stacks:
Specialized network stacks not only viable, but necessary!
Limitations:
incoming connections (not initiating them)
typical HTTP requests)
200 400 600 800 1000 1200 1400 1600
# Concurrent Connections
4 16 32 48 64 80 Sandstorm Linux+nginx FreeBSD+nginx
Serving a 24KB object
Problems with general- purpose stacks:
PCB locks
(e.g., mbuf alloc./copying) Solutions with specialized stacks:
design, per-core state
cache-friendly, incr. cksum
memory copying/buffering