IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ - - PowerPoint PPT Presentation
IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ - - PowerPoint PPT Presentation
IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$ Adam%Belay ,$George$Prekas,$$ Samuel$Grossman,$Ana$Klimovic,$$ Christos$Kozyrakis,$Edouard$Bugnion$ HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$
HW$is$fast,$but$SW$is$a$BoLleneck$
$ $ 64Obyte$TCP$Echo:$
2$
0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions%
HW$is$fast,$but$SW$is$a$BoLleneck$
$ $ 64Obyte$TCP$Echo:$
3$
0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions% 4.8x% Gap% 8.8x% Gap%
IX$Closes$the$SW$Performance$Gap$
$ $ 64Obyte$TCP$Echo:$
4$
0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions%
Two$Contribu3ons$
#1:$Protec3on$and$direct$HW$access$through$virtualiza3on$
$
#2:$Execu3on$model$for$low$latency$and$high$throughput$
5$
0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions%
Why$is$SW$Slow?$
Created$by:$Arnout$Vandecappelle$ hLp://www.linuxfounda3on.org/collaborate/workgroups/networking/kernel_flow$
6$
Complex$Interface$ Code$Paths$Convoluted$by$Interrupts$and$Scheduling$
Problem:$1980s$Sobware$Architecture$
- Berkeley$sockets,$designed$for$CPU$3me$sharing$
- Today’s$largeOscale$datacenter$workloads:$
7$
Hardware:%Dense%Mul;core%+%10%GbE%(soon%40)%
O API$scalability$cri3cal!$ O Gap$between$compute$and$RAM$O>$Cache$behavior$maLers$ O Packet$interOarrival$3mes$of$50$ns$
Scale%out%access%paFerns%
O FanOin$O>$Large$connec3on$counts,$high$request$rates$ O FanOout$O>$Tail$latency$maLers!$
Conven3onal$Wisdom$
- Bypass$the$kernel$
– Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$
- Avoid$the$connec3on$scalability$boLleneck$
– Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ – Use$proxies$at$the$expense$of$latency$
- Replace$classic$Ethernet$
– Use$a$lossless$fabric$(Infiniband)$ – Offload$memory$access$(rDMA)$
- Common%thread:%Give%up%on%systems%soJware%
8$
Our$Approach$
- Bypass$the$kernel$
– Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$
- Avoid$the$connec3on$scalability$boLleneck$
– Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ – Use$proxies$at$the$expense$of$latency$
- Replace$classic$Ethernet$
– Use$a$lossless$fabric$(Infiniband)$ – Offload$memory$access$(rDMA)$
- Tackle%the%problem%head%on…%
9$
Robust%Protec;on% Between%App% and%Netstack% Connec;on% Scalability% Commodity%10Gb% Ethernet%
Separa3on$of$Control$and$Data$Plane$
10$
DP$
$
DP$
$
Host$ Kernel$ CP$
Userspace% Kernelspace% C$ C$ C$ C$ C$
Separa3on$of$Control$and$Data$Plane$
11$
DP$
$
DP$
$
Host$ Kernel$ CP$
Userspace% Kernelspace% C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$
Separa3on$of$Control$and$Data$Plane$
12$
IX$DP$
$
IX$DP$
$
Host$ Kernel$ IX$CP$
C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$ Ring%3% Guest% Ring%0% Host% Ring%0%
Separa3on$of$Control$and$Data$Plane$
13$
IX$DP$
$
IX$DP$
$
Linux$kernel$ $ $
IX$CP$
C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$ Ring%3% Guest% Ring%0% Host% Ring%0% Dune$
Separa3on$of$Control$and$Data$Plane$
14$
Linux$kernel$ $ $
IX$CP$
C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$ Ring%3% Guest% Ring%0% Host% Ring%0% Dune$ IX$ libIX$
Memcached$
IX$ libIX$ HTTPd$
The$IX$Execu3on$Pipeline$
15$
eventOdriven$app$ libIX$ RX$ TX$ TCP/IP$ 2 RX$ FIFO$ Event$ Condi3ons$ 3 Batched$ Syscalls$ TCP/IP$ Timer$ 4 5 6 Ring%3% Guest% Ring%0% 1
Design$(1):$Run$to$Comple3on$
16$
eventOdriven$app$ libIX$ RX$ TX$ TCP/IP$ 2 RX$ FIFO$ Event$ Condi3ons$ 3 Batched$ Syscalls$ TCP/IP$ Timer$ 4 5 6 Ring%3% Guest% Ring%0% 1
Improves%DataVCache%Locality% Removes%Scheduling%Unpredictably%
Design$(2):$Adap3ve$Batching$
17$
eventOdriven$app$ libIX$ RX$ TX$ TCP/IP$ 2 RX$ FIFO$ Event$ Condi3ons$ 3 Batched$ Syscalls$ TCP/IP$ Timer$ 4 5 6 Ring%3% Guest% Ring%0% 1
Improves%Instruc;onVCache%Locality%and%Prefetching%
Adap3ve$Batch$ Calcula3on$
See$the$Paper$for$more$Details$
- Design$(3):$Flow$consistent$hashing$
– Synchroniza3on$&$coherence$free$opera3on$
- Design$(4):$Na3ve$zeroOcopy$API$
– Flow$control$exposed$to$applica3on$
- Libix:$LibeventOlike$eventObased$programming$
- IX$prototype$implementa3on$$
– Dune,$DPDK,$LWIP,$~40K$SLOC$of$kernel$code$
18$
Evalua3on$
- Comparison$IX$to$Linux$and$mTCP$[NSDI$’14]$
- TCP$microbenchmarks$and$Memcached$
19$
10GbE$Switch$ ~$25$Linux$Hosts$ IX$ IX$ 1x10GbE$ 4x10GbE$ w/$L3+L4$bond$
TCP$Netpipe$
20$
2 4 6 8 10 100 200 300 400 500 Goodput (Gbps) Message Size (KB) mTCP-mTCP Linux-Linux IX-IX
½% Bandwidth% @%20%KB% ½% Bandwidth% @%135%KB% 5.7%us% ½%RTT%
TCP$Echo:$Mul3core$Scalability$ for$Short$Connec3ons$
21$
0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 7 8 Messages/sec (x 106) Number of CPU cores
IX 10GbE IX 4x10GbE Linux 10GbE Linux 4x10GbE mTCP 10GbE
Saturates% 1x10GbE%
Connec3on$Scalability$
22$
2 4 6 8 10 12 14 10 100 1000 10000 100000 Messages/sec (x 106) Connection Count (log scale) Linux-10Gbps Linux-40Gbps IX-10Gbps IX-40Gbps
~10,000% Connec;ons% Limited%by%L3%
Memcached$over$TCP$
23$
250 500 750 250 500 750 1000 1250 1500 1750 2000 Latency (µs) USR: Throughput (RPS x 103) SLA
IX (p99) IX (avg) Linux (p99) Linux (avg)
3.6x% More% RPS% 2x%Less% Tail% Latency%
6x%Less%Tail% Latency% With%IX%clients%
IX$Conclusion$
- A$protected$dataplane$OS$for$datacenter$
applica3ons$with$an$eventOdriven$model$and$ demanding$connec3on$scalability$ requirements$
- Efficient$access$to$HW,$without$sacrificing$
security,$through$virtualiza3on$
- High$throughput$and$low$latency$enabled$by$a$
dataplane$execu3on$model$
24$