IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ - PowerPoint PPT Presentation

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$ Adam%Belay ,$George$Prekas,$$ Samuel$Grossman,$Ana$Klimovic,$$ Christos$Kozyrakis,$Edouard$Bugnion$

HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 30$ 4$ Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 2$

HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 4.8x% 8.8x% 30$ Gap% 4$ Gap% Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 3$

IX$Closes$the$SW$Performance$Gap$ $ $ 64Obyte$TCP$Echo:$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 30$ 4$ Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 4$

Two$Contribu3ons$ #1:$Protec3on$and$direct$HW$access$through$virtualiza3on$ $ #2:$Execu3on$model$for$low$latency$and$high$throughput$ 60$ 10$ Millions% 50$ 8$ 40$ 6$ HW$Limit$ 30$ 4$ Linux$ 20$ IX$ 2$ 10$ 0$ 0$ Microseconds$ Requests$per$Second$ 5$

Why$is$SW$Slow?$ Complex$Interface$ Code$Paths$Convoluted$by$Interrupts$and$Scheduling$ Created$by:$Arnout$Vandecappelle$ 6$ hLp://www.linuxfounda3on.org/collaborate/workgroups/networking/kernel_flow$

Problem:$1980s$Sobware$Architecture$ • Berkeley$sockets,$designed$for$CPU$3me$sharing$ • Today’s$largeOscale$datacenter$workloads:$ Hardware:%Dense%Mul;core%+%10%GbE%(soon%40)% O API$scalability$cri3cal!$ O Gap$between$compute$and$RAM$O>$Cache$behavior$maLers$ O Packet$interOarrival$3mes$of$50$ns$ Scale%out%access%paFerns% O FanOin$O>$Large$connec3on$counts,$high$request$rates$ O FanOout$O>$Tail$latency$maLers!$ 7$

Conven3onal$Wisdom$ • Bypass$the$kernel$ – Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$ • Avoid$the$connec3on$scalability$boLleneck$ – Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ – Use$proxies$at$the$expense$of$latency$ • Replace$classic$Ethernet$ – Use$a$lossless$fabric$(Infiniband)$ – Offload$memory$access$(rDMA)$ • Common%thread:%Give%up%on%systems%soJware% 8$

Our$Approach$ • Bypass$the$kernel$ Robust%Protec;on% Between%App% – Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$ and%Netstack% • Avoid$the$connec3on$scalability$boLleneck$ Connec;on% – Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ Scalability% – Use$proxies$at$the$expense$of$latency$ • Replace$classic$Ethernet$ Commodity%10Gb% – Use$a$lossless$fabric$(Infiniband)$ Ethernet% – Offload$memory$access$(rDMA)$ • Tackle%the%problem%head%on…% 9$

Separa3on$of$Control$and$Data$Plane$ CP$ DP$ DP$ $ $ Userspace% Host$ Kernelspace% Kernel$ C$ C$ C$ C$ C$ 10$

Separa3on$of$Control$and$Data$Plane$ CP$ DP$ DP$ $ $ Userspace% Host$ Kernelspace% Kernel$ RX$ RX$ RX$ RX$ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 11$

Separa3on$of$Control$and$Data$Plane$ IX$CP$ Ring%3% IX$DP$ IX$DP$ $ $ Guest% Ring%0% Host$ Host% Kernel$ Ring%0% RX$ RX$ RX$ RX$ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 12$

Separa3on$of$Control$and$Data$Plane$ IX$CP$ Ring%3% IX$DP$ IX$DP$ $ $ Guest% Ring%0% Linux$kernel$ Host% $ Ring%0% Dune$ RX$ RX$ RX$ RX$ $ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 13$

Separa3on$of$Control$and$Data$Plane$ HTTPd$ Memcached$ IX$CP$ Ring%3% libIX$ libIX$ Guest% IX$ IX$ Ring%0% Linux$kernel$ Host% $ Ring%0% Dune$ RX$ RX$ RX$ RX$ $ TX$ TX$ TX$ TX$ C$ C$ C$ C$ C$ 14$

The$IX$Execu3on$Pipeline$ 3 Ring%3% eventOdriven$app$ Event$ Batched$ libIX$ Condi3ons$ Syscalls$ Guest% 2 4 Ring%0% TCP/IP$ TCP/IP$ 5 Timer$ RX$ FIFO$ 6 RX$ TX$ 1 15$

Design$(1):$Run$to$Comple3on$ 3 Ring%3% eventOdriven$app$ Event$ Batched$ libIX$ Condi3ons$ Syscalls$ Guest% 2 4 Ring%0% TCP/IP$ TCP/IP$ 5 Timer$ RX$ FIFO$ 6 RX$ TX$ 1 Improves%DataVCache%Locality% 16$ Removes%Scheduling%Unpredictably%

Design$(2):$Adap3ve$Batching$ 3 Ring%3% eventOdriven$app$ Event$ Batched$ libIX$ Condi3ons$ Syscalls$ Guest% 2 4 Ring%0% TCP/IP$ TCP/IP$ Adap3ve$Batch$ 5 Calcula3on$ Timer$ RX$ FIFO$ 6 RX$ TX$ 1 Improves%Instruc;onVCache%Locality%and%Prefetching% 17$

See$the$Paper$for$more$Details$ • Design$(3):$Flow$consistent$hashing$ – Synchroniza3on$&$coherence$free$opera3on$ • Design$(4):$Na3ve$zeroOcopy$API$ – Flow$control$exposed$to$applica3on$ • Libix:$LibeventOlike$eventObased$programming$ • IX$prototype$implementa3on$$ – Dune,$DPDK,$LWIP,$~40K$SLOC$of$kernel$code$ 18$

Evalua3on$ • Comparison$IX$to$Linux$and$mTCP$[NSDI$’14]$ • TCP$microbenchmarks$and$Memcached$ ~$25$Linux$Hosts$ 10GbE$Switch$ 4x10GbE$ 1x10GbE$ w/$L3+L4$bond$ IX$ IX$ 19$

TCP$Netpipe$ 10 8 ½% ½% Bandwidth% Bandwidth% Goodput (Gbps) @%20%KB% @%135%KB% 6 4 2 IX-IX Linux-Linux 5.7%us% mTCP-mTCP ½%RTT% 0 0 100 200 300 400 500 20$ Message Size (KB)

TCP$Echo:$Mul3core$Scalability$ for$Short$Connec3ons$ 4 IX 10GbE IX 4x10GbE 3.5 Linux 10GbE Messages/sec (x 10 6 ) Linux 4x10GbE 3 mTCP 10GbE 2.5 Saturates% 2 1x10GbE% 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 Number of CPU cores 21$

Connec3on$Scalability$ 14 IX-40Gbps IX-10Gbps 12 ~10,000% Linux-40Gbps Connec;ons% Linux-10Gbps Messages/sec (x 10 6 ) Limited%by%L3% 10 8 6 4 2 0 10 100 1000 10000 100000 Connection Count (log scale) 22$

Memcached$over$TCP$ 750 IX (p99) IX (avg) Linux (p99) Linux (avg) 500 SLA Latency ( µ s) 3.6x% More% RPS% 2x%Less% 250 Tail% Latency% 0 6x%Less%Tail% 0 250 500 750 1000 1250 1500 1750 2000 Latency% USR: Throughput (RPS x 10 3 ) With%IX%clients% 23$

IX$Conclusion$ • A$protected$dataplane$OS$for$datacenter$ applica3ons$with$an$eventOdriven$model$and$ demanding$connec3on$scalability$ requirements$ • Efficient$access$to$HW,$without$sacrificing$ security,$through$virtualiza3on$ • High$throughput$and$low$latency$enabled$by$a$ dataplane$execu3on$model$ 24$

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ - PowerPoint PPT Presentation

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$ Adam%Belay ,$George$Prekas,$$ Samuel$Grossman,$Ana$Klimovic,$$ Christos$Kozyrakis,$Edouard$Bugnion$ HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$

IX: A Protected Dataplane Operating System Problem Context The requirements of modern data

Immigration Update: Temporary Protected Status January 25, 2018 Agenda Temporary Protected

A Look at Intels Dataplane Development Kit Dominik Scholz Chair for Network Architectures and

Needham Public Schools Superintendents FY19 Opera3ng Budget Request Needham School CommiAee

Needham Public Schools Superintendents FY18 Opera3ng Budget

Main Memory CS 4410, Opera3ng Systems Spring 2017 Cornell University Lorenzo Alvisi Anne Bracy

Main Memory CS 4410, Opera3ng Systems Fall 2016 Cornell University Rachit Agarwal Anne Bracy

General Introduction to Protected Areas and IUCNs Role in Protected Areas Sub-Regional

Protected Planet and the World Database on Protected Areas Brian MacSharry powered by the World

Dataplane Specialization for High-performance OpenFlow Software Switching Lszl Molnr,

Dataplane Broker (DPB) Steven Simpson, Arsham Farshad, Paul McCherry, Abubakr Ali Magzoub

Software Routers ECE/CS598HPN Radhika Mittal Dataplane programmability is useful New ISP

Storm: a fast transactional dataplane for remote data structures Stanko Novakovic Yizhou Shan

Dataplane Networking journey in Containers Gary Loughnane gary.loughnane@intel.com

Snabb: Open Source Meets Dataplane RIPE77, October 2018, Amsterdam Andy Wingo | wingo@igalia.com

Calico Networking with eBPF Shaun Crampton, Core Developer for Project Calico Chris Hoge,

Auditing hooks and security transparency for CPython Steve Dower, Christian Heimes EuroPython

4B Geriatric Hip Fracture 6A Gastric Bypass Abdominal Laparoscopic Results Geriatric Hip

Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, Tyler Hunt, Simon Peter,

Memory Accesses in Out-of-Order Execution Instructor: Nima Honarmand Spring 2015 :: CSE 502

CIS 371 Computer Organization and Design Unit 9: Superscalar Pipelines Slides developed by Milo

So we broke all CSPs You won't guess what happened next! whoami and Past Work Michele

Network stack challenges at increasing speeds The 100Gbit/s challenge Jesper Dangaard Brouer

Hacking Jenkins! Orange Tsai Orange Tsai Come from Taiwan Principal security researcher