IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ - - PowerPoint PPT Presentation

ix a protected dataplane opera3ng system for high
SMART_READER_LITE
LIVE PREVIEW

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ - - PowerPoint PPT Presentation

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$ Adam%Belay ,$George$Prekas,$$ Samuel$Grossman,$Ana$Klimovic,$$ Christos$Kozyrakis,$Edouard$Bugnion$ HW$is$fast,$but$SW$is$a$BoLleneck$ $ $ 64Obyte$TCP$Echo:$


slide-1
SLIDE 1

IX:$A$Protected$Dataplane$Opera3ng$ System$for$High$Throughput$and$ Low$Latency$

Adam%Belay,$George$Prekas,$$ Samuel$Grossman,$Ana$Klimovic,$$ Christos$Kozyrakis,$Edouard$Bugnion$

slide-2
SLIDE 2

HW$is$fast,$but$SW$is$a$BoLleneck$

$ $ 64Obyte$TCP$Echo:$

2$

0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions%

slide-3
SLIDE 3

HW$is$fast,$but$SW$is$a$BoLleneck$

$ $ 64Obyte$TCP$Echo:$

3$

0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions% 4.8x% Gap% 8.8x% Gap%

slide-4
SLIDE 4

IX$Closes$the$SW$Performance$Gap$

$ $ 64Obyte$TCP$Echo:$

4$

0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions%

slide-5
SLIDE 5

Two$Contribu3ons$

#1:$Protec3on$and$direct$HW$access$through$virtualiza3on$

$

#2:$Execu3on$model$for$low$latency$and$high$throughput$

5$

0$ 10$ 20$ 30$ 40$ 50$ 60$ Microseconds$ HW$Limit$ Linux$ IX$ 0$ 2$ 4$ 6$ 8$ 10$ Requests$per$Second$ Millions%

slide-6
SLIDE 6

Why$is$SW$Slow?$

Created$by:$Arnout$Vandecappelle$ hLp://www.linuxfounda3on.org/collaborate/workgroups/networking/kernel_flow$

6$

Complex$Interface$ Code$Paths$Convoluted$by$Interrupts$and$Scheduling$

slide-7
SLIDE 7

Problem:$1980s$Sobware$Architecture$

  • Berkeley$sockets,$designed$for$CPU$3me$sharing$
  • Today’s$largeOscale$datacenter$workloads:$

7$

Hardware:%Dense%Mul;core%+%10%GbE%(soon%40)%

O API$scalability$cri3cal!$ O Gap$between$compute$and$RAM$O>$Cache$behavior$maLers$ O Packet$interOarrival$3mes$of$50$ns$

Scale%out%access%paFerns%

O FanOin$O>$Large$connec3on$counts,$high$request$rates$ O FanOout$O>$Tail$latency$maLers!$

slide-8
SLIDE 8

Conven3onal$Wisdom$

  • Bypass$the$kernel$

– Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$

  • Avoid$the$connec3on$scalability$boLleneck$

– Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ – Use$proxies$at$the$expense$of$latency$

  • Replace$classic$Ethernet$

– Use$a$lossless$fabric$(Infiniband)$ – Offload$memory$access$(rDMA)$

  • Common%thread:%Give%up%on%systems%soJware%

8$

slide-9
SLIDE 9

Our$Approach$

  • Bypass$the$kernel$

– Move$TCP$to$userOspace$(Onload,$mTCP,$Sandstorm)$ – Move$TCP$to$hardware$(TOE)$

  • Avoid$the$connec3on$scalability$boLleneck$

– Use$datagrams$instead$of$connec3ons$(DIY$conges3on$management)$ – Use$proxies$at$the$expense$of$latency$

  • Replace$classic$Ethernet$

– Use$a$lossless$fabric$(Infiniband)$ – Offload$memory$access$(rDMA)$

  • Tackle%the%problem%head%on…%

9$

Robust%Protec;on% Between%App% and%Netstack% Connec;on% Scalability% Commodity%10Gb% Ethernet%

slide-10
SLIDE 10

Separa3on$of$Control$and$Data$Plane$

10$

DP$

$

DP$

$

Host$ Kernel$ CP$

Userspace% Kernelspace% C$ C$ C$ C$ C$

slide-11
SLIDE 11

Separa3on$of$Control$and$Data$Plane$

11$

DP$

$

DP$

$

Host$ Kernel$ CP$

Userspace% Kernelspace% C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$

slide-12
SLIDE 12

Separa3on$of$Control$and$Data$Plane$

12$

IX$DP$

$

IX$DP$

$

Host$ Kernel$ IX$CP$

C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$ Ring%3% Guest% Ring%0% Host% Ring%0%

slide-13
SLIDE 13

Separa3on$of$Control$and$Data$Plane$

13$

IX$DP$

$

IX$DP$

$

Linux$kernel$ $ $

IX$CP$

C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$ Ring%3% Guest% Ring%0% Host% Ring%0% Dune$

slide-14
SLIDE 14

Separa3on$of$Control$and$Data$Plane$

14$

Linux$kernel$ $ $

IX$CP$

C$ C$ C$ C$ C$ RX$ TX$ RX$ TX$ RX$ TX$ RX$ TX$ Ring%3% Guest% Ring%0% Host% Ring%0% Dune$ IX$ libIX$

Memcached$

IX$ libIX$ HTTPd$

slide-15
SLIDE 15

The$IX$Execu3on$Pipeline$

15$

eventOdriven$app$ libIX$ RX$ TX$ TCP/IP$ 2 RX$ FIFO$ Event$ Condi3ons$ 3 Batched$ Syscalls$ TCP/IP$ Timer$ 4 5 6 Ring%3% Guest% Ring%0% 1

slide-16
SLIDE 16

Design$(1):$Run$to$Comple3on$

16$

eventOdriven$app$ libIX$ RX$ TX$ TCP/IP$ 2 RX$ FIFO$ Event$ Condi3ons$ 3 Batched$ Syscalls$ TCP/IP$ Timer$ 4 5 6 Ring%3% Guest% Ring%0% 1

Improves%DataVCache%Locality% Removes%Scheduling%Unpredictably%

slide-17
SLIDE 17

Design$(2):$Adap3ve$Batching$

17$

eventOdriven$app$ libIX$ RX$ TX$ TCP/IP$ 2 RX$ FIFO$ Event$ Condi3ons$ 3 Batched$ Syscalls$ TCP/IP$ Timer$ 4 5 6 Ring%3% Guest% Ring%0% 1

Improves%Instruc;onVCache%Locality%and%Prefetching%

Adap3ve$Batch$ Calcula3on$

slide-18
SLIDE 18

See$the$Paper$for$more$Details$

  • Design$(3):$Flow$consistent$hashing$

– Synchroniza3on$&$coherence$free$opera3on$

  • Design$(4):$Na3ve$zeroOcopy$API$

– Flow$control$exposed$to$applica3on$

  • Libix:$LibeventOlike$eventObased$programming$
  • IX$prototype$implementa3on$$

– Dune,$DPDK,$LWIP,$~40K$SLOC$of$kernel$code$

18$

slide-19
SLIDE 19

Evalua3on$

  • Comparison$IX$to$Linux$and$mTCP$[NSDI$’14]$
  • TCP$microbenchmarks$and$Memcached$

19$

10GbE$Switch$ ~$25$Linux$Hosts$ IX$ IX$ 1x10GbE$ 4x10GbE$ w/$L3+L4$bond$

slide-20
SLIDE 20

TCP$Netpipe$

20$

2 4 6 8 10 100 200 300 400 500 Goodput (Gbps) Message Size (KB) mTCP-mTCP Linux-Linux IX-IX

½% Bandwidth% @%20%KB% ½% Bandwidth% @%135%KB% 5.7%us% ½%RTT%

slide-21
SLIDE 21

TCP$Echo:$Mul3core$Scalability$ for$Short$Connec3ons$

21$

0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 6 7 8 Messages/sec (x 106) Number of CPU cores

IX 10GbE IX 4x10GbE Linux 10GbE Linux 4x10GbE mTCP 10GbE

Saturates% 1x10GbE%

slide-22
SLIDE 22

Connec3on$Scalability$

22$

2 4 6 8 10 12 14 10 100 1000 10000 100000 Messages/sec (x 106) Connection Count (log scale) Linux-10Gbps Linux-40Gbps IX-10Gbps IX-40Gbps

~10,000% Connec;ons% Limited%by%L3%

slide-23
SLIDE 23

Memcached$over$TCP$

23$

250 500 750 250 500 750 1000 1250 1500 1750 2000 Latency (µs) USR: Throughput (RPS x 103) SLA

IX (p99) IX (avg) Linux (p99) Linux (avg)

3.6x% More% RPS% 2x%Less% Tail% Latency%

6x%Less%Tail% Latency% With%IX%clients%

slide-24
SLIDE 24

IX$Conclusion$

  • A$protected$dataplane$OS$for$datacenter$

applica3ons$with$an$eventOdriven$model$and$ demanding$connec3on$scalability$ requirements$

  • Efficient$access$to$HW,$without$sacrificing$

security,$through$virtualiza3on$

  • High$throughput$and$low$latency$enabled$by$a$

dataplane$execu3on$model$

24$