6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh - PowerPoint PPT Presentation

6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh Spring 2016 1

MoDvaDon DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, …] Single-rooted tree Core Ø Full bisection bandwidth, achieved via multipathing Ø High oversubscription Spine Agg Leaf Access 1000s of server ports 1000s of server ports 2

MoDvaDon DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, …] Ø Full bisection bandwidth, achieved via multipathing Spine Leaf 1000s of server ports 3

MulD-rooted != Ideal DC Network Ideal DC network: Multi-rooted tree Big output-queued switch Can’t build it L ≈ 1000s of server ports 1000s of server ports Ø No internal bottlenecks è predictable Possible Ø Simplifies BW management Need efficient load balancing bottlenecks [Bw guarantees, QoS, …] 4

Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Randomized load balancing Ø Preserves packet order Problems: - Hash collisions (coarse granularity) - Local & stateless H( f ) % 3 = 0 (bad with asymmetry; e.g., due to link failures) 5

SoluDon Landscape Centralized Distributed [Hedera, Planck, Fastpass, …] Host-Based In-Network Cong. Oblivious Cong. Aware Cong. Oblivious Cong. Aware [Presto] [MPTCP, FlowBender…] [ECMP, WCMP, [Flare, TeXCP, CONGA, packet-spray, …] DeTail, HULA, …] 6

MPTCP ² Slides by Damon Wischik (with minor modificaDons) 7

What problem is MPTCP trying to solve? MulDpath ‘pools’ links. = Two separate links A pool of links TCP controls how a link is shared. How should a pool be shared?

9 ApplicaDon: MulDhomed web server 2 TCPs 100Mb/s @ 50Mb/s 100Mb/s 4 TCPs @ 25Mb/s

10 ApplicaDon: MulDhomed web server 2 TCPs 100Mb/s @ 33Mb/s 1 MPTCP @ 33Mb/s 100Mb/s 4 TCPs @ 25Mb/s

11 ApplicaDon: MulDhomed web server 2 TCPs 100Mb/s @ 25Mb/s 2 MPTCPs @ 25Mb/s 100Mb/s 4 TCPs @ 25Mb/s The total capacity, 200Mb/s, is shared out evenly between all 8 flows.

12 ApplicaDon: MulDhomed web server 2 TCPs 100Mb/s @ 22Mb/s 3 MPTCPs @ 22Mb/s 100Mb/s 4 TCPs @ 22Mb/s The total capacity, 200Mb/s, is shared out evenly between all 9 flows. It’s as if they were all sharing a single 200Mb/s link. The two links can be said to form a 200Mb/s pool.

13 ApplicaDon: MulDhomed web server 2 TCPs 100Mb/s @ 20Mb/s 4 MPTCPs @ 20Mb/s 100Mb/s 4 TCPs @ 20Mb/s The total capacity, 200Mb/s, is shared out evenly between all 10 flows. It’s as if they were all sharing a single 200Mb/s link. The two links can be said to form a 200Mb/s pool.

14 ApplicaDon: WIFI & cellular together How should your phone balance its traffic across very different paths? wifi path: high loss, small RTT 3G path: low loss, high RTT

15 ApplicaDon: Datacenters Can we make the network behave like a large pool of capacity?

16 MPTCP is a general-purpose mulDpath replacement for TCP.

What is the MPTCP protocol? 17 MPTCP is a replacement for TCP which lets you use mulDple paths simultaneously. user space socket API The receiver The sender MPTCP TCP MPTCP puts the stripes packets in packets the correct across paths IP order addr addr 2 addr 1

What is the MPTCP protocol? 18 MPTCP is a replacement for TCP which lets you use mulDple paths simultaneously. user space socket API The receiver The sender MPTCP TCP MPTCP puts the stripes packets in packets the correct across paths IP order addr addr port p 1 a switch with port p 2 port-based rouDng

Design goal 1: 19 MulDpath TCP should be fair to regular TCP at shared bojlenecks A mulDpath TCP flow with two subflows Regular TCP To be fair, Mul.path TCP should take as much capacity as TCP at a bo:leneck link, no ma:er how many paths it is using. Strawman soluDon: Run “½ TCP” on each path

Design goal 2: 20 MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s Each flow has a choice of a 1-hop and a 2-hop path. How should split its traffic?

Design goal 2: 21 MPTCP should use efficient paths 12Mb/s 8Mb/s 12Mb/s 8Mb/s 8Mb/s 12Mb/s If each flow split its traffic 1:1 ...

Design goal 2: 24 MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s If each flow split its traffic ∞:1 ...

Design goal 2: 25 MPTCP should use efficient paths 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s 12Mb/s TheoreDcal soluDon ( Kelly+Voice 2005; Han, Towsley et al. 2006) MPTCP should send all its traffic on its least-congested paths. Theorem. This will lead to the most efficient allocaDon possible, given a network topology and a set of available paths.

Design goal 3: 26 MPTCP should be fair compared to TCP wifi path: high loss, small RTT c d 3G path: low loss, high RTT Design Goal 2 says to send all your traffic on the least congested path, in this case 3G. But this has high RTT, hence it will give low throughput. Goal 3a. A MulDpath TCP user should get at least as much throughput as a single-path TCP would on the best of the available paths. Goal 3b. A MulDpath TCP flow should take no more capacity on any link than a single-path TCP would.

27 Design goals Goal 1. Be fair to TCP at bojleneck links Goal 2. Use efficient paths ... Goal 3. as much as we can, while being fair to TCP Goal 4. Adapt quickly when congesDon changes Goal 5. Don’t oscillate How does MPTCP achieve all this? Read: “Design, implementaDon, and evaluaDon of congesDon control for mulDpath TCP, NSDI 2011”

28 How does TCP congesDon control work? Maintain a congesDon window w . • Increase w for each ACK, by 1/ w • Decrease w for each drop, by w/ 2

How does MPTCP congesDon control 29 work? Maintain a congesDon window w r , one window for each path, where r ∊ R ranges over the set of available paths. • Increase w r for each ACK on path r , by • Decrease w r for each drop on path r , by w r / 2

Discussion 30

What You Said Ravi: “An interes.ng point in the MPTCP paper is that they target a 'sweet spot' where there is a fair amount of traffic but the core is neither overloaded nor underloaded.” 31

What You Said Hongzi: “The paper talked a bit of `probing’ to see if a link has high load and pick some other links, and there are some specific ways of assigning randomized assignment of subflows on links. I was wondering does the `power of 2 choices’ have some roles to play here?” 32

MPTCP discovers available capacity, and it 33 doesn’t need much path choice. Throughput (% of op.mal) FatTree, 128 nodes FatTree, 8192 nodes Num. paths SimulaDons of FatTree, 100Mb/s links, permutaDon traffic matrix, one flow per host, TCP+ECMP versus MPTCP. If each node-pair balances its traffic over 8 paths, chosen at random, then uDlizaDon is around 90% of opDmal.

MPTCP discovers available capacity, and it 34 shares it out more fairly than TCP+ECMP. Throughput (% of op.mal) FatTree, 128 nodes FatTree, 8192 nodes Flow rank SimulaDons of FatTree, 100Mb/s links, permutaDon traffic matrix, one flow per host, TCP+ECMP versus MPTCP.

35 MPTCP can make good path choices, as good as a very fast centralized scheduler. Throughput [% of op.mal] SimulaDon of FatTree with 128 hosts. • PermutaDon traffic matrix • Closed-loop flow arrivals (one flow finishes, another starts) MPTCP • Flow size distribuDons from VL2 dataset Hedera first-fit heuris.c

MPTCP permits flexible topologies 36 Average throughput [% of op.mal] SimulaDon of 128-node FatTree, when one of the 1Gb/s core links is cut to 100Mb/s Rank of flow Because an MPTCP flow shius its traffic onto its least congested paths, congesDon hotspots are made to “diffuse” throughout the network. Non-adapDve congesDon control, on the other hand, does not cope well with non-homogenous topologies.

MPTCP permits flexible topologies 37 Ra.o of throughputs, SimulaDon of a FatTree-like MPTCP/TCP topology with 512 nodes, but with 4 hosts for every up-link from a top-of-rack switch, i.e. the core is oversubscribed 4:1. • PermutaDon TM: each host sends to one other, each host receives from one other • Random TM: each host sends to one other, each host may receive from any number Connec.ons per host • At low loads, there are few collisions, and NICs are saturated, so TCP ≈ MPTCP • At high loads, the core is severely congested, and TCP can fully exploit all the core links, so TCP ≈ MPTCP • When the core is “right-provisioned”, i.e. just saturated, MPTCP > TCP

38 MPTCP permits flexible topologies FatTree Dual-homed FatTree (5 ports per host in total, (5 ports per host in total, 1Gb/s bisecDon bandwidth) 1Gb/s bisecDon bandwidth) 1Gb/s 1Gb/s If only 50% of hosts are acDve, you’d like each host to be able to send at 2Gb/s, faster than one NIC can support.

Presto ² Adapted from slides by Keqiang He (Wisconsin) 39

6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh - PowerPoint PPT Presentation

6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh Spring 2016 1 MoDvaDon DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, ]

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Exploring data.census.gov March 25, 2020 Dial al-in: 888 888-847 47-6593 6593 Participant P

MP-HULA A Multipath Transport Layer Aware Datacenter Load Balancing Scheme Using Programmable

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Gone WILD Richard Wang, Dana Butnariu, Jennifer Rexford Key Tradeoffs Load Balancing 1. Fast

Deterministic Load Balancing and Dictionaries in the Parallel Disk Model Mette Berger, Esben

Sparsifcatin if Infmuence Netmwirks Michael Matmhiiudakis 1 , Francesci Binchi 2 , Carlis Castlli

Algorithms, Probability & Computing Emo Welzl Ueli Maurer Angelika Steger Peter Widmayer

The Book Embedding Problem from a SAT-Solving Perspective [GD 2015] Michalis Bekos, Michael

On th the e challeng allenges es of de deplo ploying ying an an unusual sual hig igh pe

Physical POD Test and deployments #OpenCORD Full POD: definition The minimum amount of hardware

Dropping in 80Gbits (sort of) of Stateful Firewalling with OpenBSD (PF, OpenOSPF) UKNOF 37,

Anatomy of the Cerebellum Computational Models of neural Systems Lecture 2.1 David S. Touretzky

Lecture 1 Deep Field image from the Hubble Telescope Observation of atoms, electron waves in a

6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh - PowerPoint PPT Presentation

6.888: Lecture 4 Data Center Load Balancing Mohammad Alizadeh Spring 2016 1 MoDvaDon DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, ]

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Vertical Stress Increases Chapter 8 Point Load 1 3/25/2015 Point Load Point Load

Exploring data.census.gov March 25, 2020 Dial al-in: 888 888-847 47-6593 6593 Participant P

MP-HULA A Multipath Transport Layer Aware Datacenter Load Balancing Scheme Using Programmable

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Gone WILD Richard Wang, Dana Butnariu, Jennifer Rexford Key Tradeoffs Load Balancing 1. Fast

Deterministic Load Balancing and Dictionaries in the Parallel Disk Model Mette Berger, Esben

Sparsifcatin if Infmuence Netmwirks Michael Matmhiiudakis 1 , Francesci Binchi 2 , Carlis Castlli

Algorithms, Probability &amp; Computing Emo Welzl Ueli Maurer Angelika Steger Peter Widmayer

The Book Embedding Problem from a SAT-Solving Perspective [GD 2015] Michalis Bekos, Michael

On th the e challeng allenges es of de deplo ploying ying an an unusual sual hig igh pe

Physical POD Test and deployments #OpenCORD Full POD: definition The minimum amount of hardware

Dropping in 80Gbits (sort of) of Stateful Firewalling with OpenBSD (PF, OpenOSPF) UKNOF 37,

Anatomy of the Cerebellum Computational Models of neural Systems Lecture 2.1 David S. Touretzky

Lecture 1 Deep Field image from the Hubble Telescope Observation of atoms, electron waves in a

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Algorithms, Probability & Computing Emo Welzl Ueli Maurer Angelika Steger Peter Widmayer