NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters - - PowerPoint PPT Presentation

β–Ά
numfabric fast and flexible bandwidth allocation in
SMART_READER_LITE
LIVE PREVIEW

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters - - PowerPoint PPT Presentation

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters Kanthi Nagaraj (Stanford), Dinesh Bharadia(M.I.T.), Mohammad Alizadeh (M.I.T.), Hongzi Mao (M.I.T.), Sandeep Chinchali (Stanford) and Sachin Katti(Stanford) Sigcomm 2016


slide-1
SLIDE 1

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters

Kanthi Nagaraj (Stanford), Dinesh Bharadia(M.I.T.), Mohammad Alizadeh (M.I.T.), Hongzi Mao (M.I.T.), Sandeep Chinchali (Stanford) and Sachin Katti(Stanford)

Sigcomm 2016

slide-2
SLIDE 2

Datacenter fabric proposals

slide-3
SLIDE 3

Which one does the operator pick?

slide-4
SLIDE 4

Is there a single fabric that provides flexible and fast bandwidth allocation control?

Yes ! NUMFabric provides a flexible fabric that is also fast.

slide-5
SLIDE 5

Flexible and Fast

Flexible

  • Supports wide

variety of bandwidth allocation objectives Fast

  • Flows converge to

correct rates before the datacenter workload changes

slide-6
SLIDE 6

NUMFabric: Flexibility

6

Minimize avg flow completion time Translate to utility functions

Hosts

send utility function to hosts Weighted proportional fairness Application level

  • bjective

m𝑏𝑦𝑗𝑛𝑗𝑨𝑓 βˆ‘ )*

+*

  • m𝑏𝑦𝑗𝑛𝑗𝑨𝑓 βˆ‘ π‘₯𝑗 βˆ— log

(𝑦-)

  • Flow i’s utility at rate

xi

Resource pooling

m𝑏𝑦𝑗𝑛𝑗𝑨𝑓 βˆ‘ π‘₯𝑗 βˆ— log (𝑧-)

  • where yi = aggregate rate of flow

across all subpaths

xi ß rate of flow i si ß size of flow i wi ß weight of flow i

slide-7
SLIDE 7

Network Utility Maximization in general

maximize 𝑉 π‘Œ = βˆ‘ 𝑉𝑗 𝑦𝑗

  • ?@

subject to AX ≀ 𝐷 X β‰₯ 0 Problem ? Existing NUM solutions are slow and unsuitable for data center workloads

slide-8
SLIDE 8

Existing distributed NUM solutions

  • Each source iteratively adjusts rates following its own gradient towards
  • ptimal
  • The sum of the rates moves towards the global optimal

Network sends congestion Signals

H9 H8 H6 H3 H2 H1 H4 H5 H7

Each source sets its rate based on gradient of its utility function and the network feedback Sources send traffic

0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 Rates Iterations

Flow rates

slide-9
SLIDE 9

Gradient based methods

Capacity Capacity

0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 21 26 Normalized rates Iterations

Larger steps to optimal

0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 21 26 31 Iterations

Smaller steps to optimal Capacity Capacity

Overshooting might cause bloated queues and packet drops

slide-10
SLIDE 10

How can we fix this?

Use Weights instead of rates ! Setting weights of the flow and allowing a fabric to allocate rates proportional to the weights enables exactly this.

0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 21 26 Normalized rates Iterations Larger steps to optimal 0.2 0.4 0.6 0.8 1 1.2 1 3 5 7 9 11 13 15 17 19 Iterations Larger steps to optimal

?

Can we enable larger steps to optimal but without over-shooting and under-utilization? Overshooting might cause drops, queues bloating

Capacity Capacity

slide-11
SLIDE 11

NUMFabric key idea

Capacity

  • In NUMFabric, sources give up direct control over rates
  • The sources specify β€œweights” and the Weighted Max-Min fabric

allocates relative rates proportional to the weights of all flows

0.2 0.4 0.6 0.8 1 1.2 1 6 11 16 21 Setting weights to control rates

slide-12
SLIDE 12

Weights Network feedback Layer that sets weights of flows based on network feedback Layer that realizes rates proportional to the weights

  • f the flows

Translate to utility functions Application level

  • bjective

Flexible Fast

Weighted Max-Min rate allocation according to the weights

slide-13
SLIDE 13

Weight inference

slide-14
SLIDE 14

Distributed NUM mechanism

𝑉𝑗′ 𝑦𝑗 = N π‘žπ‘š

Q ∈S(-)

π‘žπ‘š βˆ‘ 𝑦𝑗 βˆ’ 𝑑Q

  • ∈@(Q)

= 0 𝑁𝑏𝑦𝑗𝑛𝑗𝑨𝑓 N 𝑉𝑗(𝑦-)

  • NUM Objective

KKT Conditions: Equations that must necessarily be true at optimal solution At optimal, either the link is fully utilized or the price of the link is zero At optimal, the marginal utility of the source is equal to the sum of the prices along the path of the flow Price of a link : variable that indicates the congestion level at the switch

slide-15
SLIDE 15

Distributed NUM mechanism

𝑉𝑗′ 𝑦𝑗 = N π‘žπ‘š

Q ∈S(-)

π‘žπ‘š βˆ‘ 𝑦𝑗 βˆ’ 𝑑Q

  • ∈@(Q)

= 0

xi = π½π‘œπ‘€π‘“π‘ π‘‘π‘“ 𝑝𝑔 𝑉𝑗^ (βˆ‘ π‘žπ‘š)

Q ∈S(-)

Sources set the rates of the flows using price feedback Switches set their prices measuring congestion solve solve

Network congestion signals

N π‘žπ‘š

Q ∈S(-)

H9 H8 H6 H3 H2 H1 H4 H5 H7

pl = π‘žπ‘š + 𝛽 βˆ— βˆ‘ 𝑦𝑗 βˆ’ 𝑑Q

  • ∈@(Q)

Sources set rates of flows Sources adapt rates of flows

slide-16
SLIDE 16

NUMFabric iterations

𝑉𝑗′ 𝑦𝑗 = N π‘žπ‘š

Q ∈S(-)

π‘žπ‘š βˆ‘ 𝑦𝑗 βˆ’ 𝑑Q

  • ∈@(Q)

= 0

wi = π½π‘œπ‘€π‘“π‘ π‘‘π‘“ 𝑝𝑔 𝑉𝑗

^ (βˆ‘

π‘žπ‘š)

Q ∈S(-)

WMM layer always achieves 100% link utilization

βœ”

Controlling rates directly causes the brittleness in the existing solutions. WMM layer converts these weights to rates

βœ–

H9 H8 H6 H3 H2 H1 H4 H5 H7

Switches adapt prices at every iteration so that the flow rates move closer to optimal

slide-17
SLIDE 17

NUMFabric iterations

As we know, controlling rates directly causes the brittleness in the existing solutions.

βœ–

Switches adapt prices every iteration so that the flow rates to move closer to optimal 𝑆𝑓𝑑𝑗𝑒𝑣𝑓 = 𝑉-β€² 𝑦𝑗 βˆ’ N π‘žπ‘š

Q ∈S(-) π‘žπ‘š = π‘žQ + min 𝑠𝑓𝑑𝑗𝑒𝑣𝑓𝑗 β„Žπ‘π‘žπ‘‘ 𝑒𝑠𝑏𝑀𝑓𝑠𝑑𝑓𝑒 𝑐𝑧 π‘”π‘šπ‘π‘₯-

βœ”

H9 H8 H6 H3 H2 H1 H4 H5 H7

Residue Residue

π‘žπ‘š βˆ‘ 𝑦𝑗 βˆ’ 𝑑Q

  • ∈@(Q)

= 0 βœ” 𝑉𝑗′ 𝑦𝑗 = N π‘žπ‘š

Q ∈S(-)

slide-18
SLIDE 18

Operation summary

Weighted Max-Min Feasible and stable rates for all flows based on weights Weight adaptation at hosts

Path prices Rates Flow weights Residues

Price adaptation at switches

Residues Prices

slide-19
SLIDE 19

Evaluation

slide-20
SLIDE 20

Evaluation setup

20

40Gbps Fabric Links 10Gbps Edge Links

8 Racks

  • ns3 simulations: 128-port leaf-spine fabric
  • RTT = ~16Β΅s
  • Evaluate speed of convergence
  • Evaluate flexibility
  • Compare the bandwidth allocations on NUMFabric with different utility

functions against point solutions for different objectives– pFabric, MPTCP, etc.

slide-21
SLIDE 21

Fast convergence

  • 100 flows start/stop at

every β€œevent”.

  • We let the system

converge before triggering another event

  • Median convergence

time (335 us) of NUMFabric is 2.3X better that the other algorithms

DGD : Dual Gradient Descent algorithm RCP* : Alpha-Fair RCP

slide-22
SLIDE 22

Flexibility : minimize flow completion times

m𝑏𝑦𝑗𝑛𝑗𝑨𝑓 βˆ‘ )*

+*

  • xi Γ  rate of the flow

si Γ  size of the flow

slide-23
SLIDE 23

Flexibility : minimize flow completion times

m𝑏𝑦𝑗𝑛𝑗𝑨𝑓 βˆ‘ log (𝑧-)

  • where yi = aggregate rate of

flow across all sub-paths

slide-24
SLIDE 24

Conclusions

  • NUMFabric enables operators to flexibly optimize network’s

bandwidth allocation for different bandwidth allocation objectives

  • NUMFabric uses weights as knobs to influence rates and thus,

decouples the objectives of finding optimal rates and stable rates.This makes it 2-3X faster existing mechanisms.

  • Using NUMFabric with objective functions on co-flows, VM-level and

tenant-level aggregates is focus of our current and future work.

slide-25
SLIDE 25

Thank you