Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - - PowerPoint PPT Presentation

surviving failures in bandwidth constrained datacenters
SMART_READER_LITE
LIVE PREVIEW

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - - PowerPoint PPT Presentation

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3 How to allocate services to physical


slide-1
SLIDE 1

Surviving Failures in Bandwidth-Constrained Datacenters

Peter Bodík2, Ishai Menache2, Mosharaf Chowdhury3, Pradeepkumar Mani1, Dave Maltz1, Ion Stoica3

Microsoft1 Research2, UC Berkeley3

slide-2
SLIDE 2

C A A

How to allocate services to physical machines?

Three important metrics considered together

– FT: service fault tolerance – BW: bandwidth usage – #M: # machine moves to reach target allocation

service 1 service 2 service 3

+

network core agg switches racks

2

slide-3
SLIDE 3

FT: Improving fault tolerance of software services

Complex fault domains: networking, power, cooling Worst-case survival = fraction of service available during single worst-case failure

– corresponds to service throughput during failure

network core switches racks containers power distribution

3

slide-4
SLIDE 4

FT: Service allocation impacts worst-case survival

Worst-case survival:

– red service: 0% -- same container, power – green service: 67% -- different containers, power

network core switches racks containers power distribution

4

slide-5
SLIDE 5

BW: Reduce bandwidth usage

  • n constrained links

BW = bandwidth usage in the core Goal

– reduce cost of infrastructure – consider other service location constraints

network core switches racks containers power distribution

5

slide-6
SLIDE 6

#M: Need incremental allocation algorithms

High cost of machine move

– need to deploy potentially TB of data – warm up caches – could take tens of minutes, impact network

network core switches racks containers power distribution

6

slide-7
SLIDE 7

Outline

Why is it difficult? Traffic analysis Optimization framework

– FT + #M – FT + BW + #M

Evaluation

7

slide-8
SLIDE 8

Trade-off between bandwidth usage and fault-tolerance

C A A

network core agg switches racks BW: utilization in core FT: fault tolerance (for agg switches)

LOW  LOW 

worst-case survival

0 

  • ptimize for

bandwidth

C A A

HIGH  HIGH  0.5 

  • ptimize for

fault tolerance

8

slide-9
SLIDE 9

Optimizing for one metric degrades the other

Results from 6 Microsoft datacenters

  • 80%
  • 40%

0% 40% 80% 120% 160% 80% 60% 40% 20% 0%

  • 20%

reduction in BW usage initial allocation allocations optimizing

  • nly worst-case survival

allocations optimizing

  • nly core bandwidth

GOAL! change in average worst-case survival

9

slide-10
SLIDE 10

FT-only and BW-only are both NP-hard, hard to approximate

FT reduces to max independent set BW reduces to min-cut in a graph

– considered previously in [Meng et al., INFOCOM’10]

Most algorithms not incremental, ignore #M

10

slide-11
SLIDE 11

Key insights

Improve FT using convex optimization

– local optimization leads to good solutions

Symmetry in the optimization space

– machines, racks, containers are interchangeable

Communication pattern is very skewed

– can spread low-talkers without affecting BW

11

slide-12
SLIDE 12

Results preview

  • 80%
  • 40%

0% 40% 80% 120% 160% 80% 60% 40% 20% 0%

  • 20%

reduction in BW usage initial allocation allocations optimizing

  • nly worst-case survival

allocations optimizing

  • nly core bandwidth

GOAL! change in average worst-case survival

12

slide-13
SLIDE 13

Results preview

  • 80%
  • 40%

0% 40% 80% 120% 160% 80% 60% 40% 20% 0%

  • 20%

reduction in BW usage initial allocation allocations optimizing

  • nly worst-case survival

allocations optimizing

  • nly core bandwidth

change in average worst-case survival

13

slide-14
SLIDE 14

Outline

Why is it difficult? Traffic analysis Optimization framework

– FT + #M – FT + BW + #M

Evaluation

14

slide-15
SLIDE 15

Service communication matrix is very sparse and skewed

(subset of) ~1000 services set of services forming an application cluster manager service

  • nly 2% of service pairs

communicate 1% of services generate 64% of traffic

(lot more in the paper)

15

slide-16
SLIDE 16

Outline

Why is it difficult? Traffic analysis Optimization framework

– FT + #M – FT + BW + #M

Evaluation

16

slide-17
SLIDE 17

Spread machines across all fault domains

– FTC negatively correlated to worst-case survival

Convex

  • ptimization

Advantages of convex cost function

– local actions lead to improvement of global metric – directly considers #M

service weight fault domain weight number of machines

  • f service s in domain f
  • ptimizing FT and #M

FT

17

slide-18
SLIDE 18

Keeps the current allocation feasible

– doesn’t change number of machines per service

Steepest descent swap = largest reduction in cost Only evaluate a small, random set of swaps

– symmetry => many “good” swaps exist

C A A

machine swap as a basic move FT

18

slide-19
SLIDE 19

FT improvement BW reduction

path of steepest descent FT

19

slide-20
SLIDE 20

Optimizing FT, BW, and #M

Steepest descent on FTC + α BW

– non-convex – no guarantees on reaching optimum

α determines the FT-BW trade-off

FT+BW

20

slide-21
SLIDE 21

path of steepest descent FT+BW

FT improvement BW reduction

α = 10 α = 1

21

slide-22
SLIDE 22

Benchmark algorithm

k-way minimum graph cut

– optimizes BW only – ignores #M

followed by steepest descent on FT+BW

FT+BW cut

machine communication graph k-way min cut

22

slide-23
SLIDE 23

Outline

Why is it difficult? Traffic analysis Optimization framework

– FT + #M – FT + BW + #M

Evaluation

23

slide-24
SLIDE 24

Evaluation setup

Simulations based on 4 production clusters

– services + machine counts – network topology – fault domains – network trace from pre-production cluster

Metrics relative to initial allocation

– don’t know actual optimum

Choosing next swap takes seconds to a minute

24

slide-25
SLIDE 25

Evaluation

  • 40%

0% 40% 80% 120% 160% 60% 40% 20% 0%

  • 20%

ΔFT core BW reduction

FT

25

slide-26
SLIDE 26

Evaluation

  • 40%

0% 40% 80% 120% 160% 60% 40% 20% 0%

  • 20%

ΔFT core BW reduction

FT cut FT+BW cut

spreading low-talkers improves FT, little impact on BW boundary for different values of α

26

slide-27
SLIDE 27

Evaluation

  • 40%

0% 40% 80% 120% 160% 60% 40% 20% 0%

  • 20%

ΔFT core BW reduction

FT+BW 2.3% moved FT cut FT+BW cut

27

slide-28
SLIDE 28

Evaluation

  • 40%

0% 40% 80% 120% 160% 60% 40% 20% 0%

  • 20%

ΔFT core BW reduction

FT+BW 9% moved 29% moved 2.3% moved FT cut FT+BW cut

28

slide-29
SLIDE 29

α changes the FT-BW tradeoff

  • 40%

0% 40% 80% 120% 160% 60% 40% 20% 0%

  • 20%

ΔFT core BW reduction

FT+BW 9% moved 29% moved 2.3% moved

  

α α

29

slide-30
SLIDE 30

Summary

Trade-off between fault tolerance and bandwidth

– algorithm that achieves improvement in both

Improvements (across 4 production datacenters)

– FT: 40% – 120% – BW: 20% – 50% – partially deployed in Bing

Key insights

– approximate NP-hard problem using convex

  • ptimization

– lot of symmetry in search space – sparse and skewed communication matrix

30

slide-31
SLIDE 31

31

slide-32
SLIDE 32

32

slide-33
SLIDE 33

Extensions

Hard constraints on FT, BW, #M

– e.g., pick a few services with FT>80%

Hierarchical BW optimization on agg switches Applies to fat-tree networks

33

slide-34
SLIDE 34

Main observations

Most traffic generated by few services (pairs)

 spread low-talkers to improve fault-tolerance

Complex, overlapping fault domains

– hierarchical network fault-domains – power fault domains not aligned with network  cell: set of machines with identical fault domains

34

slide-35
SLIDE 35

Evaluation

Moving most of machines Moving only fraction of machines

35

slide-36
SLIDE 36

Our optimization framework

Cost function considers FT and BW

– both problems NP-hard and hard to approximate – non-convex

Cut + FT + BW:

1. minimum k-way cut of communication graph

  • reshuffles all machines

2. gradient descent moves using machine swaps

FT + BW:

1.

  • nly machine swaps
  • nly moves small fraction of machines

36

slide-37
SLIDE 37

Conclusion

Study of communication patterns of Bing.com

– sparse communication matrix – very skewed communication pattern

Principled optimization of both BW and FT

– exploits communication patterns – can handle arbitrary fault domains

Reduction in BW: 20 – 50% Improvement in FT: 40 – 120%

37

slide-38
SLIDE 38

Evaluation (1 datacenter)

  • 40%

0% 40% 80% 120% 160%

  • 60%
  • 40%
  • 20%

0% 20%

ΔFT ΔBW

  • ptimizing

just FT

  • ptimizing

just BW initial allocation

38

slide-39
SLIDE 39

Evaluation

  • 40%

0% 40% 80% 120% 160%

  • 60%
  • 40%
  • 20%

0% 20%

ΔFT ΔBW

Cut+FT+BW (moves all servers)

39

slide-40
SLIDE 40

Evaluation

  • 40%

0% 40% 80% 120% 160%

  • 60%
  • 40%
  • 20%

0% 20%

ΔFT ΔBW

FT+BW+#M: 2.3% servers moved FT+BW

40

slide-41
SLIDE 41

Evaluation

  • 40%

0% 40% 80% 120% 160%

  • 60%
  • 40%
  • 20%

0% 20%

ΔFT ΔBW

FT+BW+#M: 2.3% servers moved FT+BW FT+BW+#M: 9% servers moved FT+BW+#M: 29% servers moved

41

slide-42
SLIDE 42

C A A

network topology machine communication graph

C A A

k-way graph cut

k-way min graph cut

– ignores #M: reshuffles almost all machines – ignores FT: can’t be easily extended

min cut k-way min cut

BW

+

42

slide-43
SLIDE 43

improved FT reduced BW

k-way graph cut BW

43

slide-44
SLIDE 44

Scaling algorithms to large datacenters

Only evaluate a small, random set of swaps

– symmetry => many “good” swaps exist

Cell = set of machines with same fault domains Reduce size of communication graph for cut

44

slide-45
SLIDE 45

cut + steepest descent

Step 1: min-cut

– optimizes BW

Step 2: steepest descent on FTC + α BW

– non-convex – no guarantees on reaching optimum – α determines the trade-off

Reshuffles all machines

FT BW

45

slide-46
SLIDE 46

improved FT reduced BW

cut + steepest descent FT BW

α = 1 α = 10

46

slide-47
SLIDE 47

Properties of allocation algorithms

BW #M FT FT BW FT BW #M

47

slide-48
SLIDE 48

Service communication matrix is very sparse and skewed

services set of services forming an application cluster manager service

  • nly 2% of service pairs

communicate 1% of services generate 64% of traffic

lot more in the paper

48

slide-49
SLIDE 49

Which metrics matter?

FT: fault tolerance

– service should survive infrastructure failures – failures despite redundancy

BW: bandwidth usage

– reduce usage on constrained links – lower cost of infrastructure

#M: number of moves

– moving some servers is expensive – want incremental allocation

49