Surviving Failures in Bandwidth-Constrained Datacenters
Peter Bodík2, Ishai Menache2, Mosharaf Chowdhury3, Pradeepkumar Mani1, Dave Maltz1, Ion Stoica3
Microsoft1 Research2, UC Berkeley3
Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - - PowerPoint PPT Presentation
Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3 How to allocate services to physical
Microsoft1 Research2, UC Berkeley3
service 1 service 2 service 3
network core agg switches racks
2
network core switches racks containers power distribution
3
network core switches racks containers power distribution
4
network core switches racks containers power distribution
5
network core switches racks containers power distribution
6
7
C A A
network core agg switches racks BW: utilization in core FT: fault tolerance (for agg switches)
worst-case survival
bandwidth
C A A
fault tolerance
8
0% 40% 80% 120% 160% 80% 60% 40% 20% 0%
reduction in BW usage initial allocation allocations optimizing
allocations optimizing
GOAL! change in average worst-case survival
9
10
11
0% 40% 80% 120% 160% 80% 60% 40% 20% 0%
reduction in BW usage initial allocation allocations optimizing
allocations optimizing
GOAL! change in average worst-case survival
12
0% 40% 80% 120% 160% 80% 60% 40% 20% 0%
reduction in BW usage initial allocation allocations optimizing
allocations optimizing
change in average worst-case survival
13
14
(subset of) ~1000 services set of services forming an application cluster manager service
(lot more in the paper)
15
16
service weight fault domain weight number of machines
17
C A A
18
FT improvement BW reduction
19
20
FT improvement BW reduction
21
machine communication graph k-way min cut
22
23
24
0% 40% 80% 120% 160% 60% 40% 20% 0%
FT
25
0% 40% 80% 120% 160% 60% 40% 20% 0%
FT cut FT+BW cut
26
0% 40% 80% 120% 160% 60% 40% 20% 0%
FT+BW 2.3% moved FT cut FT+BW cut
27
0% 40% 80% 120% 160% 60% 40% 20% 0%
FT+BW 9% moved 29% moved 2.3% moved FT cut FT+BW cut
28
0% 40% 80% 120% 160% 60% 40% 20% 0%
FT+BW 9% moved 29% moved 2.3% moved
29
30
31
32
33
34
35
36
37
0% 40% 80% 120% 160%
0% 20%
just FT
just BW initial allocation
38
0% 40% 80% 120% 160%
0% 20%
Cut+FT+BW (moves all servers)
39
0% 40% 80% 120% 160%
0% 20%
FT+BW+#M: 2.3% servers moved FT+BW
40
0% 40% 80% 120% 160%
0% 20%
FT+BW+#M: 2.3% servers moved FT+BW FT+BW+#M: 9% servers moved FT+BW+#M: 29% servers moved
41
network topology machine communication graph
C A A
min cut k-way min cut
42
improved FT reduced BW
43
44
45
improved FT reduced BW
46
BW #M FT FT BW FT BW #M
47
services set of services forming an application cluster manager service
lot more in the paper
48
49