Scheduling Mix-flows in Commodity Datacenters with Karuna
Li Chen, Kai Chen, Wei Bai, Mohammad Alizadeh (MIT) SING Group, CSE Department Hong Kong University of Science and Technology
Scheduling Mix-flows in Commodity Datacenters with Karuna Li Chen , - - PowerPoint PPT Presentation
Scheduling Mix-flows in Commodity Datacenters with Karuna Li Chen , Kai Chen, Wei Bai, Mohammad Alizadeh (MIT) SING Group, CSE Department Hong Kong University of Science and Technology Datacenter Transport Deadline flows Meeting
Li Chen, Kai Chen, Wei Bai, Mohammad Alizadeh (MIT) SING Group, CSE Department Hong Kong University of Science and Technology
We investigate a practical, yet neglected, problem: Coexistence of deadline and non-deadline flows Mix-flow Scheduling
2
Shortest Job First (SJF) Scheduling – pFabric, PASE, PIAS, PDQ
Deadline flows Non-deadline flows End-host End-host t
0.1 0.2 0.3 0.4 0.5 5 10 15 20 Fraction Percentage of non-deadline flows smaller than deadline flows.
Deadline Miss Rate
Scheduling only with sizes hurts deadline flows Problem: unawareness of deadlines.
3
Flow Priority = Remaining size
Earliest Deadline First Scheduling – pFabric, PASE, PIAS, PDQ
Deadline flows Non-deadline flows End-host End-host t Prioritizing deadline flows hurts non-deadline flows, especially short ones. Problem: Existing transports for deadline flows unnecessarily takes all bandwidth.
5 10 15 20 1 2 3 4 5 6 7 8 ms Percentage of deadline flows in overall traffic
99 Percentile FCT
Non-deadline: Overall Non-deadline: Size<10KB 4
Flow Priority = Time till Deadline Deadline Deadline
5
Deadline Flows
Non-deadline Flows
deadlines.
Deadline Flows
Non-deadline Flows
MCP Work Conserv. Transport Highest Priority Priority 2 Priority 3 Priority K End-host Network Fabric
6
SJF Key Insight: Deadline flows should minimally impact non-deadline flows.
Completing deadlines with minimal bandwidth
Minimal-impact Congestion control Protocol
7
Deadline flows Non-deadline flows Implementation Evaluation
Stochastic Optimization Lyapunov Optimization Framework [1] Convex Optimization Primal Solution Per-flow congestion window update function
[1] M. J. Neely. Stochastic Network Optimization with Application to Communication and Queueing Systems, Morgan & Claypool, 2010. 8
à Near-deadline completion
9
Rate t Link Cap Rate t Deadline
Stochastic Optimization Lyapunov Optimization Framework [1] Convex Optimization Primal Solution Per-flow congestion window update function
Mimicking SJF Non-deadline flows with/out known sizes
10
Deadline flows Non-deadline flows Implementation Evaluation
[2] Wei Bai, et. al., Information-Agnostic Flow Scheduling for Commodity Data Centers, USENIX NSDI 2015
Highest Priority 2nd Highest Priority
Lowest Priority
Send packets tagged with the highest priority until α# bytes sent. Send packets tagged with 2nd highest priority until α$ bytes sent. Send packets tagged with the lowest priority. Flows
11
Sum of Linear Ratios Problem (PIAS) Reformulation to include flows with known sizes Quadratic Sum of Ratios Problem (Karuna)
Demotion Thresholds: {𝛽'} Demotion Thresholds: {𝛽'} Splitting Thresholds: {𝛾'}
12
PIAS
Priority 2 Priority 3 Priority K Size ≤ 𝛾# 𝛾# < Size ≤ 𝛾$ 𝛾,-# < Size Flow with known sizes Flow without known sizes
13
…
End-host Network Fabric
14
Deadline flows Non-deadline flows Implementation Evaluation
Flow size Socket
Deadline
SO_MARK setsockopt()
Information passing Pass flow information (deadline, size) to the kernel using SO_MARK End-host Network Fabric
15
Flow size Socket
Deadline
SO_MARK setsockopt()
Tc module
pkt
Tagged pkt
Information Passing Packet tagging TC module at the sender-side. Tag DSCP fields in packet headers based on thresholds. End-host Network Fabric
16
Flow size Socket
Deadline
SO_MARK setsockopt()
Tc module
pkt
Tagged pkt
Modulate Congestion Window with MCP
Information Passing Packet tagging Rate control TC module. Non deadline flows use DCTCP Modifies window size using MCP. End-host Network Fabric
17
DCTCP
Flow size Socket
Deadline
SO_MARK setsockopt()
Tc module
pkt
Tagged pkt
Strict Priority Queueing Modulate Congestion Window with MCP
Information Passing Packet tagging Rate control Switch configuration ECN marking. Priority Queueing (priorities mapped to DSCP fields). End-host Network Fabric
18
Strict Priority Queueing Strict Priority Queueing DCTCP
Testbed Experiments Simulations
19
Deadline flows Non-deadline flows Implementation Evaluation
20 [3] Alizadeh, Mohammad, et al. "Data center tcp (dctcp)." ACM SIGCOMM computer communication review. Vol. 40. No. 4. ACM, 2010. [4] Greenberg, Albert, et al. "VL2: a scalable and flexible data center network."ACM SIGCOMM computer communication review. Vol. 39. No. 4. ACM, 2009.
Flow Size Deadline Start Time 1 14.4MB 20ms 0ms 2 48MB 120ms 0ms 3 3MB 5ms 50ms 4 0.5MB 10ms 80ms
200 400 600 800 1000 1200 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121
Mbps Time (ms)
DCTCP
Flow 1 Flow 2 Flow 3 Flow 4
Deadline Missed for Flow 1 21
Flow Size Deadline Start Time 1 14.4MB 20ms 0ms 2 48MB 120ms 0ms 3 3MB 5ms 50ms 4 0.5MB 10ms 80ms
200 400 600 800 1000 1200 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121
Mbps Time (ms)
pFabric – Earliest Deadline First
Flow 1 Flow 2 Flow 3 Flow 4
22 Flow 1 deadline Flow 2 deadline Flow 4 deadline Flow 3 deadline
Flow Size Deadline Start Time 1 14.4MB 20ms 0ms 2 48MB 120ms 0ms 3 3MB 5ms 50ms 4 0.5MB 10ms 80ms
200 400 600 800 1000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117 119 121
Mbps Time (ms)
Karuna
Flow 1 Flow 2 Flow 3 Flow 4
23
Flow Size Deadline Start Time 1 14.4MB 20ms 0ms 2 48MB 120ms 0ms 3 3MB 5ms 50ms 4 0.5MB 10ms 80ms Karuna completes deadline flow just before deadline, leaving bandwidth for non-deadline flows.
100 200 300 400 500 600 700 800 900 1000 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120
Karuna
200 400 600 800 1000 1200 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120
pFabric – Earliest Deadline First
200 400 600 800 1000 1200 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120
DCTCP
24
Mimics shortest job first scheduling for non-deadline flows.
0.916 2.721 1.716 5.554 8.04 28.725 5 10 15 20 25 30 35 0-100KB (Avg) 0-100KB (99th) ms Flow Size
FCT
Karuna DCTCP TCP 81.72 9 104.6 29 120.2 86 20 40 60 80 100 120 140 100KB-10MB (Avg) ms Flow Size
FCT
Karuna DCTCP TCP 851.7 2 718.3 69 608.0 4 100 200 300 400 500 600 700 800 900 1000 >10MB (Avg) ms Flow Size
FCT
Karuna DCTCP TCP 61.13 3 67.01 9 73.63 4 10 20 30 40 50 60 70 80 Overall ms
FCT
Karuna DCTCP TCP
25
26
2 4 6 8 10 0.8 0.85 0.9 0.95 % Deadline traffic load
Deadline Miss Rate
D3 D2TCP pFabric (EDF) Karuna 5 10 15 20 0.8 0.85 0.9 0.95 ms Deadline traffic load
95th Percentile FCT
D3 D2TCP pFabric Karuna 10000 20000 30000 40000 50000 60000 0.8 0.85 0.9 0.95 # Deadline traffic load
# Completed non- deadline flows
D3 D2TCP pFabric Karuna Reducing completion times of non-deadline flows while completing deadline flows.
27
>100x
based on size.
28