Fa Fast and Ca Cautious:
Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers
Guo Chen
Yuanwei Lu, Yuan Meng, Bojie Li, Kun Tan, Dan Pei, Peng Cheng, Layong (Larry) Luo, Yongqiang Xiong, Xiaoliang Wang, and Youjian Zhao
Leveraging Multi-path Diversity for Transport Loss Recovery in Data - - PowerPoint PPT Presentation
Fa Fast and Ca Cautious: Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers Guo Chen Yuanwei Lu, Yuan Meng, Bojie Li, Kun Tan, Dan Pei, Peng Cheng, Layong (Larry) Luo, Yongqiang Xiong, Xiaoliang Wang, and Youjian
Yuanwei Lu, Yuan Meng, Bojie Li, Kun Tan, Dan Pei, Peng Cheng, Layong (Larry) Luo, Yongqiang Xiong, Xiaoliang Wang, and Youjian Zhao
n Services care about the tail flow completion time (tail FCT)
¨ Large number of flows generated in each operation ¨ Overall performance governed by the last completed flows
16/6/25 2
Large-scale web application hosted in Data Center Network (DCN) App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic
n Services care about the tail flow completion time (tail FCT)
¨ Large number of flows generated in each operation ¨ Overall performance governed by the last completed flows
n But packet loss hurts tail FCT
¨ Real case in a Microsoft Azure’s DCN
16/6/25 3
Spine switch 2% random drop rate --> increase of 99th percentile latency of all users
DCN tail latency visualization
[Pingmesh (SIGCOMM’15)] (a) Normal (b) Spine failure
n Motivation n Packet Loss in DCN n Impact of Packet Loss n Challenge for Loss Recovery n FUSO Design n Evaluation n Summary
16/6/25 4
Loss rate and location distribution of lossy links (loss rate > 1%)
Mean loss rate 4% 78% above ToR Similar in 5 days
16/6/25 5
n Loss characteristics
¨ Measured in a Microsoft production DCN during Dec. 1st-5th, 2015
n Reasons causing loss
¨ Congestion loss
Ø
Uneven load-balance
Ø
Incast
¨ Failure loss
Ø
Silent random drop
Ø
Packet black-hole
Bursty; Transient
16/6/25 6
Complex; Hard to detect
Greatly mitigated (e.g., 1%->0.01%)
[Jupiter Rising SIGCOMM’15]
Common & Huge impact
[Pingmesh SIGCOMM’15]
n Motivation n Packet Loss in DCN n Impact of Packet Loss
¨ Why loss hurts the tail? ¨ How hard loss hurts?
n Challenge for Loss Recovery n FUSO Design n Evaluation n Summary
16/6/25 7
n Fast recovery
¨ Wait for certain number of DACKs to
detect the loss and retransmit
8 1-2 Ack 1-2 3-6 DupAck 3 Retran 3
RTT RTT Sender Receiver
1-2 Ack 1-2 3-6 Retran 3
RTT Timeout Sender Receiver
n Fast recovery
¨ Wait for certain number of DACKs to
detect the loss and retransmit
n Timeout (RTO)
¨ If not enough DACKs return, retransmit
after a timeout
9
[Pingmesh (SIGCOMM’15), DCTCP (SIGCOMM’10)]
1-2 Ack 1-2 3-6 Retran 3
RTT Timeout Sender Receiver
n Fast recovery
¨ Wait for certain number of DACKs to
detect the loss and retransmit
n Timeout (RTO)
¨ If not enough DACKs return, retransmit
after a timeout
10
[Pingmesh (SIGCOMM’15), DCTCP (SIGCOMM’10)]
Timeout probability of flows with different sizes passing a path with different packet loss rate 10KB(testbed) 100KB(testbed) 100KB(analysis) 10KB(analysis)
a.
timeout ratio sharply grows when loss rate > 1%
16/6/25 11
99th FCT > RTO 3% loss à ~10% timeout
n A little loss causes enough timeout to hurt the tail FCT
Timeout probability of flows with different sizes passing a path with different packet loss rate 10KB(testbed) 100KB(testbed) 100KB(analysis) 10KB(analysis)
a.
timeout ratio sharply grows when loss rate > 1%
16/6/25 12
99th FCT > RTO 3% loss à ~10% timeout
n A little loss causes enough timeout to hurt the tail FCT
n Motivation n Packet Loss in DCN n Impact of Packet Loss n Challenge for Loss Recovery n FUSO Design n Evaluation n Summary
16/6/25 13
n Prior works add aggressiveness to congestion control to do
loss recovery before timeout (RTO)
¨ Tail Loss Probe (TLP)
Ø
transmit one prober after 2RTT
¨ Instant Recovery (TCP-IR)
Ø
generate an FEC packet for every group of packets (up to 16)
Ø
FEC packets also act as probers, delayed 1/4RTT before sent
¨ Proactive/RepFlow
Ø
Duplicate every packet/flow
16/6/25 14
[SIGCOMM’13, RFC 5827] [SIGCOMM’13, RFC 5827] [SIGCOMM’13, INFOCOM’14]
n How long to wait before sending recovery packets?
¨ For congestion loss
Ø
Should delay enough in case of worsening congestion
16/6/25 15
Bursty: Lead to multiple consecutive losses
[Incast (WREN’09), DCTCP (SIGCOMM’10)]
n How long to wait before sending recovery packets?
¨ For congestion loss
Ø
Should delay enough in case of worsening congestion
¨ For failure loss such as random drop
Ø
Should recover as fast as possible, otherwise already increase the FCT
16/6/25 16
[TLP SIGCOMM’13, RFC 5827]
n Loss easily incurs timeout to hurt the tail n To prevent timeout, prior works add fixed aggressiveness to
recover loss before timeout
n Hard to adapt to various loss conditions
¨ Should be fast for failure loss ¨ Should be cautious for congestion loss
16/6/25 17
n Motivation n Packet Loss in DCN n Impact of Packet Loss n Challenge for Loss Recovery n FUSO Design n Evaluation n Summary
16/6/25 18
n Utilize the “good” paths to proactively conduct loss recovery
for “bad” paths
¨ Leveraging path diversity (multiple paths; a few encounter loss)
n Fast and Cautious
¨ Fast
Ø
Proactive (immediate) recovery for potential packet loss utilizing spare transmission opportunity
¨ Cautious
Ø
Strictly follow congestion control without adding aggressiveness
16/6/25 19
Receiver Sender
16/6/25 20
SF1 SF2 SF3 SF1 SF2 SF3
CWND2
CWNDtotal
CWND1 CWND3
Mu Multi-pa path h Co Congestion Co Control Da Data Di Distribution Su Sub-fl flows: Implicitly/Explicitly ma mapping to physical paths
Receiver Sender
16/6/25 21
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2 P3 P4 P5 P1
Receiver Sender
16/6/25 22
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2 P3 P4 P5 P1
Receiver Sender
16/6/25 23
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2 P3 P4 P5 P1
Receiver Sender
16/6/25 24
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2 P3 P4 P5 P1
Receiver Sender
16/6/25 25
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2 P3 P4 P5 P1
Receiver Sender
16/6/25 26
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
Receiver Sender
16/6/25 27
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P4 P5 P1 P3 P2
Lost
Receiver Sender
16/6/25 28
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P4 P5 P1 P3 P2
Lost
Receiver Sender
16/6/25 29
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P4 P5 P1 P3 P2
Lost
Receiver Sender
16/6/25 30
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P4 P5 P1 P3 P2
Lost ACK P3
Receiver Sender
16/6/25 31
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P4 P5 P1 P3 P2
Lost
Receiver Sender
16/6/25 32
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P4 P5 P1 P3 P2
Lost
Receiver Sender
16/6/25 33
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P1 P3 P2
Lost
P4 P5
Receiver Sender
16/6/25 34
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P1 P3 P2
Lost
P4 P5
Receiver Sender
16/6/25 35
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P1 P3 P2
Lost ACK P1
P4 P5
ACK P4&P5
Receiver Sender
16/6/25 36
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
Receiver Sender
16/6/25 37
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost Sp Spare CWND No No new data
Receiver Sender
16/6/25 38
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
P2
Receiver Sender
16/6/25 39
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
P2
“W “Worst” ” sub-fl flow “B “Best” ” sub-fl flow
Receiver Sender
16/6/25 40
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
P2
“W “Worst” ” sub-fl flow “B “Best” ” sub-fl flow
P2
Receiver Sender
16/6/25 41
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
P2
“W “Worst” ” sub-fl flow “B “Best” ” sub-fl flow
P2
Receiver Sender
16/6/25 42
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
P2
“W “Worst” ” sub-fl flow “B “Best” ” sub-fl flow
P2
Receiver Sender
16/6/25 43
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost
P2
“W “Worst” ” sub-fl flow “B “Best” ” sub-fl flow
P2
Do Done!
Receiver Sender
16/6/25 44
SF1 SF2 SF3 SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P3 P4 P5 P1 P2
Lost Re Retransmit af after an an RTO
Sender
16/6/25 45
SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2
Lost Possibilityofencounteringloss
Sender
16/6/25 46
SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2
Lost “W “Worst” ” sub-fl flow
n “Worst” Sub-flow
¨ With un-ACKed data ¨ Most likely having loss
Un Un-AC ACKed da data Possibilityofencounteringloss
Sender
16/6/25 47
SF1 SF2 SF3
CWNDtotal
CWND1 CWND2 CWND3
P2
Lost “W “Worst” ” sub-fl flow
n “Worst” Sub-flow
¨ With un-ACKed data ¨ Most likely having loss
Possibilityofencounteringloss
n “Best” Sub-flow
¨ With spare CWND ¨ Least likely having loss
Sp Spare CW CWND “Bes est” ” sub-fl flow
n If (spare CWND) && (no new data)
¨
Utilize the transmission opportunity to proactively recover
¨
Use “good” paths to help “bad” paths
n Multi-path diversity offers many transmission opportunities
¨
“Good” paths have spare window
16/6/25 48
App Data
. . .
P2 Multipath Congestion Control
Send to best Sub-Flow
P1 P5 P6 P7
R
P4 Spare window
...
Spare window
Un-ACKed data
P4
Sender
App Data
R
Recover
P3
Receiver
Sub-Flow 1 Sub-Flow N Sub-Flow 2
P3
Sub-Flow 1 Sub-Flow 2 Sub-Flow N Recovery packets
n Implemented in Linux kernel; ~900 lines of code
16/6/25 49
https://github.com/1989chenguo/FUSO
n Motivation n Packet Loss in DCN n Impact of Packet Loss n Challenge for Loss Recovery n FUSO Design n Evaluation n Summary
16/6/25 50
n Network
¨ 1Gbps fabric & 1Gbps hosts; ECMP routing; ECN enabled
n TCP
¨ Init_cwnd=16; min_RTO=5ms
16/6/25 51
99th FCT % of flows encountering timeout
better
n Failure loss
¨ Random-drop
16/6/25 52
Fast
Reducing 99th FCT up to ~82.3% Reducing the timeout flows up to 100% Loss rate: 0.125%-4%
Latency-sensitive flows
better
n Congestion loss
¨ Incast
16/6/25 53
Concurrent responses
Performs the best
Cautious
n Failure loss & Congestion loss
¨ From failure-loss-dominated to
congestion-loss-dominated
16/6/25 54
Loss rate: 2%
Latency-sensitive flows
Adapt to various loss condition
better Background long flows
n Simulation settings
¨ NS2 simulator; 3-
layer, 4-port FatTree
¨ 40Gbps fabric,
10Gbps host; 64 hosts, 20 switches
¨ Empirical failure
generation
16/6/25 55
Latency-sensitive flows Background long flows
Random failure
better
n Simulation settings
¨
NS2 simulator; 3-layer, 4-port FatTree fabric
¨
40Gbps fabric, 10Gbps host; 64 hosts, 20 switches
¨
Empirical failure generation
16/6/25 56
Reducing the average FCT up to ~60.3% Reducing the 99th FCT up to ~87.4%
n Motivation n Packet Loss in DCN n Impact of Packet Loss n Challenge for Loss Recovery n FUSO Design n Evaluation n Summary
16/6/25 57
n Loss hurts tail latency
¨
Loss is not uncommon
¨
A little loss leads to enough timeout, hurting the tail
n Challenges for loss recovery
¨
How to accelerate loss recovery under various loss conditions without causing congestion?
n Philosophy for FUSO
¨
To be fast & cautious are equally important
¨
Fast: Proactive loss recovery utilizing spare transmission opportunity, leveraging multipath diversity
¨
Cautious: Strictly follows congestion control without adding aggressiveness
16/6/25 58
Q&A?