Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , - - PowerPoint PPT Presentation

reducing web latency the virtue of gentle aggression
SMART_READER_LITE
LIVE PREVIEW

Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , - - PowerPoint PPT Presentation

Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Yuchung Cheng, Neal Cardwell, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan USC & Google August 14,


slide-1
SLIDE 1

Reducing Web Latency: The Virtue of Gentle Aggression

Tobias Flach, Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Yuchung Cheng, Neal Cardwell, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan USC & Google August 14, 2013

slide-2
SLIDE 2

We can improve Google’s response time by 23%

Across billions of client requests, we improved the mean response time by 23%. We achieved this by ONLY speeding up 6% of the transfers, all of them experienced packet loss. Improvement is in the tail: We halved latency in the 99th percentile. For latency-sensitive services faster transfers mean a better user experience. 2

slide-3
SLIDE 3

Ways to Reduce Latency: The State of the Art

High loss, high delay

3

slide-4
SLIDE 4

Ways to Reduce Latency: The State of the Art

Improve the proximity of services to the user Leverage multi-stage connections

Low loss, multiplexed High loss, shorter delay

4

slide-5
SLIDE 5

Evaluating TCP Performance

Low loss, multiplexed High loss, shorter delay

Analyzed billions of flows carrying Web traffic between Google and clients 5

slide-6
SLIDE 6

Transfers With Loss Are Too Slow

Loss makes Web latency 5 times slower

Delays caused by TCP loss detection and recovery 6% of transfers between Google and clients are lossy

[Delay graph]

6

slide-7
SLIDE 7

Retransmission Timeouts Are Expensive

77% of losses are recovered by retransmission timeouts

Retransmission timeouts can be 200 times larger than the RTT Caused by high RTT variance, or lack of samples 7

slide-8
SLIDE 8

Tail Drops Are Expensive

(Single) tail packet drop is very common

Tail packets are twice as likely to be dropped compared to packets early in a burst 35% of lossy bursts

  • bserve only one packet

loss 8

slide-9
SLIDE 9

Our Motivation and Goal

Our Goal: Approaching the ideal of loss detection and recovery without delay. Without making the protocol too aggressive.

9

Loss significantly slows down transfers. Due to frequent recovery via slow RTOs. Caused by tail loss.

slide-10
SLIDE 10

Design Space

Decreasing slightly Increased slightly Increased greatly Startup / Short flows Steady state Loss Recovery Loss Timeout TCP Vegas Moderation CUBIC IW 10 DDoS Defense by Offense Relentless / Decongestion

Phase Level of Aggression

10

slide-11
SLIDE 11

Setting

Frontend Server Backend Server Private Network Public Network Controlling server only Preference for solutions without client changes and middlebox compatibility Controlling client and server Latency-sensitive traffic is a small portion of traffic mix

slide-12
SLIDE 12

Setting

Frontend Server Backend Server Private Network Public Network

Reactive

Trigger fast retransmit by retransmitting the tail packet early

Proactive

Avoid retransmissions through packet duplication

Corrective

Add redundancy to enable recovery without retransmission,

  • r trigger fast retransmit
slide-13
SLIDE 13

Setting

Frontend Server Backend Server Private Network Public Network

Reactive

Trigger fast retransmit by retransmitting the tail packet early

Proactive

Avoid retransmissions through packet duplication

Corrective

Add redundancy to enable recovery without retransmission,

  • r trigger fast retransmit
slide-14
SLIDE 14

Reactive

Wait time until RTO 1 1 - 3 Receiver does not know about the loss and therefore cannot send signals back 14

slide-15
SLIDE 15

Reactive

Wait for two RTTs 1 1 - 3 2 3 Fast retransmit Retransmit new packet or previous (tail) packet after two RTTs Can trigger selective acknowledgement indicating loss

Speeds up loss detection

15

slide-16
SLIDE 16

Reactive: Detecting Masked Losses

Wait for two RTTs 1 - 3 3 Cannot ignore the case where a packet loss is recovered by the Reactive probe Count ACKs and reduce congestion window if only one ACK for tail packet received 16

slide-17
SLIDE 17

Reactive: Detecting Masked Losses

Wait for two RTTs 1 - 3 3 17 1 - 3 3 One ACK only: Loss → Reduce congestion window Two ACKs: No loss A C K 2 A C K 3 A C K 2 A C K 3 A C K 3

slide-18
SLIDE 18

Setting

Frontend Server Backend Server Private Network Public Network

Reactive

Trigger fast retransmit by retransmitting the tail packet early

Proactive

Avoid retransmissions through packet duplication

Corrective

Add redundancy to enable recovery without retransmission,

  • r trigger fast retransmit
slide-19
SLIDE 19

Setting

Frontend Server Backend Server Private Network Public Network

Reactive

Trigger fast retransmit by retransmitting the tail packet early

Proactive

Avoid retransmissions through packet duplication

Corrective

Add redundancy to enable recovery without retransmission,

  • r trigger fast retransmit
slide-20
SLIDE 20

Proactive

Wait time until RTO 3 1 - 3 Avoid almost all retransmissions through packet duplication 20

slide-21
SLIDE 21

Proactive

1 Avoid almost all retransmissions through packet duplication 1 (DUP) 2 (DUP) 3 (DUP) 2 3 Duplicates are used if original transmission was lost

Avoids loss detection and recovery

21

slide-22
SLIDE 22

A/B Experiment Setup

Frontend Server Backend Server Reactive Proactive Experimented in production environment serving billions of queries (millions of queries are sampled) Default Default 22

slide-23
SLIDE 23

Impact of Reactive and Proactive

15-day experiment, 2.6 million queries sampled: mean response time reduced by 23% 99th percentile response time reduced by 47% Impact of Proactive: Retransmission rates on the backend connection dropped from 0.99% to 0.09% Impact of Reactive: Almost 50% of retransmission timeouts on the frontend connection are converted to fast retransmits 23

slide-24
SLIDE 24

Corrective: The Middle Way

Reactive speeds up loss detection, but still requires recovery Proactive avoids loss detection and recovery, but has 100% overhead

Corrective

24

slide-25
SLIDE 25

Corrective: Forward Error Correction in TCP

Wait time until RTO 1 1 - 3 Add redundancy to enable recovery without retransmission 25

slide-26
SLIDE 26

Corrective: Forward Error Correction in TCP

1 - 3 ENCODED Encodes previously transmitted segments in few coded segments XOR coding can recover single packet loss at the receiver Signaling of recovery status to the sender to enforce congestion control or fast retransmit No loss detection required

Speeds up loss detection and recovery

26

slide-27
SLIDE 27

Evaluation: Corrective

Network emulator Synthetic workloads (fixed-size single queries) Web page downloads (complex multi-resource queries) 27

slide-28
SLIDE 28

Loading nytimes.com with Corrective

Tail latency reduced by more than 20% But: performance slightly worse on loss-free connections 28

slide-29
SLIDE 29

Dealing with Middleboxes

Protocol changes need to account for middlebox interference We designed our modules for middlebox compatibility or graceful fallback to standard TCP 29

slide-30
SLIDE 30

Dealing with Middleboxes

Unknown option in data packet is stripped Require option in all packets ACK number is rewritten for unseen sequences Resend lost segment to update middlebox state Modified retransmission payload is rejected Detect tampering through checksum

30

slide-31
SLIDE 31

Conclusion

In a measurement study analyzing billions of flows in a Google’ s production environment we found that Analysis of loss patterns motivated three designs to improve latency: Reactive, Proactive, and Corrective Reactive and Proactive improved Google’s mean response time by 23% Reactive and Corrective are IETF Internet Drafts. Reactive is implemented and enabled by default in Linux 3.10 31

slide-32
SLIDE 32

Reducing Web Latency: The Virtue of Gentle Aggression

Tobias Flach, Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Yuchung Cheng, Neal Cardwell, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan USC & Google August 14, 2013

slide-33
SLIDE 33

Additional Slides

slide-34
SLIDE 34

Why aren’t you just using a more aggressive RTO value?

Increases the risk of spurious retransmissions Severely impacts TCP performance due to potentially larger number of unnecessary retransmissions and reduction of the congestion window Delays can be the result of delayed ACKs 34

slide-35
SLIDE 35

Why are you doing Corrective on the Transport Layer?

Application Layer Applications can selectively protect important data parts Reliable transport protocol would recover redundant data Does not know which packets are prone to loss Transport Layer Has necessary data to configure and tune Corrective (e.g. packets with higher loss probability, congestion window size, loss rate, RTT) Additional protocol complexity

slide-36
SLIDE 36

Design Space

Decreasing slightly Increased slightly Increased greatly Startup / Short flows Steady state Recovery Timeout TCP Vegas Moderation CUBIC

IW 10

DDoS Defense by Offense Relentless / Decongestion

Phase Level of Aggression

Proactive

Corrective Reactive