DeTail Reducing the Tail of Flow Completion Times in Datacenter - - PowerPoint PPT Presentation

detail
SMART_READER_LITE
LIVE PREVIEW

DeTail Reducing the Tail of Flow Completion Times in Datacenter - - PowerPoint PPT Presentation

DeTail Reducing the Tail of Flow Completion Times in Datacenter Networks David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz 1 A Typical Facebook Page Modern pages have many components 2 Creating a Page Internet


slide-1
SLIDE 1

DeTail

Reducing the Tail of Flow Completion Times in Datacenter Networks

David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, Randy Katz

1

slide-2
SLIDE 2

A Typical Facebook Page

2

Modern pages have many components

slide-3
SLIDE 3

Creating a Page

Datacenter Network Internet …

Front End

News Feed

Search

Ads

Chat

3

slide-4
SLIDE 4

What’s Required?

  • Servers must perform 100’s of data retrievals*

– Many of which must be performed serially

  • While meeting a deadline of 200-300ms**

– SLA measured at the 99.9th percentile**

  • Only have 2-3ms per data retrieval

– Including communication and computation

*The Case for RAMClouds *SIGOPS’09+ **Better Never than Late *SIGCOMM’11+

4

slide-5
SLIDE 5

What is the Network’s Role?

  • Analyzed distribution of RTT measurements:
  • Median RTT takes 334μs, but 6% take over 2ms
  • Can be as high as 14ms

5

Source: Data Center TCP (DCTCP) *SIGCOMM’10+

Network delays alone can consume the data retrieval’s time budget

slide-6
SLIDE 6

Why the Tail Matters

  • Recall: 100’s of data retrievals per page creation
  • The unlikely event of a data retrieval taking too

long is likely to happen on every page creation

– Data retrieval dependencies can magnify impact

6

slide-7
SLIDE 7

Impact on Page Creation

  • Under the RTT distribution, 150 data retrievals

take 200ms (ignoring computation time)

As Facebook already at 130 data retrievals per page, need to address network delays

7

slide-8
SLIDE 8

App-Level Mitigation

  • Use timeouts & retries for critical data retrievals

– Inefficient because of high network variance – Choose from conservative timeouts and long delays or tight timeouts and increased server load

  • Hide the problem from the user

– By caching and serving stale data – Rendering pages incrementally – User often notices, becomes annoyed / frustrated

Need to focus on the root cause

8

slide-9
SLIDE 9

Outline

  • Causes of long data retrieval times
  • Cutting the tail with DeTail
  • Evaluation

9

slide-10
SLIDE 10

Causes of Long Data Retrieval Times

  • Data retrievals are short, highly variable flows

– Typically under 20KB in size, with many under 2KB*

  • Short flows provide insufficient information for

transport to agilely respond to packet drops

  • Variable flow sizes decrease efficacy of network-

layer load balancers

*Data Center TCP (DCTCP) *SIGCOMM’10+

10

slide-11
SLIDE 11

Transport Layer Response

Timeout Transport does not have sufficient information to respond agilely

11

slide-12
SLIDE 12

Network Layer Load Balancers

  • Expected to support single-path assumption
  • Common approach: hash flows to paths

– Does not consider flow size or sending rate

  • Results in uneven load spreading

– Leads hotspots and increased queuing delays

The single-path assumption restricts the ability to agilely balance load

12

slide-13
SLIDE 13

Recent Proposals

  • Reduce packet drops

– By cross-flow learning [DCTCP] or explicit flow scheduling [D3] – Maintain the single-path assumption

  • Adaptively move traffic

– By creating subflows [MPTCP] or periodically remapping flows [Hedera] – Not sufficiently agile to support short flows

13

slide-14
SLIDE 14

Outline

  • Causes of long data retrieval times
  • Cutting the tail with DeTail
  • Evaluation

14

slide-15
SLIDE 15

DeTail Stack

  • Use in-network mechanisms to maximize agility
  • Remove restrictions that hinder performance
  • Well-suited for datacenters

– Single administrative domain – Reduced backward compatibility requirements

15

slide-16
SLIDE 16

Hop-by-hop Push-back

  • Agile link-layer response to prevent packet drops

16

What about head-of-line blocking?

slide-17
SLIDE 17

Adaptive Load Balancing

  • Agile network-layer approach for balancing load

17

Synergistic relationship: local output queues indicate downstream congestion because of push-back

slide-18
SLIDE 18

Load Balancing Efficiently

  • DC flows have varying timeliness requirements*

– How to efficiently consider packet priority?

  • Compare queue occupancies for every decision

– How to efficiently compare many of them?

*Data Center TCP (DCTCP) *SIGCOMM’10+

18

slide-19
SLIDE 19

Priority in Load Balancing

Output Queue 1 High Priority Low Priority Output Queue 2 Arriving Packet Based on queue

  • ccupancy

Ideal

How to enqueue packet so it is sent soonest?

19

slide-20
SLIDE 20

Priority in Load Balancing

  • Approach: track how many bytes to be sent

before new packet

  • Use per-priority counters

– Update on each packet enqueue/dequeue – Compare counters to find least occupied port

20

slide-21
SLIDE 21

Comparing Queue Occupancies

  • Many counter comparisons required for every

forwarding decision

  • Want to efficiently pick the least occupied port

– Pre-computation is hard as solution is destination, time dependent

21

slide-22
SLIDE 22

Use Per-Counter Thresholding

  • Pick a good port, instead of the best one

Queues < T Packet Priority 1011 Favored Ports

  • Dest. Address

Forwarding Entry 0101 Acceptable Ports

&

0001 Selected Port

22

slide-23
SLIDE 23

Reorder-Resistant Transport

  • Handle packet reordering due to load balancing

– Disable TCP’s fast recovery and fast retransmission

  • Respond to congestion (no more packet drops)

– Monitor output queues and use ECN to throttle flows

23

slide-24
SLIDE 24

DeTail Stack

Application Transport Network Link Physical Layer Component Hop-by-hop Push-back Adaptive Load Balancing Reorder-Resistant Transport Function Prevent packet drops Evenly balance load Support lower layers

24

slide-25
SLIDE 25

Outline

  • Causes of long data retrieval times
  • Cutting the tail with DeTail
  • Evaluation

25

slide-26
SLIDE 26

Simulation and Implementation

  • NS-3 simulation
  • Click implementation

– Drivers and NICs buffer hundreds of packets – Must rate-limit Click to underflow buffers

26

slide-27
SLIDE 27

Topology

  • FatTree: 128-server (NS-3) / 16-server (Click)
  • Oversubscription factor of 4x

Cores Aggs TORs

Reproduced From: A Scalable Commodity Datacenter Network Architecture *SIGCOMM’08+

27

slide-28
SLIDE 28

Setup

  • Baseline

– TCP NewReno – Flow hashing based on IP headers – Prioritization of data retrievals vs. background

  • Metric

– Reduction in 99.9th percentile completion time

28

slide-29
SLIDE 29

Page Creation Workload

  • Retrieval size: 2, 4, 8, 16, 32 KB*
  • Background traffic: 1MB flows

*Covers range of query traffic sizes reported by DCTCP

DeTail reduces 99.9th percentile page creation time by over 50%

29

slide-30
SLIDE 30

Is the Whole Stack Necessary?

  • Evaluated push-back w/o adaptive load balancing

– Performs worse than baseline

30

DeTail’s mechanisms work together, overcoming their individual limitations

slide-31
SLIDE 31

What About Link Failures?

  • 10s of link failures occur per day*

– Creates permanent network imbalance

  • Example

– Core-AGG link degrades from 1Gbps to 100Mbps – DeTail achieves 91% reduction in the 99.9th percentile

DeTail effectively moves traffic away from failures, appropriately balancing load

*Understanding Network Failures in Data Centers *SIGCOMM’11+

31

slide-32
SLIDE 32

What About Long Background Flows?

  • Background Traffic: 1, 16, 64MB flows*
  • Light data retrieval traffic

DeTail’s adaptive load balancing also helps long flows

*Covers range of update flow sizes reported by DCTCP

32

slide-33
SLIDE 33

Conclusion

  • Long tail harms page creation

– The extreme case becomes the common case – Limits number of data retrievals per page

  • The DeTail stack improves long tail performance

– Can reduce the 99.9th percentile by more than 50%

33