6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 - - PowerPoint PPT Presentation

6 888 lecture 5 flow scheduling
SMART_READER_LITE
LIVE PREVIEW

6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 - - PowerPoint PPT Presentation

6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 Datacenter Transport Goal: Complete flows quickly / meet deadlines Short flows Low Latency (e.g., query, coordina1on) Large flows High Throughput (e.g., data update,


slide-1
SLIDE 1

6.888 Lecture 5: Flow Scheduling

Mohammad Alizadeh

Spring 2016

1

slide-2
SLIDE 2

Datacenter Transport

Goal: Complete flows quickly / meet deadlines

2

Short flows

(e.g., query, coordina1on)

Large flows

(e.g., data update, backup)

Low Latency High Throughput

slide-3
SLIDE 3

Low Latency CongesJon Control

(DCTCP, RCP, XCP, …)

Keep network queues small (at high throughput)

3

Can we do better? Implicitly prioriJze mice

slide-4
SLIDE 4

The Opportunity

Many DC apps/plaVorms know flow size or deadlines in advance

  • Key/value stores
  • Data processing
  • Web search

4

4

Front end Server Aggregator Aggregator Aggregator

… …

Aggregator Worker

Worker Worker

Worker Worker

slide-5
SLIDE 5

What You Said

Amy: “Many papers that propose new network protocols for datacenter networks (such as PDQ and pFabric) argue that these will improve "user experience for web services". However, none seem to evaluate the impact of their proposed scheme on user experience… I remain skepGcal that small protocol changes really have drasGc effects on end-to-end metrics such as page load Gmes, which are typically measured in seconds rather than in microseconds.”

5

slide-6
SLIDE 6

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9

TX RX

6

slide-7
SLIDE 7

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9

Objective?

Ø Minimize avg FCT Ø Minimize missed deadlines

DC transport = Flow scheduling

  • n giant switch

ingress & egress capacity constraints

TX RX

7

slide-8
SLIDE 8

Example: Minimize Avg FCT

8

Flow A Size Flow B Flow C 1 2 3

arrive at the same Jme share the same bobleneck link

A B B C C C ² Adapted from slide by Chi-Yao Hong (UIUC)

slide-9
SLIDE 9

Example: Minimize Avg FCT

9

A B C 3 B C 5 6 C

Time Throughput

1 A B B C C C

Fair sharing: 3, 5, 6 mean: 4.67

Time Throughput

1 3 B A 1 6 C

Shortest flow first: 1, 3, 6 mean: 3.33

² Adapted from slide by Chi-Yao Hong (UIUC)

slide-10
SLIDE 10

OpJmal Flow Scheduling for Avg FCT

NP-hard for mulJ-link network [Bar-Noy et al.]

– Shortest Flow First: 2-approxima1on

1 1 2 2 3 3 1 1 2 2 3 3

10

slide-11
SLIDE 11

How can we schedule flows based on flow criJcality in a distributed way?

Some transmission order

11

slide-12
SLIDE 12

PDQ

12

² Several slides based on presentaJon by Chi-Yao Hong (UIUC)

slide-13
SLIDE 13

PDQ: Distributed Explicit Rate Control

13

Sender

Receiver Switch Switch

criticality rate = 10 Packet hdr 5

Switch preferentially allocates bandwidth to critical flows

TradiJonal explicit rate control Fair sharing (e.g., XCP, RCP)

slide-14
SLIDE 14

Contrast with TradiJonal Explicit Rate Control

TradiJonal schemes (e.g. RCP, XCP) target fair sharing

14

Sender

Receiver Switch Switch

rate = 10 Packet hdr 5

Ø Each switch determines a “fair share” rate based on local congesJon: R ç R - k*congesJon-measure Ø Source use smallest rate adverJsed on their path

slide-15
SLIDE 15

Challenges

PDQ switches need to agree on rate decisions Low uJlizaJon during flow switching CongesJon and queue buildup Paused flows need to know when to start

15

slide-16
SLIDE 16

Challenge: Switches need to agree on rate decisions

16

Sender

Receiver Switch Switch

Packet hdr

What can go wrong without consensus? How do PDQ switches reach consensus? Why is “pauseby” needed?

criJcality rate = 10 pauseby = X

slide-17
SLIDE 17

What You Said

Aus%n: “It is an interesGng departure from AQM in that, with the concept of paused queues, PDQ seems to leverage senders as queue memory.”

17

slide-18
SLIDE 18

Challenge: Low uJlizaJon during flow switching

18

1-2 RTTs

Goal:

B A C B A C

PracJce:

How does PDQ avoid this?

slide-19
SLIDE 19

Time

1

Throughput 2 RTTs Start next set of flows

Early Start: Seamless flow switching

slide-20
SLIDE 20

Time

1

Throughput increased queue

SoluJon: rate controller at switches [XCP/TeXCP/D3]

Early Start: Seamless flow switching

slide-21
SLIDE 21

Discussion

21

slide-22
SLIDE 22

RCP TCP Mean flow compleJon Jme Omniscient scheduler controls with zero control feedback delay [Normalized to a lower bound] PDQ w/o Early Start PDQ

Mean FCT

slide-23
SLIDE 23

What if flow size not known?

23

Why does flow size esJmaJon (criJcality = bytes sent) work beber for Pareto?

slide-24
SLIDE 24

Other quesJons

Fairness: can long flows starve? Resilience to error: what if packet gets lost or flow informaJon is inaccurate? MulJpath: does PDQ benefit from mulJpath?

24

99% of jobs complete faster under SJF than under fair sharing

[Bansal, Harchol-Balter; SIGMETRICS’01]

AssumpJon: heavy-tailed flow distribuJon

slide-25
SLIDE 25

pFabric

25

slide-26
SLIDE 26

pFabric in 1 Slide

Packets carry a single priority #

  • e.g., prio = remaining flow size

pFabric Switches

  • Send highest priority / drop lowest priority pkts
  • Very small buffers (20-30KB for 10Gbps fabric)

pFabric Hosts

  • Send/retransmit aggressively
  • Minimal rate control: just prevent congestion collapse

26

Main Idea: Decouple scheduling from rate control

slide-27
SLIDE 27

pFabric Switch

Boils down to a sort

– EssenJally unlimited prioriJes – Thought to be difficult in hardware

ExisJng switching only support 4-16 prioriJes pFabric queues very small

  • 51.2ns to find min/max of ~600

numbers – Binary comparator tree: 10 clock cycles – Current ASICs: clock ~ 1ns

27

slide-28
SLIDE 28

pFabric Rate Control

Minimal version of TCP algorithm

  • 1. Start at line-rate

– IniJal window larger than BDP

  • 2. No retransmission Jmeout esJmaJon

– Fixed RTO at small mulJple of round-trip Jme

  • 3. Reduce window size upon packet drops

– Window increase same as TCP (slow start, congesJon avoidance, …)

  • 4. Awer mulJple consecuJve Jmeouts, enter “probe mode”

– Probe mode sends min. size packets unJl first ACK

28

What about queue buildup? Why window control?

slide-29
SLIDE 29

Why does pFabric work?

Key invariant:

At any instant, have the highest priority packet (according to ideal algorithm) available at the switch.

Priority scheduling

Ø High priority packets traverse fabric as quickly as possible

What about dropped packets?

Ø Lowest priority → not needed Jll all other packets depart Ø Buffer > BDP → enough Jme (> RTT) to retransmit

29

slide-30
SLIDE 30

Discussion

30

slide-31
SLIDE 31

31

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 FCT (normalized to op1mal in idle fabric) Load Ideal pFabric PDQ DCTCP TCP-DropTail

Overall Mean FCT

slide-32
SLIDE 32

Mice FCT (<100KB)

Average 99th Percentile

32

1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normalized FCT Load Ideal pFabric PDQ DCTCP TCP-DropTail 1 2 3 4 5 6 7 8 9 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Normalized FCT Load

slide-33
SLIDE 33

Elephant FCT (>10MB)

0.2 0.4 0.6 0.8 5 10 15 20 25 Load Normalized FCT TCP−DropTail DCTCP PDQ pFabric Ideal

33

Why the gap?

Average

slide-34
SLIDE 34

Loss Rate vs Packet Priority

(at 80% load)

34

  • * Loss rate at other hops is negligible

Almost all packet loss is for large (latency-insensitive) flows

slide-35
SLIDE 35

Next Time: MulJ-Tenant Performance IsolaJon

35

slide-36
SLIDE 36

36