6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1

Datacenter Transport Goal: Complete flows quickly / meet deadlines Short flows Low Latency (e.g., query, coordina1on) Large flows High Throughput (e.g., data update, backup) 2

Low Latency CongesJon Control (DCTCP, RCP, XCP, …) Keep network queues small (at high throughput) Implicitly prioriJze mice Can we do better? 3

The Opportunity Many DC apps/plaVorms know flow size or deadlines in advance - Key/value stores - Data processing - Web search Front end Server Aggregator … … Aggregator Aggregator Aggregator … … Worker Worker Worker Worker Worker 4 4

What You Said Amy: “Many papers that propose new network protocols for datacenter networks (such as PDQ and pFabric) argue that these will improve "user experience for web services". However, none seem to evaluate the impact of their proposed scheme on user experience… I remain skepGcal that small protocol changes really have drasGc effects on end-to-end metrics such as page load Gmes, which are typically measured in seconds rather than in microseconds.” 5

TX H9 H8 H7 H6 H5 H4 H3 H2 H1 H9 H8 H7 H6 H5 H4 H3 H2 H1 RX 6

DC transport = Objective? Flow scheduling Ø Minimize avg FCT Ø Minimize missed deadlines on giant switch H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX ingress & egress capacity constraints 7

Example: Minimize Avg FCT Size Flow A 1 A Flow B 2 B B C C C Flow C 3 arrive at the same Jme share the same bobleneck link ² Adapted from slide by Chi-Yao Hong (UIUC) 8

Example: Minimize Avg FCT A Shortest flow first: Fair sharing: B B 1, 3, 6 3, 5, 6 C C C mean: 3.33 mean: 4.67 Throughput Throughput 1 1 A B C B A B C C C Time Time 3 5 6 6 1 3 ² Adapted from slide by Chi-Yao Hong (UIUC) 9

OpJmal Flow Scheduling for Avg FCT NP-hard for mulJ-link network [Bar-Noy et al.] – Shortest Flow First: 2-approxima1on 1 1 1 1 2 2 2 2 3 3 3 3 10

How can we schedule flows based on flow criJcality in a distributed way? Some transmission order 11

PDQ ² Several slides based on presentaJon by Chi-Yao Hong (UIUC) 12

PDQ: Distributed Explicit Rate Control Sender Switch Switch Receiver … Packet hdr criticality Switch preferentially rate = 10 5 allocates bandwidth to critical flows TradiJonal explicit rate control Fair sharing (e.g., XCP, RCP) 13

Contrast with TradiJonal Explicit Rate Control TradiJonal schemes (e.g. RCP, XCP) target fair sharing Sender Switch Receiver Switch … Packet hdr 5 rate = 10 Ø Each switch determines a “fair share” rate based on local congesJon: R ç R - k*congesJon-measure Ø Source use smallest rate adverJsed on their path 14

Challenges PDQ switches need to agree on rate decisions Low uJlizaJon during flow switching CongesJon and queue buildup Paused flows need to know when to start 15

Challenge: Switches need to agree on rate decisions Sender Switch Switch Receiver … Packet hdr What can go wrong without consensus? criJcality How do PDQ switches reach consensus? rate = 10 pauseby = X Why is “pauseby” needed? 16

What You Said Aus%n: “ It is an interesGng departure from AQM in that, with the concept of paused queues, PDQ seems to leverage senders as queue memory.” 17

Challenge: Low uJlizaJon during flow switching Goal: A B C How does PDQ avoid this? 1-2 RTTs PracJce: A C B 18

Early Start: Seamless flow switching Start next set of flows 2 RTTs Throughput 1 Time

Early Start: Seamless flow switching SoluJon: rate controller at switches increased queue [XCP/TeXCP/D3] Throughput 1 Time

Discussion 21

Mean FCT Mean flow compleJon Jme [Normalized to RCP a lower bound ] TCP PDQ w/o Early Start Omniscient scheduler PDQ controls with zero control feedback delay

What if flow size not known? Why does flow size esJmaJon (criJcality = bytes sent) work beber for Pareto? 23

Other quesJons Fairness: can long flows starve? 99% of jobs complete faster under SJF than under fair sharing [Bansal, Harchol-Balter; SIGMETRICS’01] AssumpJon: heavy-tailed flow distribuJon Resilience to error: what if packet gets lost or flow informaJon is inaccurate? MulJpath: does PDQ benefit from mulJpath? 24

pFabric 25

pFabric in 1 Slide Packets carry a single priority # • e.g., prio = remaining flow size pFabric Switches • Send highest priority / drop lowest priority pkts • Very small buffers (20-30KB for 10Gbps fabric) pFabric Hosts • Send/retransmit aggressively • Minimal rate control: just prevent congestion collapse Main Idea: Decouple scheduling from rate control 26

pFabric Switch Boils down to a sort �� – EssenJally unlimited prioriJes – Thought to be difficult in hardware �� ExisJng switching only support �� 4-16 prioriJes �� pFabric queues very small �� - 51.2ns to find min/max of ~600 �� numbers �� – Binary comparator tree: 10 clock �� cycles – Current ASICs: clock ~ 1ns 27

pFabric Rate Control What about Minimal version of TCP algorithm queue buildup? 1. Start at line-rate Why window – IniJal window larger than BDP control? 2. No retransmission Jmeout esJmaJon – Fixed RTO at small mulJple of round-trip Jme 3. Reduce window size upon packet drops – Window increase same as TCP (slow start, congesJon avoidance, …) 4. Awer mulJple consecuJve Jmeouts, enter “probe mode” – Probe mode sends min. size packets unJl first ACK 28

Why does pFabric work? Key invariant: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. Priority scheduling Ø High priority packets traverse fabric as quickly as possible What about dropped packets? Ø Lowest priority → not needed Jll all other packets depart Ø Buffer > BDP → enough Jme (> RTT) to retransmit 29

Discussion 30

Overall Mean FCT Ideal pFabric PDQ DCTCP TCP-DropTail FCT (normalized to op1mal in idle fabric) 10 9 8 7 6 5 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Load 31

Mice FCT (<100KB) Average 99 th Percentile Ideal pFabric PDQ DCTCP TCP-DropTail 10 10 9 9 8 8 Normalized FCT Normalized FCT 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Load Load 32

Elephant FCT (>10MB) Average 25 TCP − DropTail DCTCP 20 PDQ Why the gap? Normalized FCT pFabric Ideal 15 10 5 0 0.2 0.4 0.6 0.8 Load 33

Loss Rate vs Packet Priority (at 80% load) * Loss rate at other hops is negligible �� Almost all packet loss is for large (latency-insensitive) flows 34

Next Time: MulJ-Tenant Performance IsolaJon 35

6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 - PowerPoint PPT Presentation

6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 Datacenter Transport Goal: Complete flows quickly / meet deadlines Short flows Low Latency (e.g., query, coordina1on) Large flows High Throughput (e.g., data update,

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Outline Workforce Scheduling DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Transportation

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

2011 TRECVID Workshop: Surveillance Event Detec>on (SED) Task

2014 TRECVID Workshop: Surveillance Event Detec/on (SED)

Anti-Inflammatory Actions of Progesterone Sam Mesiano, PhD I have NO conflicts of interest

Special Mobility strand Sassari, 12-16 June 2017 Research Group Prof. Ignazio Floris

NARUC E le c tric Ve hic le s Wo rking Gro up JANUARY ME E T I NG JANUARY 14, 2020 AGE

CEE 370 Environmental Engineering Principles Lecture #28 Water Treatment II: Softening,

IPv6 Background Radiation Geoff Huston Geoff Huston APNIC R&D APNIC R&D Radiation

Overview of the Upper Klamath Lake and Agency Upper Klamath Lake Drainage Lake TMDL Upper

6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 - PowerPoint PPT Presentation

6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1 Datacenter Transport Goal: Complete flows quickly / meet deadlines Short flows Low Latency (e.g., query, coordina1on) Large flows High Throughput (e.g., data update,

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Instruction Scheduling Last time Instruction scheduling using list scheduling Today

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Module 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Outline Workforce Scheduling DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Transportation

Ponchatoula High School Scheduling for your Junior Year 2015-2016 Scheduling Procedures Online

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Scheduling and SAT Emmanuel Hebrard Toulouse Outline Introduction 1 Scheduling and SAT

CPU Scheduling Heechul Yun 1 Agenda Introduction to CPU scheduling Classical CPU

CPU Scheduling Questions Why is scheduling needed? CSCI [4|6] 730 What is

2011 TRECVID Workshop: Surveillance Event Detec&gt;on (SED) Task

2014 TRECVID Workshop: Surveillance Event Detec/on (SED)

Anti-Inflammatory Actions of Progesterone Sam Mesiano, PhD I have NO conflicts of interest

Special Mobility strand Sassari, 12-16 June 2017 Research Group Prof. Ignazio Floris

NARUC E le c tric Ve hic le s Wo rking Gro up JANUARY ME E T I NG JANUARY 14, 2020 AGE

CEE 370 Environmental Engineering Principles Lecture #28 Water Treatment II: Softening,

IPv6 Background Radiation Geoff Huston Geoff Huston APNIC R&amp;D APNIC R&amp;D Radiation

Overview of the Upper Klamath Lake and Agency Upper Klamath Lake Drainage Lake TMDL Upper

2011 TRECVID Workshop: Surveillance Event Detec>on (SED) Task

IPv6 Background Radiation Geoff Huston Geoff Huston APNIC R&D APNIC R&D Radiation