A Generalized Blind Scheduling Policy Hanhua Feng 1 , Vishal Misra 2 - - PowerPoint PPT Presentation

a generalized blind scheduling policy
SMART_READER_LITE
LIVE PREVIEW

A Generalized Blind Scheduling Policy Hanhua Feng 1 , Vishal Misra 2 - - PowerPoint PPT Presentation

A Generalized Blind Scheduling Policy Hanhua Feng 1 , Vishal Misra 2 , 3 and Dan Rubenstein 2 1 Infinio Systems 2 Columbia University in the City of New York 3 Google TTIC SUMMER WORKSHOP: DATA CENTER SCHEDULING FROM THEORY TO PRACTICE Feng,


slide-1
SLIDE 1

A Generalized Blind Scheduling Policy

Hanhua Feng1, Vishal Misra2,3 and Dan Rubenstein2

1Infinio Systems 2Columbia University in the City of New York 3Google

TTIC SUMMER WORKSHOP: DATA CENTER SCHEDULING FROM THEORY TO PRACTICE

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 1 / 37

slide-2
SLIDE 2

Overview

1

Introduction

2

PBS Policy

3

Properties of the PBS policy

4

Implementation and Experimental Results

5

PBS in the Data Center

6

Conclusion and Future Work

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 2 / 37

slide-3
SLIDE 3

Introduction

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 3 / 37

slide-4
SLIDE 4

Scheduling Policies in Queueing Models

Scheduling is a compromise . . .

not only between individual tasks, but also . . . between systems with different workload patterns, between different performance requirements, including

mean response time, mean slowdown, responsiveness, . . . fairness measures: seniority, RAQFM, . . .

Our work

Design a flexible scheduling policy to balance these requirements.

Assumptions in this talk

Single-server queueing model Work-conserving, preemption allowed

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 4 / 37

slide-5
SLIDE 5

Blind Scheduling Policies

Non-blind policies

Know required and remaining service time when tasks arrive.

Blind policies

No information about remaining service until tasks complete.

Non-blind policy examples

SJF, SRPT, SMART . . .

Blind policy examples

FCFS, PS, LAS, LCFS, . . .

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 5 / 37

slide-6
SLIDE 6

How Do We Measure Fairness of a Policy?

Fairness criteria [cf. Raz,Levy&Avi-Itzhak 2004]

Task seniority (emphasis on ti) ⇒ FCFS Task service requirements (emphasis on xi)

Equal attained service ⇒ LAS/FBPS

Combination of the two: Equal share of processor

Current: dxi(t)/dti(t) ≡ x′

i (t)

⇒ PS Aggregated: xi(t)/ti(t) ⇒ GAS

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 6 / 37

slide-7
SLIDE 7

How to Measure Fairness of a Policy? (cont’d)

Fairness measures in literature

Comparison vs FCFS [Wang & Morris 1985] RAQFM: Comparison vs PS [Raz,Levy&Avi-Itzhak 2004]

A quantitative measure. Difficult to analyze: with results for FCFS, LCFS, PLCFS, and Random in M/M/1. G/D/m [Raz,Levy&Avi-Itzhak 2005]

Expected slowdown for given required service E[S|X = x] compared with PS [Wierman&Harchol-Balter 2004]

A classification: always fair/unfair, sometimes fair. Assume M/G/1. Extended in [Wierman&Harchol-Balter 2005].

SQF [Avi-Itzhak,Brosh&Levy 2007]

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 7 / 37

slide-8
SLIDE 8

PBS Policy

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 8 / 37

slide-9
SLIDE 9

Balance Between Two Fairness Criteria

Two fairness criteria (cont’d)

Seniority — Prefer larger sojourn time ti(t) Service requirements — Prefer smaller attained service xi(t)

Our idea: A configurable balance

Schedule a task with maximal ti(t) − αxi(t). More general: g(ti(t)) − αg(xi(t)), e,g., log ti(t) − α log xi(t).

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 9 / 37

slide-10
SLIDE 10

Our Parameterized Scheduler: PBS

The PBS policy with a single server

For every task i, compute its priority value pi(t) = log ti(t) − α log xi(t), Equivalent to Pi(t) = ti(x) [xi(t)]α α is a configurable parameter in [0, ∞). At time t, serve the task with the highest priority pi (or Pi).

Randomly choose among equal-priority tasks. Preempt low-priority tasks, if currently been served.

Can be used in continuous time (theory)

  • r in discrete time (practice).

PBS: A Unified Priority-Based Scheduler Sigmetrics 2007

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 10 / 37

slide-11
SLIDE 11

PBS: Priority-based Blind Scheduling (cont’d)

Why PBS?

Tunable: Parameter α can be changed from 0 to ∞.

Emulate well-known policies: Pi = ti/xα

i

α = 0: First-come first-serve (FCFS) Pi = ti α → ∞: Least attained service (LAS), Pi ∼ 1/xi a.k.a. Foreground-Background Processor-Sharing (FBPS) α = 1: Greatest Attained Slowdown (GAS), Pi = ti/xi closely emulate Processor-Sharing (PS). α = other values: Hybrid policies.

Blind: Using only past information (ti, xi) Simple: Easy to implement. Dimensionless: Not dependent on scale of time unit (minute, second).

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 11 / 37

slide-12
SLIDE 12

Behavior of PBS

An example

Four tasks in 4 colors Arrival time: 0s,1s,3s,5s Service: 4.5s,2.5s,3s,2s

How to read the graphs

X-axis: Time Y-axis: CPU utilization per task. Area: Service received.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 12 / 37

slide-13
SLIDE 13

Properties

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 13 / 37

slide-14
SLIDE 14

Properties of PBS for 0 < α < ∞

Some properties of PBS proved in the paper

A new task immediately receives service after arrival.

Small CPU fraction for α < 1 Large CPU fraction for α > 1.

Seniority: Earlier tasks get more attained service. Time-shared: CPU may be shared by two or more tasks.

Hospitality: A new task always gets a CPU share.

Convergence: Converge to PS in a long run for long jobs.

Converge to DPS with an offset to log formula,

No Starvation: Priority values of temporarily blocked tasks increase towards infinity, and will become highest-priority task.

For α close to 0 (FCFS) or ∞ (LAS), tasks may be blocked for a long time.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 14 / 37

slide-15
SLIDE 15

PBS Tunability: A Graphical Conclusion

PBS is monotonic in many aspects

Guidelines for tuning α manually.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 15 / 37

slide-16
SLIDE 16

Implementation

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 16 / 37

slide-17
SLIDE 17

Implementation in Linux Kernel

CPU utilization measurement

Discrete time implementation in Linux 2.6.15. 50ms moving average of measured CPU utilization per task. Measurement results are close to simulation results. Difference is the roughness on small time scales.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 17 / 37

slide-18
SLIDE 18

Emulating Existing Linux Scheduler

A small tweak

Add a bonus priority γ to the current task in order to limit context switch. With α = 2 and γ = 0.07, PBS looks close to Linux native scheduler.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 18 / 37

slide-19
SLIDE 19

Experimental model

A closed model

A fixed number of users. Each user submits a task after thinking. Exponentially distributed thinking time. Response time of every task is measured.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 19 / 37

slide-20
SLIDE 20

Experimental Results (Set A)

Computational tasks with almost deterministic CPU usage. About 3-second processing for each task. 8 users, 25s average thinking time.

For this work load,

small α works best. PBS (α < 0.7)

  • utperforms Linux

and Round-robin.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 20 / 37

slide-21
SLIDE 21

Experimental Results (Set B) (1/2)

Apache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking time. Processing time is heavy-tailed.

For this workload,

big α works best. PBS (α > 2)

  • utperforms Linux

and Round-robin.

Conclusion

Different α’s are better for different workloads.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 21 / 37

slide-22
SLIDE 22

Experimental Results (Set B) (2/2)

Apache web server 2.0, dynamic pages with heavy processing. Overloaded with 30 users, 10s average thinking time. Processing time is heavy-tailed.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 22 / 37

slide-23
SLIDE 23

Data center

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 23 / 37

slide-24
SLIDE 24

Data center fabric: A giant switch

DC Fabric: Just a Giant Switch

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9

TX RX

8

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 24 / 37

slide-25
SLIDE 25

Transport in Data Centers

H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9

Objective?

Ø Minimize avg FCT

DC transport = Flow scheduling

  • n giant switch

ingress & egress capacity constraints

TX RX

9

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 25 / 37

slide-26
SLIDE 26

pFabric

pFabric: Minimal Near-Optimal Datacenter Transport (Alizadeh et al. Sigcomm 2013)

Goal: Complete Flows Quickly

Requires scheduling flows such that:

High throughput for large flows Fabric latency (no queuing delays) for small flows

Prior work: use rate control to schedule flows

vastly improve performance, but complex

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 26 / 37

slide-27
SLIDE 27

pFabric in one slide

pFabric Packets

Packets carry a single priority number, e.g., prio = remaining flow size

pFabric Switches

Very small buffers (20-30KB for 10Gbps fabric) Send highest priority / drop lowest priority pkts

pFabric Hosts

Send/retransmit aggressively Minimal rate control: just prevent congestion collapse

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 27 / 37

slide-28
SLIDE 28

pFabric Switch

pFabric Switch

Switch Port

7 1 9 4 3

Ø Priority Scheduling send highest priority packet first Ø Priority Dropping drop lowest priority packets first

6 5

small “bag” of packets per-port

13

prio = remaining flow size

H 1 H 2 H 3 H 4 H 5 H 6 H 7 H 8 H 9

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 28 / 37

slide-29
SLIDE 29

pFabric results

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 29 / 37

slide-30
SLIDE 30

Summary of results

SJF and SRPT achieve nearly equal performance (for large flows SRPT ≈ 15% better than SJF). LAS (BytesSent) for DataMining nearly as good as

  • ptimal/SRPT, for WebSearch performance breaks down at high

load. Many jobs of similar sizes keep getting pre-empted at high loads, until the new job “catches up”, and then it starts again

Tradeoffs

SJF/SRPT work across workload distributions, but require Job Size information

Job size often not availabe ahead of time

LAS requires no job size information, but doesn’t work well with non-heavytailed job size distributions. PBS can achieve balance with right α

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 30 / 37

slide-31
SLIDE 31

Homa: Practical Low Latency Datacenter Transport (Sigcomm 2018)

  • No persistent congestion in the core

Congestion At The Edge

Big Switch Core TOR Servers Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 31 / 37

slide-32
SLIDE 32

Approach and Performance

Schedule messages in shortest-remaining-first order (SRPT) Near-optimal average latency & good tail latency for short messages

Key Ideas

Receiver-driven congestion control and packet scheduling Reduce buffer occupancy & improve latency Use of network priorities dynamically assigned by receivers Bypass queues for short messages Controlled overcommitment on receivers downlink Avoid bandwidth waste, leads to high bandwidth utilization

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 32 / 37

slide-33
SLIDE 33

Homa mechanism

SRPT to schedule packets

  • Homa receivers schedule incoming packets: one grant per packet
  • Problem: 1 RTT additional latency for scheduling (size unknown)
  • Solution: Transmit 1 RTT of packets per message blindly

Sender Receiver time Grants Scheduled Data Packets Unscheduled Data Packets

PBS for Homa

No need to know size of flows First packet has natural high priority (unless α = 0)

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 33 / 37

slide-34
SLIDE 34

Conclusion

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 34 / 37

slide-35
SLIDE 35

Conclusion and Future Work

Contributions

We introduce a novel configurable policy, PBS. By varying the single parameter, we can tune for various performance and fairness requirements. Demonstrate properties and advantages of PBS by analysis, simulations, implementation, and experiments.

Current/Future work

Closed form of mean response time in M/G/1 for any α. Design an automatic mechanism to dynamically adapt α to workload. Implement PBS with Data Center scheduling/transport systems Extend PBS to multi-core systems.

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 35 / 37

slide-36
SLIDE 36

The End

The End

Feng, Misra, Rubenstein (Columbia) A Generalized Blind Scheduling Policy TTIC WORKSHOP 36 / 37