Sc Sche heduling duling Jo Jobs s With With De Depe pende - - PowerPoint PPT Presentation

sc sche heduling duling jo jobs s with with de depe pende
SMART_READER_LITE
LIVE PREVIEW

Sc Sche heduling duling Jo Jobs s With With De Depe pende - - PowerPoint PPT Presentation

Sc Sche heduling duling Jo Jobs s With With De Depe pende ndenc ncie ies: s: New Applications, Classic Problems ms Janardhan Kulkarni, Microsoft Research, Redmond. 31 July 2018, TTI, Chicago TTIC SUMMER WORKSHOP: DATA CENTER


slide-1
SLIDE 1

Sc Sche heduling duling Jo Jobs s With With De Depe pende ndenc ncie ies: s:

New Applications, Classic Problems ms

Janardhan Kulkarni, Microsoft Research, Redmond.

31 July 2018, TTI, Chicago TTIC SUMMER WORKSHOP: DATA CENTER SCHEDULING FROM THEORY TO PRACTICE

slide-2
SLIDE 2

Roadmap

Ø Which theory models are more closer to data-center settings.

Srikanth Kandula Ratul Mahajan Amar Phanishayee Monia Ghobadi

Ø Focus on algorithms

Even complex algorithms can have algorithmic intuitions which are useful in practice.

Ø One example

One system heuristic and one complex provable algorithm (Using LP Hierarchies) that has good heuristic value.

slide-3
SLIDE 3

Luleå FB Data Center, South of Artic Circle

slide-4
SLIDE 4

Luleå FB Data Center, South of Artic Circle It is beautiful like this for 3 days…..

slide-5
SLIDE 5

Luleå FB Data Center, South of Artic Circle cold, cold, place…

slide-6
SLIDE 6

5%

“as large as cities”

Efficiency Matters a Lot

slide-7
SLIDE 7

Efficiency Matters a Lot:

“as large as cities” Emp mphasis on Principled Algorithms ms

Cost Time Simple heuristics Theoretically Sound Algorithms Simplicity is not everything!

slide-8
SLIDE 8

How we Measure Efficiency

Ø Makespan Minimizing the maximum completion time among a set of jobs. Length of the schedule. Ø Average (or total) Flow-time (aka, Job Completion-time)

  • same as response time
  • measures the time a job spends in a system

Fj = Cj − rj

slide-9
SLIDE 9

How we Measure Efficiency

Ø Makespan Minimizing the maximum completion time among a set of jobs. Length of the schedule. Ø Average (or total) Flow-time (aka, Job Completion-time)

  • same as response time
  • measures the time a job spends in a system

Fj = Cj − rj

Throughput, energy, fairness, utilization, etc..

slide-10
SLIDE 10

Challenges of Data Center Scheduling

Resources

Heterogeneous (FPGA + CPU, GPU + CPU) Multidimensional (CPU, memory, network)

Jobs

Complex dependencies: DAGs, Co-flows, etc.

Algorithms

Fast, simple, often online.

slide-11
SLIDE 11

Challenges of Data Center Scheduling

Resources

Heterogeneous (FPGA + CPU, GPU + CPU) Multidimensional (CPU, memory, network)

Jobs

Complex dependencies: DAGs, Co-flows, etc.

Algorithms

Fast, simple, often online. Rich theory with many nice algorithms when jobs have simple structures.

slide-12
SLIDE 12

Scheduling on Heterogeneous Clusters

slide-13
SLIDE 13

Ø Special purpose hardware Ø Data locality Ø Geographic location Ø Privacy concerns Why are clusters heterogeneous?

Scheduling on Heterogeneous Clusters

slide-14
SLIDE 14

1000 100 300

jobs run faster on some clusters and slower on others

Modeling Heterogeneity

slide-15
SLIDE 15

Jobs arrive over time

jobs machines

1 15 1000 ….. 10 66 100 5 ….. 98 1 15 88 ….. 13 100 788 9 ….. 13

Modeling Heterogeneity

jobs run faster on some clusters and slower on others

slide-16
SLIDE 16

Jobs arrive over time

jobs machines

1 15 1000 ….. 10 66 100 5 ….. 98 1 15 88 ….. 13 100 788 9 ….. 13

Heterogeneous == “Unr Unrela elated Mac ed Machines Sc hines Scheduling heduling”

Assign (match) jobs to clusters+ schedule to Minimize QoS.

slide-17
SLIDE 17

Beautiful Algorithms For Unrelated Machines Scheduling Problems

Makespan Flow-time Energy LST’87 CGK’09 AGK’12 ST’89 AGK’12 KLS’10 Svensson’12 BK’15 IKMP’14 AAFPW’97 IKMP’14 P’07 KD’18 A’06

Offline, Online, Multidimensional, Clairvoyant, Non-Clairvoyant, Stochastic, Truthfulness…` Has lead to development of very nice ideas: Use of vertex solutions and duality in design of algorithms, configuration LPs, potential functions, connections to game theoretic ideas…

slide-18
SLIDE 18

Beautiful Algorithms For Unrelated Machines Scheduling Problems

Makespan Flow-time Energy LST’87 CGK’09 AGK’12 ST’89 AGK’12 KLS’10 Svensson’12 BK’15 IKMP’14 AAFPW’97 IKMP’14 P’07 KD’18 A’06

Offline, Online, Multidimensional, Clairvoyant, Non-Clairvoyant, Stochastic, Truthfulness…` Has lead to development of very nice ideas: Use of vertex solutions and duality in design of algorithms, configuration LPs, potential functions, connections to game theoretic ideas…

RESEARCH DIRECTION: Few Machine types: Can we get better algorithms for some classic unrelated machines scheduling?

slide-19
SLIDE 19

Challenges of Data Center Scheduling

Resources

Heterogeneous (FPGA + CPU, GPU + CPU) Multidimensional (CPU, memory, network)

Jobs

Complex dependencies: DAGs, Co-flows, etc.

Algorithms

Fast, simple, often online.

slide-20
SLIDE 20

The plan

GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. OSDI 2016.

One Heuristic

Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni.

One Complex Theoretical Framework

Very general, works well in practice, as bad as any other algorithm on paper J Levey and Rothvoss ’16. Garg, Kulkarni, Li’18. Garg, Kukarni, Li’18. Very specific, provable, and quite complex.

slide-21
SLIDE 21

The plan

GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. OSDI 2016.

One Heuristic

Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni.

One Complex Theoretical Framework

Levey and Rothvoss ’16. Garg, Kulkarni, Li’18. Garg, Kukarni, Li’18. One of the biggest hammers in approximation algorithms. “Lift and Project” Very general, works well in practice, as bad as any other algorithm on paper J Very specific, provable, and quite complex.

slide-22
SLIDE 22

The plan

GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. OSDI 2016.

One Heuristic

Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni.

One Complex Theoretical Framework

Levey and Rothvoss ’16. Garg, Kulkarni, Li’18. Garg, Kukarni, Li’18. Very general, works well in practice, as bad as any other algorithm on paper J Very specific, provable, and quite complex.

slide-23
SLIDE 23

A Directed Acyclic Graph (DAG) Scheduling Problem in Large Clusters

GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. OSDI 2016. Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni.

slide-24
SLIDE 24

DAG Model Supported in Hadoop

Multidimensionality Heterogeneity of clusters

slide-25
SLIDE 25

Resources of a cluster

(1, 1, 1) D types of resources

Cluster Scheduling

A single job represented as a DAG (task)

slide-26
SLIDE 26

Resources of a cluster

(1, 1, 1) D types of resources

Cluster Scheduling

A single job represented as a DAG (task) Demand Vector

(1, 0,…,1/2) (1/2, 1/2,…,1/2) (1/4, 1,…,1/10)

slide-27
SLIDE 27

Resources of a cluster

(1, 1, 1) D types of resources

Cluster Scheduling

A single job represented as a DAG (task) Demand Vector

(1, 0,…,1/2) (1/2, 1/2,…,1/2) (1/4, 1,…,1/10)

Processing length (duration)

slide-28
SLIDE 28

Cluster Scheduling: Minimize Makespan

A single job represented as a DAG

(1, 0), 2 (0,1), 1 (1, 1), 1 (1, 1), 1 (0, 1), 1

1 2 3 4 5 6 7 1 2 3 4 5 6 7

(1, 1)

Cluster

slide-29
SLIDE 29

Is There a Good Algorithm?

It is unlikely (UGC-hard) that a polynomial time algorithm can achieve better than D approximation to the DAG scheduling problem. This holds even if all tasks of the DAG have 1) same length, 2) require exactly

  • ne resource.

Theorem: Bansal and Khot ‘09.

Ø Any non-idling algorithm is equally good or equally bad!

Not a useful intuition for system designers.

slide-30
SLIDE 30

Is There a Good Algorithm?

It is unlikely (UGC-hard) that a polynomial time algorithm can achieve better than D approximation to the DAG scheduling problem. This holds even if all tasks of the DAG have 1) same length, 2) require exactly

  • ne resource.

Theorem: Bansal and Khot ‘09.

1 2 3 4 5 6 7 Optimal Algorithm: Do a greedy schedule respecting precedence constraints At least one resource is used. Congestion for that resource decreases.

slide-31
SLIDE 31

Is There a Good Algorithm?

It is unlikely (UGC-hard) that a polynomial time algorithm can achieve better than D approximation to the DAG scheduling problem. This holds even if all tasks of the DAG have 1) same length, 2) require exactly

  • ne resource.

Theorem: Bansal and Khot ‘09.

1 2 3 4 5 6 7 Optimal Algorithm: Do a greedy schedule respecting precedence constraints At least one resource is used. Congestion for that resource decreases.

slide-32
SLIDE 32

Is There a Good Algorithm?

It is unlikely (UGC-hard) that a polynomial time algorithm can achieve better than D approximation to the DAG scheduling problem. This holds even if all tasks of the DAG have 1) same length, 2) require exactly

  • ne resource.

Theorem: Bansal and Khot ‘09.

1 2 3 4 5 6 7 Optimal Algorithm: Do a greedy schedule respecting precedence constraints At least one resource is used. Congestion for that resource decreases.

slide-33
SLIDE 33

When did System m Designers Care for Lo Lowerbo erbounds unds? ?

GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. OSDI 2016. Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, Janardhan Kulkarni.

² Could find almost optimal solutions on MS data sets. ² Improves makespan by 30% at least compared to simple greedy heuristics.

slide-34
SLIDE 34

Intuition of Graphene

“pathologically bad schedules in today’s approaches mostly arise due to two reasons: (a) long-running tasks have no other work to overlap with them, which reduces parallelism, and (b) the tasks that are runnable do not pack well with each other, which increases resource fragmentation.” What greedy algorithms miss? (List-Scheduling, Critical Path, etc)

slide-35
SLIDE 35

Intuition of Graphene

Main Steps Ø Our approach is to identify the potentially troublesome tasks, such as those that run for a very long time or are hard to pack.

slide-36
SLIDE 36

Intuition of Graphene

Main Steps Ø Our approach is to identify the potentially troublesome tasks, such as those that run for a very long time or are hard to pack. Ø Place the troublesome tasks first onto a virtual resource-time space. This space would have d +1 dimensions when tasks require d resources; the last dimension being time.

slide-37
SLIDE 37

Intuition of Graphene

Main Steps Ø Our approach is to identify the potentially troublesome tasks, such as those that run for a very long time or are hard to pack. Ø Place the troublesome tasks first onto a virtual resource-time space. This space would have d +1 dimensions when tasks require d resources; the last dimension being time. Ø Our intuition is that placing the troublesome tasks first leads to a good schedule since the remaining tasks can be placed into resultant holes in this space.

slide-38
SLIDE 38
slide-39
SLIDE 39

Can we formalize this intuition?

slide-40
SLIDE 40

A -Approximation for Makespan Scheduling with Precedence Constraints using LP Hierarchies.

(1 + ✏)

Levey and Rothvoss ‘ 16

slide-41
SLIDE 41

Identical Machines Scheduling

A single DAG. Each task needs to be scheduled on exactly one machine. Each task needs 1 unit of CPU. m identical machines (or CPUs) Minimize Makespan (Special case of DAG scheduling)

slide-42
SLIDE 42

Identical Machines Scheduling

A single DAG. Each task needs to be scheduled on exactly one machine. Each task needs 1 unit of CPU. m identical machines (or CPUs)

slide-43
SLIDE 43

Identical Machines Scheduling

A single DAG. Each task needs to be scheduled on exactly one machine. Each task needs 1 unit of CPU. m identical machines (or CPUs) Chain of length 4

slide-44
SLIDE 44

Identical Machines Scheduling

A single DAG. Each task needs to be scheduled on exactly one machine. Each task needs 1 unit of CPU. m identical machines (or CPUs) Chain of length 4

slide-45
SLIDE 45

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.
slide-46
SLIDE 46

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.

BAD SLOTS GOOD SLOTS

slide-47
SLIDE 47

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.

BAD SLOTS GOOD SLOTS

+

Makespan ≤

slide-48
SLIDE 48

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.

BAD SLOTS GOOD SLOTS Length of Longest chain

n/m

+

Makespan ≤

slide-49
SLIDE 49

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.

BAD SLOTS GOOD SLOTS Length of Longest chain

n/m

+

Makespan ≤

OPT ≥ OPT ≥

slide-50
SLIDE 50

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.

BAD SLOTS GOOD SLOTS Length of Longest chain

n/m

+

Makespan ≤

OPT ≥ OPT ≥

Ø Optimal theoretically. But conveys very little information in practice. Ø Does not work well in practice when there are more than one resource type.

slide-51
SLIDE 51

Identical Machines Scheduling

There is a quasi-polynomial time approximation for minimizing makespan when jobs have unit lengths, when number of machines is a constant.

  • Theorem. Levy and Rothvoss’16.

(1 + ✏)

Garg’17 made it strictly quasi-polynomial time.

slide-52
SLIDE 52

Identical Machines Scheduling

There is a quasi-polynomial time approximation for minimizing makespan when jobs have arbitrary lengths, when number of machines is a constant. The algorithm schedules jobs on a single machine and may preempt jobs within a machine.

  • Theorem. Kulkarni, Li’18.

(1 + ✏)

slide-53
SLIDE 53

Identical Machines Scheduling

There is a polynomial time optimal approximation for minimizing weighted completion time

  • f jobs, when number of machines and job sizes

are uniform.

  • Theorem. Garg, Kulkarni, Li’18.

(2 + ✏)

slide-54
SLIDE 54

Identical Machines Scheduling

Greedy or List-Scheduling is 2 approximation for minimizing makespan.

  • Theorem. Graham 1960.

BAD SLOTS GOOD SLOTS Length of Longest chain

n/m

+

Makespan ≤

OPT ≥ OPT ≥

slide-55
SLIDE 55

Crucial Observation

BAD SLOTS GOOD SLOTS Length of Longest chain

n/m

+

Makespan ≤

OPT ≥

≤ ✏ · OPT

≤ (1 + ✏) · OPT

slide-56
SLIDE 56

Crucial Observation

BAD SLOTS GOOD SLOTS Length of Longest chain

n/m

+

Makespan ≤

OPT ≥

≤ ✏ · OPT

≤ (1 + ✏) · OPT

troublesome tasks How to schedule troublesome tasks?

slide-57
SLIDE 57

Framework

Time Interval

T T1 T2 T3

Partition the tasks into a set of bottom tasks and a single set of top tasks. For each set of bottom tasks we find a sub-interval where they should be scheduled.

Then do a recursive scheduling of bottom tasks.

slide-58
SLIDE 58

Framework

Time Interval

T T1 T2 T3

Top tasks Bottom tasks Bottom tasks Bottom tasks

slide-59
SLIDE 59

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints across bottom tasks are automatically satisfied.

slide-60
SLIDE 60

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

slide-61
SLIDE 61

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

[T2, T3]

slide-62
SLIDE 62

For every task in the set of top tasks we have based on the tentative assignment of bottom jobs.

T T1 T2 T3

[rj, dj]

Precedence constraints going from bottom to top tasks are loose. There is enough space to schedule top tasks

slide-63
SLIDE 63

T T1 T2 T3

Precedence constraints going from bottom to top tasks are loose. There is enough space to schedule top tasks if there are no precedence constraints between top tasks. For every task in the set of top tasks we have based on the tentative assignment of bottom jobs.

[rj, dj]

slide-64
SLIDE 64

T T1 T2 T3

Precedence constraints going from bottom to top tasks are loose. There is enough space to schedule top tasks if there are no precedence constraints between top tasks. EDF will schedule all top tasks in the empty space but may violate the precedence constraints between top tasks For every task in the set of top tasks we have based on the tentative assignment of bottom jobs.

[rj, dj]

slide-65
SLIDE 65

Intuition of Graphene

Main Steps Ø Our approach is to identify the potentially troublesome tasks, such as those that run for a very long time or are hard to pack. Ø Place the troublesome tasks first onto a virtual resource-time space. This space would have d +1 dimensions when tasks require d resources; the last dimension being time. Ø Our intuition is that placing the troublesome tasks first leads to a good schedule since the remaining tasks can be placed into resultant holes in this space.

slide-66
SLIDE 66

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

[T2, T3]

slide-67
SLIDE 67

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

[T2, T3]

The chain length among top tasks is very small.

slide-68
SLIDE 68

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

[T2, T3]

The chain length among top tasks is very small.

slide-69
SLIDE 69

Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

[T2, T3]

The chain length among top tasks is very small.

The algorithm has recognized a crude schedule for troublesome

  • tasks. That’s why chain length among top tasks is small.
slide-70
SLIDE 70

Intuition of Graphene

Main Steps Ø Our approach is to identify the potentially troublesome tasks, such as those that run for a very long time or are hard to pack. Ø Place the troublesome tasks first onto a virtual resource-time space. This space would have d +1 dimensions when tasks require d resources; the last dimension being time. Ø Our intuition is that placing the troublesome tasks first leads to a good schedule since the remaining tasks can be placed into resultant holes in this space.

slide-71
SLIDE 71

LR’16 Framework

Time Interval

T T1 T2 T3

Bottom tasks Bottom tasks Bottom tasks Precedence constraints going from bottom to top tasks are loose.

[T2, T3]

The chain length among top tasks is very small.

slide-72
SLIDE 72

For every task in the set of top tasks we have

T T1 T2 T3 [rj, dj]

Precedence constraints going from bottom to top tasks are loose. There is enough space to schedule top tasks if there are no precedence constraints between top tasks. EDF will schedule all top tasks in the empty space but may violate the precedence constraints between top tasks all except few

slide-73
SLIDE 73

Time Interval

T T1 T2 T3

Top tasks Bottom tasks Bottom tasks Bottom tasks

How to partition the DAG?

  • 1. Precedence constraints between bottom tasks should be implied.
  • 2. The precedence constraints between top and bottom tasks are loose.
  • 3. The chain length among top tasks is small.
slide-74
SLIDE 74

Linear Programming Formulation

T

X

t=1

xjt = 1

Binary search the optimal makespan as T For every task j

X

j

xjt ≤ m

is scheduled. For time slot t has at most m jobs. For precedence relation is satisfied at each time step t.

xjt > 0

All variables are non-negative

i → j, X

t0<t

xit0 ≥ X

t0≤t

xjt0

slide-75
SLIDE 75

LP Cheats…

2/3 1/3 2/3

Optimal makespan is 4 but LP can complete in 3 time slots.

Time DAG

LP can schedule a job fractionally in a time slot.

slide-76
SLIDE 76

Interval of a task

Time Consider the LP solution. Interval of a task is smallest interval that contains fractional schedule of the task. 1/10 1/10 3/10 5/10

t1 t2

slide-77
SLIDE 77

What LP gives?

An interval for each task.

Time

slide-78
SLIDE 78

What LP gives?

An interval for each task.

Time

We use these intervals to partition the DAG into top and bottom tasks.

slide-79
SLIDE 79

Building Binary Tree

T

LP Schedules all tasks between [0, T]

T

T/2

T 2 + 1

T 4 + 1

T 4

T 2

slide-80
SLIDE 80

Building Binary Tree

LP Schedules all tasks in [0, T]

log T

[0, T] [0, T/2]

[T/2 + 1, T]

Assign each task to the smallest interval node in the tree that fully contains it.

slide-81
SLIDE 81

Building Binary Tree

LP Schedules all tasks in [0, T]

log T

[0, T] [0, T/2]

[T/2 + 1, T]

Assign each task to the smallest interval node in the tree that fully contains it.

slide-82
SLIDE 82

Building Binary Tree

LP Schedules all tasks in [0, T]

log T

[0, T] [0, T/2]

[T/2 + 1, T]

Assign each task to the smallest interval node in the tree that fully contains it.

slide-83
SLIDE 83

Building Binary Tree

LP Schedules all tasks in [0, T]

log T

[0, T] [0, T/2]

[T/2 + 1, T]

Assign each task to the smallest interval node in the tree that fully contains it.

slide-84
SLIDE 84

Defining Top and Bottom Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

(log log T)2

slide-85
SLIDE 85

Defining Top and Bottom Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

(log log T)2

Throw Them Away!!

log log T

slide-86
SLIDE 86

Defining Top and Bottom Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

(log log T)2

Top Tasks Bottom Tasks Sets Throw Them Away!!

log log T

slide-87
SLIDE 87

Defining Top and Bottom Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

(log log T)2

Top Tasks Bottom Tasks Sets Throw Them Away!!

  • 1. Precedence constraints between bottom tasks should be implied.
  • 2. The precedence constraints between top and bottom tasks are loose.
  • 3. The chain length among top tasks is small.
slide-88
SLIDE 88

Defining Top and Bottom Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

Top Tasks Bottom Tasks Sets Throw Them Away!!

log log T

slide-89
SLIDE 89

Time Interval

T1 T2 T3

Precedence constraints going from bottom to top tasks are loose.

[T2, T3] T4

Every top task can loose

  • ne interval to the left and one interval

to the right in terms of space in which it should be scheduled. But, bottom intervals are tiny compared to top, so this is not a big loss. Top tasks Bottom tasks Bottom tasks

slide-90
SLIDE 90

Time Interval

T1 T2 T3

Precedence constraints going from bottom to top tasks are loose.

[T2, T3] T4

Every top task can loose

  • ne interval to the left and one interval

to the right in terms of space in which it should be scheduled. But, bottom intervals are tiny compared to top, so this is not a big loss. Top tasks Bottom tasks Bottom tasks

  • 1. Precedence constraints between bottom tasks should be implied.
  • 2. The precedence constraints between top and bottom tasks are loose.
  • 3. The chain length among top tasks is small.
slide-91
SLIDE 91

Lift and Project Method (LP Hierarchies)

Dimensions Number of variables in LP that you want integral Original LP All the variables are integral.

A systematic way of placing troublesome tasks!

slide-92
SLIDE 92

Lift and Project Method (LP Hierarchies)

Dimensions Number of variables in LP that you want integral Original LP All the variables are integral. Running time Increases by a factor of n.

O(nS)

A systematic way of placing troublesome tasks!

slide-93
SLIDE 93

Lift and Project Method (LP Hierarchies)

Time 1/10 1/10 3/10 5/10

t1 t2

“Conditioning” Touch a variable, and it becomes integral!

slide-94
SLIDE 94

Lift and Project Method (LP Hierarchies)

Time 1/10 1/10 3/10 5/10

t1 t2

“Conditioning” Touch a variable, and it becomes integral!

slide-95
SLIDE 95

Lift and Project Method (LP Hierarchies)

Time 10/10 “Conditioning” Touch a variable, and it becomes integral!

slide-96
SLIDE 96

Lift and Project Method (LP Hierarchies)

Time 10/10

t1 t2

“Conditioning” Touch a variable, and it becomes integral!

slide-97
SLIDE 97

Lift and Project Method (LP Hierarchies)

Time “Conditioning” Touch a variable, and it becomes integral! The LP solution changes in such a way that, for every other task on, the interval in which it is scheduled in the new solution only shrinks. I have a better understanding of where this task got scheduled.

slide-98
SLIDE 98

Reducing Chain Length of Top Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

slide-99
SLIDE 99

The interval is of length T. We will make sure that there is no chain of length assigned to this interval.

Reducing Chain Length of Top Tasks

T

✏T

slide-100
SLIDE 100

The interval is of length T. We will make sure that there is no chain of length assigned to this interval.

Reducing Chain Length of Top Tasks

T

✏T xjt > 0

slide-101
SLIDE 101

The interval is of length T. We will make sure that there is no chain of length assigned to this interval.

Reducing Chain Length of Top Tasks

T

✏T xjt > 0

slide-102
SLIDE 102

The interval is of length T. We will make sure that there is no chain of length assigned to this interval.

Reducing Chain Length of Top Tasks

T

✏T

slide-103
SLIDE 103

Reducing Chain Length of Top Tasks

[0, T] [0, T/2]

[T/2 + 1, T]

slide-104
SLIDE 104

The interval is of length T. We will make sure that there is no chain of length assigned to this interval.

Reducing Chain Length of Top Tasks

T

✏T

How many conditioning are required? m/✏ Now recall that number of intervals in top tasks is 2(log logT )2

(log T)log log T

slide-105
SLIDE 105

The interval is of length T. We will make sure that there is no chain of length assigned to this interval.

Reducing Chain Length of Top Tasks

T

✏T

How many conditioning are required? m/✏ Now recall that number of intervals in top tasks is 2(log logT )2

(log T)log log T

O(m/✏ · (log T)log log T )

Running time.

slide-106
SLIDE 106

There is a quasi-polynomial time approximation for minimizing makespan when jobs have arbitrary lengths, when number of machines is a constant. The algorithm schedules jobs on a single machine and may preempt jobs within a machine.

  • Theorem. Garg, Kulkarni, Li’18.

(1 + ✏)

There is a polynomial time optimal approximation for minimizing weighted completion time of jobs, when number of machines and job sizes are uniform.

(2 + ✏)

More sophisticated use of conditioning and new algorithms for scheduling top tasks.

slide-107
SLIDE 107

Ø Our approach is to identify the potentially troublesome tasks, such as those that run for a very long time or are hard to pack. Ø Place the troublesome tasks first onto a virtual resource-time space. This space would have d +1 dimensions when tasks require d resources; the last dimension being time. Ø Our intuition is that placing the troublesome tasks first leads to a good schedule since the remaining tasks can be placed into resultant holes in this space. Intuition of Graphene Ø Using Lift and Project to figure out placing long tasks. Is there a simple, say DP approach to it? Ø Can we use LP support for placing tasks? Ø Can recursion help in Graphene setting? Lift and Project Algorithms

Big Picture

slide-108
SLIDE 108

Identical Machines Scheduling and Training Neural Networks

PipeDream: Fast and Efficient Pipeline Parallel DNN Training Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons

slide-109
SLIDE 109

Training Deep Learning Models

Ø Large fraction of the data center workloads for many companies. Ø Improving training time is considered very important. Ø DAGs are good abstractions of DNN training computations. Ø Connections to DAG scheduling and communication delay problems.

slide-110
SLIDE 110

Two Paradigms

Data Parallelism Model Parallelism

slide-111
SLIDE 111

Model Parallelism

Ø Schedule the layers among a set of machines. Typically Identical. Ø Or at most 2 types: CPU + FPGA, CPU + GPUs etc.

slide-112
SLIDE 112

Model Parallelism

Ø Schedule the layers among a set of machines. Typically Identical. Ø Or at most 2 types: CPU + FPGA, CPU + GPUs etc.

Ø There is communication between layers. Communication cost is crucial.

slide-113
SLIDE 113

Model Parallelism

These problems are quite similar to scheduling with communication delays, when there are precedence constraints. (PY’90, VLL’90, MH’95, HLV’94) Very poorly understood. Good scheduling has same effect as caching!

Zhicheng Yin, Jin Sun, Ming Li, Jaliya Ekanayake, Haibo Lin, Marc Friedman, José A. Blakeley, Clemens A. Szyperski, Nikhil R. Devanur. Bubble Execution: Resource-aware Reliable Analytics at Cloud Scale. PVLDB 11(7). PipeDream: Fast and Efficient Pipeline Parallel DNN Training Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons

slide-114
SLIDE 114

Summary: Data Center Scheduling

Resources

Heterogeneous (FPGA + CPU, GPU + CPU) Multidimensional (CPU, memory, network)

Jobs

Complex dependencies: DAGs, Co-flows, etc.

Algorithms

Fast, simple, often online.

slide-115
SLIDE 115

Summary: Data Center Scheduling

Resources

Heterogeneous (FPGA + CPU, GPU + CPU) Multidimensional (CPU, memory, network)

Jobs

Complex dependencies: DAGs, Co-flows, etc.

Algorithms

Fast, simple, often online.

Ø Often hard in worst case. What’s the right model? Ø Understand DAGs that arise in practice. Say DNNs. Ø What are the high-level algorithmic intuitions?

slide-116
SLIDE 116

Summary: Data Center Scheduling

Resources

Heterogeneous (FPGA + CPU, GPU + CPU) Multidimensional (CPU, memory, network)

Jobs

Complex dependencies: DAGs, Co-flows, etc.

Algorithms

Fast, simple, often online.

Ø Often hard in worst case. What’s the right model? Ø Understand DAGs that arise in practice. Say DNNs. Ø What are the high-level algorithmic intuitions?