Mult lti-Resource Packin ing for Clu luster Schedule lers Robert - - PowerPoint PPT Presentation

mult lti resource packin ing for clu luster schedule lers
SMART_READER_LITE
LIVE PREVIEW

Mult lti-Resource Packin ing for Clu luster Schedule lers Robert - - PowerPoint PPT Presentation

Mult lti-Resource Packin ing for Clu luster Schedule lers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella Performance of cluster schedulers We find that: Resources are fragmented i.e. machines run


slide-1
SLIDE 1

Mult lti-Resource Packin ing for Clu luster Schedule lers

Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella

slide-2
SLIDE 2

Performance of cluster schedulers

We find that:

1Time to finish a set of jobs

  • Resources are fragmented i.e. machines run below capacity
  • Even at 100% usage, goodput is smaller due to over-allocation
  • Pareto-efficient multi-resource fair schemes do not lead to good avg. performance

Tetris

Up to 40% improvement in makespan1 and job completion time with near-perfect fairness

slide-3
SLIDE 3

Findings from Bing and Facebook traces analysis

  • Tasks need varying amounts of each resource
  • Demands for resources are weakly correlated

Applications have (very) diverse resource needs Multiple resources become tight This matters, because no single bottleneck resource in the cluster:

  • E.g., enough cross-rack network bandwidth to use all cores

3

Upper bound on potential gains

  • Makespan reduces by ≈ 49%
  • Avg. job completion time reduces by ≈ 46%
slide-4
SLIDE 4

4

Why so bad #1

Production schedulers neither pack tasks nor consider all their relevant resource demands

#1 Resource Fragmentation #2 Over-allocation

slide-5
SLIDE 5

Current Schedulers “Packer” Scheduler

Machine A 4 GB Memory Machine B 4 GB Memory

T1: 2 GB T3: 4 GB T2: 2 GB

Time ime

Resource Fragmentation (RF)

STOP

Machine A 4 GB Memory Machine B 4 GB Memory

T1: 2 GB T3: 4 GB T2: 2 GB

Time ime

  • Avg. task compl. time = 1 t

5

Curren ent t Sc Schedu duler ers

RF increase with the number of resources being allocated !

  • Avg. task compl.time = 1.33 t

Allocate resources per

slo lots, s, fair irne ness. ss.

Are not explicit about packing.

slide-6
SLIDE 6

Current Schedulers “Packer” Scheduler

Machine A 4 GB Memory; 20 MB/s Nw.

Time ime

T1: 2 GB Memory T2: 2 GB Memory T3: 2 GB Memory

Machine A 4 GB Memory; 20 MB/s Nw.

Time ime

T1: 2 GB Memory 20 MB/s Nw. T2: 2 GB Memory 20 MB/s Nw. T3: 2 GB Memory

STOP

20 MB/s Nw. 20 MB/s Nw.

6

Over-Allocation

Not all of the resources are expli

licit citly ly all llocate cated E.g. g.,di ,disk sk and netw twor

  • rk

can be over

er-al allo locate cated

  • Avg. task compl.time= 2.33 t
  • Avg. task compl. time = 1.33 t

Curren ent t Sc Schedu duler ers

slide-7
SLIDE 7

Wo Work Conser serving ving ! != = no fragmentati mentation,

  • n, over

er-al allo locati cation

  • n
  • Treat cluster as a big bag of resources

 Hides the impact of resource fragmentation

  • Assume job has a fixed resource profile

 Different tasks in the same job have different demands

Multi-resource Fairness Schemes do not solve the problem Why so bad #2

 How the job is scheduled impacts jobs’ current resource profiles  Can schedule to create complementarity

Example in paper Packer vs. DRF: makespan and avg. completion time improve by over 30% Pareto eto1 effici icient ent != = perfor formant ant

1no job can increase its share without decreasing the share of another

7

slide-8
SLIDE 8

Com

  • mpeting

peting ob

  • bje

jectives ctives

Job completion time Fairness vs. Cluster efficiency vs.

Cur urrent rent Schedulers hedulers

  • 1. Resource Fragmentation
  • 3. Fair allocations sacrifice performance
  • 2. Over-Allocation

8

slide-9
SLIDE 9

# 1

Pa Pack tas asks ks al alon

  • ng

g mu multiple iple res esources

  • urces to imp
  • improv
  • ve

e clus uster ter ef efficiency iciency and and red educ uce e ma makes espan pan

9

slide-10
SLIDE 10

The Theory

  • ry

Pr Prac acti tice ce

Multi-Resource Packing of Tasks sim imil ilar to Multi-Dimensional Bin Packing

Balls could be tasks Bin could be machine, time

1APX-Hard is a strict subset of NP-hard

APX-Hard1

Existing heuristics do not directly apply:  Assume balls of a fixed size  Assume balls are known apriori

10

  • vary with time / machine placed
  • elastic
  • cope with online arrival of jobs,

dependencies, cluster activity Avoiding fragmentation looks like:  Tight bin packing  Reduce # of bins  reduce makespan

slide-11
SLIDE 11

# 1

Packing heuristic

  • 1. Check for fit to ensure no over-allocation

Over-Allocation

Alignment score (A)

11

A packing heuristic

  • Tasks resources demand vector
  • Machine resource vector

<

Fit

A works eause:

  • 2. Bigger balls get bigger scores
  • 3. Abundant resources used first

Resource Fragmentation

slide-12
SLIDE 12

# 2

Fa Faster er av aver erage age job

  • b com
  • mpletio

pletion n time me

12

slide-13
SLIDE 13

13

CHALLENGE

# 2

Shortest Remaining Time First1 (SRTF)

1SRTF – M. Harchol-Balter et al. Connection Scheduling in Web Servers [USITS’99]

schedules jobs in ascending order of their remaining time

Job Completion Time Heuristic

Q: What is the shortest remaining time ?

remaining work remaining # tasks tasks’ duratios tasks’ resoure deads

& &

= A job completion time heuristic

  • Gives a score P to every job
  • Extended SRTF to incorporate multiple resources
slide-14
SLIDE 14

14

CHALLENGE

# 2

Job Completion Time Heuristic

Combine A and P scores !

Packing Efficiency Completion Time

?

1: among J runnable jobs 2: score (j) = A(t, R)+ P(j) 3: max task t in j, demandt ≤ R (resources free) 4: pick j*, t* = argmax score(j)

A: delays job completion time P: loss in packing efficiency

slide-15
SLIDE 15

# 3

Ac Achi hieve eve pe perform

  • rmance

ance an and d fai airness ness

15

slide-16
SLIDE 16

# 3

16

  • Packer says: task T should go next to improve packing efficiency

Possible to satisfy all three In fact, happens often in practice

  • SRTF says: shedule job J to improve avg. completion time
  • Fairness says: this set of jobs should be scheduled next

Fairness Heuristic

Performance and fairness do not mix well in general But …. We a get perfet fairess ad uh etter perforae

slide-17
SLIDE 17

# 3

17

  • Fairness Knob, F  [0, 1)

Pick the best-for-perf. task from among

1-F fraction of jobs furthest from fair share

Fairness Heuristic

Fairness is not a tight constraint

  • Long term fairness not short term fairness
  • Lose a bit of fairness for a lot of gains in performance

Heuristic

F = 0 F → 1  Most unfair  Most efficient scheduling Close to perfect fairness

slide-18
SLIDE 18

18

Putting it all together

We saw: Other things in the paper:

  • Packing efficiency
  • Prefer small remaining work
  • Fairness knob
  • Estimate task demands
  • Deal with inaccuracies, barriers
  • Other cluster activities

Job Manager1 Node Manager1 Cluster-wide Resource Manager Multi-resource asks; barrier hint Track resource usage; enforce allocations New logic to match tasks to machines (+packing, +SRTF, +fairness) Allocations Asks Offers Resource availability reports

Yarn architecture

Changes to add Tetris(shown in orange)

slide-19
SLIDE 19

Evaluation

  • Implemented in Yarn 2.4
  • 250 machine cluster deployment
  • Bing and Facebook workload

19

slide-20
SLIDE 20

20

Efficiency

Makespan Multi-resource Scheduler

28 %

  • Avg. Job Compl. Time

35%

50 100 150 200 5000 10000 15000

Utilizat ilization ion (%) Tim ime (s)

CP CPU Me Mem In In St

Tetris Gains from

  • avoiding fragmentation
  • avoiding over-allocation

50 100 150 200 4500 9000 13500 18000 22500

Utilizat ilization ion (%) Tim ime (s)

CPU Mem Mem In In St

Tetris vs. Single Resource Scheduler

29 % 30 %

Over-allocation Low value → high fragmentation

Single Resource Scheduler

slide-21
SLIDE 21

21

Fairness

Fairness Knob

  • quantifies the extent to which Tetris adheres to fair allocation

No Fairness F = 0 Makespan

50 % 10 % 25 %

Job Compl. Time

40 % 23 % 35 %

  • Avg. Slowdown

[over impacted jobs]

25 % 2 % 5 %

Full Fairness F → 1 F = 0.25

slide-22
SLIDE 22

Pack efficiently along multiple resources Prefer jobs with less reaiig work Incorporate Fairness

  • Combine heuristics that improve packing efficiency with those that

lower average job completion time

  • Achieving desired amounts of fairness can coexist with improving

cluster performance

  • Implemented inside YARN; deployment and trace-driven simulations

show encouraging initial results

We are working towards a Yarn check-in

http://research.microsoft.com/en-us/UM/redmond/projects/tetris/

22