Learning Scheduling Algorithms for Data Processing Clusters Hongzi - - PowerPoint PPT Presentation

learning scheduling algorithms for data processing
SMART_READER_LITE
LIVE PREVIEW

Learning Scheduling Algorithms for Data Processing Clusters Hongzi - - PowerPoint PPT Presentation

Learning Scheduling Algorithms for Data Processing Clusters Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh Mot Motivation on Scheduling is a fundamental task in computer systems Cluster


slide-1
SLIDE 1

Learning Scheduling Algorithms for Data Processing Clusters

Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh

slide-2
SLIDE 2

Mot Motivation

  • n

Scheduling is a fundamental task in computer systems

  • Cluster management (e.g., Kubernetes, Mesos, Borg)
  • Data analytics frameworks (e.g., Spark, Hadoop)
  • Machine learning (e.g., Tensorflow)

Efficient scheduler matters for large datacenters

  • Small improvement can save millions of dollars at scale

2

slide-3
SLIDE 3

De Desi signing Op Optimal Sch chedulers s is s Intract ctable

Must consider many factors for optimal performance:

  • Job dependency structure
  • Modeling complexity
  • Placement constraints
  • Data locality
  • ……

Graphene [OSDI ’16], Carbyne [OSDI ’16] Tetris [SIGCOMM ’14], Jockey [EuroSys ’12] TetriSched [EuroSys ‘16], device placement [NIPS ’17] Delayed Scheduling [EuroSys ’10] ……

Practical deployment:

Ignore complexity à resort to simple heuristics Sophisticated system à complex configurations and tuning

No “one-size-fits-all” solution: Best algorithm depends on specific workload and system

slide-4
SLIDE 4

Can machine learning help tame the complexity of efficient schedulers for data processing jobs?

slide-5
SLIDE 5

Dec ecima: A Lea earned ned Clus uster er Schedul heduler er

5

  • Learns workload-specific scheduling algorithms for jobs with

dependencies (represented as DAGs)

Job DAG Job 2 Job 3

Scheduler

Executor 1 Executor m Executor 2

“Stages”: Identical tasks that can run in parallel Data dependencies

5

slide-6
SLIDE 6

Dec ecima: A Lea earned ned Clus uster er Schedul heduler er

6

  • Learns workload-specific scheduling algorithms for jobs with

dependencies (represented as DAGs)

Job 1

Scheduler

Server 1 Server m Server 2

slide-7
SLIDE 7

De Design n overvi view

State

Job DAG 1 Job DAG n Executor 1 Executor m

Scheduling Agent

p[

Policy Network Graph Neural Network Environment Schedulable Nodes Objective Reward

Observation of jobs and cluster status

slide-8
SLIDE 8

Number of servers working on this job

Scheduling policy: FIFO Average Job Completion Time: 225 sec

De Demo

slide-9
SLIDE 9

Scheduling policy: Shortest-Job-First Average Job Completion Time: 135 sec

slide-10
SLIDE 10

Scheduling policy: Fair Average Job Completion Time: 120 sec

slide-11
SLIDE 11

Fair Shortest-Job-First Average Job Completion Time: 135 sec Average Job Completion Time: 120 sec

slide-12
SLIDE 12

Scheduling policy: Decima Average Job Completion Time: 98 sec

slide-13
SLIDE 13

Fair Decima Average Job Completion Time: 98 sec Average Job Completion Time: 120 sec

slide-14
SLIDE 14

Con Contribution

  • ns

State

Job DAG 1 Job DAG n Executor 1 Executor m

Scheduling Agent

p[

Policy Network Graph Neural Network Environment Schedulable Nodes Objective Reward

Observation of jobs and cluster status

  • 1. First RL-based scheduler for complex data processing jobs
  • 2. Scalable graph neural network to express scheduling policies
  • 3. New learning methods that enables training with online job arrivals

14

slide-15
SLIDE 15

Enc Encode de sc sche hedul duling ng de decisi sions ns as s actions ns

15

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

Set of identical free executors

slide-16
SLIDE 16

Op Option n 1: Assign n al all Exec ecut utors in n 1 Action

Problem: huge action space

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

16

slide-17
SLIDE 17

Op Option n 2: Assign n One One Exec ecut utor Per er Action

Problem: long action sequences

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

17

slide-18
SLIDE 18

De Decima: Assign Gr Groups of Executors per Action

18

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

Use 3 servers Use 1 server Use 1 server

Action = (node, parallelism limit)

slide-19
SLIDE 19

19

Job DAG 1 Job DAG n

Node features:

  • # of tasks
  • avg. task duration
  • # of servers currently

assigned to the node

  • are free servers local to

this job?

Arbitrary number

  • f jobs

Pr Process Job Informat ation

slide-20
SLIDE 20

20

Gr Grap aph Ne Neural al Ne Netw twork

Job DAG

6 8 3 2

Score on each node

slide-21
SLIDE 21

Tr Training

21

Decima agent cluster Reinforcement learning training Generate experience data

slide-22
SLIDE 22

Time Number of backlogged jobs

Ha Handle e Online e Jo Job Arri Arrival

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Initial random policy

22

slide-23
SLIDE 23

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Time Number of backlogged jobs

Waste training time Initial random policy

Ha Handle e Online e Jo Job Arri Arrival

23

slide-24
SLIDE 24

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Time Number of backlogged jobs

Early reset for initial training Initial random policy

Ha Handle e Online e Jo Job Arri Arrival

24

slide-25
SLIDE 25

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Time Number of backlogged jobs

As training proceeds, stronger policy keeps the queue stable Increase the reset time

Curriculum learning

Ha Handle e Online e Jo Job Arri Arrival

25

slide-26
SLIDE 26

26

Va Variance from Job Sequences

RL agent needs to be robust to the variation in job arrival patterns. → huge variance can throw off the training process

slide-27
SLIDE 27

27

Job size Time t Future workload #1 Future workload #2 action at

Score for action at = (return after at) − (average return) = ∑#$%#

&

'#$ − ((*#) Must consider the entire job sequence to score actions

Va Variance from Job Sequences

slide-28
SLIDE 28

Average return for trajectories from state st with job sequence zt, zt+1, …

Score for action at = ∑"#$"

%

&"# − ((*") Score for action at = ∑"#$"

%

&"# − ((*", -", -"./, … )

In Input-De Depe pende ndent t Baseline ne

28

Broadly applicable to other systems with external input process: Adaptive video streaming, load balancing, caching, robotics with disturbance…

  • Variance reduction for reinforcement learning in input-driven environments. Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf,

Mohammad Alizadeh. International Conference on Learning Representations (ICLR), 2019.

slide-29
SLIDE 29
  • 20 TPC-H queries

sampled at random; input sizes: 2, 5, 10, 20, 50, 100 GB

  • Decima trained on

simulator; tested on real Spark cluster

29

Decima improves average job completion time by 21%-3.1x

  • ver baseline schemes

De Decima ma vs. Baseline nes: : Batche hed d Arrivals

slide-30
SLIDE 30

30

De Decima ma with th Conti tinuo nuous us Job b Arrivals

1000 jobs arrives as a Poisson process with avg. inter-arrival time = 25 sec Decima achieves 28% lower average JCT than best heuristic, and 2X better JCT in overload

Better

slide-31
SLIDE 31

Tuned weighted fair Decima

31

Un Under erstanding D g Dec ecima

Tuned weighted fair Decima

slide-32
SLIDE 32

Industrial trace (Alibaba): 20,000 jobs from production cluster Multi-resource requirement: CPU cores + memory units

Flexibility: y: Multi-Re Resource Scheduling

32

slide-33
SLIDE 33
  • Impact of each component in the learning algorithm
  • Generalization to different workloads
  • Training and inference speed
  • Handling missing features
  • Optimality gap

Ot Other Evaluations

33

slide-34
SLIDE 34

Su Summary

  • Decima uses reinforcement learning to generate workload-specific

scheduling algorithms

  • Decima employs curriculum learning and variance reduction to

enable training with stochastic job arrivals

  • Decima leverages a scalable graph neural network to process

arbitrary number of job DAGs

  • Decima outperforms existing heuristics and is flexible to apply to
  • ther applications

http://web.mit.edu/decima/

34