[PPT] - Building and Optimizing Learning- augmented Computer Systems Hongzi PowerPoint Presentation

SLIDE 1

Building and Optimizing Learning- augmented Computer Systems

Hongzi Mao October 24, 2019

Learning Scheduling Algorithms for Data Processing Clusters. Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng,

Mohammad Alizadeh. ACM SIGCOMM, 2019.

SLIDE 2

Mot Motivation

n

Scheduling is a fundamental task in computer systems

Cluster management (e.g., Kubernetes, Mesos, Borg)
Data analytics frameworks (e.g., Spark, Hadoop)
Machine learning (e.g., Tensorflow)

Efficient scheduler matters for large datacenters

Small improvement can save millions of dollars at scale

2

SLIDE 3

De Desi signing Op Optimal Sch chedulers s is s Intract ctable

Must consider many factors for optimal performance:

Job dependency structure
Modeling complexity
Placement constraints
Data locality
……

Graphene [OSDI ’16], Carbyne [OSDI ’16] Tetris [SIGCOMM ’14], Jockey [EuroSys ’12] TetriSched [EuroSys ‘16], device placement [NIPS ’17] Delayed Scheduling [EuroSys ’10] ……

Practical deployment:

Ignore complexity à resort to simple heuristics Sophisticated system à complex configurations and tuning

No “one-size-fits-all” solution: Best algorithm depends on specific workload and system

SLIDE 4

Can machine learning help tame the complexity of efficient schedulers for data processing jobs?

SLIDE 5

Dec ecima: A Lea earned ned Clus uster er Schedul heduler er

5

Learns workload-specific scheduling algorithms for jobs with

dependencies (represented as DAGs)

Job DAG Job 2 Job 3

Scheduler

Executor 1 Executor m Executor 2

“Stages”: Identical tasks that can run in parallel Data dependencies

5

SLIDE 6

Dec ecima: A Lea earned ned Clus uster er Schedul heduler er

6

Learns workload-specific scheduling algorithms for jobs with

dependencies (represented as DAGs)

Job 1

Scheduler

Server 1 Server m Server 2

SLIDE 7

Number of servers working on this job

Scheduling policy: FIFO Average Job Completion Time: 225 sec

De Demo

SLIDE 8

Scheduling policy: Shortest-Job-First Average Job Completion Time: 135 sec

SLIDE 9

Scheduling policy: Fair Average Job Completion Time: 120 sec

SLIDE 10

Fair Shortest-Job-First Average Job Completion Time: 135 sec Average Job Completion Time: 120 sec

SLIDE 11

Scheduling policy: Decima Average Job Completion Time: 98 sec

SLIDE 12

Fair Decima Average Job Completion Time: 98 sec Average Job Completion Time: 120 sec

SLIDE 13

Decima it=0

166 sec

² 20 Spark jobs (TPC-H queries), 50 servers

SLIDE 14

Decima it=3000

160 sec

SLIDE 15

Decima it=6000

148 sec

SLIDE 16

Decima it=9000

145 sec

SLIDE 17

Decima it=12000

142 sec

SLIDE 18

Decima it=15000

126 sec

SLIDE 19

Decima it=18000

111 sec

SLIDE 20

Decima it=21000

108 sec

SLIDE 21

Decima it=24000

107 sec

SLIDE 22

Decima it=27000

93 sec

SLIDE 23

Decima it=30000

89 sec

SLIDE 24

24

Design

SLIDE 25

De Design n overvi view

State

Job DAG 1 Job DAG n Executor 1 Executor m

Scheduling Agent

p[

Policy Network Graph Neural Network Environment Schedulable Nodes Objective Reward

Observation of jobs and cluster status

SLIDE 26

Con Contribution

ns

State

Job DAG 1 Job DAG n Executor 1 Executor m

Scheduling Agent

p[

Policy Network Graph Neural Network Environment Schedulable Nodes Objective Reward

Observation of jobs and cluster status

1. First RL-based scheduler for complex data processing jobs
2. Scalable graph neural network to express scheduling policies
3. New learning methods that enables training with online job arrivals

26

SLIDE 27

Enc Encode de sc sche hedul duling ng de decisi sions ns as s actions ns

27

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

Set of identical free executors

SLIDE 28

Op Option n 1: Assign n al all Exec ecut utors in n 1 Action

Problem: huge action space

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

28

SLIDE 29

Op Option n 2: Assign n One One Exec ecut utor Per er Action

Problem: long action sequences

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

29

SLIDE 30

De Decima: Assign Gr Groups of Executors per Action

30

Job DAG 1 Job DAG n Server 1 Server 2 Server 4 Server 3 Server m

Use 3 servers Use 1 server Use 1 server

Action = (node, parallelism limit)

SLIDE 31

31

Job DAG 1 Job DAG n

Node features:

# of tasks
avg. task duration
# of servers currently

assigned to the node

are free servers local to

this job?

Arbitrary number

f jobs

Pr Process Job Informat ation

SLIDE 32

32

Gr Grap aph Ne Neural al Ne Netw twork

Job DAG

6 8 3 2

Score on each node

SLIDE 33

Step 1 Step 1 Job DAG 1 Job DAG n Step 2 Step 2

Children of v !" = $ %", !' '∈) " ; +

Gr Grap aph Neural al Network

Same aggregation applied to all nodes for each DAG

SLIDE 34

Gr Grap aph Neural al Network

Same aggregation applied everywhere in the DAGs Critical path

max max max

SLIDE 35

Gr Grap aph Neural al Network

50 100 150 200 250 300 350 1umber of iterDtions 40% 60% 80% 100% Testing DccurDcy 6ingle non-lineDr DggregDtion 'ecimD's two-level DggregDtion

Supervised learning training curve Same aggregation applied everywhere in the DAGs

Use supervised learning to verify a representation

SLIDE 36

Tr Training

36

Decima agent cluster Reinforcement learning training Generate experience data

SLIDE 37

Time Number of backlogged jobs

Ha Handle e Online e Jo Job Arri Arrival

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Initial random policy

37

SLIDE 38

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Time Number of backlogged jobs

Waste training time Initial random policy

Ha Handle e Online e Jo Job Arri Arrival

38

SLIDE 39

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Time Number of backlogged jobs

Early reset for initial training Initial random policy

Ha Handle e Online e Jo Job Arri Arrival

39

SLIDE 40

The RL agent has to experience continuous job arrival during training. → inefficient if simply feeding long sequences of jobs

Time Number of backlogged jobs

As training proceeds, stronger policy keeps the queue stable Increase the reset time

Curriculum learning

Ha Handle e Online e Jo Job Arri Arrival

40

SLIDE 41

41

Va Variance from Job Sequences

RL agent needs to be robust to the variation in job arrival patterns. → huge variance can throw off the training process

SLIDE 42

Re Review: Policy Gradient RL Methods

42

! ← ! + $ %&log *& +,, ., /

,01, 2

3,0 − 5(+,) “return” from step t “baseline” Expected return from state st Increase probability of actions with better-than-expected returns

SLIDE 43

43

Job size Time t Future workload #1 Future workload #2 action at

Score for action at = (return after at) − (average return) = ∑#$%#

&

'#$ − ((*#) Must consider the entire job sequence to score actions

Va Variance from Job Sequences

SLIDE 44

Average return for trajectories from state st with job sequence zt, zt+1, …

Score for action at = ∑"#$"

%

&"# − ((*") Score for action at = ∑"#$"

%

&"# − ((*", -", -"./, … )

In Input-De Depe pende ndent t Baseline ne

44

Variance reduction for reinforcement learning in input-driven environments. Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf,

Mohammad Alizadeh. International Conference on Learning Representations (ICLR), 2019.

Theorem: Input-dependent baselines reduce variance without adding bias

1 ) 2 log 67 ", 8" ((", -", -"./, … = 0

SLIDE 45

In Input-De Depe pende ndent t Baseline ne

45

Broadly applicable to other systems with external input process: Adaptive video streaming, load balancing, caching, robotics with disturbance…

Variance reduction for reinforcement learning in input-driven environments. Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf,

Mohammad Alizadeh. International Conference on Learning Representations (ICLR), 2019.

Train with standard baseline Train with input-dependent baseline

SLIDE 46

46

Evaluation

SLIDE 47

20 TPC-H queries

sampled at random; input sizes: 2, 5, 10, 20, 50, 100 GB

Decima trained on

simulator; tested on real Spark cluster

47

Decima improves average job completion time by 21%-3.1x

ver baseline schemes

De Decima ma vs. Baseline nes: : Batche hed d Arrivals

SLIDE 48

48

De Decima ma with th Conti tinuo nuous us Job b Arrivals

1000 jobs arrives as a Poisson process with avg. inter-arrival time = 25 sec Decima achieves 28% lower average JCT than best heuristic, and 2X better JCT in overload

Better

SLIDE 49

Tuned weighted fair Decima

49

Un Under erstanding D g Dec ecima

Tuned weighted fair Decima

SLIDE 50

Industrial trace (Alibaba): 20,000 jobs from production cluster Multi-resource requirement: CPU cores + memory units

Flexibility: y: Multi-Re Resource Scheduling

50

SLIDE 51

51

Objective: Avg JCT Objective: Makespan

Executors Time (seconds)

120 60 90 30

Avg. JCT 74.5 sec, makespan 102.1 sec

Executors

Avg. JCT 67.3 sec, makespan 119.6 sec

Time (seconds)

120 60 90 30

Flexibility: y: Different Objectives & Systems

SLIDE 52

52

Objective: Avg JCT Objective: Avg JCT zero-cost migration

Executors

Avg. JCT 67.3 sec, makespan 119.6 sec

Time (seconds)

120 60 90 30

Executors

Avg. JCT 61.4 sec, makespan 114.3 sec

Time (seconds)

120 60 90 30

Flexibility: y: Different Objectives & Systems

SLIDE 53

Impact of each component in the learning algorithm
Generalization to different workloads
Training and inference speed
Handling missing features
Optimality gap

Ot Other Evaluations

53

SLIDE 54

Re Real-wo world Video Bitrate Adaptation with RL

Real-world Video Adaptation with Reinforcement Learning. Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell,

Yuandong Tian, Mohammad Alizadeh, Eytan Bakshy. ICML Workshop, 2019.

720P

RL Agent (§3.2) Policy neural network

240P 360P 720P 1080P

Sample action State Observe predicted bandwidth and current buffer Reward shaping (§3.4) Network and watch time trace replay (§3.3) Simulator (§3.1)

SLIDE 55

Re Real-wo world Video Bitrate Adaptation with RL

Real-world Video Adaptation with Reinforcement Learning. Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell,

Yuandong Tian, Mohammad Alizadeh, Eytan Bakshy. ICML Workshop, 2019.

720P

Translated ABR model Next bitrate action State observations Video session

Front end Back end

Simulator (§3.1) RL training (§3.2-4) Policy translation (§3.5) Store experience Update model

ELF

SLIDE 56

Re Real-wo world Video Bitrate Adaptation with RL

Real-world Video Adaptation with Reinforcement Learning. Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell,

Yuandong Tian, Mohammad Alizadeh, Eytan Bakshy. ICML Workshop, 2019.

xt

t

n1t n2t nMt xt

t

n1t xt

t

nMt Softmax

Policy neural network

πθ(at|st)

Bandwidth estimate Buffer occupancy File size of bitrate 1 File size of bitrate 2 Parameters θ Parameters θ File size of bitrate M q1t qMt pMt p1t

State Policy Network Action

SLIDE 57

Re Real-wo world Video Bitrate Adaptation with RL

Real-world Video Adaptation with Reinforcement Learning. Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell,

Yuandong Tian, Mohammad Alizadeh, Eytan Bakshy. ICML Workshop, 2019.

SLIDE 58

Re Real-wo world Video Bitrate Adaptation with RL

Real-world Video Adaptation with Reinforcement Learning. Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell,

Yuandong Tian, Mohammad Alizadeh, Eytan Bakshy. ICML Workshop, 2019.

SLIDE 59

Park: k: An Open Platform for Learning-Au Augm gmen ented ed S System ems

59

Computer system environment

Triggers an MDP step

Common interface RL agent

Python C++, Java, HTML, Rust, … RPC request

Listening server

RPC reply state, reward action Agent object

Actor

Experience Storage Training

12 system environments for networking, databases, distributed systems, …
Contains real system and simulation
Interact with the system with a standard API
Park: An Open Platform for Learning-Augmented Computer Systems. H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus, R. Addanki, M.

Khani, S. He, V. Nathan, F. Cangialosi, S. Venkatakrishnan, W. Weng, S. Han, T. Kraska, M. Alizadeh. Neural Information Processing Systems (NeurIPS), 2019.

SLIDE 60

Park: k: An Open Platform for Learning-Au Augm gmen ented ed S System ems

60

Park: An Open Platform for Learning-Augmented Computer Systems. H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus, R. Addanki, M.

Khani, S. He, V. Nathan, F. Cangialosi, S. Venkatakrishnan, W. Weng, S. Han, T. Kraska, M. Alizadeh. Neural Information Processing Systems (NeurIPS), 2019.

SLIDE 61

Park: k: An Open Platform for Learning-Au Augm gmen ented ed S System ems

61

Park: An Open Platform for Learning-Augmented Computer Systems. H. Mao, P. Negi, A. Narayan, H. Wang, J. Yang, H. Wang, R. Marcus, R. Addanki, M.

Khani, S. He, V. Nathan, F. Cangialosi, S. Venkatakrishnan, W. Weng, S. Han, T. Kraska, M. Alizadeh. Neural Information Processing Systems (NeurIPS), 2019.

Some example challenges:

Infinite horizon
Representation of the states and actions
Simulation-reality gap
Needle-in-the-haystack problem
…

SLIDE 62

Su Summary

Decima develops new RL algorithms to learn workload-specific

cluster scheduling algorithms http://web.mit.edu/decima/

ABRL conducts large-scale production experiment on applying RL to

video bitrate adaptation https://openreview.net/forum?id=SJlCkwN8iV

Park open sources a platform for RL research in computer systems

https://github.com/park-project/park

62