[PPT] - HyperSched Deadline-aware Scheduler for Model Development Richard PowerPoint Presentation

SLIDE 1

HyperSched

Deadline-aware Scheduler for Model Development

Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov

1

SLIDE 2

2

SLIDE 3

Boogle Inc. Data Science @

2

SLIDE 4

3

SLIDE 5

3

SLIDE 6

Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization?????

3

SLIDE 7

L

Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization?????

3

SLIDE 8

How to optimize? Try Random Search

4

SLIDE 9

How to optimize? Try Random Search

4

SLIDE 10

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

SLIDE 11

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

SLIDE 12

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

SLIDE 13

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

SLIDE 14

# GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

6

SLIDE 15

# GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

6

SLIDE 16

# GPUs Time Accuracy Time

DEADLINES EXIST

Scheduling Problem?

7

SLIDE 17

# GPUs Time Accuracy Time

DEADLINES EXIST

Scheduling Problem?

7

SLIDE 18

Given finite time and compute resources, Scheduling Problem

Instead of increasing

DL cluster efficiency

[OSDI 2018]

Job Completion Time

[NSDI 2019, EuroSys 2018]

8

SLIDE 19

Given finite time and compute resources, Scheduling Problem evaluate many random trials (configurations) Exploration Problem

Instead of increasing

DL cluster efficiency

[OSDI 2018]

Job Completion Time

[NSDI 2019, EuroSys 2018]

8

SLIDE 20

Given finite time and compute resources, Scheduling Problem evaluate many random trials (configurations) Exploration Problem to obtain the best trained model Exploitation Problem

Instead of increasing

DL cluster efficiency

[OSDI 2018]

Job Completion Time

[NSDI 2019, EuroSys 2018]

8

SLIDE 21

HyperSched is an application-level scheduler for model development.

9

SLIDE 22

Balances explore and exploit by adaptively allocating resources based on:

HyperSched is an application-level scheduler for model development.

9

SLIDE 23

Balances explore and exploit by adaptively allocating resources based on:
Awareness of resource constraints

HyperSched is an application-level scheduler for model development.

# GPU TIME

9

SLIDE 24

Balances explore and exploit by adaptively allocating resources based on:
Awareness of resource constraints
Awareness of training objectives

HyperSched is an application-level scheduler for model development.

# GPU TIME TIME Accuracy

9

SLIDE 25

Properties/Assumptions of model development workloads

10

SLIDE 26

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

10

SLIDE 27

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

Each trial is iterative and returns intermediate results

10

SLIDE 28

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

Each trial is iterative and returns intermediate results
Trials can be checkpointed during training.

10

SLIDE 29

Accuracy Time

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

Each trial is iterative and returns intermediate results
Trials can be checkpointed during training.
All trials share the same objective. Care only about 1

model.

10

SLIDE 30

Accuracy Time

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

Each trial is iterative and returns intermediate results
Trials can be checkpointed during training.
All trials share the same objective. Care only about 1

model.

Model training can be accelerated by parallelizing/

distributing its workload (data parallelism).

10

SLIDE 31

# GPU TIME

How to use allocation for exploration and exploitation

11

SLIDE 32

Naive Approach: Static Space/Time Allocation

# GPU TIME

12

SLIDE 33

Naive Approach: Static Space/Time Allocation

# GPU TIME

Exploration

12

SLIDE 34

Naive Approach: Static Space/Time Allocation

# GPU TIME

Exploration Exploitation

12

SLIDE 35

4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017

Naive Approach: Static Space/Time Allocation

13

SLIDE 36

4 Layer CNN on CIFAR10 - Mukkamala, ICML2017

Problem: Initial Performance is a weak proxy of final behavior

Naive Approach: Static Space/Time Allocation

13

SLIDE 37

Naive Solution: Static Space/Time Allocation

TIME

Underallocate exploration…

14

# GPU TIME

14

SLIDE 38

Naive Solution: Static Space/Time Allocation

TIME

… or underallocate exploitation

15

# GPU TIME

15

SLIDE 39

Naive Solution: Static Space/Time Allocation

Main problem: Cannot rely on initial performance.

16

SLIDE 40

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

17

SLIDE 41

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

Distributed hyperparameter tuning algorithm based off
ptimal resource allocation.

17

SLIDE 42

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

Distributed hyperparameter tuning algorithm based off
ptimal resource allocation.
SOTA results over other existing algorithms

17

SLIDE 43

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

Distributed hyperparameter tuning algorithm based off
ptimal resource allocation.
SOTA results over other existing algorithms
Deployed on many AutoML offerings today

17

SLIDE 44

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 45

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch()

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 46

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT:

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 47

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η):

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 48

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 49

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η else:

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 50

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η else: # allow new trials to start trial.pause(); break

r: min. epoch
R: max epoch
η (eta): Balance explore/exploit
Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

SLIDE 51

Benefit: Mitigate noisy initial performance by adaptive allocation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

TIME Accuracy

19

SLIDE 52

How to improve?

Benefit: Mitigate noisy initial performance by adaptive allocation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

TIME Accuracy

19

SLIDE 53

How to improve?

Benefit: Mitigate noisy initial performance by adaptive allocation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

TIME Accuracy

19

SLIDE 54

HyperSched Solution

20

SLIDE 55

HyperSched Solution

1. Build on ASHA’s adaptive allocation

20

SLIDE 56

HyperSched Solution

1. Build on ASHA’s adaptive allocation
2. Avoid starting trials close to deadline

20

SLIDE 57

HyperSched Solution

1. Build on ASHA’s adaptive allocation
2. Avoid starting trials close to deadline
3. Consolidate parallel resources to top trial

near deadline to maximize accuracy

20

SLIDE 58

HyperSched: Early Termination

Build on ASHA’s adaptive allocation.

From ASHA:

Evaluate trials for min. epoch r - up to max epoch R
Balance explore/exploit with parameter η
Mitigate problem of noisy initial performance

# GPU TIME TIME Accuracy

21

SLIDE 59

HyperSched: Admission Policy

Avoid starting trials close to deadline

# GPU TIME TIME Accuracy

22

SLIDE 60

HyperSched: Admission Policy

Avoid starting trials close to deadline

R: max epoch
η: Explore/exploit parameter

# GPU TIME TIME Accuracy

22

SLIDE 61

HyperSched: Admission Policy

Avoid starting trials close to deadline

R: max epoch
η: Explore/exploit parameter
Intuition: Only start trials if

they have a chance of beating incumbent

# GPU TIME TIME Accuracy

def should_start_trial():  return Tleft > min( furthest_trial().time * η, base_epoch_time*R)

22

SLIDE 62

HyperSched: Resource Reallocation

Dynamically allocate parallel resources to final trials

# GPU TIME TIME Accuracy

23

SLIDE 63

HyperSched: Resource Reallocation

Dynamically allocate parallel resources to final trials

# GPU TIME TIME Accuracy

def on_result(trial): if should_stop(trial): update_allocation() return

Uniform Allocation of available

resources

23

SLIDE 64

HyperSched: Resource Reallocation

Dynamically allocate parallel resources to final trials

# GPU TIME TIME Accuracy

def on_result(trial): if should_stop(trial): update_allocation() return elif should_resize(trial): ckpt = trial.checkpoint() set_allocation(trial) trial.restart(ckpt)

Uniform Allocation of available

resources

Resize by checkpointing and starting

again with more parallel workers

23

SLIDE 65

HyperSched Implementation

24

SLIDE 66

HyperSched leverages Ray Tune’s scheduler API

HyperSched

http://tune.io/

25

SLIDE 67

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

26

SLIDE 68

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

Trials return intermediate

information (performance,

verhead)

26

SLIDE 69

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

Trials return intermediate

information (performance,

verhead)
Maintains internal allocation

mapping and deadline timer

26

SLIDE 70

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

Trials return intermediate

information (performance,

verhead)
Maintains internal allocation

mapping and deadline timer

Uses Tune Scheduling APIs

for execution (resizing, checkpointing, pausing, etc).

26

SLIDE 71

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

Does not manage physical

placement decisions

Trials return intermediate

information (performance,

verhead)
Maintains internal allocation

mapping and deadline timer

Uses Tune Scheduling APIs

for execution (resizing, checkpointing, pausing, etc).

26

SLIDE 72

Overview of HyperSched Results

For more results, see paper + poster.

27

SLIDE 73

CIFAR10 Experiment

Setup:

1 hour deadline, 8 GPUs (V100)
Resnet50 on CIFAR10
144 different hyperparameter configurations

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

SLIDE 74

CIFAR10 Experiment

Setup:

1 hour deadline, 8 GPUs (V100)
Resnet50 on CIFAR10
144 different hyperparameter configurations

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000 https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

SLIDE 75

CIFAR10 Experiment

Setup:

1 hour deadline, 8 GPUs (V100)
Resnet50 on CIFAR10
144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000 https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

SLIDE 76

CIFAR10 Experiment

Setup:

1 hour deadline, 8 GPUs (V100)
Resnet50 on CIFAR10
144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time alidation Accuracy

Mitigates noisy initial performance

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

SLIDE 77

CIFAR10 Experiment

Setup:

1 hour deadline, 8 GPUs (V100)
Resnet50 on CIFAR10
144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000

Mitigates noisy initial performance

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

SLIDE 78

CIFAR10 Experiment

Setup:

1 hour deadline, 8 GPUs (V100)
Resnet50 on CIFAR10
144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000

Mitigates noisy initial performance Achieve 93.84% Val (original repo 93.57%)

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

SLIDE 79

Performance across deadlines

ResNet50 model on CIFAR10, (8 V100 GPUs)
144 different configurations

29

SLIDE 80

Performance across deadlines

ResNet50 model on CIFAR10, (8 V100 GPUs)
144 different configurations

Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

29

SLIDE 81

Performance across deadlines

ResNet50 model on CIFAR10, (8 V100 GPUs)
144 different configurations

Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

Time (s) Max Accuracy

29

SLIDE 82

Performance across deadlines

ResNet50 model on CIFAR10, (8 V100 GPUs)
144 different configurations

HyperSched outperforms ASHA across a variety of deadlines by evaluating less trials and exploiting existing trials

Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

Trials Evaluated 22.5 45 67.5 90 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

Time (s) Max Accuracy

29

SLIDE 83

HyperSched Summary

HyperSched is an application-level scheduler for deadline-

based model development

HyperSched uses constraint-awareness and is informed by

application-level objectives to increase model accuracy

Our evaluation shows HyperSched outperforms state-of-

the-art parameter tuning algorithms

30

SLIDE 84

HyperSched Summary

HyperSched is an application-level scheduler for deadline-

based model development

HyperSched uses constraint-awareness and is informed by

application-level objectives to increase model accuracy

Our evaluation shows HyperSched outperforms state-of-

the-art parameter tuning algorithms

Thank you! Questions?

30

SLIDE 85

31