HyperSched Deadline-aware Scheduler for Model Development Richard - - PowerPoint PPT Presentation

hypersched
SMART_READER_LITE
LIVE PREVIEW

HyperSched Deadline-aware Scheduler for Model Development Richard - - PowerPoint PPT Presentation

HyperSched Deadline-aware Scheduler for Model Development Richard Liaw , Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov 1 2 Data Science @ Boogle Inc. 2 3 3 Learning Rate?


slide-1
SLIDE 1

HyperSched

Deadline-aware Scheduler for Model Development

Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov

  • 1
slide-2
SLIDE 2

2

slide-3
SLIDE 3

Boogle Inc. Data Science @

2

slide-4
SLIDE 4

3

slide-5
SLIDE 5

3

slide-6
SLIDE 6

Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization?????

3

slide-7
SLIDE 7

L

Learning Rate? Momentum?? Network Size? Preprocessing Parameters??? Featurization?????

3

slide-8
SLIDE 8

How to optimize? Try Random Search

4

slide-9
SLIDE 9

How to optimize? Try Random Search

4

slide-10
SLIDE 10

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

slide-11
SLIDE 11

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

slide-12
SLIDE 12

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

slide-13
SLIDE 13

GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

5

slide-14
SLIDE 14

# GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

6

slide-15
SLIDE 15

# GPUs Time Accuracy Time

Terri is faced with the decision choosing the right level of parallelism

Trials (sets of hyperparameters to evaluate)

6

slide-16
SLIDE 16

# GPUs Time Accuracy Time

DEADLINES EXIST

Scheduling Problem?

7

slide-17
SLIDE 17

# GPUs Time Accuracy Time

DEADLINES EXIST

Scheduling Problem?

7

slide-18
SLIDE 18

Given finite time and compute resources, Scheduling Problem

Instead of increasing

  • DL cluster efficiency

[OSDI 2018]

  • Job Completion Time

[NSDI 2019, EuroSys 2018]

8

slide-19
SLIDE 19

Given finite time and compute resources, Scheduling Problem evaluate many random trials (configurations) Exploration Problem

Instead of increasing

  • DL cluster efficiency

[OSDI 2018]

  • Job Completion Time

[NSDI 2019, EuroSys 2018]

8

slide-20
SLIDE 20

Given finite time and compute resources, Scheduling Problem evaluate many random trials (configurations) Exploration Problem to obtain the best trained model Exploitation Problem

Instead of increasing

  • DL cluster efficiency

[OSDI 2018]

  • Job Completion Time

[NSDI 2019, EuroSys 2018]

8

slide-21
SLIDE 21

HyperSched is an application-level scheduler for model development.

9

slide-22
SLIDE 22
  • Balances explore and exploit by adaptively allocating resources based on:

HyperSched is an application-level scheduler for model development.

9

slide-23
SLIDE 23
  • Balances explore and exploit by adaptively allocating resources based on:
  • Awareness of resource constraints

HyperSched is an application-level scheduler for model development.

# GPU TIME

9

slide-24
SLIDE 24
  • Balances explore and exploit by adaptively allocating resources based on:
  • Awareness of resource constraints
  • Awareness of training objectives

HyperSched is an application-level scheduler for model development.

# GPU TIME TIME Accuracy

9

slide-25
SLIDE 25

Properties/Assumptions of model development workloads

10

slide-26
SLIDE 26

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

10

slide-27
SLIDE 27

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

  • Each trial is iterative and returns intermediate results

10

slide-28
SLIDE 28

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

  • Each trial is iterative and returns intermediate results
  • Trials can be checkpointed during training.

10

slide-29
SLIDE 29

Accuracy Time

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

  • Each trial is iterative and returns intermediate results
  • Trials can be checkpointed during training.
  • All trials share the same objective. Care only about 1

model.

10

slide-30
SLIDE 30

Accuracy Time

Properties/Assumptions of model development workloads

Model development consists of evaluating many trials.

  • Each trial is iterative and returns intermediate results
  • Trials can be checkpointed during training.
  • All trials share the same objective. Care only about 1

model.

  • Model training can be accelerated by parallelizing/

distributing its workload (data parallelism).

10

slide-31
SLIDE 31

# GPU TIME

How to use allocation for exploration and exploitation

11

slide-32
SLIDE 32

Naive Approach: Static Space/Time Allocation

# GPU TIME

12

slide-33
SLIDE 33

Naive Approach: Static Space/Time Allocation

# GPU TIME

Exploration

12

slide-34
SLIDE 34

Naive Approach: Static Space/Time Allocation

# GPU TIME

Exploration Exploitation

12

slide-35
SLIDE 35

4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017

Naive Approach: Static Space/Time Allocation

13

slide-36
SLIDE 36

4 Layer CNN on CIFAR10 - Mukkamala, ICML2017

Problem: Initial Performance is a weak proxy of final behavior

Naive Approach: Static Space/Time Allocation

13

slide-37
SLIDE 37

Naive Solution: Static Space/Time Allocation

TIME

Underallocate exploration…

14

# GPU TIME

14

slide-38
SLIDE 38

Naive Solution: Static Space/Time Allocation

TIME

… or underallocate exploitation

15

# GPU TIME

15

slide-39
SLIDE 39

Naive Solution: Static Space/Time Allocation

Main problem: Cannot rely on initial performance.

16

slide-40
SLIDE 40

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

17

slide-41
SLIDE 41

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

  • Distributed hyperparameter tuning algorithm based off
  • ptimal resource allocation.

17

slide-42
SLIDE 42

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

  • Distributed hyperparameter tuning algorithm based off
  • ptimal resource allocation.
  • SOTA results over other existing algorithms

17

slide-43
SLIDE 43

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

  • Distributed hyperparameter tuning algorithm based off
  • ptimal resource allocation.
  • SOTA results over other existing algorithms
  • Deployed on many AutoML offerings today

17

slide-44
SLIDE 44

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-45
SLIDE 45

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch()

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-46
SLIDE 46

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT:

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-47
SLIDE 47

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η):

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-48
SLIDE 48

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-49
SLIDE 49

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η else:

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-50
SLIDE 50

r r η * r η * r η * η * r

# GPU TIME TIME Accuracy

LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η else: # allow new trials to start trial.pause(); break

  • r: min. epoch
  • R: max epoch
  • η (eta): Balance explore/exploit
  • Intuition: Progressively allocate more

resources to promising trials

* Simplified representation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

18

slide-51
SLIDE 51

Benefit: Mitigate noisy initial performance by adaptive allocation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

TIME Accuracy

19

slide-52
SLIDE 52

How to improve?

Benefit: Mitigate noisy initial performance by adaptive allocation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

TIME Accuracy

19

slide-53
SLIDE 53

How to improve?

Benefit: Mitigate noisy initial performance by adaptive allocation

Better Solution: Asynchronous Successive Halving Algorithm (ASHA) [Li2018]

TIME Accuracy

19

slide-54
SLIDE 54

HyperSched Solution

20

slide-55
SLIDE 55

HyperSched Solution

  • 1. Build on ASHA’s adaptive allocation

20

slide-56
SLIDE 56

HyperSched Solution

  • 1. Build on ASHA’s adaptive allocation
  • 2. Avoid starting trials close to deadline

20

slide-57
SLIDE 57

HyperSched Solution

  • 1. Build on ASHA’s adaptive allocation
  • 2. Avoid starting trials close to deadline
  • 3. Consolidate parallel resources to top trial

near deadline to maximize accuracy

20

slide-58
SLIDE 58

HyperSched: Early Termination

Build on ASHA’s adaptive allocation.

From ASHA:

  • Evaluate trials for min. epoch r - up to max epoch R
  • Balance explore/exploit with parameter η
  • Mitigate problem of noisy initial performance

# GPU TIME TIME Accuracy

21

slide-59
SLIDE 59

HyperSched: Admission Policy

Avoid starting trials close to deadline

# GPU TIME TIME Accuracy

22

slide-60
SLIDE 60

HyperSched: Admission Policy

Avoid starting trials close to deadline

  • R: max epoch
  • η: Explore/exploit parameter

# GPU TIME TIME Accuracy

22

slide-61
SLIDE 61

HyperSched: Admission Policy

Avoid starting trials close to deadline

  • R: max epoch
  • η: Explore/exploit parameter
  • Intuition: Only start trials if

they have a chance of beating incumbent

# GPU TIME TIME Accuracy

def should_start_trial():
 return Tleft > min( furthest_trial().time * η, base_epoch_time*R)

22

slide-62
SLIDE 62

HyperSched: Resource Reallocation

Dynamically allocate parallel resources to final trials

# GPU TIME TIME Accuracy

23

slide-63
SLIDE 63

HyperSched: Resource Reallocation

Dynamically allocate parallel resources to final trials

# GPU TIME TIME Accuracy

def on_result(trial): if should_stop(trial): update_allocation() return

  • Uniform Allocation of available

resources

23

slide-64
SLIDE 64

HyperSched: Resource Reallocation

Dynamically allocate parallel resources to final trials

# GPU TIME TIME Accuracy

def on_result(trial): if should_stop(trial): update_allocation() return elif should_resize(trial): ckpt = trial.checkpoint() set_allocation(trial) trial.restart(ckpt)

  • Uniform Allocation of available

resources

  • Resize by checkpointing and starting

again with more parallel workers

23

slide-65
SLIDE 65

HyperSched Implementation

24

slide-66
SLIDE 66

HyperSched leverages Ray Tune’s scheduler API

HyperSched

http://tune.io/

25

slide-67
SLIDE 67

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

26

slide-68
SLIDE 68

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

  • Trials return intermediate

information (performance,

  • verhead)

26

slide-69
SLIDE 69

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

  • Trials return intermediate

information (performance,

  • verhead)
  • Maintains internal allocation

mapping and deadline timer

26

slide-70
SLIDE 70

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

  • Trials return intermediate

information (performance,

  • verhead)
  • Maintains internal allocation

mapping and deadline timer

  • Uses Tune Scheduling APIs

for execution (resizing, checkpointing, pausing, etc).

26

slide-71
SLIDE 71

HyperSched Implementation

HyperSched

ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU

JOB RESULT SCHEDULER DECISION

W2 W W W W W W W W W W W

t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4

1 2 3

  • Does not manage physical

placement decisions

  • Trials return intermediate

information (performance,

  • verhead)
  • Maintains internal allocation

mapping and deadline timer

  • Uses Tune Scheduling APIs

for execution (resizing, checkpointing, pausing, etc).

26

slide-72
SLIDE 72

Overview of HyperSched Results

For more results, see paper + poster.

27

slide-73
SLIDE 73

CIFAR10 Experiment

Setup:

  • 1 hour deadline, 8 GPUs (V100)
  • Resnet50 on CIFAR10
  • 144 different hyperparameter configurations

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

slide-74
SLIDE 74

CIFAR10 Experiment

Setup:

  • 1 hour deadline, 8 GPUs (V100)
  • Resnet50 on CIFAR10
  • 144 different hyperparameter configurations

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000 https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

slide-75
SLIDE 75

CIFAR10 Experiment

Setup:

  • 1 hour deadline, 8 GPUs (V100)
  • Resnet50 on CIFAR10
  • 144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000 https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

slide-76
SLIDE 76

CIFAR10 Experiment

Setup:

  • 1 hour deadline, 8 GPUs (V100)
  • Resnet50 on CIFAR10
  • 144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time alidation Accuracy

Mitigates noisy initial performance

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

slide-77
SLIDE 77

CIFAR10 Experiment

Setup:

  • 1 hour deadline, 8 GPUs (V100)
  • Resnet50 on CIFAR10
  • 144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000

Mitigates noisy initial performance

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

slide-78
SLIDE 78

CIFAR10 Experiment

Setup:

  • 1 hour deadline, 8 GPUs (V100)
  • Resnet50 on CIFAR10
  • 144 different hyperparameter configurations

2 4 6 8

GPUs Allocated vs Time GPUs per trial Time (s) 3000

0.2 0.4 0.6 0.8 1

Accuracy vs Time Validation Accuracy Time (s) 3000

Mitigates noisy initial performance Achieve 93.84% Val (original repo 93.57%)

https://github.com/kuangliu/pytorch-cifar (2.3k stars)

28

slide-79
SLIDE 79

Performance across deadlines

  • ResNet50 model on CIFAR10, (8 V100 GPUs)
  • 144 different configurations

29

slide-80
SLIDE 80

Performance across deadlines

  • ResNet50 model on CIFAR10, (8 V100 GPUs)
  • 144 different configurations

Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

29

slide-81
SLIDE 81

Performance across deadlines

  • ResNet50 model on CIFAR10, (8 V100 GPUs)
  • 144 different configurations

Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

Time (s) Max Accuracy

29

slide-82
SLIDE 82

Performance across deadlines

  • ResNet50 model on CIFAR10, (8 V100 GPUs)
  • 144 different configurations

HyperSched outperforms ASHA across a variety of deadlines by evaluating less trials and exploiting existing trials

Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

Trials Evaluated 22.5 45 67.5 90 Deadline (s) 900 (s) 1800 (s) 3600 (s)

ASHA HyperSched

Time (s) Max Accuracy

29

slide-83
SLIDE 83

HyperSched Summary

  • HyperSched is an application-level scheduler for deadline-

based model development

  • HyperSched uses constraint-awareness and is informed by

application-level objectives to increase model accuracy

  • Our evaluation shows HyperSched outperforms state-of-

the-art parameter tuning algorithms

30

slide-84
SLIDE 84

HyperSched Summary

  • HyperSched is an application-level scheduler for deadline-

based model development

  • HyperSched uses constraint-awareness and is informed by

application-level objectives to increase model accuracy

  • Our evaluation shows HyperSched outperforms state-of-

the-art parameter tuning algorithms

Thank you! Questions?

30

slide-85
SLIDE 85

31