HyperSched
Deadline-aware Scheduler for Model Development
Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov
- 1
HyperSched Deadline-aware Scheduler for Model Development Richard - - PowerPoint PPT Presentation
HyperSched Deadline-aware Scheduler for Model Development Richard Liaw , Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov 1 2 Data Science @ Boogle Inc. 2 3 3 Learning Rate?
Deadline-aware Scheduler for Model Development
Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph E. Gonzalez, Ion Stoica, Alexey Tumanov
2
2
3
3
3
L
3
4
4
GPUs Time Accuracy Time
5
GPUs Time Accuracy Time
5
GPUs Time Accuracy Time
5
GPUs Time Accuracy Time
5
# GPUs Time Accuracy Time
6
# GPUs Time Accuracy Time
6
# GPUs Time Accuracy Time
7
# GPUs Time Accuracy Time
7
Instead of increasing
[OSDI 2018]
[NSDI 2019, EuroSys 2018]
8
Instead of increasing
[OSDI 2018]
[NSDI 2019, EuroSys 2018]
8
Instead of increasing
[OSDI 2018]
[NSDI 2019, EuroSys 2018]
8
9
9
# GPU TIME
9
# GPU TIME TIME Accuracy
9
10
Model development consists of evaluating many trials.
10
Model development consists of evaluating many trials.
10
Model development consists of evaluating many trials.
10
Accuracy Time
Model development consists of evaluating many trials.
model.
10
Accuracy Time
Model development consists of evaluating many trials.
model.
distributing its workload (data parallelism).
10
# GPU TIME
11
# GPU TIME
12
# GPU TIME
12
# GPU TIME
12
4 Layer CNN on CIFAR10 - Mukkamala, ICML2017 4 Layer CNN on CIFAR10 - Mukkamala, ICML2017
13
4 Layer CNN on CIFAR10 - Mukkamala, ICML2017
13
TIME
14
# GPU TIME
14
TIME
15
# GPU TIME
15
16
17
17
17
17
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
resources to promising trials
* Simplified representation
18
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
LIMIT = r while trial.iter < R: trial.run_one_epoch()
resources to promising trials
* Simplified representation
18
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT:
resources to promising trials
* Simplified representation
18
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η):
resources to promising trials
* Simplified representation
18
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η
resources to promising trials
* Simplified representation
18
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η else:
resources to promising trials
* Simplified representation
18
r r η * r η * r η * η * r
# GPU TIME TIME Accuracy
LIMIT = r while trial.iter < R: trial.run_one_epoch() if trial.iter == LIMIT: if is_top(trial, LIMIT, 1/η): LIMIT *= η else: # allow new trials to start trial.pause(); break
resources to promising trials
* Simplified representation
18
Benefit: Mitigate noisy initial performance by adaptive allocation
TIME Accuracy
19
Benefit: Mitigate noisy initial performance by adaptive allocation
TIME Accuracy
19
Benefit: Mitigate noisy initial performance by adaptive allocation
TIME Accuracy
19
20
20
20
20
Build on ASHA’s adaptive allocation.
From ASHA:
# GPU TIME TIME Accuracy
21
Avoid starting trials close to deadline
# GPU TIME TIME Accuracy
22
Avoid starting trials close to deadline
# GPU TIME TIME Accuracy
22
Avoid starting trials close to deadline
they have a chance of beating incumbent
# GPU TIME TIME Accuracy
def should_start_trial(): return Tleft > min( furthest_trial().time * η, base_epoch_time*R)
22
Dynamically allocate parallel resources to final trials
# GPU TIME TIME Accuracy
23
Dynamically allocate parallel resources to final trials
# GPU TIME TIME Accuracy
def on_result(trial): if should_stop(trial): update_allocation() return
resources
23
Dynamically allocate parallel resources to final trials
# GPU TIME TIME Accuracy
def on_result(trial): if should_stop(trial): update_allocation() return elif should_resize(trial): ckpt = trial.checkpoint() set_allocation(trial) trial.restart(ckpt)
resources
again with more parallel workers
23
24
HyperSched
http://tune.io/
25
HyperSched
ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
JOB RESULT SCHEDULER DECISION
W2 W W W W W W W W W W W
t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4
1 2 3
26
HyperSched
ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
JOB RESULT SCHEDULER DECISION
W2 W W W W W W W W W W W
t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4
1 2 3
information (performance,
26
HyperSched
ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
JOB RESULT SCHEDULER DECISION
W2 W W W W W W W W W W W
t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4
1 2 3
information (performance,
mapping and deadline timer
26
HyperSched
ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
JOB RESULT SCHEDULER DECISION
W2 W W W W W W W W W W W
t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4
1 2 3
information (performance,
mapping and deadline timer
for execution (resizing, checkpointing, pausing, etc).
26
HyperSched
ti ⇒ ri GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU
JOB RESULT SCHEDULER DECISION
W2 W W W W W W W W W W W
t3⇒ r4 t1⇒ r1 t2⇒ r2 t4⇒ r4
1 2 3
placement decisions
information (performance,
mapping and deadline timer
for execution (resizing, checkpointing, pausing, etc).
26
For more results, see paper + poster.
27
Setup:
https://github.com/kuangliu/pytorch-cifar (2.3k stars)
28
Setup:
0.2 0.4 0.6 0.8 1
Accuracy vs Time Validation Accuracy Time (s) 3000 https://github.com/kuangliu/pytorch-cifar (2.3k stars)
28
Setup:
2 4 6 8
GPUs Allocated vs Time GPUs per trial Time (s) 3000
0.2 0.4 0.6 0.8 1
Accuracy vs Time Validation Accuracy Time (s) 3000 https://github.com/kuangliu/pytorch-cifar (2.3k stars)
28
Setup:
2 4 6 8
GPUs Allocated vs Time GPUs per trial Time (s) 3000
Mitigates noisy initial performance
https://github.com/kuangliu/pytorch-cifar (2.3k stars)
28
Setup:
2 4 6 8
GPUs Allocated vs Time GPUs per trial Time (s) 3000
0.2 0.4 0.6 0.8 1
Accuracy vs Time Validation Accuracy Time (s) 3000
Mitigates noisy initial performance
https://github.com/kuangliu/pytorch-cifar (2.3k stars)
28
Setup:
2 4 6 8
GPUs Allocated vs Time GPUs per trial Time (s) 3000
0.2 0.4 0.6 0.8 1
Accuracy vs Time Validation Accuracy Time (s) 3000
Mitigates noisy initial performance Achieve 93.84% Val (original repo 93.57%)
https://github.com/kuangliu/pytorch-cifar (2.3k stars)
28
29
Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)
ASHA HyperSched
29
Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)
ASHA HyperSched
Time (s) Max Accuracy
29
HyperSched outperforms ASHA across a variety of deadlines by evaluating less trials and exploiting existing trials
Accuracy 0.7 0.775 0.85 0.925 1 Deadline (s) 900 (s) 1800 (s) 3600 (s)
ASHA HyperSched
Trials Evaluated 22.5 45 67.5 90 Deadline (s) 900 (s) 1800 (s) 3600 (s)
ASHA HyperSched
Time (s) Max Accuracy
29
based model development
application-level objectives to increase model accuracy
the-art parameter tuning algorithms
30
based model development
application-level objectives to increase model accuracy
the-art parameter tuning algorithms
30
31