SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - - PowerPoint PPT Presentation
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - - PowerPoint PPT Presentation
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman AI is the new electricity. Machine translation Recommendation system Autonomous driving Object
“AI is the new electricity.”
- Machine translation
- Recommendation system
- Autonomous driving
- Object detection and recognition
2
Supervised Unsupervised Transfer Reinforcement
Learning
ML algorithms are approximate
- ML model: a parametric transformation
! "
#
$
3
ML algorithms are approximate
- ML model: a parametric transformation
- maps input variables ! to output variables "
- typically contains a set of parameters #
- Quality: how well model maps input to the correct output
- Loss function: discrepancy of model output and ground truth
! "
$
%
4
Training ML models: an iterative process
- Training algorithms iteratively minimize a loss function
- E.g., stochastic gradient descent (SGD), L-BFGS
5
Worker Worker
Update Model
Job Worker Data Shards
Model Replica !
" #
Model !
"
Tasks Send Task
Training ML models: an iterative process
- Quality improvement is subject to diminishing returns
- More than 80% of work done in 20% of time
20 40 60 80 100 CuPulDtLve TLPe % 20 40 60 80 100 LRVV ReductLRn % LRgReg 6V0 LDA 0LPC
6
Exploratory ML training: not a one-time effort
- Train model multiple times for exploratory purposes
- Provide early feedback, direct model search for high quality models
Collect Data Extract Features Train ML Models Adjust Feature Space Tune Hyperparameters Restructure Models
7
Worker Job #1 Job #2 Job #3
1
Worker
3
Worker
2
Worker
3 3 2 1 1
Scheduler
How to schedule multiple training jobs on shared cluster?
- Key features of ML jobs
- Approximate
- Diminishing returns
- Exploratory process
- Problem with resource fairness scheduling
- Jobs in early stage: could benefit a lot from additional resources
- Jobs almost converged: make only marginal improvement
8
SLAQ: quality-aware scheduling
- Intuition: in the context of approximate ML training, more resources should
be allocated to jobs that have the most potential for quality improvement
50 100 150 200 250 TLme 0.0 0.2 0.4 0.6 0.8 1.0
AFFuraFy
4ualLty-Aware FaLr 5esRurFe 0.0 0.6 1.2 1.8 2.4 3.0
LRss
50 100 150 200 250 TLme 0.0 0.2 0.4 0.6 0.8 1.0
AFFuraFy
4ualLty-Aware FaLr 5esRurFe 0.0 0.6 1.2 1.8 2.4 3.0
LRss
50 100 150 200 250 TLme 0.0 0.2 0.4 0.6 0.8 1.0
AFFuraFy
4ualLty-Aware FaLr 5esRurFe 0.0 0.6 1.2 1.8 2.4 3.0
LRss
9
Solution Overview
Normalize quality metrics Predict quality improvement Quality-driven scheduling
10
Normalizing quality metrics
Applicable to All Algorithms? Comparable Magnitudes? Known Range? Predictable? Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc.
- Loss
- Normalized Loss
- ∆Loss
- Normalized ∆Loss
- 11
Applicable to All Algorithms? Comparable Magnitudes? Known Range? Predictable? Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc. Loss Normalized Loss ∆Loss Normalized ∆Loss
Normalizing quality metrics
- Normalize change of loss values w.r.t. largest change so far
- Currently does not support some non-convex optimization algorithms
30 60 90 120
IterDtLRn
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
1RrPDlLzeG ∆LRVV
.-0eDnV LRgReg 690 6903Rly GBT GBTReg 0L3C LDA LLnReg
12
Training iterations: loss prediction
- Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17]
- Overhead for frequent offline analysis is huge
- Strawman: use last ∆Loss as prediction for future ∆Loss
- SLAQ: online prediction using weighted curve fitting
LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 10-4 10-3 10-2 10-1 100
3reGLcWLRn ErrRr %
0.1 0.0 0.4 0.4 1.1 0.2 1.2 0.6 4.8 4.7 6.1 4.3 52.5 3.6
6WrDwPDn WeLghWeG Curve
13
LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 10-4 10-3 10-2 10-1 100
3reGLcWLRn ErrRr %
0.1 0.0 0.4 0.4 1.1 0.2 1.2 0.6 4.8 4.7 6.1 4.3 52.5 3.6
6WrDwPDn WeLghWeG Curve
Scheduling approximate ML training jobs
- Predict how much quality can be improved when assign X workers to jobs
- Reassign workers to maximize quality improvement
Worker Job #1 Job #2 Job #3 Scheduler
1
Worker
3
Worker
2
Worker
3 3 2 1 1 Prediction Resource Allocation
14
Experiment setup
- Representative mix of training jobs with
- Compare against a work-conserving fair scheduler
Algorithm Acronym Type Optimization Algorithm Dataset K-Means K-Means Clustering Lloyd Algorithm Synthetic Logistic Regression LogReg Classification Gradient Descent Epsilon [33] Support Vector Machine SVM Classification Gradient Descent Epsilon SVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34] Gradient Boosted Tree GBT Classification Gradient Boosting Epsilon GBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35] Multi-Layer Perceptron Classifier MLPC Classification L-BFGS Epsilon Latent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36] Linear Regression LinReg Regression L-BFGS YearPredictionMSD
15
Evaluation: resource allocation across jobs
- 160 training jobs submitted to cluster following Poisson distribution
- 25% jobs with high loss values
- 25% jobs with medium loss values
- 50% jobs with low loss values (almost converged)
100 200 300 400 500 600 700 800
7iPe (seconds)
20 40 60 80 100
6haUe of ClusteU C38s (%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 100 200 300 400 500 600 700 800
7iPe (seconds)
20 40 60 80 100
6haUe of ClusteU C38s (%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 100 200 300 400 500 600 700 800
7iPe (seconds)
20 40 60 80 100
6haUe of ClusteU C38s (%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 100 200 300 400 500 600 700 800
7iPe (seconds)
20 40 60 80 100
6haUe of ClusteU C38s (%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
16
Evaluation: cluster-wide quality and time
- SLAQ’s average loss is 73% lower
than that of the fair scheduler
100 200 300 400 500 600 700 800
7Lme (seFRQds)
0.00 0.05 0.10 0.15 0.20
LRss
)aLr 5esRurFe 6LA4
80 85 90 95 100
LRss 5eduFtLRQ %
10 20 40 100 200
TLme (seFRQds)
)aLr 5esRurFe SLA4
- SLAQ reduces time to reach 90%
(95%) loss reduction by 45% (30%) Quality Time
17
SLAQ Evaluation: Scalability
- Frequently reschedule and reconfigure in reaction to changes of progress
- Even with thousands of concurrent jobs, SLAQ makes rescheduling
decisions in just a few seconds
1000 2000 4000 8000 16000
1umber of WorNers
0.0 0.5 1.0 1.5 2.0
6chedulinJ Time (s)
1000 2000 3000 4000 Jobs
18
Conclusion
- SLAQ leverages the approximate and iterative ML training process
- Highly tailored prediction for iterative job quality
- Allocate resources to maximize quality improvement
- SLAQ achieves better overall quality and end-to-end training time
19
100 200 300 400 500 600 700 800
7iPe (seconds)
20 40 60 80 100
6haUe of ClusteU C38s (%)
%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs
LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 10-4 10-3 10-2 10-1 100
3reGLcWLRn ErrRr %
0.1 0.0 0.4 0.4 1.1 0.2 1.2 0.6 4.8 4.7 6.1 4.3 52.5 3.6
6WrDwPDn WeLghWeG Curve
Training iterations: runtime prediction
32 64 96 128 160 192 224 256
1umber oI Cores
101 102 103 104
Iteration 7ime (s)
2347 2307 2323 2318 2394 2398 2406 2406
10K 100K 10 100
- Iteration runtime: ! " #/%
- Model complexity !, data size #, number of workers %
- Model update (i.e., size of Δ() is comparably much smaller
20