SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - - PowerPoint PPT Presentation

slaq quality driven scheduling for distributed machine
SMART_READER_LITE
LIVE PREVIEW

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning - - PowerPoint PPT Presentation

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman AI is the new electricity. Machine translation Recommendation system Autonomous driving Object


slide-1
SLIDE 1

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

slide-2
SLIDE 2

“AI is the new electricity.”

  • Machine translation
  • Recommendation system
  • Autonomous driving
  • Object detection and recognition

2

Supervised Unsupervised Transfer Reinforcement

Learning

slide-3
SLIDE 3

ML algorithms are approximate

  • ML model: a parametric transformation

! "

#

$

3

slide-4
SLIDE 4

ML algorithms are approximate

  • ML model: a parametric transformation
  • maps input variables ! to output variables "
  • typically contains a set of parameters #
  • Quality: how well model maps input to the correct output
  • Loss function: discrepancy of model output and ground truth

! "

$

%

4

slide-5
SLIDE 5

Training ML models: an iterative process

  • Training algorithms iteratively minimize a loss function
  • E.g., stochastic gradient descent (SGD), L-BFGS

5

Worker Worker

Update Model

Job Worker Data Shards

Model Replica !

" #

Model !

"

Tasks Send Task

slide-6
SLIDE 6

Training ML models: an iterative process

  • Quality improvement is subject to diminishing returns
  • More than 80% of work done in 20% of time

20 40 60 80 100 CuPulDtLve TLPe % 20 40 60 80 100 LRVV ReductLRn % LRgReg 6V0 LDA 0LPC

6

slide-7
SLIDE 7

Exploratory ML training: not a one-time effort

  • Train model multiple times for exploratory purposes
  • Provide early feedback, direct model search for high quality models

Collect Data Extract Features Train ML Models Adjust Feature Space Tune Hyperparameters Restructure Models

7

slide-8
SLIDE 8

Worker Job #1 Job #2 Job #3

1

Worker

3

Worker

2

Worker

3 3 2 1 1

Scheduler

How to schedule multiple training jobs on shared cluster?

  • Key features of ML jobs
  • Approximate
  • Diminishing returns
  • Exploratory process
  • Problem with resource fairness scheduling
  • Jobs in early stage: could benefit a lot from additional resources
  • Jobs almost converged: make only marginal improvement

8

slide-9
SLIDE 9

SLAQ: quality-aware scheduling

  • Intuition: in the context of approximate ML training, more resources should

be allocated to jobs that have the most potential for quality improvement

50 100 150 200 250 TLme 0.0 0.2 0.4 0.6 0.8 1.0

AFFuraFy

4ualLty-Aware FaLr 5esRurFe 0.0 0.6 1.2 1.8 2.4 3.0

LRss

50 100 150 200 250 TLme 0.0 0.2 0.4 0.6 0.8 1.0

AFFuraFy

4ualLty-Aware FaLr 5esRurFe 0.0 0.6 1.2 1.8 2.4 3.0

LRss

50 100 150 200 250 TLme 0.0 0.2 0.4 0.6 0.8 1.0

AFFuraFy

4ualLty-Aware FaLr 5esRurFe 0.0 0.6 1.2 1.8 2.4 3.0

LRss

9

slide-10
SLIDE 10

Solution Overview

Normalize quality metrics Predict quality improvement Quality-driven scheduling

10

slide-11
SLIDE 11

Normalizing quality metrics

Applicable to All Algorithms? Comparable Magnitudes? Known Range? Predictable? Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc.

  • Loss
  • Normalized Loss
  • ∆Loss
  • Normalized ∆Loss
  • 11

Applicable to All Algorithms? Comparable Magnitudes? Known Range? Predictable? Accuracy / F1 Score / Area Under Curve / Confusion Matrix / etc. Loss Normalized Loss ∆Loss Normalized ∆Loss

slide-12
SLIDE 12

Normalizing quality metrics

  • Normalize change of loss values w.r.t. largest change so far
  • Currently does not support some non-convex optimization algorithms

30 60 90 120

IterDtLRn

−0.2 0.0 0.2 0.4 0.6 0.8 1.0

1RrPDlLzeG ∆LRVV

.-0eDnV LRgReg 690 6903Rly GBT GBTReg 0L3C LDA LLnReg

12

slide-13
SLIDE 13

Training iterations: loss prediction

  • Previous work: offline profiling / analysis [Ernest NSDI 16] [CherryPick NSDI 17]
  • Overhead for frequent offline analysis is huge
  • Strawman: use last ∆Loss as prediction for future ∆Loss
  • SLAQ: online prediction using weighted curve fitting

LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 10-4 10-3 10-2 10-1 100

3reGLcWLRn ErrRr %

0.1 0.0 0.4 0.4 1.1 0.2 1.2 0.6 4.8 4.7 6.1 4.3 52.5 3.6

6WrDwPDn WeLghWeG Curve

13

LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 10-4 10-3 10-2 10-1 100

3reGLcWLRn ErrRr %

0.1 0.0 0.4 0.4 1.1 0.2 1.2 0.6 4.8 4.7 6.1 4.3 52.5 3.6

6WrDwPDn WeLghWeG Curve

slide-14
SLIDE 14

Scheduling approximate ML training jobs

  • Predict how much quality can be improved when assign X workers to jobs
  • Reassign workers to maximize quality improvement

Worker Job #1 Job #2 Job #3 Scheduler

1

Worker

3

Worker

2

Worker

3 3 2 1 1 Prediction Resource Allocation

14

slide-15
SLIDE 15

Experiment setup

  • Representative mix of training jobs with
  • Compare against a work-conserving fair scheduler

Algorithm Acronym Type Optimization Algorithm Dataset K-Means K-Means Clustering Lloyd Algorithm Synthetic Logistic Regression LogReg Classification Gradient Descent Epsilon [33] Support Vector Machine SVM Classification Gradient Descent Epsilon SVM (polynomial kernel) SVMPoly Classification Gradient Descent MNIST [34] Gradient Boosted Tree GBT Classification Gradient Boosting Epsilon GBT Regression GBTReg Regression Gradient Boosting YearPredictionMSD [35] Multi-Layer Perceptron Classifier MLPC Classification L-BFGS Epsilon Latent Dirichlet Allocation LDA Clustering EM / Online Algorithm Associated Press Corpus [36] Linear Regression LinReg Regression L-BFGS YearPredictionMSD

15

slide-16
SLIDE 16

Evaluation: resource allocation across jobs

  • 160 training jobs submitted to cluster following Poisson distribution
  • 25% jobs with high loss values
  • 25% jobs with medium loss values
  • 50% jobs with low loss values (almost converged)

100 200 300 400 500 600 700 800

7iPe (seconds)

20 40 60 80 100

6haUe of ClusteU C38s (%)

%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 100 200 300 400 500 600 700 800

7iPe (seconds)

20 40 60 80 100

6haUe of ClusteU C38s (%)

%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 100 200 300 400 500 600 700 800

7iPe (seconds)

20 40 60 80 100

6haUe of ClusteU C38s (%)

%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs 100 200 300 400 500 600 700 800

7iPe (seconds)

20 40 60 80 100

6haUe of ClusteU C38s (%)

%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs

16

slide-17
SLIDE 17

Evaluation: cluster-wide quality and time

  • SLAQ’s average loss is 73% lower

than that of the fair scheduler

100 200 300 400 500 600 700 800

7Lme (seFRQds)

0.00 0.05 0.10 0.15 0.20

LRss

)aLr 5esRurFe 6LA4

80 85 90 95 100

LRss 5eduFtLRQ %

10 20 40 100 200

TLme (seFRQds)

)aLr 5esRurFe SLA4

  • SLAQ reduces time to reach 90%

(95%) loss reduction by 45% (30%) Quality Time

17

slide-18
SLIDE 18

SLAQ Evaluation: Scalability

  • Frequently reschedule and reconfigure in reaction to changes of progress
  • Even with thousands of concurrent jobs, SLAQ makes rescheduling

decisions in just a few seconds

1000 2000 4000 8000 16000

1umber of WorNers

0.0 0.5 1.0 1.5 2.0

6chedulinJ Time (s)

1000 2000 3000 4000 Jobs

18

slide-19
SLIDE 19

Conclusion

  • SLAQ leverages the approximate and iterative ML training process
  • Highly tailored prediction for iterative job quality
  • Allocate resources to maximize quality improvement
  • SLAQ achieves better overall quality and end-to-end training time

19

100 200 300 400 500 600 700 800

7iPe (seconds)

20 40 60 80 100

6haUe of ClusteU C38s (%)

%ottoP 50% Jobs 6econd 25% Jobs 7oS 25% Jobs

LDA G%7 LLn5eg 6V0 0L3C LRg5eg 6V03Rly 10-4 10-3 10-2 10-1 100

3reGLcWLRn ErrRr %

0.1 0.0 0.4 0.4 1.1 0.2 1.2 0.6 4.8 4.7 6.1 4.3 52.5 3.6

6WrDwPDn WeLghWeG Curve

slide-20
SLIDE 20

Training iterations: runtime prediction

32 64 96 128 160 192 224 256

1umber oI Cores

101 102 103 104

Iteration 7ime (s)

2347 2307 2323 2318 2394 2398 2406 2406

10K 100K 10 100

  • Iteration runtime: ! " #/%
  • Model complexity !, data size #, number of workers %
  • Model update (i.e., size of Δ() is comparably much smaller

20