Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - - PowerPoint PPT Presentation

model switching dealing with fluctuating workloads in
SMART_READER_LITE
LIVE PREVIEW

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - - PowerPoint PPT Presentation

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service


slide-1
SLIDE 1

Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems

1New York University, 2Microsoft

jeffjunzhang@nyu.edu

USENIX HotCloud 2020 [*] Machine Learning-as-a-Service

Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1

slide-2
SLIDE 2

Deep-Learning Models are Pervasive

1

slide-3
SLIDE 3

Computations in Deep Learning

Activation Tensor

Convolution

Convolution Filters Output Tensor

=

Output Tensor

...

[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.

Convolutions account for more than 90% computation à dominate both run-time and energy.*

2

Execution time factors: depth, activation/filter size in each layer

slide-4
SLIDE 4

Offline Training Data Data Collection Cleaning & Visualization Feature Eng. & Model Design Training & Validation

Model Development

Trained Models Training Pipelines Live Data

Training

Validation End User Application

Query Prediction

Prediction Service

Inference

Feedback

Logic

Machine Learning Lifecycle (Typical)

3

[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”

Online

Model Development

  • Prescribes model design, architecture and data processing

Training

  • At scale on live data
  • Retraining on new data and manage model versioning

Serving or Online Inference

  • Deploys trained model into device, edge, or cloud

MLaaS

slide-5
SLIDE 5

MLaaS: Challenges and Limitations

4

[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.

Existing Solution

  • Static model versioning
  • Tie each application to one specific model at run-time
  • In the event of load spikes:
  • Prune requests (new, low priority, near deadline etc.)
  • QoS violations, customer

L

  • Add “significant new capacity" (autoscaling)
  • Not economically viable, provider

L

Maintain QoS under dynamic workloads.

  • Service success rates dropped below 99.99%
  • Teams suffered 2-hour outage in Europe
  • Free offers and new subscriptions were limited

March 28, 2020

slide-6
SLIDE 6

200 400 600 800 1000

Execution Time (ms)

90 90.5 91 91.5 92 92.5 93 93.5 94 94.5

Accuracy Single Image Inference with ResNet-x

1 thread 2 threads 4 threads 8 threads 16 threads

ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

Opportunity: DNN Model Diversity

For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)

[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016;

5 Residual Block

Model

slide-7
SLIDE 7

Assumption: fluctuating workloads fixed hardware capacity

6

Which DNN model to use? How to allocate resources?

MLaaS Provider ML App End Users

What is the QoS?

s e n d i n g r e q u e s t s rendering predictions

SLAs

Typical SLA Objectives: latency, throughput, cost,… Make all decisions online!

Questions in this Study

slide-8
SLIDE 8
  • 2. “Dog”, 0.3s

7

MLaaS Provider Application End Users

What Do Users Care About?

  • 1. “Cat”, 5s

From the users’ perspective, deadline misses and incorrect predictions are equally bad:

  • User can always meet deadline by

guessing randomly

Quick and correct predictions!

slide-9
SLIDE 9

5 10 15 20 25 30

Load (Queries Per Second)

0.88 0.89 0.9 0.91 0.92 0.93 0.94

Effective Accuracy

ResNet152 ResNet101 ResNet50 ResNet34 ResNet18

A New Metric for MLaaS

8

No single DNN works best at all load levels (𝜇: load, 𝑏: baseline accuracy)

Effective Accuracy (𝑏!"") : the fraction of correct predictions within deadline (D)

Likelihood of meeting deadline

𝑏#$$ = 𝑞%,'×𝑏

slide-10
SLIDE 10

1 2 3 4 5 6 7 8

Load (Queries Per Second)

500 1000 1500 2000 2500 3000 3500 4000 4500

99th percentile tail latency (ms) ResNet-152 Tail Latency Vs Job arrival rate

R:16 T:1 R:8 T:2 R:4 T:4 R:2 T:8 R:1 T:16

Characterizing DNN Parallelism

As load increases, additional replicas help more than threads.

9

Fixed Capacity: 16 threads End to End Query Latency R: Replicas T: Threads

slide-11
SLIDE 11

Online Model Switching Framework

10

Load Change Detection Offline policy training Model Switching

Load Best Model Parallelism 0-4 QPS ResNet-152 <R:4 T:4> > 20 ResNet-18 …

Model that exhibits best effective accuracy is a function of load

Front-End Queries Predictions

Dynamically select best model (effective accuracy) based on load

Model Switching Controller

Online Serving

SLA deadline

slide-12
SLIDE 12

Experimental Setup

  • Built on top of Clipper, an open-source containerized model-serving

framework (caching, adaptive batching disabled)

  • Deployed PyTorch pretrained ResNet models on ImageNet (R: 4 T: 4)
  • Two dedicated Azure VMs:
  • Server: 32 vCPUs + 128GB RAM
  • Client: 8vCPUs + 32GB RAM
  • Markov Model based load generator
  • Open system model
  • Poisson inter-arrivals

11

Model Switching

[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.

sampling period 1 sec

slide-13
SLIDE 13

Evaluation: Automatic Model-Switching

12

50 100 150 200 250 300

Time (sec)

5 10 15 20 25

Load (Queries Per Second)

ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

Model-Switching can quickly adapt to load spikes.

slide-14
SLIDE 14

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

Deadline (ms)

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Effective Accuracy

Model Switch ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

(s)

Evaluation: Effective Accuracy

13

Model-Switching achieves pareto-optimal effective accuracy.

slide-15
SLIDE 15

Evaluation: Tail Latency

14

Model-Switching tradeoffs deadline slack for accuracy.

SLA deadline: 750 ms

0.5 1 1.5 2

Latency (ms)

0.2 0.4 0.6 0.8 1

Percentage Empirical CDF

Model Switch ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

0.75 (s)

slide-16
SLIDE 16

Model-Switching: Manage Fluctuating Workloads in MLaaS Systems

Thank You and Questions

Jeff Zhang

jeffjunzhang@nyu.edu

slide-17
SLIDE 17
  • How to prepare a pool of models for each application?
  • Neural Architecture Search, Multi-level Quantization
  • Current approach pre-deploys all (20) candidate models
  • Cold start time (ML): tens of seconds
  • RAM overheads: currently 11.8% of the total 128 GB RAM
  • Reinforcement learning based controller for model switching
  • Account for job queue status, system load, current latency
  • Offline training free
  • Integrate with existing MLaaS techniques
  • Batching, caching, autoscaling etc.
  • Exploit availability of heterogenous computing resources
  • CPU, GPU, TPU, FPGA

Discussion and Future Work

15