Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - - PowerPoint PPT Presentation

▶

Feb 12, 2024 23 likes •198 views

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service

SLIDE 1

Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems

1New York University, 2Microsoft

jeffjunzhang@nyu.edu

USENIX HotCloud 2020 [*] Machine Learning-as-a-Service

Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1

SLIDE 2

Deep-Learning Models are Pervasive

SLIDE 3

Computations in Deep Learning

Activation Tensor

Convolution

Convolution Filters Output Tensor

=

Output Tensor

...

[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.

Convolutions account for more than 90% computation à dominate both run-time and energy.*

Execution time factors: depth, activation/filter size in each layer

SLIDE 4

Offline Training Data Data Collection Cleaning & Visualization Feature Eng. & Model Design Training & Validation

Model Development

Trained Models Training Pipelines Live Data

Training

Validation End User Application

Query Prediction

Prediction Service

Inference

Feedback

Logic

Machine Learning Lifecycle (Typical)

[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”

Online

Model Development

Prescribes model design, architecture and data processing

Training

At scale on live data
Retraining on new data and manage model versioning

Serving or Online Inference

Deploys trained model into device, edge, or cloud

MLaaS

SLIDE 5

MLaaS: Challenges and Limitations

[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.

Existing Solution

Static model versioning
Tie each application to one specific model at run-time
In the event of load spikes:
Prune requests (new, low priority, near deadline etc.)
QoS violations, customer

Add “significant new capacity" (autoscaling)
Not economically viable, provider

Maintain QoS under dynamic workloads.

Service success rates dropped below 99.99%
Teams suffered 2-hour outage in Europe
Free offers and new subscriptions were limited

March 28, 2020

SLIDE 6

200 400 600 800 1000

Execution Time (ms)

90 90.5 91 91.5 92 92.5 93 93.5 94 94.5

Accuracy Single Image Inference with ResNet-x

1 thread 2 threads 4 threads 8 threads 16 threads

ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

Opportunity: DNN Model Diversity

For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)

[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016;

5 Residual Block

Model

SLIDE 7

Assumption: fluctuating workloads fixed hardware capacity

Which DNN model to use? How to allocate resources?

MLaaS Provider ML App End Users

What is the QoS?

s e n d i n g r e q u e s t s rendering predictions

SLAs

Typical SLA Objectives: latency, throughput, cost,… Make all decisions online!

Questions in this Study

SLIDE 8

2. “Dog”, 0.3s

MLaaS Provider Application End Users

What Do Users Care About?

1. “Cat”, 5s

From the users’ perspective, deadline misses and incorrect predictions are equally bad:

User can always meet deadline by

guessing randomly

Quick and correct predictions!

SLIDE 9

5 10 15 20 25 30

Load (Queries Per Second)

0.88 0.89 0.9 0.91 0.92 0.93 0.94

Effective Accuracy

ResNet152 ResNet101 ResNet50 ResNet34 ResNet18

A New Metric for MLaaS

No single DNN works best at all load levels (𝜇: load, 𝑏: baseline accuracy)

Effective Accuracy (𝑏!"") : the fraction of correct predictions within deadline (D)

Likelihood of meeting deadline

𝑏#$$ = 𝑞%,'×𝑏

SLIDE 10

1 2 3 4 5 6 7 8

Load (Queries Per Second)

500 1000 1500 2000 2500 3000 3500 4000 4500

99th percentile tail latency (ms) ResNet-152 Tail Latency Vs Job arrival rate

R:16 T:1 R:8 T:2 R:4 T:4 R:2 T:8 R:1 T:16

Characterizing DNN Parallelism

As load increases, additional replicas help more than threads.

Fixed Capacity: 16 threads End to End Query Latency R: Replicas T: Threads

SLIDE 11

Online Model Switching Framework

Load Change Detection Offline policy training Model Switching

Load Best Model Parallelism 0-4 QPS ResNet-152 <R:4 T:4> > 20 ResNet-18 …

Model that exhibits best effective accuracy is a function of load

Front-End Queries Predictions

Dynamically select best model (effective accuracy) based on load

Model Switching Controller

Online Serving

SLA deadline

SLIDE 12

Experimental Setup

Built on top of Clipper, an open-source containerized model-serving

framework (caching, adaptive batching disabled)

Deployed PyTorch pretrained ResNet models on ImageNet (R: 4 T: 4)
Two dedicated Azure VMs:
Server: 32 vCPUs + 128GB RAM
Client: 8vCPUs + 32GB RAM
Markov Model based load generator
Open system model
Poisson inter-arrivals

Model Switching

[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.

sampling period 1 sec

SLIDE 13

Evaluation: Automatic Model-Switching

50 100 150 200 250 300

Time (sec)

5 10 15 20 25

Load (Queries Per Second)

ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

Model-Switching can quickly adapt to load spikes.

SLIDE 14

0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

Deadline (ms)

0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Effective Accuracy

Model Switch ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

(s)

Evaluation: Effective Accuracy

Model-Switching achieves pareto-optimal effective accuracy.

SLIDE 15

Evaluation: Tail Latency

Model-Switching tradeoffs deadline slack for accuracy.

SLA deadline: 750 ms

0.5 1 1.5 2

Latency (ms)

0.2 0.4 0.6 0.8 1

Percentage Empirical CDF

Model Switch ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152

0.75 (s)

SLIDE 16

Model-Switching: Manage Fluctuating Workloads in MLaaS Systems

Thank You and Questions

Jeff Zhang

jeffjunzhang@nyu.edu

SLIDE 17

How to prepare a pool of models for each application?
Neural Architecture Search, Multi-level Quantization
Current approach pre-deploys all (20) candidate models
Cold start time (ML): tens of seconds
RAM overheads: currently 11.8% of the total 128 GB RAM
Reinforcement learning based controller for model switching
Account for job queue status, system load, current latency
Offline training free
Integrate with existing MLaaS techniques
Batching, caching, autoscaling etc.
Exploit availability of heterogenous computing resources
CPU, GPU, TPU, FPGA

Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems

Jeff Zhang1, Sameh Elnikety2, Shuayb Zarar2, Atul Gupta2, Siddharth Garg1

Deep-Learning Models are Pervasive

Computations in Deep Learning

=

...

Convolutions account for more than 90% computation à dominate both run-time and energy.*

Execution time factors: depth, activation/filter size in each layer

Model Development

Machine Learning Lifecycle (Typical)

Model Development

Training

Serving or Online Inference

MLaaS

MLaaS: Challenges and Limitations

Existing Solution

Maintain QoS under dynamic workloads.

Opportunity: DNN Model Diversity

Assumption: fluctuating workloads fixed hardware capacity

Which DNN model to use? How to allocate resources?

MLaaS Provider ML App End Users

What is the QoS?

SLAs

Typical SLA Objectives: latency, throughput, cost,… Make all decisions online!

Questions in this Study

MLaaS Provider Application End Users

What Do Users Care About?

From the users’ perspective, deadline misses and incorrect predictions are equally bad:

guessing randomly

Quick and correct predictions!

A New Metric for MLaaS

No single DNN works best at all load levels (𝜇: load, 𝑏: baseline accuracy)

Effective Accuracy (𝑏!"") : the fraction of correct predictions within deadline (D)

Likelihood of meeting deadline

𝑏#$$ = 𝑞%,'×𝑏

Characterizing DNN Parallelism

As load increases, additional replicas help more than threads.

Fixed Capacity: 16 threads End to End Query Latency R: Replicas T: Threads

Online Model Switching Framework

Load Change Detection Offline policy training Model Switching

Model that exhibits best effective accuracy is a function of load

Front-End Queries Predictions

Dynamically select best model (effective accuracy) based on load

Model Switching Controller

Online Serving

Experimental Setup

framework (caching, adaptive batching disabled)

Evaluation: Automatic Model-Switching

Model-Switching can quickly adapt to load spikes.

Evaluation: Effective Accuracy

Model-Switching achieves pareto-optimal effective accuracy.

Evaluation: Tail Latency

Model-Switching tradeoffs deadline slack for accuracy.

SLA deadline: 750 ms

Model-Switching: Manage Fluctuating Workloads in MLaaS Systems

Thank You and Questions

Jeff Zhang

Discussion and Future Work