Model-Switching: Dealing with Fluctuating Workloads in MLaaS* Systems
1New York University, 2Microsoft
jeffjunzhang@nyu.edu
USENIX HotCloud 2020 [*] Machine Learning-as-a-Service
Model-Switching: Dealing with Fluctuating Workloads in MLaaS * - - PowerPoint PPT Presentation
Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh Elnikety 2 , Shuayb Zarar 2 , Atul Gupta 2 , Siddharth Garg 1 1 New York University, 2 Microsoft jeffjunzhang@nyu.edu [*] Machine Learning-as-a-Service
1New York University, 2Microsoft
jeffjunzhang@nyu.edu
USENIX HotCloud 2020 [*] Machine Learning-as-a-Service
1
Activation Tensor
Convolution
Convolution Filters Output Tensor
Output Tensor
[Ref.] Sze, V., et al., “Efficient Processing of Deep Neural Networks: A Tutorial and Survey”, IEEE Proceedings 2017.
2
Offline Training Data Data Collection Cleaning & Visualization Feature Eng. & Model Design Training & Validation
Trained Models Training Pipelines Live Data
Training
Validation End User Application
Query Prediction
Prediction Service
Inference
Feedback
Logic
3
[Ref.] Gonzalez, J. et al., “Deploying Interactive ML Applications with Clipper.”
Online
4
[Ref.] Gujarati, A. et al., “Swayam: Distributed Autoscaling to Meet SLAs of ML Inference Services with Resource Efficiency”, ACM Middleware 2017.
L
L
March 28, 2020
200 400 600 800 1000
Execution Time (ms)
90 90.5 91 91.5 92 92.5 93 93.5 94 94.5
Accuracy Single Image Inference with ResNet-x
1 thread 2 threads 4 threads 8 threads 16 threads
ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152
For the same application, many models can be trained with tradeoffs among: Accuracy, Inference Time and Computation Cost (Parallelism)
[Ref.] He, K., et al., “Deep Residual Learning for Image Recognition”, IEEE CVPR 2016;
5 Residual Block
Model
6
s e n d i n g r e q u e s t s rendering predictions
7
5 10 15 20 25 30
Load (Queries Per Second)
0.88 0.89 0.9 0.91 0.92 0.93 0.94
Effective Accuracy
ResNet152 ResNet101 ResNet50 ResNet34 ResNet18
8
1 2 3 4 5 6 7 8
Load (Queries Per Second)
500 1000 1500 2000 2500 3000 3500 4000 4500
99th percentile tail latency (ms) ResNet-152 Tail Latency Vs Job arrival rate
R:16 T:1 R:8 T:2 R:4 T:4 R:2 T:8 R:1 T:16
9
10
Load Best Model Parallelism 0-4 QPS ResNet-152 <R:4 T:4> > 20 ResNet-18 …
SLA deadline
11
Model Switching
[Ref.] D. Crankshaw et al., “Clipper: A low-latency online prediction serving system”, USENIX NDSI 2017.
sampling period 1 sec
12
50 100 150 200 250 300
Time (sec)
5 10 15 20 25
Load (Queries Per Second)
ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152
0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Deadline (ms)
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Effective Accuracy
Model Switch ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152
(s)
13
14
0.5 1 1.5 2
Latency (ms)
0.2 0.4 0.6 0.8 1
Percentage Empirical CDF
Model Switch ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152
0.75 (s)
jeffjunzhang@nyu.edu
15