Joseph E. Gonzalez
- Asst. Professor, UC Berkeley
jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com
what happens after learning?
Prediction Serving what happens after learning? Joseph E. Gonzalez - - PowerPoint PPT Presentation
Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael
Joseph E. Gonzalez
jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com
what happens after learning?
Outline
Daniel Crankshaw, Xin Wang Michael Franklin, & Ion Stoica
VELOX Clipper
Create
VW
Caffe
Big Data
Big Model
Training
Learning
Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab
Big Data
Big Model
Training
Application
Decision Query
Learning Inference
Big Data Training
Learning Inference
Big Model Application
Decision Query
Timescale: ~10 milliseconds Systems: online and latency optimized Less studied …
Big Data
Big Model
Training
Application
Decision Query
Learning Inference
Feedback
Big Data Training
Application
Decision
Learning Inference
Feedback
Timescale: hours to weeks Systems: combination of systems Less studied …
Big Data
Big Model
Training
Application
Decision Query
Learning Inference
Feedback
Responsive (~10ms) Adaptive (~1 seconds)
Responsive (~10ms) Adaptive (~1 seconds)
VELOX Model Serving System [CIDR’15]
Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan
Key Insight:
Decompose models into fast and slow changing components
Big Data Training
Application
Decision Query
Learning Inference
Feedback
Big Data Training
Application
Decision Query
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing Model
Hybrid Offline + Online Learning
Update the user weights online:
Update feature functions offline using batch solvers
Common modeling structure
Items Users Matrix Factorization
Input
Deep Learning Ensemble Methods
Big Data Training
Application
Decision Query
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing Model
Big Data Training
Application
Decision Query
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing Model per user
Velox Online Learning for Recommendations
(20-News Groups)
0.1 0.2 0.3 0.4 0.5 0.6 10 20 30
Error Examples
Online Updates: 0.4 ms Retraining: 7.1 seconds
>4 orders-of-magnitude faster adaptation given sufficient offline training data
Velox Online Learning for Recommendations
(20-News Groups)
0.1 0.2 0.3 0.4 0.5 0.6 10 20 30
Error Examples
Partial Updates: 0.4 ms Retraining: 7.1 seconds
>4 orders-of-magnitude faster adaptation
Big Data Training
Application
Decision Query
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing Model per user
Big Data Training
Application
Decision Query
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing Model per user
Velox
B D A S
Tachyon Mesos Spark HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames MLLib Keystone ML
Learning
erkeley ata nalytics tack
VELOX: the Missing Piece of BDAS
B D A S erkeley ata nalytics tack
Tachyon Mesos Spark HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames MLLib Keystone ML
Learning
Management and Serving
VELOX: the Missing Piece of BDAS Velox
B D A S erkeley ata nalytics tack
Mesos HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames
Learning
Management and Serving
VELOX: the Missing Piece of BDAS Velox
Tachyon Spark MLLib Keystone ML
VELOX Architecture
Spark MLLib
Single JVM Instance
Velox
Keystone ML
Content Rec. Fraud Detection
VELOX Architecture
Spark MLLib
Single JVM Instance
Velox
Keystone ML
Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation
Create
VW Caffe
VELOX as a Middle Layer Arch?
Spark MLLib
Velox
Keystone ML
Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation
Create
VW Caffe Generalize ?
Daniel Crankshaw Xin Wang Michael Franklin Joseph E. Gonzalez Ion Stoica
Clipper Generalizes Velox Across ML Frameworks
Clipper
Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation
Create
VW Caffe
Clipper
Create
VW Caffe
Key Insight: The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks
As a result, Clipper is able to: Ø hide complexity
Ø by providing a common prediction interface
Ø bound latency and maximize throughput
Ø through approximate caching and adaptive batching
Ø enable robust online learning and personalization
Ø through generalized split-model correction policies
without modifying machine learning frameworks or end-user applications
Clipper Design Goals
Low and bounded latency predictions
Ø interactive applications need reliable latency objectives
Up-to-date and personalized predictions across models and frameworks
Ø generalize the split model decomposition
Optimize throughput for performance under heavy load
Ø single query can trigger many predictions
Simplify deployment
Ø serve models using the original code and systems
Clipper Architecture
Clipper
Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation
VW Caffe
Create
Clipper Architecture
Clipper
Applications Predict Observe RPC/REST Interface
VW Caffe
Create
Clipper Architecture
Clipper Caffe
Applications
ust
Predict Observe RPC/REST Interface
Model Wrapper (MW) MW MW MW
RPC RPC RPC RPC
Clipper Architecture
Clipper Caffe
Applications Predict Observe RPC/REST Interface
Model Wrapper (MW) MW MW MW
RPC RPC RPC RPC Model Abstraction Layer
Provide a common interface to models while bounding latency and maximizing throughput.
Correction Layer
Improve accuracy through ensembles,
Clipper Architecture
Clipper Caffe
Applications Predict Observe RPC/REST Interface
Model Wrapper (MW) MW MW MW
RPC RPC RPC RPC Correction Layer
Correction Policy
Model Abstraction Layer
Approximate Caching Adaptive Batching
Caffe
Provides a unified generic prediction API across frameworks Ø Reduce Latency à Approximate Caching Ø Increase Throughput à Adaptive Batching Ø Simplify Deployment à RPC + Model Wrapper
Model Wrapper (MW) MW MW MW
RPC RPC RPC RPC Model Abstraction Layer
Approximate Caching Adaptive Batching Approximate Caching Adaptive Batching
Model Wrapper (MW) MW MW MW
Caffe
Model Wrapper (MW) MW MW MW
RPC RPC RPC RPC Model Abstraction Layer
Approximate Caching Adaptive Batching
common interface
Model Wrapper (MW)
RPC
Caffe
MW
RPC
MW
RPC
MW
RPC Model Abstraction Layer
Approximate Caching Adaptive Batching Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes
Ø Resource isolation
Model Abstraction Layer
Approximate Caching Adaptive Batching
Model Wrapper (MW)
RPC
Caffe
MW
RPC
MW
RPC
MW
RPC
MW
RPC
MW
RPC
Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes
Ø Resource isolation Ø Scale-out
Problem: frameworks optimized for batch processing not latency
A single page load may generate many queries
Adaptive Batching to Improve Throughput
Ø Optimal batch depends on:
Ø hardware configuration Ø model and framework Ø system load
Clipper Solution: be as slow as allowed… Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease) Ø Why batching helps:
Hardware Acceleration Helps amortize system overhead
Throughput (Queries Per Second) Latency (ms) Batch Sizes (Queries)
Tensor Flow Conv. Net (GPU)
Latency Deadline
Optimal Batch Size
Comparison to TensorFlow Serving
Takeaway: Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)
Approximate Caching to Reduce Latency
Clipper Solution: Approximate Caching apply locality sensitive hash functions Ø Opportunity for caching Ø Need for approximation
Popular items may be evaluated frequently
High Dimensional and continuous valued queries have low cache hit rate.
Bag-of-Words Model Images
? ?
Cache Hit Cache Miss
?
Cache Hit Error
Clipper Architecture
Clipper Caffe
Applications Predict Observe RPC/REST Interface
Model Wrapper (MW) MW MW MW
RPC RPC RPC RPC Correction Layer
Correction Policy
Model Abstraction Layer
Approximate Caching Adaptive Batching
Goal: Maximize accuracy through ensembles, online learning, and personalization Generalize the split-model insight from Velox to achieve: Ø robust predictions by combining multiple models & frameworks Ø online learning and personalization by correcting and personalizing predictions in response to feedback
Clipper
Correction Layer
Correction Policy
Big Data
Application
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing User Model
Velox
Caffe
Big Data
Application
Learning Inference
Feedback Slow
Slow Changing Model Fast Changing User Model
Clipper
Caffe
Slow Changing Model Fast Changing User Model
Clipper
Correction Policy
Improves prediction accuaray by: Ø Incorporating real-time feedback Ø Managing personalization Ø Combine models & frameworks
Ø enables frameworks to compete
Improved Prediction Accuracy (ImageNet)
System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088
sequence of pre-trained state-of-the-art models
Improved Prediction Accuracy
System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088 Clipper Ensemble 5.86% 2930
5.2% relative improvement in prediction accuracy!
Increased Load Ø Solutions:
Ø Caching and Batching Ø Load-shedding correction policy can prioritize frameworks
Stragglers
Ø e.g., framework fails to meet SLO
Ø Solution: Anytime predictions
Ø Correction policy must render predictions with missing inputs Ø e.g., built-in correction policies substitute expected value
Caffe
Slow Changing Model Fast Changing User Model
Clipper
Cost of Ensembles ?
Evaluation of Throughput Under Heavy Load
Accuracy Throughput (queries per second) Takeaway: Clipper is able to gracefully degrade accuracy to maintain availability under heavy load.
Conclusion
Clipper
Create
VW Caffe
Clipper sits between applications and ML frameworks to Ø to simplifying deployment Ø bound latency and increase throughput Ø and enable real-time learning and personalization across machine learning frameworks
Big Data
Model
Training
Application
Decision Query
Feedback
VELOX Clipper
Create
VW
Caffe
Ongoing & Future Research Directions
Ø Serving and updating RL models Ø Bandit techniques in correction policies Ø Splitting inference across the cloud and the client to reduce latency and bandwidth requirements Ø Secure model evaluation on the client (model DRM)
No Coarsening
Coarsening + Anytime Predictions
Overly Coarsened More Features
Better
Best
Coarser Hash
fi(x; θ) ≈ E [fi(x; θ)] fi(x; θ) ≈ fi(z; θ)