Joseph E. Gonzalez
Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu
Research at the intersection of AI + Systems Joseph E. Gonzalez - - PowerPoint PPT Presentation
Research at the intersection of AI + Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu Looking Back on AI Systems Going back to when I started graduate school Machine learning community has had an
Joseph E. Gonzalez
Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu
Looking Back on AI Systems
Going back to when I started graduate school …
Machine learning community has had an evolving focus on AI Systems 2006 2017
Fast Algorithms ML for Systems Distributed Algorithms Deep Learning Frameworks
Integration of Communities
Machine Learning Frameworks
Big Data
Big Model
Training
Learning
The focus of AI Systems research has been on model training.
Big Data
Big Model
Training
Distributed Dataflow Systems Stochastic Optimization GPU / TPU Acceleration Deep Learning (CNN/RNN) Domain Specific Languages (TensorFlow) Symbolic Methods
Enabling Machine Learning and Systems Innovations
Splash
CoCoA
VW
rllab
Big Data
Big Model
Training
Big Data
Big Model
Training
Learning
Big Data
Big Model
Training
Learning
Drive Actions
Big Data
Big Model
Training
Application
Decision Query
Learning Prediction
Big Data Training
Learning Prediction
Big Model Application
Decision Query
Goal: ~10 ms under heavy load Complicated by Deep Learning è New ML Algorithms and Systems
Models getting more complex
Ø 10s of GFLOPs [1] Ø Recurrent nets
Support low-latency, high-throughput serving workloads
Deployed on critical path
Ø Maintain latency goals under heavy load
[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
Using specialized hardware for predictions
Google Translate
Serving
82,000 GPUs running 24/7
[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html140 billion words a day1
Designed New Hardware! Tensor Processing Unit (TPU)
“If each of the world’s Android phones used the new Google voice search for just
three minutes a day, these engineers
realized, the company would
need twice as many data centers.”
– Wired
Prediction-Serving Challenges
???
Create
VW
Caffe
14Large and growing ecosystem of ML models and frameworks Support low-latency, high- throughput serving workloads
Wide range of application and frameworks
???
Create
VW Caffe
16Wide range of application and frameworks
One-Off Systems for High-Value Tasks
Problems:
Expensive to build and maintain Ø Requires AI + Systems expertise Tightly-coupled model, framework, and application Ø Difficult to update models and add new frameworks
Prediction Serving is an Open Problem
Ø Computationally challenging
Ø Need low latency & high throughput
Ø No standard technology or abstractions for serving models
Low Latency Prediction Serving System [NSDI’17]
Prediction Cascades
Learning to make fast predictions
[Work in Progress]
Daniel Crankshaw Xin Wang Ion Stoica Giulio Zhou Alexey Tumanov Corey Zumar Yika Luo
Low Latency Prediction Serving System
???
Create
VW Caffe
20Wide range of application and frameworks
Middle layer for prediction serving.
Common Abstraction System Optimizations
Create
VW Caffe
21Clipper
MC MC MC
RPC RPC RPC RPC
Clipper Decouples Applications and Models
Applications
Model Container (MC)
Caffe
Predict Observe RPC/REST Interface
Clipper Architecture
Clipper
Applications Predict Observe RPC/REST Interface Model Abstraction Layer
Provide a common interface and system optimizations
Model Selection Layer
Combine predictions across frameworks MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Clipper Architecture
Clipper
Applications Predict Observe RPC/REST Interface Model Selection Layer
Combine predictions across frameworks MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Model Abstraction Layer
Provide a common interface and system optimizations
Optimized Batching Caching Common API Model Isolation
A single page load may generate many queries
Batching to Improve Throughput
Ø Optimal batch depends on:
Ø hardware configuration Ø model and framework Ø system load
Ø Why batching helps:
Throughput
Throughput-optimized frameworks
Batch size
Clipper Solution: Adaptively tradeoff latency and throughput… Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)
Throughput (QPS)
20000 40000 60000
4 8 3 8 6 2 2 8 5 9 2 9 3 5 4 8 9 3 4 1 9 7 4 7 2 1 9 8 9 6 3 3 1 7 7 2 6 1 9 2 2 3 1 9 2 1 AdaStLve 1R BatFKLng
20 40
2 2 2 2 2 8 2 5 5
1 R
S 5 a n d R P ) R r e V t ( 6 K O e a r n ) L L n e a r 6 9 ( 3 y 6 S a r N ) L L n e a r 6 9 ( 6 K L e a r n ) K e r n e O 6 9 ( 6 K L e a r n ) L R g 5 e g r e V V L R n ( 6 K L e a r n ) 1000
1 3 4 4 9 6 3 2 1 3 1 8 3 1 2 2 4
P99 Latency (ms) Batch Size
TensorFlow- Serving
Predict
RPC Interface
Applications
Overhead of decoupled architecture Clipper
Predict Feedback RPC/REST Interface
Caffe
MC MC MC
RPC RPC RPC RPC Applications
MC
Throughput (QPS) Better P99 Latency (ms) Better
Overhead of decoupled architecture
Decentralized system matches performance of centralized design.
Clipper Architecture
Clipper
Applications Predict Observe RPC/REST Interface Model Selection Layer
Combine predictions across frameworks MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Model Abstraction Layer
Provide a common interface and system optimizations
Clipper Architecture
Clipper
Applications Predict Observe RPC/REST Interface
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Model Abstraction Layer
Provide a common interface and system optimizations
Model Selection Layer
Combine predictions across frameworks
Clipper
Predict Observe RPC/REST Interface
MC MC MC
RPC RPC RPC RPC
Model Container (MC)
Caffe
Model Abstraction Layer
Provide a common interface and system optimizations
Model Selection Layer
Combine predictions across frameworks
Caffe
Version 1 Version 2 Version 3
Periodic retraining Experiment with new models and frameworks
Caffe
“CAT” “DJ” “CAT” “CAT”
“CAT”
UNSURE
Selection Policy can Calibrate Confidence
Policy
Version 2 Version 3
ensemEle 4-agree 5-agree 0.0 0.2 0.4
7Rp-5 ErrRr 5ate
0.0586 0.0469 0.3182 0.0327 0.1983
Image1et
cRnIident unsure cRnIident unsure
Better
Selection Policy: Estimate confidence
ensemEle 4-agree 5-agree 0.0 0.2 0.4
7Rp-5 ErrRr 5ate
0.0586 0.0469 0.3182 0.0327 0.1983
Image1et
cRnIident unsure cRnIident unsure
Better
width is percentage of query workloads
Selection Policy: Estimate confidence
Open Research Questions
Ø Efficient execution of complex model compositions
Ø Optimal batching to achieve end-to-end latency goals
Ø Automatic model failure identification and correction
Ø Use anomaly detection techniques to identify model failures
Ø Prediction serving on the edge
Ø Allowing models to span cloud and edge infrastructure
http://clipper.ai
Low Latency Prediction Serving System [NSDI’17]
Prediction Cascades
Learning to make fast predictions.
[Work in Progress]
Low Latency Prediction Serving System [NSDI’17]
Prediction Cascades
Learning to make fast predictions.
[Work in Progress]
56.6 69.8 73.3 76.2 77.4 78.3 10 20 30 40 50 60 70 80 90
Accuracy
0.08 0.15 0.31 0.33 0.67 1 0.2 0.4 0.6 0.8 1 1.2
Relative Cost
Model costs are increasing much faster than gains in accuracy.
Small but significant Order of magnitude gap
Complexity à Complexity à
Prediction Cascades
Simple models for simple tasks
Daniel Crankshaw Xin Wang Alexey Tumanov Yika Luo
Query Simple Model Prediction Accurate Model Prediction
I don’t Know
Fast Slow
Combine fast (inaccurate) models with slow (accurate) models to maximize accuracy while reducing computational costs.
https://arxiv.org/abs/1706.00885
78.3 78.3 78.3 78.3 78.3 20 40 60 80 100
Accuracy
37% reduction in runtime @ no loss in accuracy
0.89 0.76 0.8 0.63 1 0.2 0.4 0.6 0.8 1 1.2
Relative Cost
Query Simple Model ResNet152
Conv Conv Conv Conv Conv Conv Conv FC
Query Simple Model Prediction ResNet152 Prediction
I don’t Know
Fast Slow
Gate Gate
Ø Cascades within a Model
Query Prediction
Conv Conv Conv Conv Conv Conv Conv FC
Query Simple Model Prediction Accurate Model Prediction
I don’t Know
Fast Slow
Gate Gate
Ø Cascades within a Model
Query Prediction
Skip Blocks
Cascading reduces computational cost
5es1et110 5es1et74 5es1et38 20 40 60 80 100 120
AverDge Depth
110.0 74.0 38.0 67.08 54.16 35.82 40% 67.72 28% 54.69 10% 34.31
1R GDte CRnvGDte 511GDte
Similar gains on larger models
Number of Layers Skipped
Skip More Skip More Skip Less Skip Less
Difficult Images Easy Images
Future Directions for Cascades
Ø Using reinforcement learning techniques to reduce gating costs Ø Query triage during load spikes à forcing fractions
Ø Irregular execution à
Ø complicates batching Ø Issues for parallel execution
Low Latency Prediction Serving System [NSDI’17]
Prediction Cascades
Simple models for simple tasks
[Work in Progress]
Managing the Machine Learning Lifecycle
Distributed Python for Reinforcement Learning
Other AI Systems Projects in RISE
We are developing new technologies that will enables applications to make low-latency intelligent decision on live data with strong security guarantees.
Joseph E. Gonzalez jegonzal@cs.berkeley.edu