Unifying Twitter Around a Single ML Platform
Yi Zhuang (@yz), Nicholas Leonard (@strife076) April 17, 2019
Unifying Twitter Around a Single ML Platform Yi Zhuang (@yz), - - PowerPoint PPT Presentation
Unifying Twitter Around a Single ML Platform Yi Zhuang (@yz), Nicholas Leonard (@strife076) April 17, 2019 Overview ML Use Cases at Twitter ML Platform Requirements & Challenges Unifying Twitter Around a Single ML Platform
Unifying Twitter Around a Single ML Platform
Yi Zhuang (@yz), Nicholas Leonard (@strife076) April 17, 2019
User Candidate Ad Context “Click”
ML is Everywhere
PBs of data per day Some models train on Tens of TBs of data per day
Tens of millions of predictions per second
tens of milliseconds
Training examples everyday
Features
Serving latency Predictions every second
PyTorch
Scikit Learn In-house Frameworks
VW
Lua Torch
TensorFlow
Models
Tooling & Resources
Knowledge
Model Training & Serving Model Refreshes Data Cleaning and Preprocessing Experiment Tracking Etc.
A Single Consistent ML Platform Across Twitter
5
P r
u c i
M
e l S e r v i n g
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor. Donec facilisis lacus eget mauris.
4
E x p e r i m e n t a t i
T r a c k i n g
3
M
e l T r a i n i n g a n d E v a l u a t i
P r e p r
e s s i n g a n d F e a t u r i z a t i
2 1
P i p e l i n e O r c h e s t r a t i
Production ML Engineers Deep Learning Researcher Data Scientists
○
E.g. user data, tweet data, engagement data
○
Duplication of effort
○
Inconsistent featurization schemes for training vs serving
Consistency across teams => sharing & efficiency Important: feature consistency between training and serving
○
Relies on feature discretization
○
Model learns new data as it becomes available (~15 min delay)
○
Lua hidden via YAML
○
Hard to debug and unit test
○
JVM -> JNI -> Lua VMs -> C/C++
○
Export graphs as protobuf
○
Serve graphs from Java/Scala:
■
JVM -> TensorFlow
... across different ML framework: small differences, large impacts Online experiments take time Need simple setup, fast iterations
Different approaches to productionizing training algorithms:
○
Retraining frequency varies
○
Cron, Aurora, Airflow Jobs
○
Helps reduce model staleness
Apache Airflow: DAGs
ML models become stale over time ML Hyperparameter tunings are often tedious
○
Models still running using Lua Torch
○
Retrained manually every ~6 months.
○
Migrate Health ML models to new ML Platform
○
Reach metric parity with existing models (minimum)
Training Data Preprocessing Feature Store Data Exploration
Training Offline Evaluation Model Tuning Experiment Loop
Online A/B Testing
Production Experiment
Prediction Servers
○
ML engineers vs DL researchers
○
Production vs exploration
2018 Strategy: Consistency & Adoption 2019 Strategy: Ease of Use & Velocity
10x, 50x training speed Auto model evaluation & validation Auto model deploy & auto scaling Auto hyperparameter tuning & architecture search Continuous Deep Learning Model Training and so on ...
If you are interested in learning more about Twitter Cortex, please contact: @yz @strife076