A journey towards real-life results illustrated by using AI in - - PowerPoint PPT Presentation
A journey towards real-life results illustrated by using AI in - - PowerPoint PPT Presentation
A journey towards real-life results illustrated by using AI in Twitters Timelines Overview ML Workflows The Timelines Ranking case The power of the platform, opportunities Future Deep Learning Workflows Pure
- ML Workflows
- The Timelines Ranking case
- The power of the platform, opportunities
- Future
Overview
- Pure Research
○
Model Exploration
- Applied Research
○
Dataset/Feature Exploration
○
Model Exploration
- Production
○ Feature Addition ○ Data Addition ○ Training ○ Deployment ○ A/B test
Deep Learning Workflows
- Pure Research
○
Model Exploration + Training → Very flexible modeling framework
- Applied Research
○
Dataset/Feature Exploration → Flexible data exploration framework
○
Model Exploration + Training → Flexible modeling framework
- Production
○ Feature Addition → Scalable data manipulation framework ○ Data Addition → Scalable data manipulation framework ○ Training → Fast, robust training engine ○ Deployment → Seamless and tested ML services ○ A/B test → Good AB test environment
Deep Learning Workflows
Deep Learning Workflows
Pure Research Production Applied Research
Deep Learning Workflows
PRODUCTION
- Model architecture doesn’t matter (anymore)
- Large Scale data manipulation matters
- Fast training matters
- Ease of deployment matters
- Testing matters!!!
○ Training VS online ○ Continuous integration
Data First Workflow
Case Study Timelines Ranking (Blog Post @TwitterEng)
- Sparse features
- A few billions data samples
- Low latency
- Candidates generation → Heavy model → sort → publish
- Before: decision trees + other sparse techniques
- Probability prediction
Timelines Ranking
Timelines Ranking New Modules
Sparse Linear Layer
F = Sigmoid/ReLU/PReLU . . . . . . .
days_since [0,+infinity]]
- bama_word
{0,1} 1 2
. . . . . . .
2 5
Nj = F(∑ Wi,j * norm(Vi) + Bj)
V1 V2 Vn-2 Vn-1 Vn is_vit {0,1} has_image {0,1} engagement ratio [0,+infinity]]
Sparse Linear Layer: Online Normalization
- Example: input feature (value == 1M)
⇒ weight_gradient == 1M ⇒ update == 1M * learning_rate ⇒ explosion
- Solution: normalization of input values
norm(Vi) == Vi / max(all_abs_Vi) + bi
Belongs to [-1,1] Trainable per-feature bias: discriminate absence and presence of features
Batch Size VS PyTorch (1 thread) VS TensorFlow (’’’’) VS PyTorch (4 threads) VS TF* (’’’’) 1 2.1x 4.1x 2.8x 5.5x 16 1.7x 1.7x 4.6x 4.3x 64 1.7x 1.3x 5.1x 3.8x 256 1.8x 1.2x 5.6x 3.5x
Sparse Linear Layer: Speedups CPU -- i7 3790k Forward pass -- ~500 features -- output size == 50
Sparse Linear Layer: Speedups GPU -- Tesla M40 -- CUDA 7.5 Forward pass -- ~500 features -- output size == 50
Batch Size VS cuSparse 1 0.7x 16 4.4x 64 5.2x 256 2x
Split Nets
V1 V2 Vn-2 Vn-1 Vn
. . . . . . .
has_image {0,1} has_link {0,1} engagement ratio [0,+infinity]] days_since [0,+infinity]]
- bama_word
{0,1} 1 1 2 N 2 N
... ...
SPLIT NET 1: TWEET BINARY FEATURES SPLIT NET K: ENGAGEMENT FEATURES
Split Nets
. . . . . . .
1 1 2 N 2 N
...
SPLIT 1: TWEET BINARY FEATURES SPLIT K: ENGAGEMENT FEATURES
...
GLUE ALL SPLIT NETS !!! UNIQUE DEEP NET (N*K neurons)
Prevent overfitting -- Split by feature type
- Send “dense” features on one side
○
BINARY
○
CONTINUOUS
○
(SPARSE_CONTINUOUS)
- “Sparse” features on the other side
○ DISCRETE ○ STRING ○ SPARSE_BINARY
○
(SPARSE_CONTINUOUS)
Sampling -- Calibration
- Sample according to positive ratio P
- Output average probability == P ⇒ Need Calibration
- Use Isotonic Calibration
Feature Discretization
Intuition
- Max normalization good to avoid explosion BUT
- Per-aggregate-feature min/max range larger
- Max-normalization will generate very small input feature values
- The deep net will have tremendous trouble learning on such small values
- Std/mean normalization? Better but still not satisfying
Solution
- Discretization
Dsicretization
- Feature id == 10
- → over the entire dataset, compute equal sized bins, assign bin_id
- At inference time, for key/value (id,value):
○ id → bin_id ○ value → 1
- Other possibilities: Decision trees, ...
Final simplest architecture
1) Discretizer(s) 2) Sparse Layer with online normalization 3) MLP 4) Prediction 5) Isotonic Calibration
The power of the platform
- Testing
- Tracking
- Automation
- Robustness
- Standardization
- Speed
- Workflow
- Examples
- Support
- Easy Multimodal (Text + media + sparse + …)
The power of the platform
- How to train all this?
○ Train the discretizer ○ Train the deep net ○ Calibrate the probabilities ○ Validate ○ ...
- Training loop + ML scheduler → one-liner
- Unique serialization format for params
The power of the platform
- How to deploy all this?
- Tight Twitter infra integration + saved model → one-liner deployment
- Arbitrary number of instances
- All the goodies from Twitter services infra!
- Seamless
The power of the platform
- How to test all this?
- Model offline validation → PredictionSet A
- Model online prediction → PredictionSet B
- PredictionSet A == PredictionSet B ??
- Yes → ready to ship
- Continuous integration
The power of the platform
Future
In a single platform:
- Abstract DAG of:
○ Services ○ Storages ○ Dataset ○ ...
- Model dependency handling
- Offline/Online feature mapping
- Coverage for all the workflows
- Bundling
- … Cloud?
Future of DL platforms
DAG of services
Model A Model D Cache Storage Cache Storage Model B Model C Timelines, Recommendations, ...