A journey towards real-life results illustrated by using AI in - - PowerPoint PPT Presentation

a journey towards real life results
SMART_READER_LITE
LIVE PREVIEW

A journey towards real-life results illustrated by using AI in - - PowerPoint PPT Presentation

A journey towards real-life results illustrated by using AI in Twitters Timelines Overview ML Workflows The Timelines Ranking case The power of the platform, opportunities Future Deep Learning Workflows Pure


slide-1
SLIDE 1

A journey towards real-life results

… illustrated by using AI in Twitter’s Timelines

slide-2
SLIDE 2
  • ML Workflows
  • The Timelines Ranking case
  • The power of the platform, opportunities
  • Future

Overview

slide-3
SLIDE 3
  • Pure Research

Model Exploration

  • Applied Research

Dataset/Feature Exploration

Model Exploration

  • Production

○ Feature Addition ○ Data Addition ○ Training ○ Deployment ○ A/B test

Deep Learning Workflows

slide-4
SLIDE 4
  • Pure Research

Model Exploration + Training → Very flexible modeling framework

  • Applied Research

Dataset/Feature Exploration → Flexible data exploration framework

Model Exploration + Training → Flexible modeling framework

  • Production

○ Feature Addition → Scalable data manipulation framework ○ Data Addition → Scalable data manipulation framework ○ Training → Fast, robust training engine ○ Deployment → Seamless and tested ML services ○ A/B test → Good AB test environment

Deep Learning Workflows

slide-5
SLIDE 5

Deep Learning Workflows

Pure Research Production Applied Research

slide-6
SLIDE 6

Deep Learning Workflows

PRODUCTION

slide-7
SLIDE 7
  • Model architecture doesn’t matter (anymore)
  • Large Scale data manipulation matters
  • Fast training matters
  • Ease of deployment matters
  • Testing matters!!!

○ Training VS online ○ Continuous integration

Data First Workflow

slide-8
SLIDE 8

Case Study Timelines Ranking (Blog Post @TwitterEng)

slide-9
SLIDE 9
  • Sparse features
  • A few billions data samples
  • Low latency
  • Candidates generation → Heavy model → sort → publish
  • Before: decision trees + other sparse techniques
  • Probability prediction

Timelines Ranking

slide-10
SLIDE 10

Timelines Ranking New Modules

slide-11
SLIDE 11

Sparse Linear Layer

F = Sigmoid/ReLU/PReLU . . . . . . .

days_since [0,+infinity]]

  • bama_word

{0,1} 1 2

. . . . . . .

2 5

Nj = F(∑ Wi,j * norm(Vi) + Bj)

V1 V2 Vn-2 Vn-1 Vn is_vit {0,1} has_image {0,1} engagement ratio [0,+infinity]]

slide-12
SLIDE 12

Sparse Linear Layer: Online Normalization

  • Example: input feature (value == 1M)

⇒ weight_gradient == 1M ⇒ update == 1M * learning_rate ⇒ explosion

  • Solution: normalization of input values

norm(Vi) == Vi / max(all_abs_Vi) + bi

Belongs to [-1,1] Trainable per-feature bias: discriminate absence and presence of features

slide-13
SLIDE 13

Batch Size VS PyTorch (1 thread) VS TensorFlow (’’’’) VS PyTorch (4 threads) VS TF* (’’’’) 1 2.1x 4.1x 2.8x 5.5x 16 1.7x 1.7x 4.6x 4.3x 64 1.7x 1.3x 5.1x 3.8x 256 1.8x 1.2x 5.6x 3.5x

Sparse Linear Layer: Speedups CPU -- i7 3790k Forward pass -- ~500 features -- output size == 50

slide-14
SLIDE 14

Sparse Linear Layer: Speedups GPU -- Tesla M40 -- CUDA 7.5 Forward pass -- ~500 features -- output size == 50

Batch Size VS cuSparse 1 0.7x 16 4.4x 64 5.2x 256 2x

slide-15
SLIDE 15

Split Nets

V1 V2 Vn-2 Vn-1 Vn

. . . . . . .

has_image {0,1} has_link {0,1} engagement ratio [0,+infinity]] days_since [0,+infinity]]

  • bama_word

{0,1} 1 1 2 N 2 N

... ...

SPLIT NET 1: TWEET BINARY FEATURES SPLIT NET K: ENGAGEMENT FEATURES

slide-16
SLIDE 16

Split Nets

. . . . . . .

1 1 2 N 2 N

...

SPLIT 1: TWEET BINARY FEATURES SPLIT K: ENGAGEMENT FEATURES

...

GLUE ALL SPLIT NETS !!! UNIQUE DEEP NET (N*K neurons)

slide-17
SLIDE 17

Prevent overfitting -- Split by feature type

  • Send “dense” features on one side

BINARY

CONTINUOUS

(SPARSE_CONTINUOUS)

  • “Sparse” features on the other side

○ DISCRETE ○ STRING ○ SPARSE_BINARY

(SPARSE_CONTINUOUS)

slide-18
SLIDE 18

Sampling -- Calibration

  • Sample according to positive ratio P
  • Output average probability == P ⇒ Need Calibration
  • Use Isotonic Calibration
slide-19
SLIDE 19

Feature Discretization

Intuition

  • Max normalization good to avoid explosion BUT
  • Per-aggregate-feature min/max range larger
  • Max-normalization will generate very small input feature values
  • The deep net will have tremendous trouble learning on such small values
  • Std/mean normalization? Better but still not satisfying

Solution

  • Discretization
slide-20
SLIDE 20

Dsicretization

  • Feature id == 10
  • → over the entire dataset, compute equal sized bins, assign bin_id
  • At inference time, for key/value (id,value):

○ id → bin_id ○ value → 1

  • Other possibilities: Decision trees, ...
slide-21
SLIDE 21

Final simplest architecture

1) Discretizer(s) 2) Sparse Layer with online normalization 3) MLP 4) Prediction 5) Isotonic Calibration

slide-22
SLIDE 22

The power of the platform

slide-23
SLIDE 23
  • Testing
  • Tracking
  • Automation
  • Robustness
  • Standardization
  • Speed
  • Workflow
  • Examples
  • Support
  • Easy Multimodal (Text + media + sparse + …)

The power of the platform

slide-24
SLIDE 24
  • How to train all this?

○ Train the discretizer ○ Train the deep net ○ Calibrate the probabilities ○ Validate ○ ...

  • Training loop + ML scheduler → one-liner
  • Unique serialization format for params

The power of the platform

slide-25
SLIDE 25
  • How to deploy all this?
  • Tight Twitter infra integration + saved model → one-liner deployment
  • Arbitrary number of instances
  • All the goodies from Twitter services infra!
  • Seamless

The power of the platform

slide-26
SLIDE 26
  • How to test all this?
  • Model offline validation → PredictionSet A
  • Model online prediction → PredictionSet B
  • PredictionSet A == PredictionSet B ??
  • Yes → ready to ship
  • Continuous integration

The power of the platform

slide-27
SLIDE 27

Future

slide-28
SLIDE 28

In a single platform:

  • Abstract DAG of:

○ Services ○ Storages ○ Dataset ○ ...

  • Model dependency handling
  • Offline/Online feature mapping
  • Coverage for all the workflows
  • Bundling
  • … Cloud?

Future of DL platforms

slide-29
SLIDE 29

DAG of services

Model A Model D Cache Storage Cache Storage Model B Model C Timelines, Recommendations, ...

slide-30
SLIDE 30

THANKS!