[PPT] - Predictive Precompute with Recurrent Neural Networks Hanson Wang PowerPoint Presentation

SLIDE 1

Predictive Precompute with Recurrent Neural Networks

Hanson Wang Zehui Wang Yuanyuan Ma MLSys 2020

SLIDE 2

On client: prefetching

Improve the latency of user interactions in the Facebook

app by precomputing data queries before the interactions

ccur

On server: cache warmup

Improve cache hit-rates in Facebook backend services by

precomputing cache values hours in advance

2

De Defining Pre recompute

SLIDE 3

3

De Defining Pre recompute: Pre refetching

User opens the tab Wait for data to arrive…

SLIDE 4

4

De Defining Pre recompute: Pre refetching

Data gets precomputed at startup time Data is immediately available!

SLIDE 5

Naïvely precomputing 100% of the time is too expensive
Facebook spends non-trivial % of compute on this
Idea: Predict user behavior to avoid wasting resources
Classification problem: P(tab access) at session start
Apply threshold on top of probability to make precompute

decisions (can be tuned to product constraints)

5

Pr Pred edictive Precompute

SLIDE 6

6

Fo Formulation as an ML problem em

time

Session 1 (10mins) Co Context (C1) hour of day = 9 # notifications = 1 user age = 25 … access A1 = 1 Session 2 (10mins) Co Context (C2) hour of day = 11 # notifications = 0 user age = 25 … no access A2 = 0 Access prediction P(A1) Access prediction P(A2)

In In gene neral al, we want ant to

estimat

ate: P(An | C1, A1, C2, A2, …, Cn)

SLIDE 7

Simple features can be taken from current context (Ci)

Time-based (hour of day, day of week)
User-based (age, country)
Session-based (notification count)
How to incorporate previous contexts and accesses?

7

Fo Formulation as an ML problem em

Fea Features res

SLIDE 8

Historical usage features must be “engineered” for traditional models Fo Formulation as an ML problem em

Hi Historical Features time

Session 1 A1 = 1 Co Context (C1) hour of day = 9 # notifications = 1 … Session 2 A1 = 1 Co Context (C2) hour of day = 11 # notifications = 1 … Session 3 A3 = 0 Co Context (C3) hour of day = 13 # notifications = 0 …

SLIDE 9

Historical usage features must be “engineered” for traditional models Fo Formulation as an ML problem em

Hi Historical Features time

Session 1 A1 = 1 Co Context (C1) hour of day = 9 # notifications = 1 … Session 2 A1 = 1 Co Context (C2) hour of day = 11 # notifications = 1 … Session 3 A3 = 0 Co Context (C3) hour of day = 13 # notifications = 0 … Number of accesses in the past 7 days = 1 Access rate in the past 7 days = 50%

SLIDE 10

Historical usage features must be “engineered” for traditional models Fo Formulation as an ML problem em

Hi Historical Features time

Session 1 A1 = 1 Co Context (C1) hour of day = 9 # # notifications = = 1 … Session 2 A1 = 1 Co Context (C2) hour of day = 11 # # notifications = = 1 … Session 3 A3 = 0 Co Context (C3) hour of day = 13 # notifications = 0 … Number of accesses in the past 14 days wi with n notifications = 2 Access rate in the past 14 days wi with n notifications = 100%

SLIDE 11

11

His Historic ical l features domin inate feature im importance…

Referrer page User's overall access rate (1 day) User's overall access rate (28 day) Notification count User’s access rate with current referrer page (28 days) User’s access rate with current notification count (28 days) User’s access rate with current notification count and referrer page (28 days) Sample feature importance from a GBDT model (q (qualit lity drops >15 15% wit itho hout access rates)

SLIDE 12

“Recipe” for historical features:

Select an aggregation type (count, access rate, time elapsed…)
Select a time range (1 day, 7 days, 28 days…)
(Optional) Filter on a subset of context attributes

(with / without notifications, at the current hour of the day, …) 💦 Combinatorial explosion of features! 💱 Aggregation features make inference expensive!

12

Fo Formulation as an ML problem em

Fea Features res

SLIDE 13

Traditional models

Simple baseline: output the lifetime access rate for each user
Most basic historical feature, surprisingly effective
Logistic Regression, Gradient-boosted Decision Trees
Consumes concatenated vector of engineered features

13

Fo Formulation as an ML problem em

Mo Models

SLIDE 14

Alt-text: The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent. — xkcd #1838

SLIDE 15

Recurrent neural networks address problems with historical features: Complex, non-linear interactions between features can be captured through a hidden state “memory” for each user. Hidden state updates are incremental in nature. Storage consumption is bounded by the number of dimensions. Model each user’s session history as a sequential prediction task.

15

Ne Neural ne networ

rks to
the rescue

SLIDE 16

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

SLIDE 17

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Predictions (P(Ai), online) Hidden states (hi, async)

Session 1 Session 2 Session 3 Session 1 Session 2 Session 3

SLIDE 18

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Prediction Layer h1: last known hidden state f3: feature vector t3: time of prediction T(t3 – t1): time since h1, encoded

SLIDE 19

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Hidden Layer f3: feature vector A3: true label for session 3 h2: previous hidden state T(Δt3): time since h2, encoded

SLIDE 20

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Model session + update delays (δ)

SLIDE 21

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Hidden state updates are decoupled from predictions

SLIDE 22

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

1-layer fully-connected network (256 neurons) Latent cross1 is helpful: hi ◦ (1 + Linear(fi)) GRU with 128 hidden dims

[1] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H (2018). Latent cross: Making use of context in recurrent recommender systems.

SLIDE 23

1M user histories over a 30 day period
~60 sessions per user on average, ~10% positive rate
Only compute loss on last 21 days
All evaluation metrics use last 7 days
Training takes about ~8 hours on GPU (PyTorch)
Faster with BPPSA?

23

Tr Training det etails

SLIDE 24

Facebook company

Results

24

SLIDE 25

Precision: (true positives) / (predicted positives)

What percentage of precomputed results are accessed?
Inversely correlated to additional compute cost.

Recall: (true positives) / (total positives)

What percentage of accesses used precomputed results?
Directly correlated to product latency improvements.

25

Pr Prec ecision and Rec ecall fo for Pr Prec ecompute

SLIDE 26

Precision-Recall Curves: FB Mobile Tab

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 3% 6% 9% 12% 15% 18% 21% 24% 27% 30% 33% 36% 39% 42% 45% 48% 51% 54% 57% 60% 63% 66% 69% 72% 75% 78% 81% 84% 87% 90% 93% 96% 99% Precision Recall

Baseline Logistic Regression GBDT RNN

SLIDE 27

Precision-Recall Curves: FB Mobile Tab

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 3% 6% 9% 12% 15% 18% 21% 24% 27% 30% 33% 36% 39% 42% 45% 48% 51% 54% 57% 60% 63% 66% 69% 72% 75% 78% 81% 84% 87% 90% 93% 96% 99% Precision Recall

Baseline Logistic Regression GBDT RNN

Recall at Precision = 50%

In practice, we typically try to hit a precision target.

SLIDE 28

Numerical comparison: FB Mobile Tab Mo Model l Type PR PR-AUC AUC R@ R@50 50% Baseline 0.470 0.413 Logistic Regression 0.546 0.596 GBDT 0.578 0.616 Recurrent Neural Network 0. 0.596 0. 0.642 Improvement 3.11% 4.22%

~3.4% increase in successful prefetches

SLIDE 29

Numerical comparison: Mobile Phone Use2

Public benchmark from Pielot, M., Cardoso, B., Katevas, K., Serra, J., Matic, A., and Oliver, N (2017). Beyond interruptibility: Predicting opportune moments to engage mobile phone users.

Mo Model l Type PR PR-AUC AUC R@ R@50 50% Baseline 0.591 0.811 Logistic Regression 0.683 0.906 GBDT 0.686 0.917 Recurrent Neural Network 0. 0.767 0. 0.977 Improvement 11.8% 6.54%

SLIDE 30

Online Testing

0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 PR-AUC Days since experiment start RNN GBDT

Results are stable over long time periods

SLIDE 31

31

Sy System Architecture

20% 80%

Locally Globally Server

0. Prefetch request

SLIDE 32

32

Sy System Architecture

20% 80%

Locally Globally Server Key Value Store

1. Fetch hidden state
0. Prefetch request

SLIDE 33

33

Sy System Architecture

20% 80%

Locally Globally Server Inference Service Key Value Store

2. Compute

prediction

1. Fetch hidden state
0. Prefetch request

SLIDE 34

34

Sy System Architecture

20% 80%

Locally Globally Server Inference Service Key Value Store

2. Compute

prediction

1. Fetch hidden state

Logging Service

3. Log features

and labels

0. Prefetch request

SLIDE 35

35

Sy System Architecture

20% 80%

Locally Globally Server Inference Service Key Value Store

2. Compute

prediction

1. Fetch hidden state

Logging Service

3. Log features

and labels

4. Compute new

hidden states

5. Record new

hidden state

0. Prefetch request

SLIDE 36

Facebook company

Traditional Methods

36

Manually engineered features
10-100s of aggregation feature

lookups per prediction

Multiple KBs of storage required

per user

~0.1ms model latency
Minimal feature engineering
1 key-value lookup per

prediction

Tunable (128 dim ~= 0.5KB)

small storage cost per user

~1ms model latency

10 10x overa rall re reduction in co compu mpute co costs RNN Method

SLIDE 37

Facebook company 37

Precompute tasks, like application prefetching and cache warmup, can be modeled well through ML Recurrent neural networks achieve superior modeling performance while reducing feature engineering time RNNs also have surprisingly favorable characteristics when used in large-scale systems

Summary

SLIDE 38

Predictive Precompute with Recurrent Neural Networks Hanson Wang - - PowerPoint PPT Presentation

Predictive Precompute with Recurrent Neural Networks

Results

20% 80%

20% 80%

20% 80%

20% 80%

20% 80%

Summary

Thank you