Predictive Precompute with Recurrent Neural Networks Hanson Wang - - PowerPoint PPT Presentation

predictive precompute with recurrent neural networks
SMART_READER_LITE
LIVE PREVIEW

Predictive Precompute with Recurrent Neural Networks Hanson Wang - - PowerPoint PPT Presentation

Predictive Precompute with Recurrent Neural Networks Hanson Wang Zehui Wang Yuanyuan Ma MLSys 2020 De Defining Pre recompute On client: prefetching Improve the latency of user interactions in the Facebook app by precomputing data


slide-1
SLIDE 1

Predictive Precompute with Recurrent Neural Networks

Hanson Wang Zehui Wang Yuanyuan Ma MLSys 2020

slide-2
SLIDE 2

On client: prefetching

  • Improve the latency of user interactions in the Facebook

app by precomputing data queries before the interactions

  • ccur

On server: cache warmup

  • Improve cache hit-rates in Facebook backend services by

precomputing cache values hours in advance

2

De Defining Pre recompute

slide-3
SLIDE 3

3

De Defining Pre recompute: Pre refetching

User opens the tab Wait for data to arrive…

slide-4
SLIDE 4

4

De Defining Pre recompute: Pre refetching

Data gets precomputed at startup time Data is immediately available!

slide-5
SLIDE 5
  • Naïvely precomputing 100% of the time is too expensive
  • Facebook spends non-trivial % of compute on this
  • Idea: Predict user behavior to avoid wasting resources
  • Classification problem: P(tab access) at session start
  • Apply threshold on top of probability to make precompute

decisions (can be tuned to product constraints)

5

Pr Pred edictive Precompute

slide-6
SLIDE 6

6

Fo Formulation as an ML problem em

time

Session 1 (10mins) Co Context (C1) hour of day = 9 # notifications = 1 user age = 25 … access A1 = 1 Session 2 (10mins) Co Context (C2) hour of day = 11 # notifications = 0 user age = 25 … no access A2 = 0 Access prediction P(A1) Access prediction P(A2)

In In gene neral al, we want ant to

  • estimat

ate: P(An | C1, A1, C2, A2, …, Cn)

slide-7
SLIDE 7

Simple features can be taken from current context (Ci)

  • Time-based (hour of day, day of week)
  • User-based (age, country)
  • Session-based (notification count)
  • How to incorporate previous contexts and accesses?

7

Fo Formulation as an ML problem em

Fea Features res

slide-8
SLIDE 8

Historical usage features must be “engineered” for traditional models Fo Formulation as an ML problem em

Hi Historical Features time

Session 1 A1 = 1 Co Context (C1) hour of day = 9 # notifications = 1 … Session 2 A1 = 1 Co Context (C2) hour of day = 11 # notifications = 1 … Session 3 A3 = 0 Co Context (C3) hour of day = 13 # notifications = 0 …

slide-9
SLIDE 9

Historical usage features must be “engineered” for traditional models Fo Formulation as an ML problem em

Hi Historical Features time

Session 1 A1 = 1 Co Context (C1) hour of day = 9 # notifications = 1 … Session 2 A1 = 1 Co Context (C2) hour of day = 11 # notifications = 1 … Session 3 A3 = 0 Co Context (C3) hour of day = 13 # notifications = 0 … Number of accesses in the past 7 days = 1 Access rate in the past 7 days = 50%

slide-10
SLIDE 10

Historical usage features must be “engineered” for traditional models Fo Formulation as an ML problem em

Hi Historical Features time

Session 1 A1 = 1 Co Context (C1) hour of day = 9 # # notifications = = 1 … Session 2 A1 = 1 Co Context (C2) hour of day = 11 # # notifications = = 1 … Session 3 A3 = 0 Co Context (C3) hour of day = 13 # notifications = 0 … Number of accesses in the past 14 days wi with n notifications = 2 Access rate in the past 14 days wi with n notifications = 100%

slide-11
SLIDE 11

11

His Historic ical l features domin inate feature im importance…

Referrer page User's overall access rate (1 day) User's overall access rate (28 day) Notification count User’s access rate with current referrer page (28 days) User’s access rate with current notification count (28 days) User’s access rate with current notification count and referrer page (28 days) Sample feature importance from a GBDT model (q (qualit lity drops >15 15% wit itho hout access rates)

slide-12
SLIDE 12

“Recipe” for historical features:

  • Select an aggregation type (count, access rate, time elapsed…)
  • Select a time range (1 day, 7 days, 28 days…)
  • (Optional) Filter on a subset of context attributes

(with / without notifications, at the current hour of the day, …) 💦 Combinatorial explosion of features! 💱 Aggregation features make inference expensive!

12

Fo Formulation as an ML problem em

Fea Features res

slide-13
SLIDE 13

Traditional models

  • Simple baseline: output the lifetime access rate for each user
  • Most basic historical feature, surprisingly effective
  • Logistic Regression, Gradient-boosted Decision Trees
  • Consumes concatenated vector of engineered features

13

Fo Formulation as an ML problem em

Mo Models

slide-14
SLIDE 14

Alt-text: The pile gets soaked with data and starts to get mushy over time, so it's technically recurrent. — xkcd #1838

slide-15
SLIDE 15

Recurrent neural networks address problems with historical features: Complex, non-linear interactions between features can be captured through a hidden state “memory” for each user. Hidden state updates are incremental in nature. Storage consumption is bounded by the number of dimensions. Model each user’s session history as a sequential prediction task.

15

Ne Neural ne networ

  • rks to
  • the rescue
slide-16
SLIDE 16

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

slide-17
SLIDE 17

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Predictions (P(Ai), online) Hidden states (hi, async)

Session 1 Session 2 Session 3 Session 1 Session 2 Session 3

slide-18
SLIDE 18

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Prediction Layer h1: last known hidden state f3: feature vector t3: time of prediction T(t3 – t1): time since h1, encoded

slide-19
SLIDE 19

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Hidden Layer f3: feature vector A3: true label for session 3 h2: previous hidden state T(Δt3): time since h2, encoded

slide-20
SLIDE 20

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Model session + update delays (δ)

slide-21
SLIDE 21

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

Hidden state updates are decoupled from predictions

slide-22
SLIDE 22

Recurrent Network Architecture

GRU GRU h1 h2 h0 MLP f1 h0 P(A1) t t1 t2 t1 + δ

T(0)

MLP P(A2) t2 + δ f2 h1

T(t2 - t1)

A1 f1

T(∆t1)

A2 f2

T(∆t2)

MLP P(A3) f3 h1

T(t3 - t1)

t3 t3 + δ GRU A3 f3

T(∆t3)

h3

1-layer fully-connected network (256 neurons) Latent cross1 is helpful: hi ◦ (1 + Linear(fi)) GRU with 128 hidden dims

[1] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V., and Chi, E. H (2018). Latent cross: Making use of context in recurrent recommender systems.

slide-23
SLIDE 23
  • 1M user histories over a 30 day period
  • ~60 sessions per user on average, ~10% positive rate
  • Only compute loss on last 21 days
  • All evaluation metrics use last 7 days
  • Training takes about ~8 hours on GPU (PyTorch)
  • Faster with BPPSA?

23

Tr Training det etails

slide-24
SLIDE 24

Facebook company

Results

24

slide-25
SLIDE 25

Precision: (true positives) / (predicted positives)

  • What percentage of precomputed results are accessed?
  • Inversely correlated to additional compute cost.

Recall: (true positives) / (total positives)

  • What percentage of accesses used precomputed results?
  • Directly correlated to product latency improvements.

25

Pr Prec ecision and Rec ecall fo for Pr Prec ecompute

slide-26
SLIDE 26

Precision-Recall Curves: FB Mobile Tab

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 3% 6% 9% 12% 15% 18% 21% 24% 27% 30% 33% 36% 39% 42% 45% 48% 51% 54% 57% 60% 63% 66% 69% 72% 75% 78% 81% 84% 87% 90% 93% 96% 99% Precision Recall

Baseline Logistic Regression GBDT RNN

slide-27
SLIDE 27

Precision-Recall Curves: FB Mobile Tab

10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 3% 6% 9% 12% 15% 18% 21% 24% 27% 30% 33% 36% 39% 42% 45% 48% 51% 54% 57% 60% 63% 66% 69% 72% 75% 78% 81% 84% 87% 90% 93% 96% 99% Precision Recall

Baseline Logistic Regression GBDT RNN

Recall at Precision = 50%

In practice, we typically try to hit a precision target.

slide-28
SLIDE 28

Numerical comparison: FB Mobile Tab Mo Model l Type PR PR-AUC AUC R@ R@50 50% Baseline 0.470 0.413 Logistic Regression 0.546 0.596 GBDT 0.578 0.616 Recurrent Neural Network 0. 0.596 0. 0.642 Improvement 3.11% 4.22%

~3.4% increase in successful prefetches

slide-29
SLIDE 29

Numerical comparison: Mobile Phone Use2

Public benchmark from Pielot, M., Cardoso, B., Katevas, K., Serra, J., Matic, A., and Oliver, N (2017). Beyond interruptibility: Predicting opportune moments to engage mobile phone users.

Mo Model l Type PR PR-AUC AUC R@ R@50 50% Baseline 0.591 0.811 Logistic Regression 0.683 0.906 GBDT 0.686 0.917 Recurrent Neural Network 0. 0.767 0. 0.977 Improvement 11.8% 6.54%

slide-30
SLIDE 30

Online Testing

0.1 0.2 0.3 0.4 0.5 0.6 0.7 10 20 30 40 50 60 PR-AUC Days since experiment start RNN GBDT

Results are stable over long time periods

slide-31
SLIDE 31

31

Sy System Architecture

20% 80%

Locally Globally Server

  • 0. Prefetch request
slide-32
SLIDE 32

32

Sy System Architecture

20% 80%

Locally Globally Server Key Value Store

  • 1. Fetch hidden state
  • 0. Prefetch request
slide-33
SLIDE 33

33

Sy System Architecture

20% 80%

Locally Globally Server Inference Service Key Value Store

  • 2. Compute

prediction

  • 1. Fetch hidden state
  • 0. Prefetch request
slide-34
SLIDE 34

34

Sy System Architecture

20% 80%

Locally Globally Server Inference Service Key Value Store

  • 2. Compute

prediction

  • 1. Fetch hidden state

Logging Service

  • 3. Log features

and labels

  • 0. Prefetch request
slide-35
SLIDE 35

35

Sy System Architecture

20% 80%

Locally Globally Server Inference Service Key Value Store

  • 2. Compute

prediction

  • 1. Fetch hidden state

Logging Service

  • 3. Log features

and labels

  • 4. Compute new

hidden states

  • 5. Record new

hidden state

  • 0. Prefetch request
slide-36
SLIDE 36

Facebook company

Traditional Methods

36

  • Manually engineered features
  • 10-100s of aggregation feature

lookups per prediction

  • Multiple KBs of storage required

per user

  • ~0.1ms model latency
  • Minimal feature engineering
  • 1 key-value lookup per

prediction

  • Tunable (128 dim ~= 0.5KB)

small storage cost per user

  • ~1ms model latency

10 10x overa rall re reduction in co compu mpute co costs RNN Method

slide-37
SLIDE 37

Facebook company 37

Precompute tasks, like application prefetching and cache warmup, can be modeled well through ML Recurrent neural networks achieve superior modeling performance while reducing feature engineering time RNNs also have surprisingly favorable characteristics when used in large-scale systems

Summary

slide-38
SLIDE 38

Thank you