Addressing Delayed Feedback for Continuous Training with Neural - - PowerPoint PPT Presentation

addressing delayed feedback for continuous training with
SMART_READER_LITE
LIVE PREVIEW

Addressing Delayed Feedback for Continuous Training with Neural - - PowerPoint PPT Presentation

Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction SI Ktena, A Tejani, L Theis, P Kumar Myana, D Dilipkumar, F Huszr, S Yoo, W Shi RecSys 2019 Background Why continuous training? 2 Background Why


slide-1
SLIDE 1

RecSys 2019

Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction

SI Ktena, A Tejani, L Theis, P Kumar Myana, D Dilipkumar, F Huszár, S Yoo, W Shi

slide-2
SLIDE 2

Why continuous training?

2 Background

slide-3
SLIDE 3

Why continuous training?

2 Background

slide-4
SLIDE 4

3

New campaign IDs + non-stationary features

Background

slide-5
SLIDE 5

Challenge: Delayed feedback

Fact: Users may click ads after 1 second 1 minute

  • r 1 hour
slide-6
SLIDE 6

Challenge: Delayed feedback

Why is it a challenge? Should we wait? → Delays model training Should we not wait? How do we decide the label?

Training Delay Model quality

slide-7
SLIDE 7

Solution: accept “fake negative”

(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1

Time

Event Label Weight

slide-8
SLIDE 8

Solution: accept “fake negative”

(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1

Time

Event Label Weight

slide-9
SLIDE 9

Solution: accept “fake negative”

(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1

Time

Event Label Weight

slide-10
SLIDE 10

Solution: accept “fake negative”

(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1

Time

same features

Event Label Weight

slide-11
SLIDE 11

Solution: accept “fake negative”

(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1

Time

same features

Event Label Weight

slide-12
SLIDE 12

Solution: accept “fake negative”

(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1

Time

Assume X #Clicks out of Y #Impressions

Works well when CTR is low, where X/Y ~= X/ (X+Y) same features

Event Label Weight

slide-13
SLIDE 13

7 Background

Delayed feedback models

slide-14
SLIDE 14

7 Background

Delayed feedback models

slide-15
SLIDE 15

7 Background

Delayed feedback models

  • The probability of click is not constant through

time [Chapelle 2014]

slide-16
SLIDE 16

7 Background

Delayed feedback models

  • The probability of click is not constant through

time [Chapelle 2014]

  • Second model similar to survival time analysis

models captures the delay between impression and click

slide-17
SLIDE 17

7 Background

Delayed feedback models

  • The probability of click is not constant through

time [Chapelle 2014]

  • Second model similar to survival time analysis

models captures the delay between impression and click

  • Assume an exponential distribution or other non-

parametric distribution

slide-18
SLIDE 18

7 Background

Delayed feedback models

  • The probability of click is not constant through

time [Chapelle 2014]

  • Second model similar to survival time analysis

models captures the delay between impression and click

  • Assume an exponential distribution or other non-

parametric distribution

slide-19
SLIDE 19

8 Background

Delayed feedback models

slide-20
SLIDE 20

Our approach

slide-21
SLIDE 21

10

Importance sampling

  • p is the actual data distribution
  • b is the biased data distribution

Importance weights

Our approach

slide-22
SLIDE 22

11 Our approach

  • Continuous training scheme -> potentially wait

infinite time for positive engagement

  • Two models

○ Logistic regression ○ Wide-and-deep model

  • Four loss functions

○ Delayed feedback loss [Chapelle, 2014] ○ Positive-unlabeled loss [du Plessis et al., 2015] ○ Fake negative weighted ○ Fake negative calibration

slide-23
SLIDE 23

11 Our approach

  • Continuous training scheme -> potentially wait

infinite time for positive engagement

  • Two models

○ Logistic regression ○ Wide-and-deep model

  • Four loss functions

○ Delayed feedback loss [Chapelle, 2014] ○ Positive-unlabeled loss [du Plessis et al., 2015] ○ Fake negative weighted ○ Fake negative calibration

both rely on importance sampling

slide-24
SLIDE 24

12

Delayed feedback loss Assume exponential distribution for time delay

Loss functions

slide-25
SLIDE 25

12

Delayed feedback loss Assume exponential distribution for time delay

Loss functions

slide-26
SLIDE 26

13

Fake negative weighted & calibration Don’t apply any weights on the training samples,

  • nly calibrate the output of the network using the

following formulation

Loss functions

slide-27
SLIDE 27

13

Fake negative weighted & calibration Don’t apply any weights on the training samples,

  • nly calibrate the output of the network using the

following formulation

Loss functions

slide-28
SLIDE 28

Experiments

slide-29
SLIDE 29

15 Offline experiments

Criteo data ○ Small dataset & public ○ Training - 15.5M / Testing: 3.5M examples

RCE: normalised version of cross-entropy (higher values are better)

slide-30
SLIDE 30

15 Offline experiments

Criteo data ○ Small dataset & public ○ Training - 15.5M / Testing: 3.5M examples

RCE: normalised version of cross-entropy (higher values are better)

slide-31
SLIDE 31

16 Offline experiments

Twitter data ○ Large & proprietary due to user information ○ Training: 668M ads w. FN / Testing: 7M ads

RCE: normalised version of cross-entropy (higher values are better)

slide-32
SLIDE 32

16 Offline experiments

Twitter data ○ Large & proprietary due to user information ○ Training: 668M ads w. FN / Testing: 7M ads

RCE: normalised version of cross-entropy (higher values are better)

slide-33
SLIDE 33

17 Online experiment

Online (A/B test)

Pooled RCE: RCE on combined traffic generated by models RPMq: Revenue per thousand requests

slide-34
SLIDE 34

18

Conclusions

  • Solve problem of delayed feedback in continuous

training by relying on importance weights

  • FN weighted and FN calibration proposed and

applied for the first time

  • Offline evaluation on large proprietary dataset and
  • nline A/B test
slide-35
SLIDE 35

19

Future directions

  • Address catastrophic forgetting and overfitting
  • Exploration / exploitation strategies
slide-36
SLIDE 36

Questions?

https://careers.twitter.com

@s0f1ra