RecSys 2019
Addressing Delayed Feedback for Continuous Training with Neural - - PowerPoint PPT Presentation
Addressing Delayed Feedback for Continuous Training with Neural - - PowerPoint PPT Presentation
Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction SI Ktena, A Tejani, L Theis, P Kumar Myana, D Dilipkumar, F Huszr, S Yoo, W Shi RecSys 2019 Background Why continuous training? 2 Background Why
SLIDE 1
SLIDE 2
Why continuous training?
2 Background
SLIDE 3
Why continuous training?
2 Background
SLIDE 4
3
New campaign IDs + non-stationary features
Background
SLIDE 5
Challenge: Delayed feedback
Fact: Users may click ads after 1 second 1 minute
- r 1 hour
SLIDE 6
Challenge: Delayed feedback
Why is it a challenge? Should we wait? → Delays model training Should we not wait? How do we decide the label?
Training Delay Model quality
SLIDE 7
Solution: accept “fake negative”
(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1
Time
Event Label Weight
SLIDE 8
Solution: accept “fake negative”
(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1
Time
Event Label Weight
SLIDE 9
Solution: accept “fake negative”
(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1
Time
Event Label Weight
SLIDE 10
Solution: accept “fake negative”
(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1
Time
same features
Event Label Weight
SLIDE 11
Solution: accept “fake negative”
(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1
Time
same features
Event Label Weight
SLIDE 12
Solution: accept “fake negative”
(user1, ad1, t1) imp 1 (user2, ad1, t2) imp 1 (user1, ad1, t3) click 1
Time
Assume X #Clicks out of Y #Impressions
Works well when CTR is low, where X/Y ~= X/ (X+Y) same features
Event Label Weight
SLIDE 13
7 Background
Delayed feedback models
SLIDE 14
7 Background
Delayed feedback models
SLIDE 15
7 Background
Delayed feedback models
- The probability of click is not constant through
time [Chapelle 2014]
SLIDE 16
7 Background
Delayed feedback models
- The probability of click is not constant through
time [Chapelle 2014]
- Second model similar to survival time analysis
models captures the delay between impression and click
SLIDE 17
7 Background
Delayed feedback models
- The probability of click is not constant through
time [Chapelle 2014]
- Second model similar to survival time analysis
models captures the delay between impression and click
- Assume an exponential distribution or other non-
parametric distribution
SLIDE 18
7 Background
Delayed feedback models
- The probability of click is not constant through
time [Chapelle 2014]
- Second model similar to survival time analysis
models captures the delay between impression and click
- Assume an exponential distribution or other non-
parametric distribution
SLIDE 19
8 Background
Delayed feedback models
SLIDE 20
Our approach
SLIDE 21
10
Importance sampling
- p is the actual data distribution
- b is the biased data distribution
Importance weights
Our approach
SLIDE 22
11 Our approach
- Continuous training scheme -> potentially wait
infinite time for positive engagement
- Two models
○ Logistic regression ○ Wide-and-deep model
- Four loss functions
○ Delayed feedback loss [Chapelle, 2014] ○ Positive-unlabeled loss [du Plessis et al., 2015] ○ Fake negative weighted ○ Fake negative calibration
SLIDE 23
11 Our approach
- Continuous training scheme -> potentially wait
infinite time for positive engagement
- Two models
○ Logistic regression ○ Wide-and-deep model
- Four loss functions
○ Delayed feedback loss [Chapelle, 2014] ○ Positive-unlabeled loss [du Plessis et al., 2015] ○ Fake negative weighted ○ Fake negative calibration
both rely on importance sampling
SLIDE 24
12
Delayed feedback loss Assume exponential distribution for time delay
Loss functions
SLIDE 25
12
Delayed feedback loss Assume exponential distribution for time delay
Loss functions
SLIDE 26
13
Fake negative weighted & calibration Don’t apply any weights on the training samples,
- nly calibrate the output of the network using the
following formulation
Loss functions
SLIDE 27
13
Fake negative weighted & calibration Don’t apply any weights on the training samples,
- nly calibrate the output of the network using the
following formulation
Loss functions
SLIDE 28
Experiments
SLIDE 29
15 Offline experiments
Criteo data ○ Small dataset & public ○ Training - 15.5M / Testing: 3.5M examples
RCE: normalised version of cross-entropy (higher values are better)
SLIDE 30
15 Offline experiments
Criteo data ○ Small dataset & public ○ Training - 15.5M / Testing: 3.5M examples
RCE: normalised version of cross-entropy (higher values are better)
SLIDE 31
16 Offline experiments
Twitter data ○ Large & proprietary due to user information ○ Training: 668M ads w. FN / Testing: 7M ads
RCE: normalised version of cross-entropy (higher values are better)
SLIDE 32
16 Offline experiments
Twitter data ○ Large & proprietary due to user information ○ Training: 668M ads w. FN / Testing: 7M ads
RCE: normalised version of cross-entropy (higher values are better)
SLIDE 33
17 Online experiment
Online (A/B test)
Pooled RCE: RCE on combined traffic generated by models RPMq: Revenue per thousand requests
SLIDE 34
18
Conclusions
- Solve problem of delayed feedback in continuous
training by relying on importance weights
- FN weighted and FN calibration proposed and
applied for the first time
- Offline evaluation on large proprietary dataset and
- nline A/B test
SLIDE 35
19
Future directions
- Address catastrophic forgetting and overfitting
- Exploration / exploitation strategies
SLIDE 36