Response prediction using collaborative filtering with hierarchies - - PowerPoint PPT Presentation

response prediction using collaborative filtering with
SMART_READER_LITE
LIVE PREVIEW

Response prediction using collaborative filtering with hierarchies - - PowerPoint PPT Presentation

Response prediction using collaborative filtering with hierarchies and side-information Aditya Krishna Menon 1 Krishna-Prasad Chitrapura 2 Sachin Garg 2 Deepak Agarwal 3 Nagaraj Kota 2 1 UC San Diego 2 Yahoo! Labs Bangalore 3 Yahoo! Research Santa


slide-1
SLIDE 1

Response prediction using collaborative filtering with hierarchies and side-information

Aditya Krishna Menon1 Krishna-Prasad Chitrapura2 Sachin Garg2 Deepak Agarwal3 Nagaraj Kota2

1UC San Diego 2Yahoo! Labs Bangalore 3Yahoo! Research Santa Clara

KDD ’11, August 22, 2011

1 / 36

slide-2
SLIDE 2

Outline

1

Background: response prediction

2

A latent feature approach to response prediction

3

Combining latent and explicit features

4

Exploiting hierarchical information

5

Experimental results

2 / 36

slide-3
SLIDE 3

The response prediction problem

Basic workflow in computational advertising: Content publisher (e.g. Yahoo!) receives bids from advertisers: Amount paid on some action e.g. ad is clicked, conversion, ...

3 / 36

slide-4
SLIDE 4

The response prediction problem

Basic workflow in computational advertising: Compute expected revenue using clickthrough rate (CTR): Assuming pay-per-click model

4 / 36

slide-5
SLIDE 5

The response prediction problem

Basic workflow in computational advertising: Ads are sort by expected revenue, best ad is chosen Response prediction: Estimate the CTR for each candidate ad

5 / 36

slide-6
SLIDE 6

Approaches to estimating the CTR

Maximum likelihood estimate (MLE) is straightforward:

  • Pr[Click|Display; (Page, Ad)] =

# of clicks in historical data # of displays in historical data

◮ Few displays → too noisy, not displayed → undefined ◮ Can apply statistical smoothing [Agarwal et al., 2009]

Logistic regression on page and ad features [Richardson et al., 2007] LMMH [Agarwal et al., 2010], a log-linear model with hierarchical corrections, is state-of-the-art

6 / 36

slide-7
SLIDE 7

This work

We take a collaborative filtering approach to response prediction

◮ “Recommending” ads to pages based on past history ◮ Learns latent features for pages and ads

Key ingredient is exploiting hierarchical structure

◮ Ties together pages and ads in latent space ◮ Overcomes extreme sparsity of datasets

Experimental results demonstrate state-of-the-art performance

7 / 36

slide-8
SLIDE 8

Outline

1

Background: response prediction

2

A latent feature approach to response prediction

3

Combining latent and explicit features

4

Exploiting hierarchical information

5

Experimental results

8 / 36

slide-9
SLIDE 9

Response prediction as matrix completion

Response prediction has a natural interpretation as matrix completion:

? ? 0.5 1.0 0.5 1.0 0.0 1.0 0.25

◮ Cells are historical CTRs of ads on pages; many cells “missing” ◮ Wish to fill in missing entries, but also smoothen existing ones 9 / 36

slide-10
SLIDE 10

Connection to movie recommendation

This is reminiscent of the movie recommendation problem:

? ?

◮ Cells are ratings of movies by users; many cells “missing” ◮ Very active research area following Netflix prize 10 / 36

slide-11
SLIDE 11

Recommending movies with latent features

A popular approach is to learn latent features from the data:

◮ User i represented by αi ∈ Rk, movie j by βj ∈ Rk ◮ Ratings modelled as (user, movie) affinity in this latent space

For a matrix X with observed cells O, we optimize min

α,β

  • (i,j)∈O

ℓ(Xij, αT

i βj) + Ω(α, β).

◮ Loss ℓ = square-loss, hinge-loss, ... ◮ Regularizer Ω = ℓ2 penalization typically 11 / 36

slide-12
SLIDE 12

Why try latent features for response prediction?

State-of-the-art method for movie recommendation

◮ Reason to think it can be successful for response prediction also

Data is allowed to “speak for itself”

◮ Historical information mined to determine influential factors

Flexible, analogues to supervised learning

◮ Easy to incorporate explicit features, domain knowledge 12 / 36

slide-13
SLIDE 13

Response prediction via latent features - I

Modelling raw CTR matrix with latent features is not sensible

◮ Ignores the confidence in the individual cells

Instead, split each cell into # of displays and # of clicks:

? ? 0.5 1.0 0.5 1.0 0.0 1.0 0.25

Click = +ve example, non-click = -ve example Now focus on modelling entries in each cell

13 / 36

slide-14
SLIDE 14

Response prediction via latent features - I

Modelling raw CTR matrix with latent features is not sensible

◮ Ignores the confidence in the individual cells

Instead, split each cell into # of displays and # of clicks:

? ?

◮ Click = +ve example, non-click = -ve example ◮ Now focus on modelling entries in each cell 14 / 36

slide-15
SLIDE 15

Response prediction via latent features - II

Important to learn meaningful probabilities

◮ Discrimination of click versus not-click is insufficient

For page p and ad a, we may use a sigmoidal model for the individual CTRs: ˆ Ppa = Pr[Click|Display; (p, a)] = exp(αT

p βa)

1 + exp(αT

p βa)

◮ αp, βa ∈ Rk are the latent feature vectors for pages and ads ◮ Corresponds to a logistic loss function [Agarwal and Chen, 2009,

Menon and Elkan, 2010, Yang et al., 2011]

15 / 36

slide-16
SLIDE 16

Confidence weighted objective

We use the sigmoidal model on each cell entry

◮ Treats them as independent training examples

Now maximize conditional log-likelihood:

min

α,β −

  • (p,a)∈O

Cpa log ˆ Ppa(α, β) + (Dpa − Cpa) log(1 − ˆ Ppa(α, β))+ λα 2 ||α||2

F + λβ

2 ||β||2

F

where C = # of clicks, D = # of displays

◮ Terms in objective are confidence weighted ◮ Estimates will be meaningful probabilities 16 / 36

slide-17
SLIDE 17

Outline

1

Background: response prediction

2

A latent feature approach to response prediction

3

Combining latent and explicit features

4

Exploiting hierarchical information

5

Experimental results

17 / 36

slide-18
SLIDE 18

Incorporating explicit features

We’d like latent features to complement, rather than replace, explicit features

◮ For response prediction, explicit features quite predictive ◮ Makes sense to use this information

Incorporate features spa ∈ Rd for the (page, ad) pair (p, a) via ˆ Ppa = σ(wTspa + αT

p βa)

= σ(

  • w; 1

T spa; αT

p βa

  • )

Alternating optimization of (α, β) and w works well

◮ Predictions from factorization → additional features into

logistic regression

18 / 36

slide-19
SLIDE 19

An issue of confidence

Rewrite objective as

min

α,β,w −

  • (p,a)∈O

Dpa

  • Mpa log ˆ

Ppa(α, β, w) + (1 − Mpa) log(1 − ˆ Ppa(α, β, w))

  • λα

2 ||α||2

F + λβ

2 ||β||2

F + λw

2 ||w||2

2

where Mpa := Cpa/Dpa is the MLE for the CTR Issue: Mpa is noisy → confidence weighting is inaccurate

◮ Ideally want to use true probability Ppa itself 19 / 36

slide-20
SLIDE 20

An iterative heuristic

After learning model, replace Mpa with model prediction, and re-learn with new confidence weighting

◮ Can iterate until convergence

Can be used as part of latent/explicit feature interplay:

Additional input features Updated confidences

Confidence weighted factorization Logistic regression

20 / 36

slide-21
SLIDE 21

Outline

1

Background: response prediction

2

A latent feature approach to response prediction

3

Combining latent and explicit features

4

Exploiting hierarchical information

5

Experimental results

21 / 36

slide-22
SLIDE 22

Hierarchical structure to response prediction data

Webpages and ads may be arranged into hierarchies:

Root

Advertiser 1 Camp c

... Ad 1 Ad n Ad 2 ... ... ...

Advertiser a Camp 1 Camp 2 Camp c-1

Hierarchy encodes correlations in CTRs

◮ e.g. Two ads by same advertiser → similar CTRs ◮ Highly structured form of side-information

Successfully used in previous work [Agarwal et al., 2010]

◮ How to exploit this information in our model? 22 / 36

slide-23
SLIDE 23

Using hierarchies: big picture

Intuition: “similar” webpages/ads should have similar latent vectors Each node in the hierarchy is given its own latent vector

β1 β2 βn βn+1 βn+2 βn+c βn+c+1 βn+c+a βRoot Root

Advertiser 1 Camp c

. . . Ad 1 Ad n Ad 2 . . . . . . . . .

Advertiser a Camp 1 Camp 2 Camp c-1

◮ We will tie parameters based on links in hierarchy ◮ Achieved in three simple steps 23 / 36

slide-24
SLIDE 24

Principle 1: Hierarchical regularization

Each node’s latent vector should equal its parent’s, in expectation: αp ∼ N(αParent(p), σ2I) With a MAP estimate of the parameters, this corresponds to the regularizer Ω(α) =

  • p,p′

Spp′||αp − αp′||2

2

where Spp′ is a parent indicator matrix

◮ Latent vectors constrained to be similar to parents ◮ Induces correlation amongst siblings in hierarchy 24 / 36

slide-25
SLIDE 25

Principle 2: Agglomerative fitting

Can create meaningful priors by making parent nodes’ vectors predictive of data:

◮ Associate with each node clicks/views that are the sums of its

childrens’ clicks/views

◮ Then consider an augmented matrix of all publisher and ad

nodes, with appropriate clicks and views

  • 25 / 36
slide-26
SLIDE 26

Principle 2: Agglomerative fitting

We treat the aggregated data as just another response prediction dataset

◮ Learn latent features for parent nodes on this data ◮ Estimates will be more reliable than those of children

Once estimated, these vectors serve as prior in hierarchical regularizer

◮ Children’s vectors are shrunk towards “agglomerated vector” 26 / 36

slide-27
SLIDE 27

Principle 3: Residual fitting

Augment prediction to include bias terms for nodes along the path from root to leaf: ˆ Ppa = σ(αT

p βa + αT p βParent(a) + αT Parent(p)βParent(a) + . . .)

◮ Treats the hierarchy as a series of categorical features

Can be viewed as decomposition of the latent vectors: ˜ αp =

  • u∈Path(p)

αu ˜ βa =

  • v∈Path(a)

βv

27 / 36

slide-28
SLIDE 28

The final model

Our final model has the following ingredients:

◮ Confidence weighting of the objective ◮ Logistic loss to estimate meaningful probabilities ◮ Incorporation of explicit features ⋆ Iterative heuristic for improving confidence weighting ◮ Tying together of latent features via hierarchy

Optimization can be done in alternating manner

◮ Fix α and optimize for β, and vice-versa ◮ Optimization for each βj can be done in parallel ⋆ Individual optimization via stochastic gradient descent 28 / 36

slide-29
SLIDE 29

Outline

1

Background: response prediction

2

A latent feature approach to response prediction

3

Combining latent and explicit features

4

Exploiting hierarchical information

5

Experimental results

29 / 36

slide-30
SLIDE 30

Experimental setup

We compare the latent feature approach to three methods:

1

Generalized linear model (GLM) on explicit features

2

Logistic regression with cross-features [Richardson et al., 2007]

3

Hierarchical log-linear model (LMMH) [Agarwal et al., 2010]

Comparison is on three Yahoo! ad datasets:

1

Click: (90B, 3B) (train, test) pairs

2

Post-view conversions (PVC): (7B, 250M) (train, test) pairs

3

Post-click conversions (PCC): (500M, 20M) (train, test) pairs

Report % improvement in Bernoulli log-likelihood over GLM

◮ Measure of quality of probabilities 30 / 36

slide-31
SLIDE 31

Results on Click

Learning predictive latent features challenging due to sparsity

◮ Using biases from hierarchy improves performance significantly

With hierarchical tying, outperforms existing methods With explicit features, our model has clear lift over LMMH

◮ Value in combining complementary information in the two 0% 5% 10% 15% 20% 25% 30% 4.1% 14.54% 14.85% 16.17% 17.23% 18.64% 18.91% 22.37% 24.01% 26.47% CWFact LogReg Residual LogReg+Hash Agglomerative LMMH Hybrid CWFact+LogReg Hybrid+LogReg Hybrid+LogReg++

CWFact = Confidence weighted factorization Hybrid = CWFact + All hierarchical components Hybrid+LogReg = With explicit features Hybrid+LogReg++ = With iterative heuristic

31 / 36

slide-32
SLIDE 32

Results on PVC and PCC

Our combined model gives the best results on these datasets also Explicit features again important for best performance

◮ Latent features alone are only competitive with LMMH

On PCC, iterative heuristic helps outperform LMMH

◮ Reliability of confidence weighting is important

0% 10% 20% 30% 40% 50% 60% 1 9 . 3 3 % 3 2 . 5 9 % 3 7 . 8 % 3 8 . 1 % 4 1 . 4 % 4 2 . 8 % 4 6 . 6 5 % 4 6 . 8 7 % 4 8 . 4 % 5 . 7 8 % C W F a c t L

  • g

R e g + H a s h R e s i d u a l L

  • g

R e g A g g l

  • m

e r a t i v e H y b r i d C W F a c t + L

  • g

R e g L M M H H y b r i d + L

  • g

R e g H y b r i d + L

  • g

R e g + + 0% 5% 10% 15% 20% 25% 30% 8 . 9 2 % 1 4 . 5 % 1 4 . 5 2 % 1 7 . 3 8 % 1 7 . 7 5 % 1 9 . 3 7 % 2 2 . 4 % 2 2 . 3 8 % 2 3 . 7 7 % 2 4 . 8 % C W F a c t R e s i d u a l L

  • g

R e g L

  • g

R e g + H a s h A g g l

  • m

e r a t i v e H y b r i d C W F a c t + L

  • g

R e g H y b r i d + L

  • g

R e g L M M H H y b r i d + L

  • g

R e g + +

32 / 36

slide-33
SLIDE 33

Value of iterative confidence reweighting

Trick of iteratively recomputing confidence-weighting by model prediction gives useful performance boost

◮ Generally, log-likelihood improves after each such iteration

2 4 6 8 10 24% 24.5% 25% 25.5% 26% 26.5% 27% Iteration number Log−likelihood lift

33 / 36

slide-34
SLIDE 34

Latent and explicit features

Ideally, predictions should be ∼ MLE when # of displays is large With latent features, model converges to MLE faster

◮ Variance of logistic regression model, which uses explicit

features only, is significantly reduced

−3 −2 −1 1 2 3 Training set views log(CTR / Prediction) LogReg Optimal

(a) Explicit features only

−3 −2 −1 1 2 3 Training set views log(CTR / Prediction) Hybrid+LogReg++ Optimal

(b) Latent + explicit features

34 / 36

slide-35
SLIDE 35

Conclusions

Response prediction can be approached from a collaborative filtering perspective Learning latent features for pages and ads gives state-of-the-art performance Some adaptation required for success in this domain

◮ Had to use confidence weighting scheme ⋆ Iteratively refined the confidences ◮ Incorporating explicit features gives important boost to lifts ◮ Hierarchical information helps overcome data sparsity 35 / 36

slide-36
SLIDE 36

References I

Agarwal, D., Agrawal, R., Khanna, R., and Kota, N. (2010). Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In KDD ’10, pages 213–222, New York, NY, USA. ACM. Agarwal, D. and Chen, B.-C. (2009). Regression-based latent factor models. In KDD ’09, pages 19–28, New York, NY, USA. ACM. Agarwal, D., Chen, B.-C., and Elango, P. (2009). Spatio-temporal models for estimating click-through rate. In WWW ’09, pages 21–30, New York, NY, USA. ACM. Menon, A. K. and Elkan, C. (2010). A log-linear model with latent features for dyadic prediction. In ICDM ’10. Richardson, M., Dominowska, E., and Ragno, R. (2007). Predicting clicks: estimating the click-through rate for new ads. In WWW ’07, pages 521–530, New York, NY, USA. ACM. Yang, S., Long, B., Smola, A., Sadagopan, N., Zheng, Z., and Zha, H. (2011). Like like alike – joint friendship and interest propagation in social networks. In WWW ’11.

36 / 36