Response prediction using collaborative filtering with hierarchies - PowerPoint PPT Presentation

Response prediction using collaborative filtering with hierarchies and side-information Aditya Krishna Menon 1 Krishna-Prasad Chitrapura 2 Sachin Garg 2 Deepak Agarwal 3 Nagaraj Kota 2 1 UC San Diego 2 Yahoo! Labs Bangalore 3 Yahoo! Research Santa Clara KDD ’11, August 22, 2011 1 / 36

Outline Background: response prediction 1 A latent feature approach to response prediction 2 Combining latent and explicit features 3 Exploiting hierarchical information 4 Experimental results 5 2 / 36

The response prediction problem Basic workflow in computational advertising: Content publisher (e.g. Yahoo!) receives bids from advertisers: Amount paid on some action e.g. ad is clicked, conversion, ... 3 / 36

The response prediction problem Basic workflow in computational advertising: Compute expected revenue using clickthrough rate (CTR): Assuming pay-per-click model 4 / 36

The response prediction problem Basic workflow in computational advertising: Ads are sort by expected revenue, best ad is chosen Response prediction : Estimate the CTR for each candidate ad 5 / 36

Approaches to estimating the CTR Maximum likelihood estimate (MLE) is straightforward: # of clicks in historical data � Pr [ Click | Display ; ( Page , Ad )] = # of displays in historical data ◮ Few displays → too noisy, not displayed → undefined ◮ Can apply statistical smoothing [Agarwal et al., 2009] Logistic regression on page and ad features [Richardson et al., 2007] LMMH [Agarwal et al., 2010], a log-linear model with hierarchical corrections, is state-of-the-art 6 / 36

This work We take a collaborative filtering approach to response prediction ◮ “Recommending” ads to pages based on past history ◮ Learns latent features for pages and ads Key ingredient is exploiting hierarchical structure ◮ Ties together pages and ads in latent space ◮ Overcomes extreme sparsity of datasets Experimental results demonstrate state-of-the-art performance 7 / 36

Response prediction as matrix completion Response prediction has a natural interpretation as matrix completion: 0.5 1.0 ? 0.5 ? 0.25 0.0 1.0 1.0 ◮ Cells are historical CTRs of ads on pages; many cells “missing” ◮ Wish to fill in missing entries, but also smoothen existing ones 9 / 36

Connection to movie recommendation This is reminiscent of the movie recommendation problem: ? ? ◮ Cells are ratings of movies by users; many cells “missing” ◮ Very active research area following Netflix prize 10 / 36

Recommending movies with latent features A popular approach is to learn latent features from the data: ◮ User i represented by α i ∈ R k , movie j by β j ∈ R k ◮ Ratings modelled as (user, movie) affinity in this latent space For a matrix X with observed cells O , we optimize � ℓ ( X ij , α T min i β j ) + Ω( α, β ) . α,β ( i,j ) ∈O ◮ Loss ℓ = square-loss, hinge-loss, ... ◮ Regularizer Ω = ℓ 2 penalization typically 11 / 36

Why try latent features for response prediction? State-of-the-art method for movie recommendation ◮ Reason to think it can be successful for response prediction also Data is allowed to “speak for itself” ◮ Historical information mined to determine influential factors Flexible, analogues to supervised learning ◮ Easy to incorporate explicit features, domain knowledge 12 / 36

Response prediction via latent features - I Modelling raw CTR matrix with latent features is not sensible ◮ Ignores the confidence in the individual cells Instead, split each cell into # of displays and # of clicks: 0.5 1.0 ? 0.5 ? 0.25 0.0 1.0 1.0 Click = +ve example, non-click = -ve example Now focus on modelling entries in each cell 13 / 36

Response prediction via latent features - I Modelling raw CTR matrix with latent features is not sensible ◮ Ignores the confidence in the individual cells Instead, split each cell into # of displays and # of clicks: ? ? ◮ Click = +ve example, non-click = -ve example ◮ Now focus on modelling entries in each cell 14 / 36

Response prediction via latent features - II Important to learn meaningful probabilities ◮ Discrimination of click versus not-click is insufficient For page p and ad a , we may use a sigmoidal model for the individual CTRs: exp( α T p β a ) ˆ P pa = Pr[ Click | Display ; ( p, a )] = 1 + exp( α T p β a ) ◮ α p , β a ∈ R k are the latent feature vectors for pages and ads ◮ Corresponds to a logistic loss function [Agarwal and Chen, 2009, Menon and Elkan, 2010, Yang et al., 2011] 15 / 36

Confidence weighted objective We use the sigmoidal model on each cell entry ◮ Treats them as independent training examples Now maximize conditional log-likelihood: � C pa log ˆ P pa ( α, β ) + ( D pa − C pa ) log(1 − ˆ min α,β − P pa ( α, β ))+ ( p,a ) ∈O λ α F + λ β 2 || α || 2 2 || β || 2 F where C = # of clicks, D = # of displays ◮ Terms in objective are confidence weighted ◮ Estimates will be meaningful probabilities 16 / 36

Incorporating explicit features We’d like latent features to complement, rather than replace, explicit features ◮ For response prediction, explicit features quite predictive ◮ Makes sense to use this information Incorporate features s pa ∈ R d for the (page, ad) pair ( p, a ) via ˆ P pa = σ ( w T s pa + α T p β a ) � � T � s pa ; α T � w ; 1 p β a = σ ( ) Alternating optimization of ( α, β ) and w works well ◮ Predictions from factorization → additional features into logistic regression 18 / 36

An issue of confidence Rewrite objective as � � � M pa log ˆ P pa ( α, β, w ) + (1 − M pa ) log(1 − ˆ α,β,w − min P pa ( α, β, w )) D pa ( p,a ) ∈O λ α F + λ β F + λ w 2 || α || 2 2 || β || 2 2 || w || 2 2 where M pa := C pa /D pa is the MLE for the CTR Issue : M pa is noisy → confidence weighting is inaccurate ◮ Ideally want to use true probability P pa itself 19 / 36

features Additional input Updated confidences Confidence weighted factorization Logistic regression An iterative heuristic After learning model, replace M pa with model prediction, and re-learn with new confidence weighting ◮ Can iterate until convergence Can be used as part of latent/explicit feature interplay: 20 / 36

Advertiser a Root Advertiser 1 Camp c Camp c -1 Ad 1 Ad n Ad 2 Camp 2 Camp 1 Hierarchical structure to response prediction data Webpages and ads may be arranged into hierarchies: ... ... ... ... Hierarchy encodes correlations in CTRs ◮ e.g. Two ads by same advertiser → similar CTRs ◮ Highly structured form of side-information Successfully used in previous work [Agarwal et al., 2010] ◮ How to exploit this information in our model? 22 / 36

Using hierarchies: big picture Intuition : “similar” webpages/ads should have similar latent vectors Each node in the hierarchy is given its own latent vector β Root Root β n + c +1 β n + c + a . . . Advertiser 1 Advertiser a β n +1 β n +2 β n + c . . . Camp 1 Camp 2 Camp c -1 Camp c . . . β 1 β 2 β n Ad 1 . . . Ad 2 Ad n ◮ We will tie parameters based on links in hierarchy ◮ Achieved in three simple steps 23 / 36

Principle 1: Hierarchical regularization Each node’s latent vector should equal its parent’s, in expectation: α p ∼ N ( α Parent ( p ) , σ 2 I ) With a MAP estimate of the parameters, this corresponds to the regularizer � S pp ′ || α p − α p ′ || 2 Ω( α ) = 2 p,p ′ where S pp ′ is a parent indicator matrix ◮ Latent vectors constrained to be similar to parents ◮ Induces correlation amongst siblings in hierarchy 24 / 36

Principle 2: Agglomerative fitting Can create meaningful priors by making parent nodes’ vectors predictive of data: ◮ Associate with each node clicks/views that are the sums of its childrens’ clicks/views ◮ Then consider an augmented matrix of all publisher and ad nodes, with appropriate clicks and views �� 25 / 36

Principle 2: Agglomerative fitting We treat the aggregated data as just another response prediction dataset ◮ Learn latent features for parent nodes on this data ◮ Estimates will be more reliable than those of children Once estimated, these vectors serve as prior in hierarchical regularizer ◮ Children’s vectors are shrunk towards “agglomerated vector” 26 / 36

Principle 3: Residual fitting Augment prediction to include bias terms for nodes along the path from root to leaf: ˆ P pa = σ ( α T p β a + α T p β Parent ( a ) + α T Parent ( p ) β Parent ( a ) + . . . ) ◮ Treats the hierarchy as a series of categorical features Can be viewed as decomposition of the latent vectors: � α p = ˜ α u u ∈ Path ( p ) � ˜ β a = β v v ∈ Path ( a ) 27 / 36

Response prediction using collaborative filtering with hierarchies - PowerPoint PPT Presentation

Response prediction using collaborative filtering with hierarchies and side-information Aditya Krishna Menon 1 Krishna-Prasad Chitrapura 2 Sachin Garg 2 Deepak Agarwal 3 Nagaraj Kota 2 1 UC San Diego 2 Yahoo! Labs Bangalore 3 Yahoo! Research Santa

CS490W: What is Collaborative Filtering? Collaborative Filtering (CF): Making recommendation

Filtering Cubemaps Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Traffic Control Mechanisms Filtering Source address filtering Other forms of filtering

Lesson 7 Rate Conversion Filtering and Downsampling interchange Filtering and Upsampling

Collaborative Filtering Yun-Ta Tsai 1 , Markus Steinberger 2 , Dawid Pajk 3 , Kari Pulli 4 1

Nonlinear Filtering using Particles and Outline Nonlinear Quadrature Filtering Monte Carlo

Effective Missing Data Prediction for Collaborative Filtering Hao Ma, Irwin King, and Michael R.

1 An Filtering System that Monitors Document Search Engines Can Help, But Not Enough!

Collaborative Filtering Presentation by Alex Hugger Filtering Documents Mittwoch, 28. April 2010

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

aHomestake Array and Wiener Filtering Array Coherence Wiener Filtering Velocity Measurements

Least-Action Filtering L. C. G. Rogers Statistical Laboratory, University of Cambridge

The Filtering Matrix Interrogating Internet Filtering and Surveillance Practices Worldwide Nart

Statistical Filtering and Control for AI and Robotics Part I. Bayes filtering Riccardo Muradore

FILTERING MACROECONOMIC DATA WienerKolmogorov Filtering of Stationary Sequences The classical

Enabling Operator Reordering in Data Flow Programs Through Static Code Analysis XLDI 2012 Fabian

Numerical Optimization Techniques L eon Bottou NEC Labs America COS 424 3/2/2010

Random Graphs Lecture 10 CSCI 4974/6971 3 Oct 2016 1 / 11 Todays Biz 1. Reminders 2.

CS 6453: StreamScope Soumya Basu March 7, 2017 Motivation Streaming data is everywhere!

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Deep Character-Level Bora Edizel - Phd Student UPF Click-Through Rate Prediction Amin Mantrach -

4 Idiots Approach for Click-through Rate Prediction 1/15 Team Members 4 Idiots consist of:

Web Mining and Recommender Systems Algorithms for advertising Learning Goals Introduce the