Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! - - PowerPoint PPT Presentation

scalable machine learning
SMART_READER_LITE
LIVE PREVIEW

Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! - - PowerPoint PPT Presentation

Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Significant content courtesy of Yehuda Koren 8. Recommender Systems Much content courtesy of (Mr


slide-1
SLIDE 1

Scalable Machine Learning

  • 8. Recommender Systems

Alex Smola Yahoo! Research and ANU

http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

Significant content courtesy of Yehuda Koren

slide-2
SLIDE 2
  • 8. Recommender Systems

Much content courtesy of (Mr Netflix) Yehuda Koren

slide-3
SLIDE 3

Outline

  • Neighborhood methods
  • User / movie similarity
  • Iteration on graph
  • Matrix Factorization
  • Singular value decomposition
  • Convex reformulation
  • Ranking and Session Modeling
  • Ordinal regression
  • Session models
  • Features
  • Latent dense (Bayesian Probabilistic Matrix Factorization)
  • Latent sparse (Dirichlet process factorization)
  • Coldstart problem (inferring features)
  • Hashing
slide-4
SLIDE 4

Why

slide-5
SLIDE 5
slide-6
SLIDE 6

Netflix

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

Personalized Content

adapt to general popularity pick based on user preferences

slide-11
SLIDE 11

Spam Filtering

Something went wrong!

slide-12
SLIDE 12

A more formal view

  • User (requests content)
  • Objects (that can be displayed)
  • Context (device, location, time)
  • Interface (mobile browser, tablet, viewport)

u

  • c

interface

recommend relevant objects

slide-13
SLIDE 13

Examples

  • Movie recommendation (Netflix)
  • Related product recommendation (Amazon)
  • Web page ranking (Google)
  • Social recommendation (Facebook)
  • News content recommendation (Yahoo)
  • Priority inbox & spam filtering (Google)
  • Online dating (OK Cupid)
  • Computational Advertising (Yahoo)
slide-14
SLIDE 14

Running Example Netflix Movie Recommendation

score

date movie user

1 5/7/02 21 1 5 8/2/04 213 1 4 3/6/01 345 2 4 5/1/05 123 2 3 7/15/02 768 2 5 1/22/01 76 3 4 8/3/00 45 4 1 9/10/05 568 5 2 3/5/03 342 5 2 12/28/00 234 5 5 8/11/02 76 6 4 6/15/03 56 6

score date movie user

? 1/6/05 62 1 ? 9/13/04 96 1 ? 8/18/05 7 2 ? 11/22/05 3 2 ? 6/13/02 47 3 ? 8/12/01 15 3 ? 9/1/00 41 4 ? 8/27/05 28 4 ? 4/4/05 93 5 ? 7/16/03 74 5 ? 2/14/04 69 6 ? 10/3/03 83 6

Training data Test data

slide-15
SLIDE 15

Challenges

  • Scalability
  • Millions of objects
  • 100s of millions of users
  • Cold start
  • Changing user base
  • Changing inventory (movies, stories, goods)
  • Attributes
  • Imbalanced dataset

User activity / item reviews are power law distributed

http://www.igvita.com/2006/10/29/dissecting-the-netflix-dataset/

slide-16
SLIDE 16

Netflix competition yardstick

  • Least mean squares prediction error
  • Easy to define
  • Wrong measure for composing sessions!
  • Consistent (in large sample size limit this will

converge to minimizer)

rmse(S) = s |S|−1 X

(i,u)∈S

(ˆ rui − rui)2

slide-17
SLIDE 17

1 Neighborhood Methods

slide-18
SLIDE 18

Basic Idea

Joe

#2 #3 #1 #4

slide-19
SLIDE 19
  • Derive ¡unknown ¡ratings ¡from ¡those ¡of ¡“similar” ¡items ¡
  • Basic Idea
  • (user,user) similarity to recommend items
  • good if item base is smaller than user base
  • good if item base changes rapidly
  • traverse bipartite similarity graph
  • (item,item) similarity to recommend new items that

were also liked by the same users

  • good if the user base

is small is small

  • Oldest known CF method
slide-20
SLIDE 20

Neighborhood based CF

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1

1

3 1 2 4 4 5

2

5 3 4 3 2 1 4 2

3

2 4 5 4 2

4

5 2 2 4 3 4

5

4 2 3 3 1

6 users items

  • unknown rating
  • rating between 1 to 5

0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6

slide-21
SLIDE 21

Neighborhood based CF

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1

1

3 1 2 4 4 5

2

5 3 4 3 2 1 4 2

3

2 4 5 4 2

4

5 2 2 4 3 4

5

4 2 3 3 1

6 users items

  • unknown rating
  • rating between 1 to 5

?

0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6

slide-22
SLIDE 22

Neighborhood based CF

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1

1

3 1 2 4 4 5

2

5 3 4 3 2 1 4 2

3

2 4 5 4 2

4

5 2 2 4 3 4

5

4 2 3 3 1

6 users items

  • unknown rating
  • rating between 1 to 5

?

0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6

slide-23
SLIDE 23

Neighborhood based CF

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1

1

3 1 2 4 4 5

2

5 3 4 3 2 1 4 2

3

2 4 5 4 2

4

5 2 2 4 3 4

5

4 2 3 3 1

6 users items

  • unknown rating
  • rating between 1 to 5

similarity s13 = 0.2 s16 = 0.3 weighted average ?

0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6

slide-24
SLIDE 24

Neighborhood based CF

12 11 10 9 8 7 6 5 4 3 2 1

4 5 5 3 1

1

3 1 2 4 4 5

2

5 3 4 3 2 1 4 2

3

2 4 5 4 2

4

5 2 2 4 3 4

5

4 2 3 3 1

6 users items

  • unknown rating
  • rating between 1 to 5

similarity s13 = 0.2 s16 = 0.3 weighted average

0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6

2.6

slide-25
SLIDE 25
  • Derive ¡unknown ¡ratings ¡from ¡those ¡of ¡“similar” ¡items ¡
  • Properties
  • Intuitive
  • No (substantial) training
  • Handles new users / items
  • Easy to explain to user
  • Accuracy & scalability questionable
slide-26
SLIDE 26

Normalization / Bias

  • Problem
  • Some items are significantly higher rated
  • Some users rate substantially lower
  • Ratings change over time
  • Bias correction is crucial for nearest neighborhood

recommender algorithm

  • Offset per user
  • Offset per movie
  • Time effects
  • Global bias

bui = µ + bu + bi

user item global

Bell & Koren ICDM 2007 http://public.research.att.com/~volinsky/netflix/BellKorICDM07.pdf

slide-27
SLIDE 27

Baseline estimation

  • Mean rating is 3.7
  • Troll Hunter is 0.7 above mean
  • User rates 0.2 below mean
  • Baseline is 4.2 stars
  • Least mean squares problem
  • Jointly convex. Alternatively remove mean & iterate

minimize

b

X

(u,i)

(rui − µ − bu − bi)2 + λ "X

u

b2

u +

X

i

b2

i

#

bi = P

u∈R(i)(rui − µ − bu)

λ + |R(i)| and bu = P

i∈R(u)(rui − µ − bi)

λ + |R(u)|

slide-28
SLIDE 28

Parzen Windows style CF

  • Similarity measure sij between items
  • Find set sk(i,u) of k-nearest neighbors to i that

were rated by user u

  • Weighted average over the set
  • How to compute sij?

ˆ rui = bui + P

j∈sk(i,u) sij(ruj − buj)

P

j∈sk(i,u) sij

where bui = µ + bu + bi

slide-29
SLIDE 29

each item rated by a distinct set of users

1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ? ? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4 User ratings for item i: User ratings for item j:

  • (item,item) similarity measures
  • Pearson correlation coefficient
  • nonuniform support
  • compute only over shared support
  • shrinkage towards 0 to address problem of

small support (typically few items in common)

sij = Cov[rui, ruj] Std[rui]Std[ruj]

slide-30
SLIDE 30

(item,item) similarity

  • Empirical Pearson correlation coefficient
  • Smoothing towards 0 for small support
  • Make neighborhood more peaked
  • Shrink towards baseline for small neighborhood

ˆ ρij = P

u∈U(i,j)(rui − bui)(ruj − buj)

qP

u∈U(i,j)(rui − bui)2 P u∈U(i,j)(ruj − buj)2

sij = |U(i, j)| − 1 |U(i, j)| − 1 + λ ˆ ρij sij → s2

ij

ˆ rui = bui + P

j∈sk(i,u) sij(ruj − buj)

λ + P

j∈sk(i,u) sij

slide-31
SLIDE 31

Similarity for binary data

  • Pearson correlation meaningless
  • Views
  • Purchase behavior
  • Clicks
  • Jaccard similarity

(intersection vs. joint)

  • Observed/expected ratio

Improve by counting per user (many users better than heavy users)

mi users acting on i mij users acting on both i and j m total number of users

sij = mij α + mi + mj − mij sij = observed expected ≈ mij α + mimj/m

slide-32
SLIDE 32

2 Matrix Factorization

slide-33
SLIDE 33

Basics

slide-34
SLIDE 34

Basic Idea

~

M ≈ U · V

slide-35
SLIDE 35

Latent variable view

Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s ¡11 Sense and Sensibility

Gus Dave

slide-36
SLIDE 36

Basic matrix factorization

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users users A rank-3 SVD approximation

slide-37
SLIDE 37

Estimate unknown ratings as inner products of latent factors

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users A rank-3 SVD approximation users

?

slide-38
SLIDE 38

Estimate unknown ratings as inner products of latent factors

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users A rank-3 SVD approximation users

?

slide-39
SLIDE 39

Estimate unknown ratings as inner products of latent factors

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users

2.4

A rank-3 SVD approximation users

slide-40
SLIDE 40

Properties

  • SVD is undefined for missing entries
  • stochastic gradient descent (faster)
  • alternating optimization
  • Overfitting without regularization

particularly if fewer reviews than dimensions

  • Very popular on Netflix

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

slide-41
SLIDE 41

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4 Prize: 0.8563 Netflix: 0.9514

slide-42
SLIDE 42

Risk Minimization View

  • Objective Function
  • Alternating least squares

minimize

p,q

X

(u,i)∈S

(rui hpu, qii)2 + λ h kpk2

Frob + kqk2 Frob

i pu ← 2 4λ1 + X

i|(u,i)2S

qiq>

i

3 5

1

X

i

qirui qi ← 2 4λ1 + X

u|(u,i)2S

pup>

u

3 5

1

X

i

purui

good for MapReduce

slide-43
SLIDE 43

Risk Minimization View

  • Objective Function
  • Stochastic gradient descent
  • No need for locking
  • Multicore updates asynchronously

(Recht, Re, Wright, 2012 - Hogwild)

minimize

p,q

X

(u,i)∈S

(rui hpu, qii)2 + λ h kpk2

Frob + kqk2 Frob

i

much faster

pu (1 ληt)pu ηtqi(rui hpu, qii) qi (1 ληt)qi ηtpu(rui hpu, qii)

slide-44
SLIDE 44

Theoretical Motivation

slide-45
SLIDE 45

deFinetti Theorem

  • Independent random variables
  • Exchangeable random variables
  • There exists a conditionally independent

representation of exchangeable r.v. This motivates latent variable models

p(X) =

m

Y

i=1

p(xi) p(X) = p(x1, . . . , xm) = p(xπ(1), . . . , xπ(m))

xi xi ϴ

p(X) = Z dp(θ)

m

Y

i=1

p(xi|θ)

slide-46
SLIDE 46

Aldous Hoover Factorization

  • Matrix-valued set of random variable

Example - Erdos Renyi graph model

  • Independently exchangeable on matrix
  • Aldous Hoover Theorem

p(E) = Y

i,j

p(Vij)

p(E) = p(E11, E12, . . . , Emn) = p(Eπ(1)ρ(1), Eπ(1)ρ(2), . . . , Eπ(m)ρ(n))

p(E) = Z dp(θ) Z

m

Y

i=1

dp(ui)

n

Y

j=1

dp(vj) Y

i,j

p(Eij|ui, vj, θ)

slide-47
SLIDE 47

Aldous Hoover Factorization

  • Rating matrix is (row,

column) exchangeable

  • Draw latent variables per

row and column

  • Draw matrix entries

independently given pairs

  • Absence / presence of

rating is a signal

  • Can be extended to graphs

with vertex attributes

u1 u2 u3 u4 u5 u6 v1 e11 e12 e15 e16 v2 e24 v3 e32 v4 e43 e46 v5 e55

slide-48
SLIDE 48

Aldous Hoover variants

  • Jointly exchangeable matrix
  • Social network graphs
  • Draw vertex attributes first, then edges
  • Cold start problem
  • New user appears
  • Attributes (age, location, browser)
  • Can estimate latent variables from that
  • User and item factors in matrix factorization

problem can be viewed as AH-factors

slide-49
SLIDE 49

Improvements

slide-50
SLIDE 50

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

Add biases

slide-51
SLIDE 51

Bias

  • Objective Function
  • Stochastic gradient descent

minimize

p,q

X

(u,i)∈S

(rui (µ + bu + bi + hpu, qii))2+ λ h kpk2

Frob + kqk2 Frob + kbusersk2 + kbitemsk2i

pu (1 ληt)pu ηtqiρui qi (1 ληt)qi ηtpuρui bu (1 ληt)bu ηtρui bi (1 ληt)bi ηtρui µ (1 ληt)µ ηtρui where ρui = (rui (µ + bi + bu + hpu, qii))

slide-52
SLIDE 52

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

“who ¡rated ¡ what”

slide-53
SLIDE 53

Ratings are not given at random

  • Marlin et al. “Collaborative Filtering and the

Missing at Random Assumption” UAI 2007

  • B. ¡Marlin ¡et ¡al., ¡“Collaborative ¡Filtering ¡and ¡the ¡Missing ¡

at ¡Random ¡Assumption” ¡ Yahoo! survey answers Yahoo! music ratings Netflix ratings

slide-54
SLIDE 54

Movie rating matrix

  • Characterize users by which movies they rated

Edge attributes (observed, rating)

  • Adding features to recommender system

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

users movies

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

users movies

 

 

rui cui

rui = µ + bu + bi + hpu, qii + hcu, xii

regression

slide-55
SLIDE 55

Alternative integration

  • Key idea - use related ratings to average
  • Salakhudtinov & Mnih, 2007
  • Koren et al., 2008

Overparametrize items by q and x

qi ← qi + X

u

cuipu qi ← qi + X

u

cuixj

slide-56
SLIDE 56

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

temporal effects

slide-57
SLIDE 57

Something Happened in Early 2004…

2004 Netflix ratings by date

Netflix changed rating labels

slide-58
SLIDE 58

Are movies getting better with time?

slide-59
SLIDE 59

Sources of temporal change

  • Items
  • Seasonal effects

(Christmas, Valentine’s day, Holiday movies)

  • Public perception of movies (Oscar etc.)
  • Users
  • Changed labeling of reviews
  • Anchoring (relative to previous movie)
  • Change of rater in household
  • Selection bias for time of viewing
slide-60
SLIDE 60

Modeling temporal change

  • Time-dependent bias
  • Time-dependent user preferences
  • Parameterize functions b and p
  • Slow changes for items
  • Fast sudden changes for users
  • Good parametrization is key

rui(t) = µ + bu(t) + bi(t) + hqi, pu(t)i

Koren et al., KDD 2009 (CF with temporal dynamics)

slide-61
SLIDE 61

Biases 33% Personalization 10% Unexplained 57%

Sources of Variance in Netflix data

1.276 (total variance) 0.732 (unexplained) 0.415 (biases) 0.129 (personalization) + +

Bias matters

slide-62
SLIDE 62

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4 Prize: 0.8563 Netflix: 0.9514

( ) ( ) ( ) ( )

T ui u i i u uj j j

r t b t b t q p t b x            

T ui i u

r q p 

slide-63
SLIDE 63

More ideas

  • Explain factorizations
  • Cold start (new users)
  • Different regularization for different parameter

groups / different users

  • Sharing of statistical strength between users
  • Hierarchical matrix co-clustering / factorization

(write a paper on that)

slide-64
SLIDE 64

3 Session Modeling

slide-65
SLIDE 65

Motivation

slide-66
SLIDE 66

User interaction

  • Explicit search query
  • Search engine
  • Genre selection on movie site
  • Implicit search query
  • News site
  • Priority inbox
  • Comments on article
  • Viewing specific movie (see also ...)
  • Sponsored search (advertising)

Space, users’ time and attention are limited.

slide-67
SLIDE 67
slide-68
SLIDE 68

session? models?

slide-69
SLIDE 69

Did the user SCROLL DOWN?

slide-70
SLIDE 70

Bad ideas ...

  • Show items based on relevance
  • Yes, this user likes Die Hard.
  • But he likes other movies, too
  • Show items only for majority of users

‘apple’ vs. ‘Apple’

slide-71
SLIDE 71

User response

collapse collapse

implicit user interest log it!

slide-72
SLIDE 72

hover on link

slide-73
SLIDE 73

Response is conditioned on available options

  • User search for ‘chocolate’
  • What the user really would have wanted
  • User can only pick from

available items

  • Preferences are often relative

user picks this

slide-74
SLIDE 74

Models

slide-75
SLIDE 75

Independent click model

  • Each object has click probability
  • Object is viewed independently
  • Used in computational advertising (with some position correction)
  • Horribly wrong assumption
  • OK if probability is very small (OK in ads)

p(x|s) =

n

Y

i=1

1 1 + e−xisi

slide-76
SLIDE 76

Logistic click model

  • User picks at most one object
  • Exponential family model for click
  • Ignores order of objects
  • Assumes that the user looks at all before taking action

p(x|s) = esx es0 + P

x0 esx0 = exp (sx − g(s))

no click

no click

slide-77
SLIDE 77

Sequential click model

  • User traverses list
  • At each position some probability of clicking
  • When user reaches end of the list he aborts
  • This assumes that a patient user viewed all items

no click click

p(x = j|s) = "j−1 Y

i=1

1 1 + esi # 1 1 + e−sj

slide-78
SLIDE 78

Skip click model

  • User traverses list
  • At each position some probability of clicking
  • At each position the user may abandon the process
  • This assumes that user traverses list sequentially

click no no no no click click click

slide-79
SLIDE 79

Context skip click model

  • User traverses list
  • At each position some probability of clicking which depends on previous content
  • At each position the user may abandon the process
  • User may click more than once
slide-80
SLIDE 80

Context skip click model

slide-81
SLIDE 81

Context skip click model

  • Viewing probability
  • Click probability (only if viewed)

p(vi = 1|vi−1 = 0) = 0 p(vi = 1|vi−1 = 1, ci−1 = 0) = 1 1 + e−αi p(vi = 1|vi−1 = 1, ci−1 = 1) = 1 1 + e−βi

nctional form: p(ci = 1|vi = 1, ci−1, di) = 1 1 + e−f(|ci−1|,di,di−1)

user is gone user returns prior context

slide-82
SLIDE 82

Incremental gains score

  • Submodular gain per additional document
  • Relevance score per document
  • Coverage over different aspects
  • Position dependent score
  • Score dependent on number of previous clicks

f(|ci−1|, di, di−1) :=ρ(S, di|a, b) ρ(S, di−1|a, b) + γ|ci−1| + δi := X

s∈S

X

j

[s]j aj X

d∈di

[d]j + bj “ ρj(di) ρj(di−1) ”! + γ|ci−1| + δi

slide-83
SLIDE 83
  • Latent variables

We don’t know v whether user viewed result

  • Use variational inference to integrate out v

(more next week in graphical models)

Optimization

log p(c)  log p(c) + D(q(v)kp(v|c)) = Ev∼q(v) [ log p(c) + log q(v) log p(v|c)] = Ev∼q(v) [ log p(c, v)] H(q(v)).

slide-84
SLIDE 84

Optimization

  • Compute latent viewing probability given clicks
  • Easy since we only have one transition from

views to no views (no DP needed)

  • Expected log-likelihood under viewing model
  • Convex expected log-likelihood
  • Stochastic gradient descent
  • Parametrization uses personalization, too

(user, position, viewport, browser)

slide-85
SLIDE 85
slide-86
SLIDE 86
slide-87
SLIDE 87
slide-88
SLIDE 88
slide-89
SLIDE 89

4 Feature Representation

slide-90
SLIDE 90

Bayesian Probabilistic Matrix Factorization

slide-91
SLIDE 91

Statistical Model

  • Aldous-Hoover factorization
  • normal distribution for

user and item attributes

  • rating given by inner product
  • Ratings
  • Latent factors

U V j

i

Rij

j=1,...,M i=1,...,N V

σ

U

σ σ

r s

p(Rij|Ui, Vj, σ2) = N (Rij|U T

i Vj, σ2)

p(U|σ2

U) = N

  • i=1

N (Ui|0, σ2

UI),

p(V |σ2

V ) = M

  • j=1

N (Vj|0, σ2

V I)

Salakhudtinov & Mnih, ICML 2008 BPMF

slide-92
SLIDE 92

Details

U Vj

i

Rij

j=1,...,M i=1,...,N

σ Θ

V U

Θ α αV

U

eature

  • Priors on all factors
  • Wishart prior is conjugate

to Gaussian, hence use it

  • Allows us to adapt the

variance automatically

  • Inference (Gibbs sampler)
  • Sample user factors (parallel)
  • Sample movie factors (parallel)
  • Sample hyperparameters (parallel)
slide-93
SLIDE 93

Making it fancier (constrained BPMF)

i

Y V j Rij

j=1,...,M

U i

i

I

i=1,...,N V

σ

U

σ W

k=1,...,M

k

W

σ σ

t .

who rated what

slide-94
SLIDE 94

Results (Mnih & Salakthudtinov)

1−5 6−10 −20 −40 −80 −160 −320 −640 >641 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 Number of Observed Ratings RMSE PMF Constrained PMF Movie Average

1−5 6−10 −20 −40 −80 −160 −320 −640 >641 5 10 15 20 Number of Observed Ratings Users (%)

helps for infrequent users

slide-95
SLIDE 95

Multiple Sources

slide-96
SLIDE 96

Social Network Data

Data: users, connections, features Goal: suggest connections

slide-97
SLIDE 97

Social Network Data

Data: users, connections, features Goal: suggest connections

slide-98
SLIDE 98

Social Network Data

Data: users, connections, features Goal: suggest connections

x x’ y y’ e

slide-99
SLIDE 99

Social Network Data

Data: users, connections, features Goal: model/suggest connections

x x’ y y’ e

p(x, y, e) = Y

i∈Users

p(yi)p(xi|yi) Y

i,j∈Users

p(eij|xi, yi, xj, yj)

Direct application of the Aldous-Hoover theorem. Edges are conditionally independent.

slide-100
SLIDE 100

Applications

slide-101
SLIDE 101

Applications

slide-102
SLIDE 102

Applications

slide-103
SLIDE 103

Applications

social network = friendship + interests

slide-104
SLIDE 104

Applications

social network = friendship + interests

recommend users based

  • n friendship & interests

recommend apps based

  • n friendship & interests
slide-105
SLIDE 105

Social Recommendation

recommend users based

  • n friendship & interests

recommend apps based

  • n friendship & interests
  • boost traffic
  • make the user

graph more dense

  • increase user

population

  • stickiness
  • boost traffic
  • increased revenue
  • increased user

participation

  • make app graph

more dense

... usually addressed by separate tools ...

slide-106
SLIDE 106

Homophily

recommend users based

  • n friendship & interests

recommend apps based

  • n friendship & interests
  • users with similar

interests are more likely to connect

  • friends install similar

applications

Highly correlated. Estimate both jointly

slide-107
SLIDE 107

Model

x x’ y y’ e v u s

(latent) app features

(latent) user features

app install

slide-108
SLIDE 108

Model

x x’ y y’ e v u a

  • Social interaction
  • App install

xi ∼ p(x|yi) xj ∼ p(x|yj) eij ∼ p(e|xi, yi, xj, yj, Φ) xi ∼ p(x|yi) vj ∼ p(v|uj) aij ∼ p(a|xi, yi, uj, vj, Φ)

slide-109
SLIDE 109

Model

  • Social interaction
  • App install

xi ∼ p(x|yi) xj ∼ p(x|yj) eij ∼ p(e|xi, yi, xj, yj, Φ) xi ∼ p(x|yi) vj ∼ p(v|uj) aij ∼ p(a|xi, yi, uj, vj, Φ) xi = Ayi + i vj = Buj + ˜ j eij ∼ p(e|x>

i xj + y> i Wyj)

aij ∼ p(a|x>

i vj + y> i Muj)

cold start latent features bilinear features

slide-110
SLIDE 110

Optimization Problem

minimize λe X

(i,j)

l(eij, x>

i xj + y> i Wyj)+

λa X

(i,j)

l(aij, x>

i vj + y> i Muj)+

λx X

i

γ(xi|yi) + λv X

i

γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2

minimize

slide-111
SLIDE 111

Optimization Problem

minimize λe X

(i,j)

l(eij, x>

i xj + y> i Wyj)+

λa X

(i,j)

l(aij, x>

i vj + y> i Muj)+

λx X

i

γ(xi|yi) + λv X

i

γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2

minimize social

slide-112
SLIDE 112

Optimization Problem

minimize λe X

(i,j)

l(eij, x>

i xj + y> i Wyj)+

λa X

(i,j)

l(aij, x>

i vj + y> i Muj)+

λx X

i

γ(xi|yi) + λv X

i

γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2

minimize social app

slide-113
SLIDE 113

Optimization Problem

minimize λe X

(i,j)

l(eij, x>

i xj + y> i Wyj)+

λa X

(i,j)

l(aij, x>

i vj + y> i Muj)+

λx X

i

γ(xi|yi) + λv X

i

γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2

minimize social app

reconstruction

slide-114
SLIDE 114

Optimization Problem

minimize λe X

(i,j)

l(eij, x>

i xj + y> i Wyj)+

λa X

(i,j)

l(aij, x>

i vj + y> i Muj)+

λx X

i

γ(xi|yi) + λv X

i

γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2

minimize social app regularizer

reconstruction

slide-115
SLIDE 115

Loss Function

slide-116
SLIDE 116

Loss

  • Much more evidence of application non-install

(i.e. many more negative examples)

  • Few links between vertices in friendship

network (even within short graph distance)

  • Generate ranking problems (link, non-link) with

non-links drawn from background set

slide-117
SLIDE 117

Loss

application recommendation social recommendation

slide-118
SLIDE 118

Optimization

  • Nonconvex optimization problem
  • Large set of variables
  • Stochastic gradient descent
  • n x, v, ε for speed
  • Use hashing to reduce

memory load, i.e.

xi = Ayi + i vj = Buj + ˜ j eij ∼ p(e|x>

i xj + y> i Wyj)

aij ∼ p(a|x>

i vj + y> i Muj)

xij = σ(i, j)X[h(i, j)]

binary hash hash

slide-119
SLIDE 119

Y! Pulse

slide-120
SLIDE 120

Y! Pulse Data

1.2M users, 386 items 6.1M friend connections 29M interest indications

slide-121
SLIDE 121

App Recommendation

SIM: similarity based model; RLFM: regression based latent factor model (Chen&Agarwal); NLFM: SIM&RLFM

slide-122
SLIDE 122

Social recommendation

slide-123
SLIDE 123

app recommendation L2 penalty

slide-124
SLIDE 124
  • Multiple relations

(user, user) (user, app) (app, advertisement)

  • Users visiting several properties

news, mail, frontpage, social network, etc.

  • Different statistical models
  • Latent Dirichlet Allocation for latent factors
  • Indian Buffet Process

Extensions

x x’ y y’ e

slide-125
SLIDE 125
  • Multiple relations

(user, user) (user, app) (app, advertisement)

  • Users visiting several properties

news, mail, frontpage, social network, etc.

  • Different statistical models
  • Latent Dirichlet Allocation for latent factors
  • Indian Buffet Process

v u a

Extensions

x x’ y y’ e

slide-126
SLIDE 126

v u a

  • Multiple relations

(user, user) (user, app) (app, advertisement)

  • Users visiting several properties

news, mail, frontpage, social network, etc.

  • Different statistical models
  • Latent Dirichlet Allocation for latent factors
  • Indian Buffet Process

v u a

Extensions

x x’ y y’ e

slide-127
SLIDE 127

v u a

  • Multiple relations

(user, user) (user, app) (app, advertisement)

  • Users visiting several properties

news, mail, frontpage, social network, etc.

  • Different statistical models
  • Latent Dirichlet Allocation for latent factors
  • Indian Buffet Process

v u a

Extensions

x x’ y y’ e

slide-128
SLIDE 128

More strategies

slide-129
SLIDE 129

Multiple factor LDA

  • Discrete set of preferences

(Porteous, Bart, Welling, 2008)

  • User picks one to assess movie
  • Movie represented by a discrete attribute
  • Inference by Gibbs sampler
  • Works fairly well
  • Extension by Lester Mackey and coworkers to

combine with BPMF model

slide-130
SLIDE 130

More state representations

  • Indian Buffet Process

(Griffiths & Ghahramani, 2005)

  • Attribute vector is binary string
  • Models preferences naturally & very compact

(Inference is costly)

  • Hierarchical attribute representation and

clustering over users ... TO DO

slide-131
SLIDE 131

5 Hashing

slide-132
SLIDE 132

Parameter Storage

  • We have millions of users
  • We have millions of products
  • Storage - for 100 factors this requires

106 x 106 x 8 = 8TB

  • We want a model that can be kept in RAM (<16GB)
  • Instant response for each user
  • Disks have 20 IOP/s at best (SSDs much better)
  • Privacy (what if parameter vector leaks)
slide-133
SLIDE 133

Recall - Hash Kernels

Hey, please mention subtly during your talk that people should use Yahoo mail more often. Thanks, Someone instance: task/user (=barney):

⇥ xi ∈ RN×(U+1)

1 3 2

  • 1

h()

h(‘mention’) h(‘mention_barney’)

s(m_b) s(m)

{-1, 1}

Similar to count hash (Charikar, Chen, Farrach-Colton, 2003)

X

i

¯ w[h(i)]σ(i)xi

slide-134
SLIDE 134
  • Hashing compression
  • Approximation is O(1/n)
  • To show that estimate is unbiased take expectation
  • ver Rademacher hash.

Collaborative Filtering

ui =

  • j,k:h(j,k)=i

ξ(j, k)Ujk and vi =

  • j,k:h(j,k)=i

ξ(j, k)Vjk. Xij :=

  • k

ξ(k, i)ξ(k, j)uh(k,i)vh(k,j).

slide-135
SLIDE 135
  • Hashing compression
  • Expectation

Collaborative Filtering

Xij :=

  • k

ξ(k, i)ξ(k, j)uh(k,i)vh(k,j).

Xij := X

k

ξ(k, i)ξ0(k, j) X

l,k:h(k,l)=h(k,i)

X

  • ,k:h0(k,o)=h0(k,j)

ξ(k, l)ξ0(k, o)UklVko

ui = X

j,k:h(k,j)=i

ξ(k, j)Ukj and vi = X

j,k:h0(k,j)=i

ξ0(k, j)Vkj.

expectation vanishes expectation vanishes

slide-136
SLIDE 136

Collaborative Hashing

  • Combine with stochastic gradient descent
  • Random access in memory is expensive

(we now have to do k lookups per pair)

  • Feistel networks can accelerate this
  • Distributed optimization without locking
slide-137
SLIDE 137

Examples

Thousands of elements in M Thousands of elements in U

1225 840 720 520 400 240 120 60 32 16 10 9 8 7 6 5 1.20 1.22 1.24 1.26 1.28 1.30 1.32 rows in M rows in U

983 500 450 400 350 300 250 200 150 100 50 1682 500 450 400 350 300 250 200 150 100 50 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16

Eachmovie MovieLens

slide-138
SLIDE 138

Summary

  • Neighborhood methods
  • User / movie similarity
  • Iteration on graph
  • Matrix Factorization
  • Singular value decomposition
  • Convex reformulation
  • Ranking and Session Modeling
  • Ordinal regression
  • Session models
  • Features
  • Latent dense (Bayesian Probabilistic Matrix Factorization)
  • Latent sparse (Dirichlet process factorization)
  • Coldstart problem (inferring features)
  • Hashing
slide-139
SLIDE 139

Further reading

  • Collaborative Filtering with temporal dynamics

http://research.yahoo.com/files/kdd-fp074-koren.pdf

  • Neighborhood factorization

http://research.yahoo.com/files/paper.pdf

  • Matrix Factorization for recommender systems

http://research.yahoo.com/files/ieeecomputer.pdf

  • CoFi Rank (collaborative filtering & ranking)

http://www.cofirank.org/

  • Yehuda Koren’s papers

http://research.yahoo.com/Yehuda_Koren