SLIDE 1 Scalable Machine Learning
Alex Smola Yahoo! Research and ANU
http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12
Significant content courtesy of Yehuda Koren
SLIDE 2
Much content courtesy of (Mr Netflix) Yehuda Koren
SLIDE 3 Outline
- Neighborhood methods
- User / movie similarity
- Iteration on graph
- Matrix Factorization
- Singular value decomposition
- Convex reformulation
- Ranking and Session Modeling
- Ordinal regression
- Session models
- Features
- Latent dense (Bayesian Probabilistic Matrix Factorization)
- Latent sparse (Dirichlet process factorization)
- Coldstart problem (inferring features)
- Hashing
SLIDE 4
Why
SLIDE 5
SLIDE 6
Netflix
SLIDE 7
SLIDE 8
SLIDE 9
SLIDE 10 Personalized Content
adapt to general popularity pick based on user preferences
SLIDE 11
Spam Filtering
Something went wrong!
SLIDE 12 A more formal view
- User (requests content)
- Objects (that can be displayed)
- Context (device, location, time)
- Interface (mobile browser, tablet, viewport)
u
interface
recommend relevant objects
SLIDE 13 Examples
- Movie recommendation (Netflix)
- Related product recommendation (Amazon)
- Web page ranking (Google)
- Social recommendation (Facebook)
- News content recommendation (Yahoo)
- Priority inbox & spam filtering (Google)
- Online dating (OK Cupid)
- Computational Advertising (Yahoo)
SLIDE 14 Running Example Netflix Movie Recommendation
score
date movie user
1 5/7/02 21 1 5 8/2/04 213 1 4 3/6/01 345 2 4 5/1/05 123 2 3 7/15/02 768 2 5 1/22/01 76 3 4 8/3/00 45 4 1 9/10/05 568 5 2 3/5/03 342 5 2 12/28/00 234 5 5 8/11/02 76 6 4 6/15/03 56 6
score date movie user
? 1/6/05 62 1 ? 9/13/04 96 1 ? 8/18/05 7 2 ? 11/22/05 3 2 ? 6/13/02 47 3 ? 8/12/01 15 3 ? 9/1/00 41 4 ? 8/27/05 28 4 ? 4/4/05 93 5 ? 7/16/03 74 5 ? 2/14/04 69 6 ? 10/3/03 83 6
Training data Test data
SLIDE 15 Challenges
- Scalability
- Millions of objects
- 100s of millions of users
- Cold start
- Changing user base
- Changing inventory (movies, stories, goods)
- Attributes
- Imbalanced dataset
User activity / item reviews are power law distributed
http://www.igvita.com/2006/10/29/dissecting-the-netflix-dataset/
SLIDE 16 Netflix competition yardstick
- Least mean squares prediction error
- Easy to define
- Wrong measure for composing sessions!
- Consistent (in large sample size limit this will
converge to minimizer)
rmse(S) = s |S|−1 X
(i,u)∈S
(ˆ rui − rui)2
SLIDE 17
1 Neighborhood Methods
SLIDE 18 Basic Idea
Joe
#2 #3 #1 #4
SLIDE 19
- Derive ¡unknown ¡ratings ¡from ¡those ¡of ¡“similar” ¡items ¡
- Basic Idea
- (user,user) similarity to recommend items
- good if item base is smaller than user base
- good if item base changes rapidly
- traverse bipartite similarity graph
- (item,item) similarity to recommend new items that
were also liked by the same users
is small is small
SLIDE 20 Neighborhood based CF
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1
1
3 1 2 4 4 5
2
5 3 4 3 2 1 4 2
3
2 4 5 4 2
4
5 2 2 4 3 4
5
4 2 3 3 1
6 users items
- unknown rating
- rating between 1 to 5
0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6
SLIDE 21 Neighborhood based CF
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1
1
3 1 2 4 4 5
2
5 3 4 3 2 1 4 2
3
2 4 5 4 2
4
5 2 2 4 3 4
5
4 2 3 3 1
6 users items
- unknown rating
- rating between 1 to 5
?
0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6
SLIDE 22 Neighborhood based CF
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1
1
3 1 2 4 4 5
2
5 3 4 3 2 1 4 2
3
2 4 5 4 2
4
5 2 2 4 3 4
5
4 2 3 3 1
6 users items
- unknown rating
- rating between 1 to 5
?
0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6
SLIDE 23 Neighborhood based CF
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1
1
3 1 2 4 4 5
2
5 3 4 3 2 1 4 2
3
2 4 5 4 2
4
5 2 2 4 3 4
5
4 2 3 3 1
6 users items
- unknown rating
- rating between 1 to 5
similarity s13 = 0.2 s16 = 0.3 weighted average ?
0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6
SLIDE 24 Neighborhood based CF
12 11 10 9 8 7 6 5 4 3 2 1
4 5 5 3 1
1
3 1 2 4 4 5
2
5 3 4 3 2 1 4 2
3
2 4 5 4 2
4
5 2 2 4 3 4
5
4 2 3 3 1
6 users items
- unknown rating
- rating between 1 to 5
similarity s13 = 0.2 s16 = 0.3 weighted average
0.2 · 2 + 0.3 · 3 0.2 + 0.3 = 2.6
2.6
SLIDE 25
- Derive ¡unknown ¡ratings ¡from ¡those ¡of ¡“similar” ¡items ¡
- Properties
- Intuitive
- No (substantial) training
- Handles new users / items
- Easy to explain to user
- Accuracy & scalability questionable
SLIDE 26 Normalization / Bias
- Problem
- Some items are significantly higher rated
- Some users rate substantially lower
- Ratings change over time
- Bias correction is crucial for nearest neighborhood
recommender algorithm
- Offset per user
- Offset per movie
- Time effects
- Global bias
bui = µ + bu + bi
user item global
Bell & Koren ICDM 2007 http://public.research.att.com/~volinsky/netflix/BellKorICDM07.pdf
SLIDE 27 Baseline estimation
- Mean rating is 3.7
- Troll Hunter is 0.7 above mean
- User rates 0.2 below mean
- Baseline is 4.2 stars
- Least mean squares problem
- Jointly convex. Alternatively remove mean & iterate
minimize
b
X
(u,i)
(rui − µ − bu − bi)2 + λ "X
u
b2
u +
X
i
b2
i
#
bi = P
u∈R(i)(rui − µ − bu)
λ + |R(i)| and bu = P
i∈R(u)(rui − µ − bi)
λ + |R(u)|
SLIDE 28 Parzen Windows style CF
- Similarity measure sij between items
- Find set sk(i,u) of k-nearest neighbors to i that
were rated by user u
- Weighted average over the set
- How to compute sij?
ˆ rui = bui + P
j∈sk(i,u) sij(ruj − buj)
P
j∈sk(i,u) sij
where bui = µ + bu + bi
SLIDE 29
each item rated by a distinct set of users
1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ? ? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4 User ratings for item i: User ratings for item j:
- (item,item) similarity measures
- Pearson correlation coefficient
- nonuniform support
- compute only over shared support
- shrinkage towards 0 to address problem of
small support (typically few items in common)
sij = Cov[rui, ruj] Std[rui]Std[ruj]
SLIDE 30 (item,item) similarity
- Empirical Pearson correlation coefficient
- Smoothing towards 0 for small support
- Make neighborhood more peaked
- Shrink towards baseline for small neighborhood
ˆ ρij = P
u∈U(i,j)(rui − bui)(ruj − buj)
qP
u∈U(i,j)(rui − bui)2 P u∈U(i,j)(ruj − buj)2
sij = |U(i, j)| − 1 |U(i, j)| − 1 + λ ˆ ρij sij → s2
ij
ˆ rui = bui + P
j∈sk(i,u) sij(ruj − buj)
λ + P
j∈sk(i,u) sij
SLIDE 31 Similarity for binary data
- Pearson correlation meaningless
- Views
- Purchase behavior
- Clicks
- Jaccard similarity
(intersection vs. joint)
Improve by counting per user (many users better than heavy users)
mi users acting on i mij users acting on both i and j m total number of users
sij = mij α + mi + mj − mij sij = observed expected ≈ mij α + mimj/m
SLIDE 32
2 Matrix Factorization
SLIDE 33
Basics
SLIDE 34 Basic Idea
~
M ≈ U · V
SLIDE 35 Latent variable view
Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s ¡11 Sense and Sensibility
Gus Dave
SLIDE 36 Basic matrix factorization
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
~ ~
items users users A rank-3 SVD approximation
SLIDE 37 Estimate unknown ratings as inner products of latent factors
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
~ ~
items users A rank-3 SVD approximation users
?
SLIDE 38 Estimate unknown ratings as inner products of latent factors
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
~ ~
items users A rank-3 SVD approximation users
?
SLIDE 39 Estimate unknown ratings as inner products of latent factors
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
~ ~
items users
2.4
A rank-3 SVD approximation users
SLIDE 40 Properties
- SVD is undefined for missing entries
- stochastic gradient descent (faster)
- alternating optimization
- Overfitting without regularization
particularly if fewer reviews than dimensions
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
~
- SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡
-
- –
–
SLIDE 41 40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500
0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4 Prize: 0.8563 Netflix: 0.9514
SLIDE 42 Risk Minimization View
- Objective Function
- Alternating least squares
minimize
p,q
X
(u,i)∈S
(rui hpu, qii)2 + λ h kpk2
Frob + kqk2 Frob
i pu ← 2 4λ1 + X
i|(u,i)2S
qiq>
i
3 5
1
X
i
qirui qi ← 2 4λ1 + X
u|(u,i)2S
pup>
u
3 5
1
X
i
purui
good for MapReduce
SLIDE 43 Risk Minimization View
- Objective Function
- Stochastic gradient descent
- No need for locking
- Multicore updates asynchronously
(Recht, Re, Wright, 2012 - Hogwild)
minimize
p,q
X
(u,i)∈S
(rui hpu, qii)2 + λ h kpk2
Frob + kqk2 Frob
i
much faster
pu (1 ληt)pu ηtqi(rui hpu, qii) qi (1 ληt)qi ηtpu(rui hpu, qii)
SLIDE 44
Theoretical Motivation
SLIDE 45 deFinetti Theorem
- Independent random variables
- Exchangeable random variables
- There exists a conditionally independent
representation of exchangeable r.v. This motivates latent variable models
p(X) =
m
Y
i=1
p(xi) p(X) = p(x1, . . . , xm) = p(xπ(1), . . . , xπ(m))
xi xi ϴ
p(X) = Z dp(θ)
m
Y
i=1
p(xi|θ)
SLIDE 46 Aldous Hoover Factorization
- Matrix-valued set of random variable
Example - Erdos Renyi graph model
- Independently exchangeable on matrix
- Aldous Hoover Theorem
p(E) = Y
i,j
p(Vij)
p(E) = p(E11, E12, . . . , Emn) = p(Eπ(1)ρ(1), Eπ(1)ρ(2), . . . , Eπ(m)ρ(n))
p(E) = Z dp(θ) Z
m
Y
i=1
dp(ui)
n
Y
j=1
dp(vj) Y
i,j
p(Eij|ui, vj, θ)
SLIDE 47 Aldous Hoover Factorization
column) exchangeable
- Draw latent variables per
row and column
independently given pairs
rating is a signal
- Can be extended to graphs
with vertex attributes
u1 u2 u3 u4 u5 u6 v1 e11 e12 e15 e16 v2 e24 v3 e32 v4 e43 e46 v5 e55
SLIDE 48 Aldous Hoover variants
- Jointly exchangeable matrix
- Social network graphs
- Draw vertex attributes first, then edges
- Cold start problem
- New user appears
- Attributes (age, location, browser)
- Can estimate latent variables from that
- User and item factors in matrix factorization
problem can be viewed as AH-factors
SLIDE 49
Improvements
SLIDE 50 40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500
0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4
Add biases
SLIDE 51 Bias
- Objective Function
- Stochastic gradient descent
minimize
p,q
X
(u,i)∈S
(rui (µ + bu + bi + hpu, qii))2+ λ h kpk2
Frob + kqk2 Frob + kbusersk2 + kbitemsk2i
pu (1 ληt)pu ηtqiρui qi (1 ληt)qi ηtpuρui bu (1 ληt)bu ηtρui bi (1 ληt)bi ηtρui µ (1 ληt)µ ηtρui where ρui = (rui (µ + bi + bu + hpu, qii))
SLIDE 52 40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500
0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4
“who ¡rated ¡ what”
SLIDE 53 Ratings are not given at random
- Marlin et al. “Collaborative Filtering and the
Missing at Random Assumption” UAI 2007
- B. ¡Marlin ¡et ¡al., ¡“Collaborative ¡Filtering ¡and ¡the ¡Missing ¡
at ¡Random ¡Assumption” ¡ Yahoo! survey answers Yahoo! music ratings Netflix ratings
SLIDE 54 Movie rating matrix
- Characterize users by which movies they rated
Edge attributes (observed, rating)
- Adding features to recommender system
-
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
users movies
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
users movies
rui cui
rui = µ + bu + bi + hpu, qii + hcu, xii
regression
SLIDE 55 Alternative integration
- Key idea - use related ratings to average
- Salakhudtinov & Mnih, 2007
- Koren et al., 2008
Overparametrize items by q and x
qi ← qi + X
u
cuipu qi ← qi + X
u
cuixj
SLIDE 56 40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500
0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4
temporal effects
SLIDE 57 Something Happened in Early 2004…
2004 Netflix ratings by date
Netflix changed rating labels
SLIDE 58
Are movies getting better with time?
SLIDE 59 Sources of temporal change
(Christmas, Valentine’s day, Holiday movies)
- Public perception of movies (Oscar etc.)
- Users
- Changed labeling of reviews
- Anchoring (relative to previous movie)
- Change of rater in household
- Selection bias for time of viewing
SLIDE 60 Modeling temporal change
- Time-dependent bias
- Time-dependent user preferences
- Parameterize functions b and p
- Slow changes for items
- Fast sudden changes for users
- Good parametrization is key
rui(t) = µ + bu(t) + bi(t) + hqi, pu(t)i
Koren et al., KDD 2009 (CF with temporal dynamics)
SLIDE 61 Biases 33% Personalization 10% Unexplained 57%
Sources of Variance in Netflix data
1.276 (total variance) 0.732 (unexplained) 0.415 (biases) 0.129 (personalization) + +
Bias matters
SLIDE 62 40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500
0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91
10 100 1000 10000 100000
RMSE
Millions of Parameters
Factor models: Error vs. #parameters
NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4 Prize: 0.8563 Netflix: 0.9514
( ) ( ) ( ) ( )
T ui u i i u uj j j
r t b t b t q p t b x
T ui i u
r q p
SLIDE 63 More ideas
- Explain factorizations
- Cold start (new users)
- Different regularization for different parameter
groups / different users
- Sharing of statistical strength between users
- Hierarchical matrix co-clustering / factorization
(write a paper on that)
SLIDE 64
3 Session Modeling
SLIDE 65
Motivation
SLIDE 66 User interaction
- Explicit search query
- Search engine
- Genre selection on movie site
- Implicit search query
- News site
- Priority inbox
- Comments on article
- Viewing specific movie (see also ...)
- Sponsored search (advertising)
Space, users’ time and attention are limited.
SLIDE 67
SLIDE 68 session? models?
SLIDE 69 Did the user SCROLL DOWN?
SLIDE 70 Bad ideas ...
- Show items based on relevance
- Yes, this user likes Die Hard.
- But he likes other movies, too
- Show items only for majority of users
‘apple’ vs. ‘Apple’
SLIDE 71 User response
collapse collapse
implicit user interest log it!
SLIDE 72
hover on link
SLIDE 73 Response is conditioned on available options
- User search for ‘chocolate’
- What the user really would have wanted
- User can only pick from
available items
- Preferences are often relative
user picks this
SLIDE 74
Models
SLIDE 75 Independent click model
- Each object has click probability
- Object is viewed independently
- Used in computational advertising (with some position correction)
- Horribly wrong assumption
- OK if probability is very small (OK in ads)
p(x|s) =
n
Y
i=1
1 1 + e−xisi
SLIDE 76 Logistic click model
- User picks at most one object
- Exponential family model for click
- Ignores order of objects
- Assumes that the user looks at all before taking action
p(x|s) = esx es0 + P
x0 esx0 = exp (sx − g(s))
no click
no click
SLIDE 77 Sequential click model
- User traverses list
- At each position some probability of clicking
- When user reaches end of the list he aborts
- This assumes that a patient user viewed all items
no click click
p(x = j|s) = "j−1 Y
i=1
1 1 + esi # 1 1 + e−sj
SLIDE 78 Skip click model
- User traverses list
- At each position some probability of clicking
- At each position the user may abandon the process
- This assumes that user traverses list sequentially
click no no no no click click click
SLIDE 79 Context skip click model
- User traverses list
- At each position some probability of clicking which depends on previous content
- At each position the user may abandon the process
- User may click more than once
SLIDE 80
Context skip click model
SLIDE 81 Context skip click model
- Viewing probability
- Click probability (only if viewed)
p(vi = 1|vi−1 = 0) = 0 p(vi = 1|vi−1 = 1, ci−1 = 0) = 1 1 + e−αi p(vi = 1|vi−1 = 1, ci−1 = 1) = 1 1 + e−βi
nctional form: p(ci = 1|vi = 1, ci−1, di) = 1 1 + e−f(|ci−1|,di,di−1)
user is gone user returns prior context
SLIDE 82 Incremental gains score
- Submodular gain per additional document
- Relevance score per document
- Coverage over different aspects
- Position dependent score
- Score dependent on number of previous clicks
f(|ci−1|, di, di−1) :=ρ(S, di|a, b) ρ(S, di−1|a, b) + γ|ci−1| + δi := X
s∈S
X
j
[s]j aj X
d∈di
[d]j + bj “ ρj(di) ρj(di−1) ”! + γ|ci−1| + δi
SLIDE 83
We don’t know v whether user viewed result
- Use variational inference to integrate out v
(more next week in graphical models)
Optimization
log p(c) log p(c) + D(q(v)kp(v|c)) = Ev∼q(v) [ log p(c) + log q(v) log p(v|c)] = Ev∼q(v) [ log p(c, v)] H(q(v)).
SLIDE 84 Optimization
- Compute latent viewing probability given clicks
- Easy since we only have one transition from
views to no views (no DP needed)
- Expected log-likelihood under viewing model
- Convex expected log-likelihood
- Stochastic gradient descent
- Parametrization uses personalization, too
(user, position, viewport, browser)
SLIDE 85
SLIDE 86
SLIDE 87
SLIDE 88
SLIDE 89
4 Feature Representation
SLIDE 90
Bayesian Probabilistic Matrix Factorization
SLIDE 91 Statistical Model
- Aldous-Hoover factorization
- normal distribution for
user and item attributes
- rating given by inner product
- Ratings
- Latent factors
U V j
i
Rij
j=1,...,M i=1,...,N V
σ
U
σ σ
r s
p(Rij|Ui, Vj, σ2) = N (Rij|U T
i Vj, σ2)
p(U|σ2
U) = N
N (Ui|0, σ2
UI),
p(V |σ2
V ) = M
N (Vj|0, σ2
V I)
Salakhudtinov & Mnih, ICML 2008 BPMF
SLIDE 92 Details
U Vj
i
Rij
j=1,...,M i=1,...,N
σ Θ
V U
Θ α αV
U
eature
- Priors on all factors
- Wishart prior is conjugate
to Gaussian, hence use it
variance automatically
- Inference (Gibbs sampler)
- Sample user factors (parallel)
- Sample movie factors (parallel)
- Sample hyperparameters (parallel)
SLIDE 93 Making it fancier (constrained BPMF)
i
Y V j Rij
j=1,...,M
U i
i
I
i=1,...,N V
σ
U
σ W
k=1,...,M
k
W
σ σ
t .
who rated what
SLIDE 94 Results (Mnih & Salakthudtinov)
1−5 6−10 −20 −40 −80 −160 −320 −640 >641 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 Number of Observed Ratings RMSE PMF Constrained PMF Movie Average
1−5 6−10 −20 −40 −80 −160 −320 −640 >641 5 10 15 20 Number of Observed Ratings Users (%)
helps for infrequent users
SLIDE 95
Multiple Sources
SLIDE 96 Social Network Data
Data: users, connections, features Goal: suggest connections
SLIDE 97 Social Network Data
Data: users, connections, features Goal: suggest connections
SLIDE 98 Social Network Data
Data: users, connections, features Goal: suggest connections
x x’ y y’ e
SLIDE 99 Social Network Data
Data: users, connections, features Goal: model/suggest connections
x x’ y y’ e
p(x, y, e) = Y
i∈Users
p(yi)p(xi|yi) Y
i,j∈Users
p(eij|xi, yi, xj, yj)
Direct application of the Aldous-Hoover theorem. Edges are conditionally independent.
SLIDE 100
Applications
SLIDE 101
Applications
SLIDE 102
Applications
SLIDE 103 Applications
social network = friendship + interests
SLIDE 104 Applications
social network = friendship + interests
recommend users based
recommend apps based
SLIDE 105 Social Recommendation
recommend users based
recommend apps based
- n friendship & interests
- boost traffic
- make the user
graph more dense
population
- stickiness
- boost traffic
- increased revenue
- increased user
participation
more dense
... usually addressed by separate tools ...
SLIDE 106 Homophily
recommend users based
recommend apps based
- n friendship & interests
- users with similar
interests are more likely to connect
applications
Highly correlated. Estimate both jointly
SLIDE 107 Model
x x’ y y’ e v u s
(latent) app features
(latent) user features
app install
SLIDE 108 Model
x x’ y y’ e v u a
- Social interaction
- App install
xi ∼ p(x|yi) xj ∼ p(x|yj) eij ∼ p(e|xi, yi, xj, yj, Φ) xi ∼ p(x|yi) vj ∼ p(v|uj) aij ∼ p(a|xi, yi, uj, vj, Φ)
SLIDE 109 Model
- Social interaction
- App install
xi ∼ p(x|yi) xj ∼ p(x|yj) eij ∼ p(e|xi, yi, xj, yj, Φ) xi ∼ p(x|yi) vj ∼ p(v|uj) aij ∼ p(a|xi, yi, uj, vj, Φ) xi = Ayi + i vj = Buj + ˜ j eij ∼ p(e|x>
i xj + y> i Wyj)
aij ∼ p(a|x>
i vj + y> i Muj)
cold start latent features bilinear features
SLIDE 110 Optimization Problem
minimize λe X
(i,j)
l(eij, x>
i xj + y> i Wyj)+
λa X
(i,j)
l(aij, x>
i vj + y> i Muj)+
λx X
i
γ(xi|yi) + λv X
i
γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2
minimize
SLIDE 111 Optimization Problem
minimize λe X
(i,j)
l(eij, x>
i xj + y> i Wyj)+
λa X
(i,j)
l(aij, x>
i vj + y> i Muj)+
λx X
i
γ(xi|yi) + λv X
i
γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2
minimize social
SLIDE 112 Optimization Problem
minimize λe X
(i,j)
l(eij, x>
i xj + y> i Wyj)+
λa X
(i,j)
l(aij, x>
i vj + y> i Muj)+
λx X
i
γ(xi|yi) + λv X
i
γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2
minimize social app
SLIDE 113 Optimization Problem
minimize λe X
(i,j)
l(eij, x>
i xj + y> i Wyj)+
λa X
(i,j)
l(aij, x>
i vj + y> i Muj)+
λx X
i
γ(xi|yi) + λv X
i
γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2
minimize social app
reconstruction
SLIDE 114 Optimization Problem
minimize λe X
(i,j)
l(eij, x>
i xj + y> i Wyj)+
λa X
(i,j)
l(aij, x>
i vj + y> i Muj)+
λx X
i
γ(xi|yi) + λv X
i
γ(vi|ui)+ λW W2 + λM M2 + λA A2 + λB B2
minimize social app regularizer
reconstruction
SLIDE 115
Loss Function
SLIDE 116 Loss
- Much more evidence of application non-install
(i.e. many more negative examples)
- Few links between vertices in friendship
network (even within short graph distance)
- Generate ranking problems (link, non-link) with
non-links drawn from background set
SLIDE 117 Loss
application recommendation social recommendation
SLIDE 118 Optimization
- Nonconvex optimization problem
- Large set of variables
- Stochastic gradient descent
- n x, v, ε for speed
- Use hashing to reduce
memory load, i.e.
xi = Ayi + i vj = Buj + ˜ j eij ∼ p(e|x>
i xj + y> i Wyj)
aij ∼ p(a|x>
i vj + y> i Muj)
xij = σ(i, j)X[h(i, j)]
binary hash hash
SLIDE 119
Y! Pulse
SLIDE 120 Y! Pulse Data
1.2M users, 386 items 6.1M friend connections 29M interest indications
SLIDE 121 App Recommendation
SIM: similarity based model; RLFM: regression based latent factor model (Chen&Agarwal); NLFM: SIM&RLFM
SLIDE 122
Social recommendation
SLIDE 123 app recommendation L2 penalty
SLIDE 124
(user, user) (user, app) (app, advertisement)
- Users visiting several properties
news, mail, frontpage, social network, etc.
- Different statistical models
- Latent Dirichlet Allocation for latent factors
- Indian Buffet Process
Extensions
x x’ y y’ e
SLIDE 125
(user, user) (user, app) (app, advertisement)
- Users visiting several properties
news, mail, frontpage, social network, etc.
- Different statistical models
- Latent Dirichlet Allocation for latent factors
- Indian Buffet Process
v u a
Extensions
x x’ y y’ e
SLIDE 126 v u a
(user, user) (user, app) (app, advertisement)
- Users visiting several properties
news, mail, frontpage, social network, etc.
- Different statistical models
- Latent Dirichlet Allocation for latent factors
- Indian Buffet Process
v u a
Extensions
x x’ y y’ e
SLIDE 127 v u a
(user, user) (user, app) (app, advertisement)
- Users visiting several properties
news, mail, frontpage, social network, etc.
- Different statistical models
- Latent Dirichlet Allocation for latent factors
- Indian Buffet Process
v u a
Extensions
x x’ y y’ e
SLIDE 128
More strategies
SLIDE 129 Multiple factor LDA
- Discrete set of preferences
(Porteous, Bart, Welling, 2008)
- User picks one to assess movie
- Movie represented by a discrete attribute
- Inference by Gibbs sampler
- Works fairly well
- Extension by Lester Mackey and coworkers to
combine with BPMF model
SLIDE 130 More state representations
(Griffiths & Ghahramani, 2005)
- Attribute vector is binary string
- Models preferences naturally & very compact
(Inference is costly)
- Hierarchical attribute representation and
clustering over users ... TO DO
SLIDE 131
5 Hashing
SLIDE 132 Parameter Storage
- We have millions of users
- We have millions of products
- Storage - for 100 factors this requires
106 x 106 x 8 = 8TB
- We want a model that can be kept in RAM (<16GB)
- Instant response for each user
- Disks have 20 IOP/s at best (SSDs much better)
- Privacy (what if parameter vector leaks)
SLIDE 133 Recall - Hash Kernels
Hey, please mention subtly during your talk that people should use Yahoo mail more often. Thanks, Someone instance: task/user (=barney):
⇥ xi ∈ RN×(U+1)
1 3 2
h()
h(‘mention’) h(‘mention_barney’)
s(m_b) s(m)
{-1, 1}
Similar to count hash (Charikar, Chen, Farrach-Colton, 2003)
X
i
¯ w[h(i)]σ(i)xi
SLIDE 134
- Hashing compression
- Approximation is O(1/n)
- To show that estimate is unbiased take expectation
- ver Rademacher hash.
Collaborative Filtering
ui =
ξ(j, k)Ujk and vi =
ξ(j, k)Vjk. Xij :=
ξ(k, i)ξ(k, j)uh(k,i)vh(k,j).
SLIDE 135
- Hashing compression
- Expectation
Collaborative Filtering
Xij :=
ξ(k, i)ξ(k, j)uh(k,i)vh(k,j).
Xij := X
k
ξ(k, i)ξ0(k, j) X
l,k:h(k,l)=h(k,i)
X
ξ(k, l)ξ0(k, o)UklVko
ui = X
j,k:h(k,j)=i
ξ(k, j)Ukj and vi = X
j,k:h0(k,j)=i
ξ0(k, j)Vkj.
expectation vanishes expectation vanishes
SLIDE 136 Collaborative Hashing
- Combine with stochastic gradient descent
- Random access in memory is expensive
(we now have to do k lookups per pair)
- Feistel networks can accelerate this
- Distributed optimization without locking
SLIDE 137 Examples
Thousands of elements in M Thousands of elements in U
1225 840 720 520 400 240 120 60 32 16 10 9 8 7 6 5 1.20 1.22 1.24 1.26 1.28 1.30 1.32 rows in M rows in U
983 500 450 400 350 300 250 200 150 100 50 1682 500 450 400 350 300 250 200 150 100 50 1.02 1.04 1.06 1.08 1.10 1.12 1.14 1.16
Eachmovie MovieLens
SLIDE 138 Summary
- Neighborhood methods
- User / movie similarity
- Iteration on graph
- Matrix Factorization
- Singular value decomposition
- Convex reformulation
- Ranking and Session Modeling
- Ordinal regression
- Session models
- Features
- Latent dense (Bayesian Probabilistic Matrix Factorization)
- Latent sparse (Dirichlet process factorization)
- Coldstart problem (inferring features)
- Hashing
SLIDE 139 Further reading
- Collaborative Filtering with temporal dynamics
http://research.yahoo.com/files/kdd-fp074-koren.pdf
- Neighborhood factorization
http://research.yahoo.com/files/paper.pdf
- Matrix Factorization for recommender systems
http://research.yahoo.com/files/ieeecomputer.pdf
- CoFi Rank (collaborative filtering & ranking)
http://www.cofirank.org/
http://research.yahoo.com/Yehuda_Koren