Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! - PowerPoint PPT Presentation

Estimate unknown ratings as inner products of latent factors users 1 3 5 5 4 5 4 4 2 1 3 2.4 items ~ 2 4 1 2 3 4 3 5 2 4 5 4 2 4 3 4 2 2 5 1 3 3 2 4 users 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 .1 -.4 .2 items -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 -.5 .6 .5 ~ 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.2 .3 .5 1.1 2.1 .3 -.7 2.1 -2 -1 .7 .3 A rank-3 SVD approximation

Properties .1 -.4 .2 1 3 5 5 4 -.5 .6 .5 5 4 4 2 1 3 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 ~ 2 4 1 2 3 4 3 5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2 4 5 4 2 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 4 3 4 2 2 5 -1 .7 .3 1 3 3 2 4 • SVD is undefined for missing entries • stochastic gradient descent (faster) • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡  • alternating optimization •  • Overfitting without regularization particularly if fewer reviews than dimensions • – • Very popular on Netflix –

Netflix: 0.9514 Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD 100 0.9 200 SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Prize: 0.8563 Millions of Parameters

Risk Minimization View • Objective Function ( r ui � h p u , q i i ) 2 + λ h i k p k 2 Frob + k q k 2 X minimize Frob p,q ( u,i ) ∈ S • Alternating least squares � 1 2 3 X X q i q > 4 λ 1 + p u ← q i r ui i 5 good for i i | ( u,i ) 2 S MapReduce � 1 2 3 X X p u p > 4 λ 1 + q i ← p u r ui u 5 i u | ( u,i ) 2 S

Risk Minimization View • Objective Function ( r ui � h p u , q i i ) 2 + λ h i k p k 2 Frob + k q k 2 X minimize Frob p,q ( u,i ) ∈ S • Stochastic gradient descent much p u (1 � λη t ) p u � η t q i ( r ui � h p u , q i i ) faster q i (1 � λη t ) q i � η t p u ( r ui � h p u , q i i ) • No need for locking • Multicore updates asynchronously (Recht, Re, Wright, 2012 - Hogwild)

Theoretical Motivation

deFinetti Theorem • Independent random variables m x i Y p ( X ) = p ( x i ) i =1 • Exchangeable random variables p ( X ) = p ( x 1 , . . . , x m ) = p ( x π (1) , . . . , x π ( m ) ) • There exists a conditionally independent representation of exchangeable r.v. ϴ m Z Y p ( X ) = dp ( θ ) p ( x i | θ ) x i i =1 This motivates latent variable models

Aldous Hoover Factorization • Matrix-valued set of random variable Example - Erdos Renyi graph model Y p ( E ) = p ( V ij ) i,j • Independently exchangeable on matrix p ( E ) = p ( E 11 , E 12 , . . . , E mn ) = p ( E π (1) ρ (1) , E π (1) ρ (2) , . . . , E π ( m ) ρ ( n ) ) • Aldous Hoover Theorem m n Z Z Y Y Y p ( E ) = dp ( θ ) dp ( u i ) dp ( v j ) p ( E ij | u i , v j , θ ) i =1 j =1 i,j

Aldous Hoover Factorization • Rating matrix is (row, column) exchangeable u 1 u 2 u 3 u 4 u 5 u 6 • Draw latent variables per v 1 e 11 e 12 e 15 e 16 row and column v 2 e 24 • Draw matrix entries independently given pairs v 3 e 32 • Absence / presence of v 4 e 43 e 46 rating is a signal e 55 v 5 • Can be extended to graphs with vertex attributes

Aldous Hoover variants • Jointly exchangeable matrix • Social network graphs • Draw vertex attributes first, then edges • Cold start problem • New user appears • Attributes (age, location, browser) • Can estimate latent variables from that • User and item factors in matrix factorization problem can be viewed as AH-factors

Improvements

Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD Add 100 0.9 200 biases SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Millions of Parameters

Bias • Objective Function X ( r ui � ( µ + b u + b i + h p u , q i i )) 2 + minimize p,q ( u,i ) ∈ S Frob + k b users k 2 + k b items k 2 i h k p k 2 Frob + k q k 2 λ • Stochastic gradient descent p u (1 � λη t ) p u � η t q i ρ ui q i (1 � λη t ) q i � η t p u ρ ui b u (1 � λη t ) b u � η t ρ ui b i (1 � λη t ) b i � η t ρ ui µ (1 � λη t ) µ � η t ρ ui where ρ ui = ( r ui � ( µ + b i + b u + h p u , q i i ))

Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD 100 200 0.9 SVD++ “who ¡rated ¡ SVD v.2 what” 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Millions of Parameters

Ratings are not given at random Netflix ratings Yahoo! music ratings Yahoo! survey answers • Marlin et al. “Collaborative Filtering and the B. ¡Marlin ¡et ¡al., ¡“Collaborative ¡Filtering ¡and ¡the ¡Missing ¡ at ¡Random ¡Assumption” ¡ Missing at Random Assumption” UAI 2007

• Movie rating matrix •  users users 1 3 5 5 4 1 0 1 0 0 1 0 0 1 0 1 0 5 4 4 2 1 3 0 0 1 1 0 0 1 0 0 1 1 1 movies movies 2 4 1 2 3 4 3 5 1 1 0 1 1 0 1 0 1 1 1 0 2 4 5 4 2 0 1 1 0 1 0 0 1 0 0 1 0 4 3 4 2 2 5 0 0 1 1 1 1 0 0 0 0 1 1 1 3 3 2 4 1 0 1 0 1 0 0 1 0 0 1 0 r ui  c ui      • Characterize users by which movies they rated Edge attributes (observed, rating) • Adding features to recommender system regression r ui = µ + b u + b i + h p u , q i i + h c u , x i i

Alternative integration • Key idea - use related ratings to average • Salakhudtinov & Mnih, 2007 X q i ← q i + c ui p u u • Koren et al., 2008 X q i ← q i + c ui x j u Overparametrize items by q and x

Factor models: Error vs. #parameters 0.91 40 60 NMF 90 0.905 128 50 180 BiasSVD 100 0.9 200 SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 temporal 200 50 0.89 effects SVD v.4 100 200 500 0.885 100 50 200 500 100 200 500 1000 1500 0.88 0.875 10 100 1000 10000 100000 Millions of Parameters

Something Happened in Early 2004 … Netflix ratings by date Netflix changed rating labels 2004

Are movies getting better with time?

Sources of temporal change • Items • Seasonal effects (Christmas, Valentine’s day, Holiday movies) • Public perception of movies (Oscar etc.) • Users • Changed labeling of reviews • Anchoring (relative to previous movie) • Change of rater in household • Selection bias for time of viewing

Modeling temporal change • Time-dependent bias • Time-dependent user preferences r ui ( t ) = µ + b u ( t ) + b i ( t ) + h q i , p u ( t ) i • Parameterize functions b and p • Slow changes for items • Fast sudden changes for users • Good parametrization is key Koren et al., KDD 2009 (CF with temporal dynamics)

Bias matters Sources of Variance in Netflix data Biases 33% Unexplained 57% Personalization 10% 0.732 (unexplained) + 0.415 (biases) + 0.129 (personalization) 1.276 (total variance)

Netflix: 0.9514 Factor models: Error vs. #parameters 0.91 40  T r q p 60 ui i u NMF 90 0.905 128 50 180 BiasSVD 100 0.9 200 SVD++ SVD v.2 0.895 50 RMSE 100 SVD v.3 200 50 0.89 SVD v.4 100 200 500 0.885 100 50 200 500 100 200   500 1000 1500 0.88        ( ) ( ) ( ) ( ) T   r t b t b t q p t b x ui u i i u uj j   j 0.875 10 100 1000 10000 100000 Prize: 0.8563 Millions of Parameters

More ideas • Explain factorizations • Cold start (new users) • Different regularization for different parameter groups / different users • Sharing of statistical strength between users • Hierarchical matrix co-clustering / factorization (write a paper on that)

3 Session Modeling

Motivation

User interaction • Explicit search query • Search engine • Genre selection on movie site • Implicit search query • News site • Priority inbox • Comments on article • Viewing specific movie (see also ...) • Sponsored search (advertising) Space, users’ time and attention are limited.

session? models?

Did the user SCROLL DOWN?

Bad ideas ... • Show items based on relevance • Yes, this user likes Die Hard. • But he likes other movies, too • Show items only for majority of users ‘apple’ vs. ‘Apple’

User response collapse collapse implicit user interest log it!

hover on link

Response is conditioned on available options • User search for ‘chocolate’ user picks this • What the user really would have wanted • User can only pick from available items •Preferences are often relative

Models

Independent click model • Each object has click probability • Object is viewed independently • Used in computational advertising (with some position correction) • Horribly wrong assumption • OK if probability is very small (OK in ads) n 1 Y p ( x | s ) = 1 + e − x i s i i =1

Logistic click model no click • User picks at most one object • Exponential family model for click e s x p ( x | s ) = x 0 e s x 0 = exp ( s x − g ( s )) e s 0 + P no click • Ignores order of objects • Assumes that the user looks at all before taking action

Sequential click model no click • User traverses list click • At each position some probability of clicking • When user reaches end of the list he aborts " j − 1 # 1 1 Y p ( x = j | s ) = 1 + e − s j 1 + e s i i =1 • This assumes that a patient user viewed all items

Skip click model no no no no click click click click • User traverses list • At each position some probability of clicking • At each position the user may abandon the process • This assumes that user traverses list sequentially

Context skip click model • User traverses list • At each position some probability of clicking which depends on previous content • At each position the user may abandon the process • User may click more than once

Context skip click model

Context skip click model • Viewing probability user is gone p ( v i = 1 | v i − 1 = 0) = 0 1 p ( v i = 1 | v i − 1 = 1 , c i − 1 = 0) = 1 + e − α i user returns 1 p ( v i = 1 | v i − 1 = 1 , c i − 1 = 1) = 1 + e − β i • Click probability (only if viewed) prior context nctional form: 1 p ( c i = 1 | v i = 1 , c i − 1 , d i ) = 1 + e − f ( | c i − 1 | ,d i ,d i − 1 )

Incremental gains score f ( | c i − 1 | , d i , d i − 1 ) := ρ ( S, d i | a, b ) � ρ ( S, d i − 1 | a, b ) + γ | c i − 1 | + δ i ”! “ X X X ρ j ( d i ) � ρ j ( d i − 1 ) := [ s ] j [ d ] j + b j a j s ∈ S j d ∈ d i + γ | c i − 1 | + δ i • Submodular gain per additional document • Relevance score per document • Coverage over different aspects • Position dependent score • Score dependent on number of previous clicks

Optimization • Latent variables We don’t know v whether user viewed result • Use variational inference to integrate out v (more next week in graphical models) � log p ( c )  � log p ( c ) + D ( q ( v ) k p ( v | c )) = E v ∼ q ( v ) [ � log p ( c ) + log q ( v ) � log p ( v | c )] = E v ∼ q ( v ) [ � log p ( c, v )] � H ( q ( v )) .

Optimization • Compute latent viewing probability given clicks • Easy since we only have one transition from views to no views (no DP needed) • Expected log-likelihood under viewing model • Convex expected log-likelihood • Stochastic gradient descent • Parametrization uses personalization, too (user, position, viewport, browser)

4 Feature Representation

Bayesian Probabilistic Matrix Factorization

Statistical Model r • Aldous-Hoover factorization σ σ V U • normal distribution for V j U user and item attributes i • rating given by inner product R ij i=1,...,N j=1,...,M • Ratings s σ p ( R ij | U i , V j , σ 2 ) = N ( R ij | U T i V j , σ 2 ) • Latent factors N M p ( U | σ 2 � N ( U i | 0 , σ 2 p ( V | σ 2 � N ( V j | 0 , σ 2 U ) = U I) , V ) = V I) i =1 j =1 Salakhudtinov & Mnih, ICML 2008 BPMF

Details • Priors on all factors α V α U • Wishart prior is conjugate Θ Θ to Gaussian, hence use it V U • Allows us to adapt the variance automatically V j U i • Inference (Gibbs sampler) R ij • Sample user factors (parallel) i=1,...,N j=1,...,M • Sample movie factors (parallel) eature σ • Sample hyperparameters (parallel)

Making it fancier (constrained BPMF) σ σ U W t σ . V W Y k i k=1,...,M V j U i I i who rated R ij what i=1,...,N j=1,...,M σ

Results (Mnih & Salakthudtinov) 20 1.2 1.15 Movie Average 15 1.1 Users (%) 1.05 PMF RMSE 10 1 0.95 Constrained PMF 5 0.9 0.85 0 0.8 1 − 5 6 − 10 − 20 − 40 − 80 − 160 − 320 − 640 >641 1 − 5 6 − 10 − 20 − 40 − 80 − 160 − 320 − 640 >641 Number of Observed Ratings Number of Observed Ratings helps for infrequent users

Multiple Sources

Social Network Data Data: users, connections, features Goal: suggest connections

Social Network Data Data: users, connections, features y y’ Goal: suggest connections x x’ e

Social Network Data Data: users, connections, features y y’ Goal: model/suggest connections x x’ e Y Y p ( x, y, e ) = p ( y i ) p ( x i | y i ) p ( e ij | x i , y i , x j , y j ) i ∈ Users i,j ∈ Users Direct application of the Aldous-Hoover theorem. Edges are conditionally independent.

Applications

Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! - PowerPoint PPT Presentation

Scalable Machine Learning 8. Recommender Systems Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 Significant content courtesy of Yehuda Koren 8. Recommender Systems Much content courtesy of (Mr

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Time resolved scattering studies Clement Blanchet Time resolved study Collect data at different

Chapter 22 Neutron Stars and Black Holes Material on black holes has been omitted from this

Bistable Dynamics of Turbulence Intensity in a Corrugated Temperature Profile Zhibin Guo

SEMICONDUCTOR-BASED SOURCES OF QUANTUM LIGHT Armando Rastelli Institute of Semiconductor and

Experimental Studies of RF Generated Ionospheric Turbulence Prof. James P. Sheerin

Fermi Gamma Ray Space Telescope: Launch+509 Roger Blandford KIPAC Stanford (With considerable

Voice Over Sensor Networks Rahul Mangharam 1 Anthony Rowe 1 Raj Rajkumar 1 Ryohei Suzuki 2 1 Dept.

Collge de France abroad Lectures Quantum information with real or artificial atoms and photons