recommender systems tutorial
play

Recommender Systems: Tutorial Andras Benczur Insitute for Computer - PowerPoint PPT Presentation

Recommender Systems: Tutorial Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences Supported by the EC FET Open project "New tools and algorithms for directed network analysis" (NADINE No 288956)


  1. Latent factor models • Items and users described by unobserved factors • Each item is summarized by a d -dimensional vector P i • Similarly, each user summarized by Q u • Predicted rating for Item i by User u o Inner product of P i and Q u ∑ P uk Q ik

  2. Yehuda Bell’s Example serious Braveheart Amadeus The Color Purple Lethal Weapon Sense and Ocean’s 11 Sensibility Geared towards Geared towards females males Dave The Lion King Dumb and Dumber The Princess Independence Diaries Day Gus escapist

  3. Warmup • Hypertext-induced topic search (HITS) • Connections to Singular Value Decomposition • Ranking in Web Retrieval – not-so-well-known-to-be matrix factorization application Some slides source: Monika Henzinger’s Stanford CS361 talk

  4. Motivation http://recsys.acm.org/ http://icml.cc/2014/ http://www.kdd.org/kdd2014/ Authority (content) Hub (link collection)

  5. Neighborhood graph • Subgraph associated to each query Query Results Back Set Forward Set = Start Set Result 1 b 1 f 1 f 2 b 2 Result 2 … ... ... b m f s Result n An edge for each hyperlink, but no edges within the same host

  6. HITS [Kleinberg 98] • Goal: Given a query find: o Good sources of content (authorities) o Good sources of links (hubs)

  7. Intuition • Authority comes from in-edges. Being a good hub comes from out-edges. • Better authority comes from in-edges from good hubs. Being a better hub comes from out-edges to good authorities.

  8. HITS details Repeat until h and a converge: Normalize h and a h[v] := S a[u i ] for all u i with Edge(v, u i ) a[v] := S h[w i ] for all w i with Edge(w i , v) v w 1 u 1 a w 2 h u 2 ... ... w k u k

  9. HITS and matrices a (k+1) T = h (k) T A A ij =1 if ij is edge, 0 otherwise h (k+1) T = a (k+1) T A T h (k+1) T = h (1) T ( A A T ) k a (k+1) T = a (1) T ( A T A ) k

  10. HITS and matrices II Decomposition theorem: A T A = VWV T a (k+1) T = h (k) T A A A T = UWU T VV T = UU T = I h (k+1) T = a (k+1) T A T k 2 0 … 0 ( ) w 1 a (k+1) T = a (1) T ( A T A ) k 2 0 … 0 0 w 2 = a (1) T V V T … 0 … 0 w n 2 k ( ) 2 0 … 0 w 1 h (k+1) T = h (1) T ( A A T ) k 2 0 … 0 0 w 2 = h (1) T U U T … 0 … 0 w n 2 a = α 1 v 1 + … + α n v n ; a T v i = α i

  11. Hubs and Authorities example

  12. Octave example • octave:1> • octave:2> h=[1,1,1,1,1] • octave:3> a=h*L • octave:4> h=a*transpose(L) • … • octave:12> h=[0,0,1,0,0] • octave:13> a=h*L • octave:14> h=a*transpose(L) • octave:15> [U,S,V]=svd(L) • octave:16> A=U*S*transpose(V) • octave:17> a=h*L/2.1889 • octave:4> h=a*transpose(L)/2.1889 • …

  13. Example Compare the authority scores of node D to nodes B1, B2, and B3 (Despite two separate pieces, it is a single graph.) • Values from running the 2-step hub-authority computation, starting from the all-ones vector. • Formula for running the k-step hub-authority computation. • Rank order, as k goes to infinity. • Intuition: difference between pages that have multiple reinforcing endorsements and those that simply have high in-degree.

  14. HITS and path concentration   2 [ A ] A A • ij ik kj k Paths of length exactly 2 between i and j Or maybe also less than 2 if A ii >0 • A k = |{paths of length k between endpoints}| • (AA T ) = |{alternating back-and-forth routes}| • (AA T ) k = |{alternating back-and-forth k times}|

  15. Guess best hubs and authorities! • And the second best ones? • HITS is instable, reverting the connecting edge completely changes the scores

  16. Singular Value Decomposition ( SVD ) • Handy mathematical technique that has application to many problems • Given any m  n matrix A , algorithm to find matrices U , V , and W such that A = U W V T U is m  m and orthonormal W is m  n and diagonal V is n  n and orthonormal Notion of Orthonormality?

  17. Orthonormal Basis [a T V] i =  i a = α 1 v 1 + … + α n v n ; a T v i = α i k 2 0 … 0 ( ) w 1 a T v 2 2 0 … 0 0 w 2 a T V V T … 0 … 0 w n 2 v 1 a T V         V v v v 1 2 n    

  18. SVD and PCA • Principal Components Analysis (PCA): approximating a high- dimensional data set with a lower-dimensional subspace * * Second principal component * * First principal component * * * * * * * * * Original axes * * * * * * * * * * * Data points

  19. SVD and Ellipsoids 2 [ Uy ]   • {y=Ax : ||x|| = 1} i 2 w i i • ellipsoid with axes u i of length w i * * * * Second principal component First principal component * * * * * * * * * * * Original axes * * * * * * * * * Data points

  20. Projection of graph nodes by A First three singular components of a social network Clusters by K-Means T A : x i are base {x i vectors of nodes} When will two nodes be near? If their Aij vectors are close – cosine distance

  21. Recall the recommender example serious Braveheart Amadeus The Color Purple Lethal Weapon Sense and Ocean’s 11 Sensibility Geared towards Geared towards females males Dave The Lion King Dumb and Dumber The Princess Independence Diaries Day Gus escapist

  22. SVD proof: Start with longest axis … • Select v 1 to maximize {||Ax|| : ||x|| = 1} • Compute u 1 = A v 1 / w 1 • u 1 should play the same role for A T : maximize {||A T y|| : ||y|| = 1} – but why u 1 ?? • Fix conditions ||x|| = ||y|| = 1; w 1 = max {||Ax||} = max {(Ax) T Ax} ≥ max {| y T Ax|}, and in fact equal as u 1 is in the direction of Av 1 • We can have the same for x T A T y = (y T Ax) T max {|| A T y ||} = max {|y T Ax|} = w 1

  23. Surprise: We Are Done! • We need to show U T AV=W (why?) • Use any orthonormal U*, V* orthogonal to u 1 , v 1 and try to finish: T     u v       1 1 A A           U V • A* 11 = w 1 by the way we defined u 1 • A*. 1 and A* 1 . is of form xAy and xA T y, hence cannot be longer than w 1 • We have the first row and column, proceed by induction …

  24. SVD with missing values • Most of the rating matrix is unknown • The Expectation Maximization algorithm:  ( t ) A if rating known  ij       ( t 1 )   A U V err  ij U V otherwise k ki kj ij  k ki kj  k k • Seems impossible as matrix A becomes dense, but … • For example, the Lanczos algorithm multiplies this or transpose with vector x : imputation result is cheap operation   U ( V x ) k ki kj j k • Seemed promising but badly overfits – no way to „ regularize ” the elements of U and V (keep them small) • The imputed values will quickly dominate the matrix

  25. General overview of MF approaches • Model 𝐿 ≈ 𝑇 𝑉 𝑇 𝐽 𝑇 𝑉 o How we approximate user preferences o 𝑠 𝑣,𝑗 = 𝑞 𝑣𝑈 𝑟 𝑗 𝑇 𝐽 𝐿 • Objective function (error function) o What we want to minimize or optimize? o E.g. optimize for RMSE with regularization 2 𝑇 𝑉 𝑇 𝐽 𝑣 2 2 L = + 𝜇 𝑉 + 𝜇 𝐽 𝑠 𝑣,𝑗 − 𝑠 𝑄 𝑅 𝑗 𝑣,𝑗 (𝑣,𝑗)∈𝑈𝑠𝑏𝑗𝑜 𝑣=1 𝑗=1 Learning • Learning method o How we improve the objective function? o E.g. stochastic gradient descent (SGD)

  26. Matrix Factorization Recommenders Singular Value Decomposition Stochastic Gradient Descent R = U T S V R = P T Q k x N = N x M x N M x M M x N N ≈ M x N M x k R U S V R P Q In our case: M: number of users N: number of items R: the original (sparse) rating matrix In comparison to SVD, the SGD factors are not ranked Ranked factors: iterative SGD optimize only on a single factor at a time

  27. Iterative Stochastic Gradient Descent ( „ Simon Funk ” ) Iteration 1 Iteration 2 2 x N 1 x N M Fix factor 1 M x ≈ ≈ M x N x M x N Optimize only 2 1 for factor 2 Iteration k k x N Fix factors 1..k-1 Optimize only ≈ M x N M x k … for factor k

  28. R P 1 4 3 1,1 -0,4 1,2 -0,5 1,2 -0,3 1,1 -0,2 1,2 0,9 1,1 0,8 1,2 0,9 4 4 0,4 -0,4 0,4 -0,2 0,5 -0,1 0,5 -0,3 4 2 4 1,5 1,4 1,3 0,9 0,8 -1,2 -1,3 -1,1 0,0 -0.1 0,1 0.5 0.6 Q -0,1 -0,2 0,0 0,5 0,4 -0,4 -0,2 -0,3 1,6 1,6 1,5 0,2 0,3

  29. R P 1 4 3.3 3 2.4 1,4 1,1 0,9 1,9 -0.5 3.5 4 4 1.5 2,5 -0,3 4 4.9 2 1.1 4 1,5 2,1 1,0 0.7 1.6 Q -1,0 0,8 1,6 1,8 0,0

  30. Simplest SGD: Perceptron Learning • Compute a 0-1 or a graded function of the weighted sum of the inputs • g is the activation function x w 1 1 x w   2 g w x ( ) 2 g w n    x w x w x n i i

  31. Perceptron Algorithm Input: dataset D, int number_of_iterations, float learning_rate 1. initialize weights w 1 , …, w n randomly 2. for (int i=0; i<number_of_iterations; i++) do 3. for each instance x (j) in D do 4. y‘ = ∑ x (j) k w k 5. err = y (j) – y‘ 6. for each w k do (j) 7. d j,k = learning_rate*err*x k 8. w k = w k + d j,k 9. end for 10. end foreach 11.end for

  32. The learning step is a derivative • Squared error target function err 2 = ( y - ∑ w i x i ) 2 • Derivative 2 w i ( y - ∑ w i x i ) = 2 w i err

  33. Matrix factorization • We estimate matrix M as the product of two matrices U and V . • Based on the known values of M , we search for U and V so that their product best estimates the (known) values of M

  34. Matrix factorization algorithm • Random initialization of U and V • While U x V does not approximate the values of M well enough o Choose a known value of M o Adjust the values of the corresponding row and column of U and V respectively, to improve

  35. Example for an adjustment step (2*2)+(1*1) = 5 which equals to the selected value  we do not do anything

  36. Example for an adjustment step (3*1)+(2*3) = 9 9 > 4  we decrease the values of the corresponding rows so that their products will be closer to 4

  37. What is a good adjustment step? 1. Adjustment proportional to error  let it be ε times the error o Example: error = 9 – 4 = 5 with ε =0.1 decrease proportional to 0.1*5=0.5 (3*1)+(2*3) = 9

  38. What is a good adjustment step? 2. Take into account how much a value contributes to the error o For the selected row: 3 is multiplied by 1  3 is adjusted by ε *5*1 = 0.5 2 is multiplied by 3  2 is adjusted by ε *5*3 = 1.5 o For the selected column respectively: ε *5*3=1.5 and ε *5*2=1.0

  39. Result of the adjustment step ε = 0.1 • row values decrease by: ε *5*1 = 0.5 ε *5*3 = 1.5 • column values decrease by: ε *5*3=1.5 ε *5*2=1.0 -0.5 2.5 0.5 2 (2.5*-0.5)+(0.5*2) = -0.25

  40. Gradient Descent • Why is the previously shown adjustment step a good one (at least in theory)? • Error function: sum of squared errors • Each value of U and V is a variable of the error function  partial derivatives err 2 = (u 1 v 1 + u 2 v 2 - m) 2 d err 2 / du 1 = = 2 (u 1 v 1 + u 2 v 2 - m) v 1 • Minimization of the error by gradient descent leads to the previously shown adjustment steps

  41. Gradient Descent Summary • We want to minimize RMSE o Same as minimizing MSE 2   K 1   1          ˆ 2 MSE r r r p q ui ui ui uk ki   R R    ( u , i ) R ( u , i ) R k 1 test test test test • Minimum place where its derivatives are zeroes o Because the error surface is quadratic • SGD optimization

  42. BRISMF model • Biased Regularized Incremental Simultaneous Matrix Factorization • Applies regularization to prevent overfitting • To further decrease RMSE using bias values • Model:   K        ˆ r p q b c p q b c ui u i u i uk ki u i  1 k

  43. BRISMF Learning • Loss function 2   K                    2 2 2 2 r p q b c p q b c ui uk ki u i uk ki u i     ( u , i ) R k 1 ( u , k ) ( i , k ) u i train • SGD update rules               p e q p q e p q uk ui ki uk ki ui uk ki               b e b c e c u ui u i ui i

  44. BRISMF – steps • Initialize 𝑄 and 𝑅 randomly • For each iteration o Get the next rating from 𝑆 o Update 𝑄 and 𝑅 simultaneously using the update rules • Do until.. o The training error is below a threshold o Test error is decreasing o Other stopping criteria is also possible

  45. CS345 Data Mining (2009) Recommendation Systems Netflix Challenge Anand Rajaraman, Jeffrey D. Ullman

  46. Content-based recommendations  Main idea: recommend items to customer C similar to previous items rated highly by C  Movie recommendations  recommend movies with same actor(s), director, genre, …  Websites, blogs, news  recommend other sites with “similar” content

  47. Plan of action Item profiles likes build recommend Red match Circles Triangles User profile

  48. Item Profiles  For each item, create an item profile  Profile is a set of features  movies: author, title, actor, director,…  text: set of “important” words in document  How to pick important words?  Usual heuristic is TF.IDF (Term Frequency times Inverse Doc Frequency)

  49. TF.IDF f ij = frequency of term t i in document d j n i = number of docs that mention term i N = total number of docs TF.IDF score w ij = TF ij x IDF i Doc profile = set of words with highest TF.IDF scores, together with their scores

  50. User profiles and prediction  User profile possibilities:  Weighted average of rated item profiles  Variation: weight by difference from average rating for item  …  Prediction heuristic  Given user profile c and item profile s , estimate u( c , s ) = cos( c , s ) = c . s /(| c || s |)  Need efficient method to find items with high utility: later

  51. Model-based approaches  For each user, learn a classifier that classifies items into rating classes  liked by user and not liked by user  e.g., Bayesian, regression, SVM  Apply classifier to each item to find recommendation candidates  Problem: scalability  Won’t investigate further in this class

  52. Limitations of content-based approach  Finding the appropriate features  e.g., images, movies, music  Overspecialization  Never recommends items outside user’s content profile  People might have multiple interests  Recommendations for new users  How to build a profile?  Recent result: 20 ratings more valuable than content

  53. Similarity based Collaborative Filtering  Consider user c  Find set D of other users whose ratings are “similar” to c’s ratings  Estimate user’s ratings based on ratings of users in D

  54. Similar users  Let r x be the vector of user x’s ratings  Cosine similarity measure  sim(x,y) = cos(r x , r y )  Pearson correlation coefficient  S xy = items rated by both users x and y

  55. Rating predictions  Let D be the set of k users most similar to c who have rated item s  Possibilities for prediction function (item s):  r cs = 1/k  d  D r ds  r cs = (  d  D sim(c,d) x r ds )/(  d  D sim(c,d))

  56. Complexity  Expensive step is finding k most similar customers  O(|U|)  Too expensive to do at runtime  Need to pre-compute  Naïve precomputation takes time O(N|U|)  Tricks for some speedup  Can use clustering, partitioning as alternatives, but quality degrades

  57. The traditional similarity approach • One of the earliest algorithms • Warning: performance is very poor • Improved version next …

  58. Factorization Machine (Steffen Rendle) • Model: linear regression and pairwise rank k interactions: • Substitution for traditional matrix factorization: • If items have attributes (e.g. content, tf.idf , …): • One (but not the only) way to train is by gradient descent

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend