http://cs246.stanford.edu Training data 100 million ratings, - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

 Training data  100 million ratings, 480,000 users, 17,770 movies  6 years of data: 2000-2005  Test data  Last few ratings of each user (2.8 million)  Evaluation criterion: Root Mean Square Error ( RMSE )  Netflix’s system RMSE: 0.9514  Competition  2,700+ teams  $1 million prize for 10% improvement on Netflix 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

480,000 users Matrix R 1 3 4 3 5 5 4 5 5 3 17,700 3 movies 2 2 2 5 2 1 1 3 3 1 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

480,000 users Matrix R 1 3 4 𝒔 𝟒,𝟕 3 5 5 4 5 5 3 17,700 3 movies 2 ? ? Training Data Set Test Data Set ? 2 1 ? 3 ? 1 True rating of user x on item i 𝑦𝑗 2 SSE = 𝑠 𝑦𝑗 − 𝑠 (𝑗,𝑦)∈𝑆 Predicted rating 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

 The winner of the Netflix Challenge  Multi-scale modeling of the data: Combine top level, “regional” Global effects modeling of the data, with a refined, local view:  Global: Factorization  Overall deviations of users/movies  Factorization: Collaborative filtering  Addressing “regional” effects  Collaborative filtering:  Extract local patterns 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

 Global:  Mean movie rating: 3.7 stars  The Sixth Sense is 0.5 stars above avg.  Joe rates 0.2 stars below avg.  Baseline estimation: Joe will rate The Sixth Sense 4 stars  Local neighborhood (CF/NN):  Joe didn’t like related movie Signs   Final estimate: Joe will rate The Sixth Sense 3.8 stars 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

 Earliest and most popular collaborative filtering method  Derive unknown ratings from those of “ similar ” movies (item-item variant)  Define similarity measure s ij of items i and j  Select k - nearest neighbors, compute the rating  N(i; x): items most similar to i that were rated by x   s r  ij xj  ˆ j N ( i ; x ) r s ij … similarity of items i and j  xi r uj … rating of user x on item j s N(i;x) … set of items similar to  ij j N ( i ; x ) item i that were rated by x 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

 In practice we get better estimates if we model deviations:    s ( r b ) ^  ij xj xj   j N ( i ; x ) r b  xi xi s  ij j N ( i ; x ) baseline estimate for r xi Problems/Issues: 1) Similarity measures are “arbitrary” 𝒄 𝒚𝒋 = 𝝂 + 𝒄 𝒚 + 𝒄 𝒋 2) Pairwise similarities neglect interdependencies among users μ = overall mean rating 3) Taking a weighted average can be b x = rating deviation of user x restricting = ( avg. rating of user x ) – μ Solution: Instead of s ij use w ij that b i = ( avg. rating of movie i ) – μ we estimate directly from data 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

 Use a weighted sum rather than weighted avg. : 𝑠 = 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝑘∈𝑂(𝑗;𝑦)  A few notes:  We sum over all movies j that are similar to i and were rated by x  𝒙 𝒋𝒌 is the interpolation weight (some real number)  We allow: 𝒙 𝒋𝒌 ≠ 𝟐 𝒌∈𝑶(𝒋,𝒚)  𝒙 𝒋𝒌 models interaction between pairs of movies (it does not depend on user x )  𝑶(𝒋; 𝒚) … set of movies rated by user x that are similar to movie i 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

= 𝑐 𝑦𝑗 +  𝑠 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝑘∈𝑂(𝑗,𝑦)  How to set w ij ? 𝑣𝑗 2  Remember, error metric is SSE : 𝑠 𝑣𝑗 − 𝑠 (𝑗,𝑣)∈𝑆  Find w ij that minimize SSE on training data!  Models relationships between item i and its neighbors j  w ij can be learned/estimated based on x and all other users that rated i Why is this a good idea? 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

1 3 4 3 5 5  Here is what we just did: 4 5 5 3 3  Goal: Make good recommendations 2 ? ? ? 2 1 ?  Quantify goodness using SSE: 3 ? 1 So, Lower SSE means better recommendations  We want to make good recommendations on items that some user has not yet seen. Can’t really do this. Why?  Let’s set values w such that they work well on known (user, item) ratings And hope these w s will predict well the unknown ratings  This is the first time in the class that we see Optimization methods 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

 Idea: Let’s set values w such that they work well on known (user, item) ratings  How to find such values w ?  Idea: Define an objective function and solve the optimization problem  Find w ij that minimize SSE on training data ! 2 min 𝑥 𝑗𝑘 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗 𝑦 𝑘∈𝑂 𝑗;𝑦  Think of w as a vector of numbers 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

 We have the optimization 2 problem, now what? min 𝑥 𝑗𝑘 𝑐 𝑦𝑗 + 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 − 𝑠 𝑦𝑗  Gradient decent 𝑦 𝑘∈𝑂 𝑗;𝑦  Iterate until convergence: 𝒙 𝒙 −  𝜶𝒙  … learning rate  where 𝜶𝒙 is gradient (derivative evaluated on data): 𝜖 𝛼𝑥 = = 2 𝑐 𝑦𝑗 + 𝑥 𝑗𝑙 𝑠 𝑦𝑙 − 𝑐 𝑦𝑙 − 𝑠 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝜖𝑥 𝑗𝑘 𝑦 𝑙∈𝑂 𝑗;𝑦 for 𝒌 ∈ {𝑶 𝒋; 𝒚 , ∀𝒋, ∀𝒚 } 𝜖 𝜖𝑥 𝑗𝑘 = 𝟏 else  Note: we fix movie i , go over all r xi , for every movie 𝒌 ∈ 𝑶 𝒋; 𝒚 , while | w new - w old | > ε : 𝝐 w old = w new we compute 𝝐𝒙 𝒋𝒌 w new = w old -  ·  w old 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

= 𝑐 𝑦𝑗 +  So far: 𝑠 𝑥 𝑗𝑘 𝑠 𝑦𝑘 − 𝑐 𝑦𝑘 𝑦𝑗 𝑘∈𝑂(𝑗;𝑦)  Weights w ij derived based Global effects on their role; no use of an arbitrary similarity measure ( w ij  s ij ) Factorization  Explicitly account for interrelationships among CF/NN the neighboring movies  Next: Latent factor model  Extract “regional” correlations 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

Global average: 1.1296 User average: 1.0651 Movie average: 1.0533 Netflix: 0.9514 Basic Collaborative filtering: 0.94 CF+Biases+learnt weights: 0.91 Grand Prize: 0.8563 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

Serious Braveheart The Color Amadeus Purple Lethal Sense and Weapon Sensibility Ocean’s 11 Geared Geared towards towards males females The Lion King The Princess Independence Diaries Day Dumb and Dumber Funny 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

SVD: A = U  V T  “SVD” on Netflix data: R ≈ Q · P T f factors users .1 -.4 .2 1 3 5 5 4 users -.5 .6 .5 4 5 4 2 1 3 items f factors 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 2 4 1 2 3 4 3 5 ≈ -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2 4 5 4 2 items 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 4 3 4 2 2 5 P T -1 .7 .3 1 3 3 2 4 Q R  For now let’s assume we can approximate the rating matrix R as a product of “thin” Q · P T  R has missing entries but let’s ignore that for now!  Basically, we will want the reconstruction error to be small on known ratings and we don’t care about the values on the missing ones 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

 How to estimate the missing rating of 𝑼 user x for item i ? 𝒚𝒋 = 𝒓 𝒋 ⋅ 𝒒 𝒚 𝒔 users 1 3 5 5 4 = 𝒓 𝒋𝒈 ⋅ 𝒒 𝒚𝒈 4 ? 5 4 2 1 3 items ≈ 2 4 1 2 3 4 3 5 𝒈 2 4 5 4 2 4 3 4 2 2 5 q i = row i of Q 1 3 3 2 4 p x = column x of P T .1 -.4 .2 users f factors -.5 .6 .5 items 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 P T -1 .7 .3 f factors Q 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

 How to estimate the missing rating of 𝑼 user x for item i ? 𝒚𝒋 = 𝒓 𝒋 ⋅ 𝒒 𝒚 𝒔 users 1 3 5 5 4 = 𝒓 𝒋𝒈 ⋅ 𝒒 𝒚𝒈 4 ? 5 4 2 1 3 items ≈ 2 4 1 2 3 4 3 5 𝒈 2 4 5 4 2 4 3 4 2 2 5 q i = row i of Q 1 3 3 2 4 p x = column x of P T .1 -.4 .2 users f factors -.5 .6 .5 items 1.1 -.2 .3 .5 -2 -.5 .8 -.4 .3 1.4 2.4 -.9 -.2 .3 .5 -.8 .7 .5 1.4 .3 -1 1.4 2.9 -.7 1.2 -.1 1.3 1.1 2.1 .3 2.1 -.4 .6 1.7 2.4 .9 -.3 .4 .8 .7 -.6 .1 -.7 2.1 -2 P T -1 .7 .3 f factors Q 2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

http://cs246.stanford.edu Training data 100 million ratings, - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: 2000-2005 Test data Last few ratings of each user (2.8

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann

http://cs246.stanford.edu High dimensional == many features Find

http://cs246.stanford.edu Supermarket shelf management Market-basket model: Goal: Identify

Soumyajit Gupta, Mucahid Kutlu, Vivek Khetan, and Matthew Lease ECIR 2019, Cologne, Germany, So

Introduction to the Million Songs Dataset Jamen Long Data Scientist DataCamp Building

Google matrix of the world trade network Leonardo Ermann CNEA (Buenos Aires, Argentina) Colab.

Oberseminar Convergence Mechanisms for a Smart Space App Store Bibek Shrestha

Course Introduction Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Unplanned Returns to Hospital Care: A Linked Data Study Kathy SMITH 1 and Renee IANNOTTI Health

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 11:

Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data