DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm –8:50pm Thu. Location: KH116 Fall 2017

Example: Recommender Systems v Customer X v Customer Y § Star War I § Does search on Star War I § Star War II § Recommender system suggests Star War II from data collected about customer X J. Leskovec, A. Rajaraman, J. Ullman: 2 Mining of Massive Datasets, http:// www.mmds.org

Recommendations Examples: Search Recommendations Products, web sites, Items blogs, news items, … J. Leskovec, A. Rajaraman, J. Ullman: 3 Mining of Massive Datasets, http:// www.mmds.org

From Scarcity to Abundance v Shelf space is a scarce commodity for traditional retailers § Also: TV networks, movie theaters,… v Web enables near-zero-cost dissemination of information about products § From scarcity to abundance, e.g., Amazon, Target online, eBay, etc. v More choices necessitates better filters § Recommendation engines J. Leskovec, A. Rajaraman, J. Ullman: 4 Mining of Massive Datasets, http:// www.mmds.org

Types of Recommendations v Editorial and hand curated § List of favorites § Lists of “essential” items v Simple aggregates § Top 10, Most Popular, Recent Uploads v Tailored to individual users § Amazon, Netflix, … J. Leskovec, A. Rajaraman, J. Ullman: 5 Mining of Massive Datasets, http:// www.mmds.org

Formal Model v X = set of Customers v S = set of Items v Utility function u : X × S à R § R = set of ratings § R is a totally ordered set § e.g., 0-5 stars, real number in [0,1] J. Leskovec, A. Rajaraman, J. Ullman: 6 Mining of Massive Datasets, http:// www.mmds.org

Utility Matrix Avatar LOTR Matrix Pirates 1 0.2 Alice 0.5 0.3 Bob 0.2 1 Carol 0.4 David J. Leskovec, A. Rajaraman, J. Ullman: 7 Mining of Massive Datasets, http:// www.mmds.org

Key Problems v (1) Gathering “known” ratings for matrix § How to collect the data in the utility matrix v (2) Estimate unknown ratings from the known ones § Mainly interested in high unknown ratings • We are not interested in knowing what you don’t like but what you like v (3) Evaluating estimation methods § How to measure success/performance of recommendation methods J. Leskovec, A. Rajaraman, J. Ullman: 8 Mining of Massive Datasets, http:// www.mmds.org

(1) Gathering Ratings v Explicit § Ask people to rate items § Doesn’t work well in practice – people can’t be bothered v Implicit § Learn ratings from user actions • E.g., purchase implies high rating § What about low ratings? J. Leskovec, A. Rajaraman, J. Ullman: 9 Mining of Massive Datasets, http:// www.mmds.org

(2) Estimating Utilities v Key problem: Utility matrix U is sparse § Most people have not rated most items § Cold start: • New items have no ratings • New users have no history v Approaches to recommender systems: § 1) Content-based § 2) Collaborative filtering J. Leskovec, A. Rajaraman, J. Ullman: 10 Mining of Massive Datasets, http:// www.mmds.org

Content-based Recommender Systems

Content-based Recommendations v Main idea: Recommend items to customer x similar to previous items rated highly by x § Look at x’s items vs all items Example: v Movie recommendations § Recommend movies with same actor(s), director, genre, … v Websites, blogs, news § Recommend other sites with “similar” content J. Leskovec, A. Rajaraman, J. Ullman: 12 Mining of Massive Datasets, http:// www.mmds.org

Plan of Action Item profiles likes build recommend Red match Circles Triangles User profile J. Leskovec, A. Rajaraman, J. Ullman: 13 Mining of Massive Datasets, http:// www.mmds.org

Item Profiles v For each item, create an item profile v Profile is a set (vector) of features § Movies: author, title, actor, director,… § Text: Set of “important” words in document J. Leskovec, A. Rajaraman, J. Ullman: 14 Mining of Massive Datasets, http:// www.mmds.org

User Profiles and Prediction v User profile possibilities: § Weighted average of rated item profiles § Variations: weight by difference from average rating for item ∑ w x = w j ( r xj − r x ) j = 1... N x v Prediction heuristic: § Given user profile w x and item profile w j , estimate r xj = cos( w x , w j ) = w x w j / || w j |||| w x || J. Leskovec, A. Rajaraman, J. Ullman: 15 Mining of Massive Datasets, http:// www.mmds.org

Pros: Content-based Approach v +: No need for data on other users v +: Able to recommend to users with unique tastes v +: Able to recommend new & unpopular items § No item cold-start v +: Able to provide explanations § Can provide explanations of recommended items by listing content-features that caused an item to be recommended J. Leskovec, A. Rajaraman, J. Ullman: 16 Mining of Massive Datasets, http:// www.mmds.org

Cons: Content-based Approach v –: Finding the appropriate features is hard § E.g., images, movies, music v –: Recommendations for new users § How to build a user profile? § User code-start problem v –: Overspecialization § Never recommends items outside user’s content profile § People might have multiple interests § Unable to exploit quality judgments of other users J. Leskovec, A. Rajaraman, J. Ullman: 17 Mining of Massive Datasets, http:// www.mmds.org

Collaborative Filtering Harnessing quality judgments of other users

Collaborative Filtering v Consider user x v Find set N of other x users whose ratings are “ similar ” to x ’s ratings N v Estimate x ’s ratings based on ratings of users in N J. Leskovec, A. Rajaraman, J. Ullman: 19 Mining of Massive Datasets, http:// www.mmds.org

r x = [*, _, _, *, ***] Finding “Similar” Users r y = [*, _, **, **, _] v Let r x be the vector of user x’s ratings r x , r y as sets: v Jaccard similarity measure r x = {1, 4, 5} r y = {1, 3, 4} § Problem: Ignore the value of the ratings: v Cosine Similarity measure r x , r y as points: r x = {1, 0, 0, 1, 3} § Sim(x,y)=cos(r x , r y )=r x r y /||r x || ||r y || r y = {1, 0, 2, 2, 0} § Problem: Treading missing ratings as negatives v Pearson correlation coefficient v Sim(x,y)= v cos(r x , r y )=(r x -r x,ave )(r y -r y,ave )/||r x -r x,ave || ||r y -r y,ave || 20

Cosine sim: Similarity Metric v Intuitively we want: § sim( A , B ) > sim( A , C ) v Jaccard similarity: 1/5 < 2/4 v Cosine similarity: 0.386 > 0.322 § Considers missing ratings as “negative” § Solution: subtract the (row) mean Notice cosine sim. is correlation when data is centered at 0 21

User-User Collaborative Filtering § For user u, find other similar users § Estimate rating for item i based on ratings from similar users ∑ sim ( u , n ) ⋅ r ni n ⊂ neighbors ( u ) pred ( u , i ) = ∑ sim ( u , n ) n ⊂ neighbors ( u ) Sim(u,n) … similarity of user u and n r ui … rating of user u on item i neighbor(u) … set users similar to user u J. Leskovec, A. Rajaraman, J. Ullman: 22 Mining of Massive Datasets, http:// www.mmds.org

Item-Item Collaborative Filtering v So far: User-user collaborative filtering v Another view: Item-item § For item i , find other similar items § Estimate rating for item i based on ratings for similar items § Can use same similarity metrics and prediction functions as in user-user model s r ∑ ⋅ ij xj j N ( i ; x ) ∈ r = xi s s ij … similarity of items i and j ∑ ij r xj … rating of user x on item j j N ( i ; x ) ∈ N(i;x) … set items rated by x similar to i J. Leskovec, A. Rajaraman, J. Ullman: 23 Mining of Massive Datasets, http:// www.mmds.org

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 5 5 4 2 5 4 4 2 1 3 movies 3 2 4 1 2 3 4 3 5 4 2 4 5 4 2 5 4 3 4 2 2 5 6 1 3 3 2 4 - unknown rating - rating between 1 to 5 J. Leskovec, A. Rajaraman, J. Ullman: 24 Mining of Massive Datasets, http:// www.mmds.org

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 1 1 3 ? 5 5 4 2 5 4 4 2 1 3 movies 3 2 4 1 2 3 4 3 5 4 2 4 5 4 2 5 4 3 4 2 2 5 6 1 3 3 2 4 - estimate rating of movie 1 by user 5 J. Leskovec, A. Rajaraman, J. Ullman: 25 Mining of Massive Datasets, http:// www.mmds.org

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 sim(1,m) 1 1 3 ? 5 5 4 1.00 2 5 4 4 2 1 3 -0.18 movies 3 2 4 1 2 3 4 3 5 0.41 4 2 4 5 4 2 -0.10 -0.31 5 4 3 4 2 2 5 0.59 6 1 3 3 2 4 Here we use Pearson correlation as similarity: Neighbor selection: 1) Subtract mean rating m i from each movie i Identify movies similar to m 1 = (1+3+5+5+4)/5 = 3.6 movie 1 , rated by user 5 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 26 2) Compute cosine similarities between rows

Item-Item CF (|N|=2) users 1 2 3 4 5 6 7 8 9 10 11 12 sim(1,m) 1 1 3 ? 5 5 4 1.00 2 5 4 4 2 1 3 -0.18 movies 3 2 4 1 2 3 4 3 5 0.41 4 2 4 5 4 2 -0.10 -0.31 5 4 3 4 2 2 5 0.59 6 1 3 3 2 4 Compute similarity weights: s 1,3 =0.41, s 1,6 =0.59 J. Leskovec, A. Rajaraman, J. Ullman: 27 Mining of Massive Datasets, http:// www.mmds.org

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu. Location: KH116 Fall 2017 Example: Recommender Systems v Customer X v Customer Y Star War I Does search on Star War I Star War

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

2 nu e Nnn 2 no Y e 1 Combinatorial Theorem xn y 7 Mlb Inn yn Hey a distribute my ball

Synopsis Motivation Synchronous reactive model Syntax of CRL (Core Reactive Language)

Part 1 Examples of optimization problems 1 Wolfgang Bangerth What is an optimization problem?

Prs Ps

Time - Evolving Signal Analysis Roza ACESKA Ball State University Midwestern Workshop on

TCP Protocol CS/ECpE 5516 -- Computer Networks Changes from original version marked by vertical

Parametric and Semiprametric Prediction of Finite Population Total Under Informative Sampling and

Exelon Smart Grid Multi-Service Communications Architecture Do Doug Mc McGi Ginnis 4/ 4/5/

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu. Location: KH116 Fall 2017 Example: Recommender Systems v Customer X v Customer Y Star War I Does search on Star War I Star War

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Recommender System Prof. Yanhua Li Time: 6:00pm 8:50pm Thu.

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

2 nu e Nnn 2 no Y e 1 Combinatorial Theorem xn y 7 Mlb Inn yn Hey a distribute my ball

Synopsis Motivation Synchronous reactive model Syntax of CRL (Core Reactive Language)

Part 1 Examples of optimization problems 1 Wolfgang Bangerth What is an optimization problem?

Prs Ps

Time - Evolving Signal Analysis Roza ACESKA Ball State University Midwestern Workshop on

TCP Protocol CS/ECpE 5516 -- Computer Networks Changes from original version marked by vertical

Parametric and Semiprametric Prediction of Finite Population Total Under Informative Sampling and

Exelon Smart Grid Multi-Service Communications Architecture Do Doug Mc McGi Ginnis 4/ 4/5/

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm