Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 14 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 14 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 14 Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, Yehuda Koren, Stanford CS246) Recommender Systems The Long Tail (from: https://www.wired.com/2004/10/tail/) The


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 3 - Fall 2016

Lecture 14

Jan-Willem van de Meent (credit: Andrew Ng, Alex Smola, 
 Yehuda Koren, Stanford CS246)

slide-2
SLIDE 2

Recommender Systems

slide-3
SLIDE 3

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

slide-4
SLIDE 4

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

slide-5
SLIDE 5

The Long Tail

(from: https://www.wired.com/2004/10/tail/)

slide-6
SLIDE 6

Problem Setting

slide-7
SLIDE 7

Problem Setting

slide-8
SLIDE 8

Problem Setting

slide-9
SLIDE 9

Problem Setting

  • Task: Predict user preferences for unseen items
slide-10
SLIDE 10

Content-based Filtering

Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s ¡11 Sense and Sensibility

Gus Dave

slide-11
SLIDE 11

Content-based Filtering

Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s ¡11 Sense and Sensibility

Gus Dave

Idea: Predict rating using item features on a per-user basis

slide-12
SLIDE 12

Content-based Filtering

Geared towards females Geared towards males serious escapist The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s ¡11 Sense and Sensibility

Gus Dave

Idea: Predict rating using user features on a per-item basis

slide-13
SLIDE 13

Collaborative Filtering

Joe

#2 #3 #1 #4

Idea: Predict rating based on similarity to other users

slide-14
SLIDE 14

Problem Setting

  • Task: Predict user preferences for unseen items
  • Content-based filtering: Model user/item features
  • Collaborative filtering: Implicit similarity of users items
slide-15
SLIDE 15

Recommender Systems

  • Movie recommendation (Netflix)
  • Related product recommendation (Amazon)
  • Web page ranking (Google)
  • Social recommendation (Facebook)
  • News content recommendation (Yahoo)
  • Priority inbox & spam filtering (Google)
  • Online dating (OK Cupid)
  • Computational Advertising (Everyone)
slide-16
SLIDE 16

Challenges

  • Scalability
  • Millions of objects
  • 100s of millions of users
  • Cold start
  • Changing user base
  • Changing inventory
  • Imbalanced dataset
  • User activity / item reviews 


power law distributed

  • Ratings are not missing at random
slide-17
SLIDE 17

Running Example: Netflix Data

score

date movie user

1 5/7/02 21 1 5 8/2/04 213 1 4 3/6/01 345 2 4 5/1/05 123 2 3 7/15/02 768 2 5 1/22/01 76 3 4 8/3/00 45 4 1 9/10/05 568 5 2 3/5/03 342 5 2 12/28/00 234 5 5 8/11/02 76 6 4 6/15/03 56 6

score date movie user

? 1/6/05 62 1 ? 9/13/04 96 1 ? 8/18/05 7 2 ? 11/22/05 3 2 ? 6/13/02 47 3 ? 8/12/01 15 3 ? 9/1/00 41 4 ? 8/27/05 28 4 ? 4/4/05 93 5 ? 7/16/03 74 5 ? 2/14/04 69 6 ? 10/3/03 83 6

Training data Test data

  • Released as part of $1M competition by Netflix in 2006
  • Prize awarded to BellKor in 2009
slide-18
SLIDE 18

Running Yardstick: RMSE

rmse(S) = s |S|−1 X

(i,u)∈S

(ˆ rui − rui)2

slide-19
SLIDE 19

Running Yardstick: RMSE

rmse(S) = s |S|−1 X

(i,u)∈S

(ˆ rui − rui)2

(doesn’t tell you how to actually do recommendation)

slide-20
SLIDE 20

Ratings aren’t everything

Netflix then Netflix now

slide-21
SLIDE 21

Content-based Filtering

slide-22
SLIDE 22

Item-based Features

slide-23
SLIDE 23

Item-based Features

slide-24
SLIDE 24

Item-based Features

slide-25
SLIDE 25

wu = argmin

w

|ru − X w|2

Per-user Regression

Learn a set of regression coefficients for each user

slide-26
SLIDE 26

Bias

slide-27
SLIDE 27

Bias

slide-28
SLIDE 28

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

slide-29
SLIDE 29

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated

slide-30
SLIDE 30

Bias

Moonrise Kingdom 4 5 4 4 0.3 0.2

Problem: Some movies are universally loved / hated
 some users are more picky than others

3 3 3

slide-31
SLIDE 31

Bias

Solution: Introduce a per-movie and per-user bias Problem: Some movies are universally loved / hated
 some users are more picky than others

Moonrise Kingdom 4 5 4 4 0.3 0.2

slide-32
SLIDE 32

Temporal Effects

slide-33
SLIDE 33

Changes in user behavior

2004

Netflix changed rating labels

slide-34
SLIDE 34

Are movies getting better with time?

Movies get better with time?

slide-35
SLIDE 35

Temporal Effects

Solution: Model temporal effects in bias not weights

Are movies getting better with time?

slide-36
SLIDE 36

Neighborhood Methods

slide-37
SLIDE 37

Neighborhood Based Methods

Joe

#2 #3 #1 #4

Users and items form a bipartite graph (edges are ratings)

slide-38
SLIDE 38

Neighborhood Based Methods

(user, user) similarity

  • predict rating based on average


from k-nearest users

  • good if item base is smaller than user base
  • good if item base changes rapidly

(item,item) similarity

  • predict rating based on average


from k-nearest items

  • good if the user base is small
  • good if user base changes rapidly
slide-39
SLIDE 39

Parzen-Window Style CF

ˆ rui = bui + P

j∈sk(i,u) sij(ruj − buj)

P

j∈sk(i,u) sij

where bui = µ + bu + bi

  • Define a similarity sij between items
  • Find set sk(i,u) of k-nearest neighbors 


to i that were rated by user u

  • Predict rating using weighted average over set
  • How should we define sij?
slide-40
SLIDE 40

Pearson Correlation Coefficient

sij = Cov[rui, ruj] Std[rui]Std[ruj]

each item rated by a distinct set of users

1 ? ? 5 5 3 ? ? ? 4 2 ? ? ? ? 4 ? 5 4 1 ? ? ? 4 2 5 ? ? 1 2 5 ? ? 2 ? ? 3 ? ? ? 5 4 User ratings for item i: User ratings for item j:

slide-41
SLIDE 41

(item,item) similarity

ˆ ρij = P

u∈U(i,j)(rui − bui)(ruj − buj)

qP

u∈U(i,j)(rui − bui)2 P u∈U(i,j)(ruj − buj)2

Empirical estimate of Pearson correlation coefficient sij = |U(i, j)| − 1 |U(i, j)| − 1 + λ ˆ ρij Regularize towards 0 for small support Regularize towards baseline for small neighborhood ˆ rui = bui + P

j∈sk(i,u) sij(ruj − buj)

λ + P

j∈sk(i,u) sij

slide-42
SLIDE 42

Similarity for binary labels

mi users acting on i mij users acting on both i and j m total number of users sij = mij α + mi + mj − mij sij = observed expected ≈ mij α + mimj/m Jaccard similarity Observed / Expected ratio Pearson correlation not meaningful for binary labels
 (e.g. Views, Purchases, Clicks)

slide-43
SLIDE 43

Matrix Factorization Methods

slide-44
SLIDE 44

Matrix Factorization

Moonrise Kingdom 4 5 4 4 0.3 0.2

slide-45
SLIDE 45

Matrix Factorization

Moonrise Kingdom 4 5 4 4 0.3 0.2

Idea: pose as (biased) matrix factorization problem

slide-46
SLIDE 46

Matrix Factorization

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users users A rank-3 SVD approximation

slide-47
SLIDE 47

Prediction

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users A rank-3 SVD approximation users

?

slide-48
SLIDE 48

Prediction

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~ ~

items users

2.4

A rank-3 SVD approximation users

slide-49
SLIDE 49

SVD with missing values

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

Pose as regression problem Regularize using Frobenius norm

slide-50
SLIDE 50

Alternating Least Squares

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

(regress wu given X)

slide-51
SLIDE 51

Alternating Least Squares

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

(regress wu given X)

L2: closed form solution w = (XTX + λI)1XTy

Remember ridge regression?

slide-52
SLIDE 52

Alternating Least Squares

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

(regress xi given W) (regress wu given X)

slide-53
SLIDE 53

Stochastic Gradient Descent

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

~

  • SVD ¡isn’t ¡defined ¡when ¡entries ¡are ¡unknown ¡

  • No need for locking
  • Multicore updates asynchronously


(Recht, Re, Wright, 2012 - Hogwild)

slide-54
SLIDE 54

Netflix Prize

slide-55
SLIDE 55

Netflix Prize

Training data

  • 100 million ratings, 480,000 users, 17,770 movies
  • 6 years of data: 2000-2005

Test data

  • Last few ratings of each user (2.8 million)
  • Evaluation criterion: Root Mean Square Error (RMSE)

Competition

  • 2,700+ teams
  • Netflix’s system RMSE: 0.9514
  • $1 million prize for 10% improvement on Netflix
slide-56
SLIDE 56

Improvements

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

Add biases

Do SGD, but also learn biases μ, bu and bi

slide-57
SLIDE 57

Improvements

Account for fact that ratings are not missing at random.

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

“who ¡rated ¡ what”

slide-58
SLIDE 58

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

temporal effects

Improvements

Account for drift in user and item biases

slide-59
SLIDE 59

40 60 90 128 180 50 100 200 50 100 200 50 100 200 500 100 200 500 50 100 200 500 1000 1500

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91

10 100 1000 10000 100000

RMSE

Millions of Parameters

Factor models: Error vs. #parameters

NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4

temporal effects

Improvements

Still pretty far from 0.8563 grand prize

slide-60
SLIDE 60

Winning Solution from BellKor

slide-61
SLIDE 61

Last 30 days

June 26th submission triggers 30-day “last call”

slide-62
SLIDE 62

Last 30 days

June 26th submission triggers 30-day “last call”

slide-63
SLIDE 63

BellKor fends off competitors by a hair

slide-64
SLIDE 64

BellKor fends off competitors by a hair