http://cs246.stanford.edu Training data 100 million ratings, - - PowerPoint PPT Presentation

http cs246 stanford edu training data
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu Training data 100 million ratings, - - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies 6 years of data: 2000-2005 Test data Last few ratings of each user (2.8


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

slide-2
SLIDE 2

 Training data

  • 100 million ratings, 480,000 users, 17,770 movies
  • 6 years of data: 2000-2005

 Test data

  • Last few ratings of each user (2.8 million)
  • Evaluation criterion: Root Mean Square Error (RMSE)
  • Netflix’s system RMSE: 0.9514

 Competition

  • 2,700+ teams
  • $1 million prize for 10% improvement on Netflix

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

slide-3
SLIDE 3

1 3 4 3 5 5 4 5 5 3 3 2 2 2 5 2 1 1 3 3 1 480,000 users 17,700 movies

2/10/2013 3 Jure Leskovec, Stanford C246: Mining Massive Datasets

Matrix R

slide-4
SLIDE 4

1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1 Test Data Set

SSE = 𝑠 𝑦𝑗 − 𝑠

𝑦𝑗 2 (𝑗,𝑦)∈𝑆

2/10/2013 4 Jure Leskovec, Stanford C246: Mining Massive Datasets

480,000 users 17,700 movies Predicted rating True rating of user x on item i 𝒔𝟒,𝟕

Matrix R

Training Data Set

slide-5
SLIDE 5

 The winner of the Netflix Challenge  Multi-scale modeling of the data:

Combine top level, “regional” modeling of the data, with a refined, local view:

  • Global:
  • Overall deviations of users/movies
  • Factorization:
  • Addressing “regional” effects
  • Collaborative filtering:
  • Extract local patterns

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Global effects Factorization Collaborative filtering

slide-6
SLIDE 6

 Global:

  • Mean movie rating: 3.7 stars
  • The Sixth Sense is 0.5 stars above avg.
  • Joe rates 0.2 stars below avg.

 Baseline estimation: Joe will rate The Sixth Sense 4 stars

 Local neighborhood (CF/NN):

  • Joe didn’t like related movie Signs
  •  Final estimate:

Joe will rate The Sixth Sense 3.8 stars

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

slide-7
SLIDE 7

 Earliest and most popular collaborative

filtering method

 Derive unknown ratings from those of “similar”

movies (item-item variant)

 Define similarity measure sij of items i and j  Select k-nearest neighbors, compute the rating

  • N(i; x): items most similar to i that were rated by x

2/10/2013 7 Jure Leskovec, Stanford C246: Mining Massive Datasets

 

 

 

) ; ( ) ; (

ˆ

x i N j ij x i N j xj ij xi

s r s r

sij… similarity of items i and j ruj…rating of user x on item j N(i;x)… set of items similar to item i that were rated by x

slide-8
SLIDE 8

 In practice we get better estimates if we

model deviations:

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

μ = overall mean rating bx = rating deviation of user x = (avg. rating of user x) – μ bi = (avg. rating of movie i) – μ

Problems/Issues: 1) Similarity measures are “arbitrary” 2) Pairwise similarities neglect interdependencies among users 3) Taking a weighted average can be restricting Solution: Instead of sij use wij that we estimate directly from data

^

 

 

   

) ; ( ) ; (

) (

x i N j ij x i N j xj xj ij xi xi

s b r s b r

baseline estimate for rxi

𝒄𝒚𝒋 = 𝝂 + 𝒄𝒚 + 𝒄𝒋

slide-9
SLIDE 9

 Use a weighted sum rather than weighted avg.:

𝑠

𝑦𝑗

= 𝑐𝑦𝑗 + 𝑥𝑗𝑘 𝑠

𝑦𝑘 − 𝑐𝑦𝑘 𝑘∈𝑂(𝑗;𝑦)

 A few notes:

  • We sum over all movies j that are similar to i and were

rated by x

  • 𝒙𝒋𝒌 is the interpolation weight (some real number)
  • We allow:

𝒙𝒋𝒌 ≠ 𝟐

𝒌∈𝑶(𝒋,𝒚)

  • 𝒙𝒋𝒌 models interaction between pairs of movies

(it does not depend on user x)

  • 𝑶(𝒋; 𝒚) … set of movies rated by user x that are

similar to movie i

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

slide-10
SLIDE 10

 𝑠

𝑦𝑗

= 𝑐𝑦𝑗 + 𝑥𝑗𝑘 𝑠

𝑦𝑘 − 𝑐𝑦𝑘 𝑘∈𝑂(𝑗,𝑦)

 How to set wij?

  • Remember, error metric is SSE:

𝑠 𝑣𝑗 − 𝑠

𝑣𝑗 2 (𝑗,𝑣)∈𝑆

  • Find wij that minimize SSE on training data!
  • Models relationships between item i and its neighbors j
  • wij can be learned/estimated based on x and

all other users that rated i

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

Why is this a good idea?

slide-11
SLIDE 11

 Here is what we just did:

  • Goal: Make good recommendations
  • Quantify goodness using SSE:

So, Lower SSE means better recommendations

  • We want to make good recommendations on items that

some user has not yet seen. Can’t really do this. Why?

  • Let’s set values w such that they work well on known

(user, item) ratings And hope these ws will predict well the unknown ratings

 This is the first time in the class that we see

Optimization methods

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1

slide-12
SLIDE 12

 Idea: Let’s set values w such that they work

well on known (user, item) ratings

 How to find such values w?  Idea: Define an objective function

and solve the optimization problem

 Find wij that minimize SSE on training data!

min

𝑥𝑗𝑘

𝑐𝑦𝑗 + 𝑥𝑗𝑘 𝑠

𝑦𝑘 − 𝑐𝑦𝑘 𝑘∈𝑂 𝑗;𝑦

− 𝑠

𝑦𝑗 2 𝑦

 Think of w as a vector of numbers

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

slide-13
SLIDE 13

 We have the optimization

problem, now what?

 Gradient decent

  • Iterate until convergence: 𝒙 𝒙 − 𝜶𝒙
  • where 𝜶𝒙 is gradient (derivative evaluated on data):

𝛼𝑥 = 𝜖 𝜖𝑥𝑗𝑘 = 2 𝑐𝑦𝑗 + 𝑥𝑗𝑙 𝑠

𝑦𝑙 − 𝑐𝑦𝑙 𝑙∈𝑂 𝑗;𝑦

− 𝑠

𝑦𝑗

𝑠

𝑦𝑘 − 𝑐𝑦𝑘 𝑦

for 𝒌 ∈ {𝑶 𝒋; 𝒚 , ∀𝒋, ∀𝒚 } else

𝜖 𝜖𝑥𝑗𝑘 = 𝟏

  • Note: we fix movie i, go over all rxi,

for every movie 𝒌 ∈ 𝑶 𝒋; 𝒚 , we compute

𝝐 𝝐𝒙𝒋𝒌

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

 … learning rate

while |wnew - wold| > ε: wold = wnew wnew = wold -  ·wold

min

𝑥𝑗𝑘

𝑐𝑦𝑗 + 𝑥𝑗𝑘 𝑠

𝑦𝑘 − 𝑐𝑦𝑘 𝑘∈𝑂 𝑗;𝑦

− 𝑠

𝑦𝑗 2 𝑦

slide-14
SLIDE 14

 So far: 𝑠

𝑦𝑗

= 𝑐𝑦𝑗 + 𝑥𝑗𝑘 𝑠

𝑦𝑘 − 𝑐𝑦𝑘 𝑘∈𝑂(𝑗;𝑦)

  • Weights wij derived based
  • n their role; no use of an

arbitrary similarity measure (wij  sij)

  • Explicitly account for

interrelationships among the neighboring movies

 Next: Latent factor model

  • Extract “regional” correlations

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

Global effects

Factorization

CF/NN

slide-15
SLIDE 15

Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 CF+Biases+learnt weights: 0.91

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

slide-16
SLIDE 16

Geared towards females Geared towards males Serious Funny

2/10/2013 16 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

slide-17
SLIDE 17

 “SVD” on Netflix data: R ≈ Q · PT  For now let’s assume we can approximate the

rating matrix R as a product of “thin” Q · PT

  • R has missing entries but let’s ignore that for now!
  • Basically, we will want the reconstruction error to be small on known

ratings and we don’t care about the values on the missing ones

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

users items

PT Q

items users

R

SVD: A = U  VT f factors f factors

slide-18
SLIDE 18

 How to estimate the missing rating of

user x for item i?

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

items users users

?

PT

𝒔 𝒚𝒋 = 𝒓𝒋 ⋅ 𝒒𝒚

𝑼

= 𝒓𝒋𝒈 ⋅ 𝒒𝒚𝒈

𝒈

qi = row i of Q px = column x of PT f factors

Q

f factors

slide-19
SLIDE 19

 How to estimate the missing rating of

user x for item i?

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

items users users

?

PT

f factors

Q

f factors

𝒔 𝒚𝒋 = 𝒓𝒋 ⋅ 𝒒𝒚

𝑼

= 𝒓𝒋𝒈 ⋅ 𝒒𝒚𝒈

𝒈

qi = row i of Q px = column x of PT

slide-20
SLIDE 20

 How to estimate the missing rating of

user x for item i?

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1

items

.2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

items users users

?

Q PT

2.4 f factors f factors

𝒔 𝒚𝒋 = 𝒓𝒋 ⋅ 𝒒𝒚

𝑼

= 𝒓𝒋𝒈 ⋅ 𝒒𝒚𝒈

𝒈

qi = row i of Q px = column x of PT

slide-21
SLIDE 21

Geared towards females Geared towards males Serious Funny

2/10/2013 21 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

Factor 1 Factor 2

slide-22
SLIDE 22

Geared towards females Geared towards males Serious Funny

2/10/2013 22 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

Factor 1 Factor 2

slide-23
SLIDE 23

 Remember SVD:

  • A: Input data matrix
  • U: Left singular vecs
  • V: Right singular vecs
  • : Singular values
  • SVD gives minimum reconstruction error (SSE!)

min

𝑉,V,Σ

𝐵𝑗𝑘 − 𝑉Σ𝑊T 𝑗𝑘

2 𝑗𝑘

 So in our case, “SVD” on Netflix data: R ≈ Q · PT

A = R, Q = U, PT =  VT

  • But, we are not done yet! R has missing entries!

A

m n

m n

VT

2/10/2013 23 Jure Leskovec, Stanford C246: Mining Massive Datasets

U

The sum goes over all entries. But our R has missing entries!

𝒔 𝒚𝒋 = 𝒓𝒋 ⋅ 𝒒𝒚

𝑼

slide-24
SLIDE 24

 SVD isn’t defined when entries are missing!  Use specialized methods to find P, Q

  • min

𝑄,𝑅

𝑠

𝑦𝑗 − 𝑟𝑗 ⋅ 𝑞𝑦 𝑈 2 𝑗,𝑦 ∈R

  • Note:
  • We don’t require cols of P, Q to be orthogonal/unit length
  • P, Q map users/movies to a latent space
  • The most popular model among Netflix contestants

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2

  • .4

.1 .5 .6

  • .5

.5 .3

  • .2

.3 2.1 1.1

  • 2

2.1

  • .7

.3 .7

  • 1
  • .9

2.4 1.4 .3

  • .4

.8

  • .5
  • 2

.5 .3

  • .2

1.1 1.3

  • .1

1.2

  • .7

2.9 1.4

  • 1

.3 1.4 .5 .7

  • .8

.1

  • .6

.7 .8 .4

  • .3

.9 2.4 1.7 .6

  • .4

2.1

PT Q

users items

𝑠 𝑦𝑗 = 𝑟𝑗 ⋅ 𝑞𝑦

𝑈

f factors f factors

items users

slide-25
SLIDE 25

 Want to minimize SSE for unseen test data  Idea: Minimize SSE on training data

  • Want large f (# of factors) to capture all the signals
  • But, SSE on test data begins to rise for f > 2

 Regularization is needed!

  • Allow rich model where there are sufficient data
  • Shrink aggressively where data are scarce

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

        

  

i i x x training T x i xi Q P

q p p q r

2 2 2 ,

) (

min

1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1

… regularization parameter

“error” “length”

slide-26
SLIDE 26

Geared towards females Geared towards males serious funny

2/10/2013 26 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Princess Diaries The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

Factor 1 Factor 2

minfactors “error” +  “length”

        

  

i i x x training T x i xi Q P

q p p q r

2 2 2 ,

) (

min

slide-27
SLIDE 27

Geared towards females Geared towards males serious funny

2/10/2013 27 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

Factor 1 Factor 2

The Princess Diaries

minfactors “error” +  “length”

        

  

i i x x training T x i xi Q P

q p p q r

2 2 2 ,

) (

min

slide-28
SLIDE 28

Geared towards females Geared towards males serious funny

2/10/2013 28 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

Factor 1 Factor 2

minfactors “error” +  “length”

The Princess Diaries

        

  

i i x x training T x i xi Q P

q p p q r

2 2 2 ,

) (

min

slide-29
SLIDE 29

Geared towards females Geared towards males serious funny

2/10/2013 29 Jure Leskovec, Stanford C246: Mining Massive Datasets

The Lion King Braveheart Lethal Weapon Independence Day Amadeus The Color Purple Dumb and Dumber Ocean’s 11 Sense and Sensibility

Factor 1 Factor 2

The Princess Diaries

minfactors “error” +  “length”

        

  

i i x x training T x i xi Q P

q p p q r

2 2 2 ,

) (

min

slide-30
SLIDE 30

 Want to find matrices P and Q:  Gradient decent:

  • Initialize P and Q (using SVD, pretend missing ratings are 0)
  • Do gradient descent:
  • P  P -  ·P
  • Q  Q -  ·Q
  • Where Q is gradient/derivative of matrix Q:

𝛼𝑅 = [𝛼𝑟𝑗𝑔] and 𝛼𝑟𝑗𝑔 = −2 𝑠

𝑦𝑗 − 𝑟𝑗𝑞𝑦 𝑈 𝑞𝑦𝑔 𝑦,𝑗

+ 2𝜇𝑟𝑗𝑔

  • Here 𝒓𝒋𝒈 is entry f of row qi of matrix Q
  • Observation: Computing gradients is slow!

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

        

  

i i x x training T x i xi Q P

q p p q r

2 2 2 ,

) (

min

How to compute gradient

  • f a matrix?

Compute gradient of every element independently!

slide-31
SLIDE 31

 Gradient Descent (GD) vs. Stochastic GD

  • Observation: 𝛼𝑅 = [𝛼𝑟𝑗𝑔] where

𝛼𝑟𝑗𝑔 = −2 𝑠

𝑦𝑗 − 𝑟𝑗𝑔𝑞𝑦𝑔 𝑞𝑦𝑔 𝑦,𝑗

+ 2𝜇𝑟𝑗𝑔 = 𝑹

𝒚,𝒋

𝒔𝒚𝒋

  • Here 𝒓𝒋𝒈 is entry f of row qi of matrix Q
  • 𝑹 = 𝑹 − 𝑹 = 𝑹 − 

𝑹

𝒚,𝒋

(𝒔𝒚𝒋)

  • Idea: Instead of evaluating gradient over all ratings

evaluate it for each individual rating and make a step

 GD: 𝑹𝑹 − 

𝑹(𝒔𝒚𝒋)

𝒔𝒚𝒋

 SGD: 𝑹𝑹 − 𝑹(𝒔𝒚𝒋)

  • Faster convergence!
  • Need more steps but each step is computed much faster

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

slide-32
SLIDE 32

 Convergence of GD vs. SGD

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

Iteration/step Value of the objective function GD improves the value

  • f the objective function

at every step. SGD improves the value but in a “noisy” way. GD takes fewer steps to converge but each step takes much longer to compute. In practice, SGD is much faster!

slide-33
SLIDE 33

 Stochastic gradient decent:

  • Initialize P and Q (using SVD, pretend missing ratings are 0)
  • Then iterate over the ratings (multiple times if

necessary) and update factors: For each rxi:

  • 𝜁𝑦𝑗 = 𝑠

𝑦𝑗 − 𝑟𝑗 ⋅ 𝑞𝑦 𝑈 (derivative of the “error”)

  • 𝑟𝑗 ← 𝑟𝑗 + 𝜃 𝜁𝑦𝑗 𝑞𝑦 − 𝜇 𝑟𝑗 (update equation)
  • 𝑞𝑦 ← 𝑞𝑦 + 𝜃 𝜁𝑦𝑗 𝑟𝑗 − 𝜇 𝑞𝑦 (update equation)

 2 for loops:

  • For until convergence:
  • For each rxi
  • Compute gradient, do a “step”

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 33

 … learning rate

slide-34
SLIDE 34

Koren, Bell, Volinksy, IEEE Computer, 2009

2/10/2013 34 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-35
SLIDE 35

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

slide-36
SLIDE 36

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

 μ = overall mean rating  bx = bias of user x  bi = bias of movie i

user-movie interaction movie bias user bias User-Movie interaction

Characterizes the matching between users and movies

Attracts most research in the field

Benefits from algorithmic and mathematical innovations

Baseline predictor

  • Separates users and movies
  • Benefits from insights into user’s

behavior

  • Among the main practical

contributions of the competition

slide-37
SLIDE 37

 We have expectations on the rating by

user x of movie i, even without estimating x’s attitude towards movies like i

– Rating scale of user x – Values of other ratings user gave recently (day-specific mood, anchoring, multi-user accounts) – (Recent) popularity of movie i – Selection bias; related to number of ratings user gave on the same day (“frequency”)

2/10/2013 37 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-38
SLIDE 38

 Example:

  • Mean rating:  = 3.7
  • You are a critical reviewer: your ratings are 1 star

lower than the mean: bx = -1

  • Star Wars gets a mean rating of 0.5 higher than

average movie: bi = + 0.5

  • Predicted rating for you on Star Wars:

= 3.7 - 1 + 0.5 = 3.2

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

Overall mean rating Bias for user x Bias for movie i

𝑠

𝑦𝑗 = 𝜈 + 𝑐𝑦 + 𝑐𝑗 + 𝑟𝑗⋅ 𝑞𝑦 𝑈

User-Movie interaction

slide-39
SLIDE 39

 Solve:  Stochastic gradient decent to find parameters

  • Note: Both biases bu, bi as well as interactions qi, pu

are treated as parameters (we estimate them)

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

regularization goodness of fit  is selected via grid- search on a validation set

 

             

    

 i i x x x x i i R i x T x i i x xi P Q

b b p q p q b b r

2 2 2 2 2 ) , ( ,

) (

min

 

slide-40
SLIDE 40

40

0.885 0.89 0.895 0.9 0.905 0.91 0.915 0.92 1 10 100 1000 RMSE Millions of parameters CF (no time bias) Basic Latent Factors Latent Factors w/ Biases

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-41
SLIDE 41

Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 Latent factors: 0.90 Latent factors+Biases: 0.89 Collaborative filtering++: 0.91

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

slide-42
SLIDE 42
slide-43
SLIDE 43

 Sudden rise in the

average movie rating (early 2004)

  • Improvements in Netflix
  • GUI improvements
  • Meaning of rating changed

 Movie age

  • Users prefer new movies

without any reasons

  • Older movies are just

inherently better than newer ones

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 43

  • Y. Koren, Collaborative filtering with

temporal dynamics, KDD ’09

slide-44
SLIDE 44

 Original model:

rxi =  +bx + bi + qi ·px

T

 Add time dependence to biases:

rxi =  +bx(t)+ bi(t) +qi · px

T

  • Make parameters bu and bi to depend on time
  • (1) Parameterize time-dependence by linear trends

(2) Each bin corresponds to 10 consecutive weeks

 Add temporal dependence to factors

  • px(t)… user preference vector on day t

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 44

  • Y. Koren, Collaborative filtering with temporal dynamics, KDD ’09
slide-45
SLIDE 45

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 45

0.875 0.88 0.885 0.89 0.895 0.9 0.905 0.91 0.915 0.92 1 10 100 1000 10000 RMSE Millions of parameters CF (no time bias) Basic Latent Factors CF (time bias) Latent Factors w/ Biases + Linear time factors + Per-day user biases + CF

slide-46
SLIDE 46

Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 Latent factors: 0.90 Latent factors+Biases: 0.89 Collaborative filtering++: 0.91

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 46

Latent factors+Biases+Time: 0.876

Still no prize!  Getting desperate. Try a “kitchen sink” approach!

slide-47
SLIDE 47

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 47

slide-48
SLIDE 48

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 48

June 26th submission triggers 30-day “last call”

slide-49
SLIDE 49

 Ensemble team formed

  • Group of other teams on leaderboard forms a new team
  • Relies on combining their models
  • Quickly also get a qualifying score over 10%

 BellKor

  • Continue to get small improvements in their scores
  • Realize that they are in direct competition with Ensemble

 Strategy

  • Both teams carefully monitoring the leaderboard
  • Only sure way to check for improvement is to submit a set
  • f predictions
  • This alerts the other team of your latest score

2/10/2013 49 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-50
SLIDE 50

 Submissions limited to 1 a day

  • Only 1 final submission could be made in the last 24h

 24 hours before deadline…

  • BellKor team member in Austria notices (by chance) that

Ensemble posts a score that is slightly better than BellKor’s

 Frantic last 24 hours for both teams

  • Much computer time on final optimization
  • Carefully calibrated to end about an hour before deadline

 Final submissions

  • BellKor submits a little early (on purpose), 40 mins before

deadline

  • Ensemble submits their final entry 20 mins later
  • ….and everyone waits….

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 50

slide-51
SLIDE 51

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 51

slide-52
SLIDE 52

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 52

slide-53
SLIDE 53

 Some slides and plots borrowed from

Yehuda Koren, Robert Bell and Padhraic Smyth

 Further reading:

  • Y. Koren, Collaborative filtering with temporal

dynamics, KDD ’09

  • http://www2.research.att.com/~volinsky/netflix/bpc.html
  • http://www.the-ensemble.com/

2/10/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 53