COMP9313: Big Data Management
Recommender System
Source from Dr. Xin Cao
COMP9313: Big Data Management Recommender System Source from Dr. - - PowerPoint PPT Presentation
COMP9313: Big Data Management Recommender System Source from Dr. Xin Cao Recommendations Examples: Search Recommendations Products, web sites, Items blogs, news items, 2 Recommender Systems 3 Recommender Systems Application
Source from Dr. Xin Cao
2
Items Search Recommendations Products, web sites, blogs, news items, …
Examples:
3
4
5
6
situational context)
(diversity)
7
8
Avatar LOTR Matrix Pirates Alice Bob Carol David
like
recommendation methods
9
bothered
10
11
Recommender systems reduce information overload by estimating relevance
12
Personalized recommendations
13
Collaborative: "Tell me what's popular among my peers"
14
Content-based: "Show me more of the same what I've liked"
15
Knowledge-based: "Tell me what fits based on my needs"
16
Hybrid: combinations of various inputs and/or composition of different mechanism
17 show me more
what I've liked
previous items rated highly by x
("content")
preferences)
18
19
likes
Item profiles
Red Circles Triangles
User profile
match recommend build
attributes
20
Title Genre Author Type Price Keywords The Night of the Gun Memoir David Carr Paperback 29.90 Press and journalism, drug addiction, personal memoirs, New York The Lace Reader Fiction, Mystery Brunonia Barry Hardcover 49.90 American contemporary fiction, detective, historical Into the Fire Romance, Suspense Suzanne Brockmann Hardcover 45.90 American fiction, murder, neo-Nazism
(Term frequency * Inverse Doc Frequency)
21
rating for item
𝑣(𝒚, 𝒋) = cos(𝒚, 𝒋) = 𝒚 · 𝒋 | 𝒚 | ⋅ | 𝒋 |
22
tastes
listing content-features that caused an item to be recommended
23
24
25
show me more items favored by others who have similar tastes with me
recommendations
explicitly)
similar tastes in the future
26
27
by Alice
Alice in the past and who have rated item 𝑗
tastes in the future
28
Alice has not yet rated or seen
29
Item1 Item2 Item3 Item4 Item5 Alice 5 3 4 4
?
User1 3 1 2 3 3 User2 4 3 4 3 5 User3 3 3 1 5 4 User4 1 5 5 2 1
neighbors' ratings?
30
Item1 Item2 Item3 Item4 Item5 Alice 5 3 4 4
?
User1 3 1 2 3 3 User2 4 3 4 3 5 User3 3 3 1 5 4 User4 1 5 5 2 1
||#$∩#&|| ||#$∪#&||
"!⋅"" ||"!||⋅||""||
31
𝒕𝒋𝒏 𝒚, 𝒛 = ∑𝒕∈𝑻𝒚𝒛 𝒔𝒚𝒕 − 𝒔𝒚 𝒔𝒛𝒕 − 𝒔𝒛 ∑𝒕∈𝑻𝒚𝒛 𝒔𝒚𝒕 − 𝒔𝒚 𝟑 ∑𝒕∈𝑻𝒚𝒛 𝒔𝒛𝒕 − 𝒔𝒛
𝟑
rx = [*, _, _, *, ***] ry = [*, _, **, **, _]
rx, ry as sets: rx = {1, 4, 5} ry = {1, 3, 4} rx, ry as points: rx = {1, 0, 0, 1, 3} ry = {1, 0, 2, 2, 0}
rx, ry … avg. rating of x, y
32
sim A,B vs. A,C: 0.092 > -0.559
Notice cosine sim. is correlation when data is centered at 0
𝒕𝒋𝒏(𝒚, 𝒛) = ∑𝒋 𝒔𝒚𝒋 ⋅ 𝒔𝒛𝒋 ∑𝒋 𝒔𝒚𝒋
𝟑 ⋅
∑𝒋 𝒔𝒛𝒋
𝟑
Cosine similarity:
33
correlation
𝒕𝒋𝒏 𝒚, 𝒛 = ∑𝒕∈𝑻𝒚𝒛 𝒔𝒚𝒕 − 𝒔𝒚 𝒔𝒛𝒕 − 𝒔𝒛 ∑𝒕∈𝑻𝒚𝒛 𝒔𝒚𝒕 − 𝒔𝒚 𝟑 ∑𝒕∈𝑻𝒚𝒛 𝒔𝒛𝒕 − 𝒔𝒛
𝟑 sim = 0,85 sim = 0,70 sim = 0,00 sim = -0,79
Item1 Item2 Item3 Item4 Item5 Alice 5 3 4 4
?
User1 3 1 2 3 3 User2 4 3 4 3 5 User3 3 3 1 5 4 User4 1 5 5 2 1
From similarity metric to recommendations:
have rated item i
. ∑/∈0 𝑠/,
∑"∈( 2!"⋅"") ∑"∈( 2!"
34
Shorthand: 𝒕𝒚𝒛 = 𝒕𝒋𝒏 𝒚, 𝒛
millions of items
35
items
as in user-user model
36
Î Î
) ; ( ) ; ( x i N j ij x i N j xj ij xi
sij… similarity of items i and j rxj…rating of user u on item j N(i;x)… set items rated by x similar to i
rating for Item5
37
Item1 Item2 Item3 Item4 Item5 Alice 5 3 4 4
?
User1 3 1 2 3 3 User2 4 3 4 3 5 User3 3 3 1 5 4 User4 1 5 5 2 1
38
12 11 10 9 8 7 6 5 4 3 2 1 4 5 5 3 1 1 3 1 2 4 4 5 2 5 3 4 3 2 1 4 2 3 2 4 5 4 2 4 5 2 2 4 3 4 5 4 2 3 3 1 6 users movies
39
users movies
12 11 10 9 8 7 6 5 4 3 2 1 4 5 5 ? 3 1 1 3 1 2 4 4 5 2 5 3 4 3 2 1 4 2 3 2 4 5 4 2 4 5 2 2 4 3 4 5 4 2 3 3 1 6
40
12 11 10 9 8 7 6 5 4 3 2 1 4 5 5 ? 3 1 1 3 1 2 4 4 5 2 5 3 4 3 2 1 4 2 3 2 4 5 4 2 4 5 2 2 4 3 4 5 4 2 3 3 1 6 users
Neighbor selection: Identify movies similar to movie 1, rated by user 5
movies 1.00
0.41
0.59 sim(1,m)
Here we use adjust cosine similarity: 1) Subtract mean rating mi from each movie i m1 = (1+3+5+5+4)/5 = 3.6 row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0] 2) Compute cosine similarities between rows
41
12 11 10 9 8 7 6 5 4 3 2 1 4 5 5 ? 3 1 1 3 1 2 4 4 5 2 5 3 4 3 2 1 4 2 3 2 4 5 4 2 4 5 2 2 4 3 4 5 4 2 3 3 1 6 users movies 1.00
0.41
0.59 sim(1,m)
Compute similarity weights:
s1,3=0.41, s1,6=0.59
42
12 11 10 9 8 7 6 5 4 3 2 1 4 5 5
2.6
3 1 1 3 1 2 4 4 5 2 5 3 4 3 2 1 4 2 3 2 4 5 4 2 4 5 2 2 4 3 4 5 4 2 3 3 1 6 users movies
Predict by taking weighted average: r1.5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
𝒔𝒋𝒚 = ∑𝒌∈𝑶(𝒋;𝒚) 𝒕𝒋𝒌 ⋅ 𝒔𝒌𝒚 ∑𝒕𝒋𝒌
43
Avatar LOTR Matrix Pirates Alice Bob Carol David
■ In practice, it has been observed that item-item often works better
than user-user
matrices → poor recommendation quality
44
the recommender system is embedded
recommender systems interpret this behavior as a positive rating
downloads …
additional efforts from the side of the user
interpreted
has bought; the user also might have bought a book for someone else
question of correctness of interpretation
45
https://mahout.apache.org/users/basics/algorithms.html
46
47
48
1 3 4 3 5 5 4 5 5 3 3 2 2 2 5 2 1 1 3 3 1 movies users
49
1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1 movies users Test Data Set
*+ − 𝑠 *+ ∗
∗ is the true rating of x on i
50
(RMSE) =
" #
∑(&,()∈# ̂ 𝑠
(& − 𝑠 (& .
51
52
1 3 4 3 5 5 4 5 5 3 3 2 2 2 5 2 1 1 3 3 1 480,000 users 17,700 movies
53
1 3 4 3 5 5 4 5 5 3 3 2 ? ? ? 2 1 ? 3 ? 1 Test Data Set
" /
(& − 𝑠 (& .
480,000 users 17,700 movies True rating of user x on item i 𝒔𝟕,𝟒
Training Data Set
Global effects Factorization Collaborative filtering
54
55
Grand Prize: 0.8563 Netflix: 0.9514 Movie average: 1.0533 User average: 1.0651 Global average: 1.1296 Basic Collaborative filtering: 0.94 CF+Biases+learned weights: 0.91
Þ Baseline estimation: Joe will rate The Sixth Sense 4 stars
Joe will rate The Sixth Sense 3.8 stars
56
57
^
Î Î
) ; ( ) ; (
x i N j ij x i N j xj xj ij xi xi
baseline estimate for rxi
𝒄𝒚𝒋 = 𝝂 + 𝒄𝒚 + 𝒄𝒋
μ = overall mean rating bx = rating deviation of user x = (avg. rating of user x) – μ bi = (avg. rating of movie i) – μ
Problems/Issues: 1) Similarity measures are “arbitrary” 2) Pairwise similarities neglect interdependencies among users 3) Taking a weighted average can be restricting Solution: Instead of sij use wij that we estimate directly from data
'( = 𝑐'( +
+∈-((;')
'+ − 𝑐'+
similar to movie i
does not depend on user x)
58
'( = 𝑐'( + ∑+∈-((,') 𝑥(+ 𝑠 '+ − 𝑐'+
8 9
∑(;,<)∈9 ̂ 𝑠
<; − 𝑠 <; ? or equivalently SSE:
∑(𝒋,𝒚)∈𝑺 5 𝒔𝒚𝒋 − 𝒔𝒚𝒋 𝟑
59
Lower RMSE Þ better recommendations
that user has not yet seen. Can’t really do this!
known (user, item) ratings And hope the system will also predict well the unknown ratings
60
known (user, item) ratings
𝐾 𝑥 = *
<,;
𝑐<; + *
B∈C ;;<
𝑥;B 𝑠<B − 𝑐<B − 𝑠<;
?
61
Predicted rating True rating
'( = 𝑐'( + ∑+∈-((;') 𝑥(+ 𝑠 '+ − 𝑐'+
arbitrary similarity measure (wij ¹ sij)
interrelationships among the neighboring movies
62
Global effects
Factorization
CF/NN
Analysis
63
states that a given matrix 𝑁 can be decomposed into a product of three matrices as follows
values of the diagonal of Σ are called the singular values
most important features – those with the largest singular values
some linear algebra software) but retain only the three most important features by taking only the first three columns of 𝑉 and 𝑊8
64
matrix R as a product of “thin” Q · PT
we don’t care about the values on the missing ones
65
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
users items
PT Q
users
R
factors factors
66
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1
items
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
items users users
?
PT
𝑼
𝒈
qi = row i of Q px = column x of PT factors
Q
factors
.2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
2.4
67
9,: ∑ ,,+ ∈; 𝑠+, − 𝑟, ⋅ 𝑞+ 8 <
4 5 5 3 1 3 1 2 4 4 5 5 3 4 3 2 1 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 1 .2
.1 .5 .6
.5 .3
.3 2.1 1.1
2.1
.3 .7
2.4 1.4 .3
.8
.5 .3
1.1 1.3
1.2
2.9 1.4
.3 1.4 .5 .7
.1
.7 .8 .4
.9 2.4 1.7 .6
2.1
PT Q
users items
̂ 𝑠
<; = 𝑟; ⋅ 𝑞< E
factors
items
rated
68
69
Knowledge-based: "Tell me what fits based on my needs"
70
recommendation process
71
72