SLIDE 1 Im Implicit Feedback and Performance Evaluation in in Recommender Systems
Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee
SLIDE 2 Agenda
- Intro Universal Store Recommendations
- Extreme Classification with Matrix Factorization
- Offline Evaluation Techniques
- Online Evaluation
- The Gap
- Bridging The Gap…
SLIDE 3
Microsoft Universal Store Recommendations
SLIDE 4
Windows Store
SLIDE 5
Groove Music
SLIDE 6
Xbox
SLIDE 7
Extreme Classification with Matrix Factorization
SLIDE 8 History: Netflix Prize
4 2 3 4 5 5 5 4 2 1 4 4 3 1 3 4
... ... ... ...
SLIDE 9 Two-class data – Extreme Classification
1 1 1 1 1 1 1 1 1 1
... ... ... ...
SLIDE 10 One-class data
1 1 1 1 1 1 1 1 1 1
... ... ... ...
SLIDE 11 Problem formulation
... N ≈ 10K – 1M nodes M ≈ 10 – 500M nodes
? ? ? ?
Bipartite graph → We care about ? = p(link)
SLIDE 12
Fully Bayesian model based on Variational Bayes optimization
SLIDE 13
Offline Evaluation Techniques
SLIDE 14 𝑆𝑁𝑇𝐹- Root Mean Square Error
RMSE is computed by averaging the square error over all user item pairs, 𝑣, 𝑗 ∈ ℛ 𝑆𝑁𝑇𝐹 = 1 ℛ
𝑣,𝑗 ∈ℛ
𝑇𝐹𝑣𝑗
SLIDE 15 𝑥𝑆𝑁𝑇𝐹- Weighted Root Mean Square Error
This variant of RMSE is achieved by assigning each data point a weight,𝑥𝑣𝑗, based on its importance. 𝑆𝑁𝑇𝐹 = 1 σ 𝑥𝑣𝑗
𝑣,𝑗 ∈ℛ
𝑥𝑣𝑗 ⋅ 𝑇𝐹𝑣𝑗
SLIDE 16 Precision@𝑙 / Recall@𝑙
Positive Result 1 Positive Result 2 Positive Result 3 Ground Truth Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result
𝒍 = 𝟒
𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜@𝑙 = 2 3 𝑠𝑓𝑑𝑏𝑚𝑚@𝑙 = 2 3 Positive Result 2
SLIDE 17 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 33% 67% 100%
Precision Recall
Recall v Precision
Mean Average Precision
We can plot precision as a function of recall
Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2 Average Precision
SLIDE 18 The relevance is discounted by 𝛿𝑗 =
1 log2 𝑗+1 and the sum @ k is
normalized by its upper bound – the I𝐸𝐷𝐻
𝑂𝐸𝐷𝐻 – Normalized Discounted Cumulative Gain
Positive Result 1 Relevance: 5 Positive Result 2 Relevance: 3 Positive Result 3 Relevance: 1 Ground Truth Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2
𝒍 = 𝟒
𝐸𝐷𝐻@𝑙 = 1 𝑚𝑝2 1 + 1 + 0 + 5 𝑚𝑝2 3 + 1 = 3.5 I𝐸𝐷𝐻@𝑙 = 5 𝑚𝑝2 1 + 1 + 3 𝑚𝑝2 2 + 1 + 1 𝑚𝑝2 3 + 1 = 7.39 𝑶𝑬𝑫𝑯@𝒍 =
𝟒.𝟔 𝟖.𝟒𝟘 =0.47
SLIDE 19 𝑁𝑄𝑆- Mean Percentile Rank
Sometimes there is only one “positive” items in the test set…
Positive Result 1 Ground Truth Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2 Negative Result Negative Result 𝒔𝒃𝒐𝒍𝒋 = 𝟒 𝑵𝑸𝑺 = 𝟏. 𝟔
SLIDE 20
MPR in Xbox
SLIDE 21 Spearman’s Rho Coefficient
In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking
Result 1 Result 2 Result 3 Ground Truth Ranking Ranking Induced by Algorithm Result 1 Result 3 Result 4 Result 2 Result 4 𝑠
1 − Ƹ
𝑠
1 = 1 − 3
𝑠2 − Ƹ 𝑠2 = 2 − 4 𝑠3 − Ƹ 𝑠3 = 3 − 1 𝑠
4 − Ƹ
𝑠
4 = 4 − 2
SLIDE 22 Kendall’s Tau Coefficient
In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking
Positive Result 1 Positive Result 2 Positive Result 3 Ground Truth Ranking Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2 Negative Result Same Order sign 𝑠
1 − 𝑠2 ⋅ sign
Ƹ 𝑠
1 − Ƹ
𝑠2 = 1
SLIDE 23 Offline Techniques – Open Questions
- How do we measure the importance/ relevance of the positive items?
- Long tail items are more important. But how do we quantify?
- How many items do we care to recommend?
- Should the best item be the first item?
- Maybe the best item should be in the middle?
- What about diversity?
- What about contextual effects?
- What about items fatigue?
SLIDE 24
Online Experimentation
SLIDE 25 Online Experiments
- Randomized controlled experiments
- Measure KPIs (Key Performance Indicator) directly
- Can compare several variants simultaneously
- The ultimate evaluation technique!
SLIDE 26
Online Experiments in Xbox
SLIDE 27 Game Purchase
Direct Purchases
SLIDE 28 Total Game Purchase
Total Purchases
SLIDE 29 Experimentation Caveats
- What KPIs to measure?
- How long to run the experiment?
- External factors may influence the results
- Cannibalization is hard to account for
- Expensive to implement
- Can’t compare algorithms before “lighting up”
SLIDE 30
The Gap
SLIDE 31
Accuracy and Diversity Interactions
SLIDE 32 Characterizing The Offline / Online Evaluation Gap
- Overemphasis of popular items
- List recommendations (diversity, item position)
- Freshness/ Fatigue
- Contextual information is not fully utilized
- Learning from historical data lets you predict the future. But what we
really care about is changing the future!
SLIDE 33
Bridging The Gap
SLIDE 34 Mitigating Evaluation Techniques
- Domain experts / focus groups
- Internal user studies
- Off-policy evaluation techniques
SLIDE 35 Off Policy Evaluation - Example
𝑊
ℎ 𝜌 𝑇 - The expected reward of a policy ℎ given data 𝑇 from a “logging
policy” 𝜌. 𝑊
ℎ 𝜌 𝑇 = 1
𝑇
𝑦,𝑏,𝑠 ∈𝑇
𝑠 ⋅ 𝕁 ℎ 𝑦 == 𝑏 max ො 𝜌 𝑦 𝑏 , 𝜐 where 𝑇 denotes the set of context-action-reward tuples available in the logs
SLIDE 36 Caveats of Off-policy Evaluation
- Need to formulate everything in terms of a policy
- Needs sufficient support
- Becomes very difficult when your policies are time dependent
SLIDE 37
We are looking for postdoc researchers to join us in Israel… Email: RecoRecruitmentEmail@microsoft.com
Thank you!