Im Implicit Feedback and Performance Evaluation in in Recommender - - PowerPoint PPT Presentation

im implicit feedback and performance
SMART_READER_LITE
LIVE PREVIEW

Im Implicit Feedback and Performance Evaluation in in Recommender - - PowerPoint PPT Presentation

Im Implicit Feedback and Performance Evaluation in in Recommender Systems Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee Agenda Intro Universal Store Recommendations Extreme Classification with Matrix Factorization


slide-1
SLIDE 1

Im Implicit Feedback and Performance Evaluation in in Recommender Systems

Shay Ben Elazar Mike Gartrell Noam Koenigstein Gal Lavee

slide-2
SLIDE 2

Agenda

  • Intro Universal Store Recommendations
  • Extreme Classification with Matrix Factorization
  • Offline Evaluation Techniques
  • Online Evaluation
  • The Gap
  • Bridging The Gap…
slide-3
SLIDE 3

Microsoft Universal Store Recommendations

slide-4
SLIDE 4

Windows Store

slide-5
SLIDE 5

Groove Music

slide-6
SLIDE 6

Xbox

slide-7
SLIDE 7

Extreme Classification with Matrix Factorization

slide-8
SLIDE 8

History: Netflix Prize

4 2 3 4 5 5 5 4 2 1 4 4 3 1 3 4

... ... ... ...

slide-9
SLIDE 9

Two-class data – Extreme Classification

1 1 1 1 1 1 1 1 1 1

... ... ... ...

slide-10
SLIDE 10

One-class data

1 1 1 1 1 1 1 1 1 1

... ... ... ...

slide-11
SLIDE 11

Problem formulation

... N ≈ 10K – 1M nodes M ≈ 10 – 500M nodes

? ? ? ?

Bipartite graph → We care about ? = p(link)

slide-12
SLIDE 12

Fully Bayesian model based on Variational Bayes optimization

slide-13
SLIDE 13

Offline Evaluation Techniques

slide-14
SLIDE 14

𝑆𝑁𝑇𝐹- Root Mean Square Error

RMSE is computed by averaging the square error over all user item pairs, 𝑣, 𝑗 ∈ ℛ 𝑆𝑁𝑇𝐹 = 1 ℛ ෍

𝑣,𝑗 ∈ℛ

𝑇𝐹𝑣𝑗

slide-15
SLIDE 15

𝑥𝑆𝑁𝑇𝐹- Weighted Root Mean Square Error

This variant of RMSE is achieved by assigning each data point a weight,𝑥𝑣𝑗, based on its importance. 𝑆𝑁𝑇𝐹 = 1 σ 𝑥𝑣𝑗 ෍

𝑣,𝑗 ∈ℛ

𝑥𝑣𝑗 ⋅ 𝑇𝐹𝑣𝑗

slide-16
SLIDE 16

Precision@𝑙 / Recall@𝑙

Positive Result 1 Positive Result 2 Positive Result 3 Ground Truth Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result

𝒍 = 𝟒

𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜@𝑙 = 2 3 𝑠𝑓𝑑𝑏𝑚𝑚@𝑙 = 2 3 Positive Result 2

slide-17
SLIDE 17

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 33% 67% 100%

Precision Recall

Recall v Precision

Mean Average Precision

We can plot precision as a function of recall

Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2 Average Precision

slide-18
SLIDE 18

The relevance is discounted by 𝛿𝑗 =

1 log2 𝑗+1 and the sum @ k is

normalized by its upper bound – the I𝐸𝐷𝐻

𝑂𝐸𝐷𝐻 – Normalized Discounted Cumulative Gain

Positive Result 1 Relevance: 5 Positive Result 2 Relevance: 3 Positive Result 3 Relevance: 1 Ground Truth Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2

𝒍 = 𝟒

𝐸𝐷𝐻@𝑙 = 1 𝑚𝑝𝑕2 1 + 1 + 0 + 5 𝑚𝑝𝑕2 3 + 1 = 3.5 I𝐸𝐷𝐻@𝑙 = 5 𝑚𝑝𝑕2 1 + 1 + 3 𝑚𝑝𝑕2 2 + 1 + 1 𝑚𝑝𝑕2 3 + 1 = 7.39 𝑶𝑬𝑫𝑯@𝒍 =

𝟒.𝟔 𝟖.𝟒𝟘 =0.47

slide-19
SLIDE 19

𝑁𝑄𝑆- Mean Percentile Rank

Sometimes there is only one “positive” items in the test set…

Positive Result 1 Ground Truth Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2 Negative Result Negative Result 𝒔𝒃𝒐𝒍𝒋 = 𝟒 𝑵𝑸𝑺 = 𝟏. 𝟔

slide-20
SLIDE 20

MPR in Xbox

slide-21
SLIDE 21

Spearman’s Rho Coefficient

In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking

Result 1 Result 2 Result 3 Ground Truth Ranking Ranking Induced by Algorithm Result 1 Result 3 Result 4 Result 2 Result 4 𝑠

1 − Ƹ

𝑠

1 = 1 − 3

𝑠2 − Ƹ 𝑠2 = 2 − 4 𝑠3 − Ƹ 𝑠3 = 3 − 1 𝑠

4 − Ƹ

𝑠

4 = 4 − 2

slide-22
SLIDE 22

Kendall’s Tau Coefficient

In scenarios where we want to emphasize the full ranking we may compare the ranking of the algorithm to a reference ranking

Positive Result 1 Positive Result 2 Positive Result 3 Ground Truth Ranking Ranking Induced by Algorithm Positive Result 1 Positive Result 3 Negative Result Positive Result 2 Negative Result Same Order sign 𝑠

1 − 𝑠2 ⋅ sign

Ƹ 𝑠

1 − Ƹ

𝑠2 = 1

slide-23
SLIDE 23

Offline Techniques – Open Questions

  • How do we measure the importance/ relevance of the positive items?
  • Long tail items are more important. But how do we quantify?
  • How many items do we care to recommend?
  • Should the best item be the first item?
  • Maybe the best item should be in the middle?
  • What about diversity?
  • What about contextual effects?
  • What about items fatigue?
slide-24
SLIDE 24

Online Experimentation

slide-25
SLIDE 25

Online Experiments

  • Randomized controlled experiments
  • Measure KPIs (Key Performance Indicator) directly
  • Can compare several variants simultaneously
  • The ultimate evaluation technique!
slide-26
SLIDE 26

Online Experiments in Xbox

slide-27
SLIDE 27

Game Purchase

Direct Purchases

slide-28
SLIDE 28

Total Game Purchase

Total Purchases

slide-29
SLIDE 29

Experimentation Caveats

  • What KPIs to measure?
  • How long to run the experiment?
  • External factors may influence the results
  • Cannibalization is hard to account for
  • Expensive to implement
  • Can’t compare algorithms before “lighting up”
slide-30
SLIDE 30

The Gap

slide-31
SLIDE 31

Accuracy and Diversity Interactions

slide-32
SLIDE 32

Characterizing The Offline / Online Evaluation Gap

  • Overemphasis of popular items
  • List recommendations (diversity, item position)
  • Freshness/ Fatigue
  • Contextual information is not fully utilized
  • Learning from historical data lets you predict the future. But what we

really care about is changing the future!

slide-33
SLIDE 33

Bridging The Gap

slide-34
SLIDE 34

Mitigating Evaluation Techniques

  • Domain experts / focus groups
  • Internal user studies
  • Off-policy evaluation techniques
slide-35
SLIDE 35

Off Policy Evaluation - Example

𝑊

ℎ 𝜌 𝑇 - The expected reward of a policy ℎ given data 𝑇 from a “logging

policy” 𝜌. 𝑊

ℎ 𝜌 𝑇 = 1

𝑇 ෍

𝑦,𝑏,𝑠 ∈𝑇

𝑠 ⋅ 𝕁 ℎ 𝑦 == 𝑏 max ො 𝜌 𝑦 𝑏 , 𝜐 where 𝑇 denotes the set of context-action-reward tuples available in the logs

slide-36
SLIDE 36

Caveats of Off-policy Evaluation

  • Need to formulate everything in terms of a policy
  • Needs sufficient support
  • Becomes very difficult when your policies are time dependent
slide-37
SLIDE 37

We are looking for postdoc researchers to join us in Israel… Email: RecoRecruitmentEmail@microsoft.com

Thank you!