Evaluating Machine Learned User Experiences Asela Gunawardana - - PowerPoint PPT Presentation

evaluating machine learned
SMART_READER_LITE
LIVE PREVIEW

Evaluating Machine Learned User Experiences Asela Gunawardana - - PowerPoint PPT Presentation

Evaluating Machine Learned User Experiences Asela Gunawardana Intelligent User Experiences Microsoft Research The typical machine learning problem ,


slide-1
SLIDE 1

Evaluating Machine Learned User Experiences

Asela Gunawardana Intelligent User Experiences Microsoft Research

slide-2
SLIDE 2

The typical machine learning problem

𝒈 𝒚𝒋 ෝ 𝒛𝒋 𝒚𝒋 𝒛𝒋 𝑴 ෝ 𝒛𝒋, 𝒛𝒋 𝒎𝒋

slide-3
SLIDE 3

Evaluation is easy: just measure σ𝑗 𝑚𝑗 on the test set.

slide-4
SLIDE 4

Thank You Questions?

slide-5
SLIDE 5

Problem: for real problems, we need to decide what labels 𝑧𝑗 to look at, and what loss function 𝑀(⋅,⋅) to use.

slide-6
SLIDE 6

But is this really a serious problem? How hard can it be? E.g. Netflix: 𝑦𝑗 = user𝑗, movie𝑗 𝑧𝑗 ∈ 1,2,3,4,5 𝑀 𝑧𝑗, ෝ 𝑧𝑗 = 𝑧𝑗 − ෝ 𝑧𝑗 2

slide-7
SLIDE 7

Fixing the labels and loss fixes the problem

The “Netflix problem” at NIPS is:

𝑉𝑈 × 𝑁 ≈ 𝑆

slide-8
SLIDE 8

The user’s Netflix problem is:

slide-9
SLIDE 9

Really?

slide-10
SLIDE 10

Where are the stars?

slide-11
SLIDE 11
slide-12
SLIDE 12

Does our formulation of the problem really help users find things to watch?

slide-13
SLIDE 13

Does predicting ratings help users find things to watch?

slide-14
SLIDE 14

Predicting Ratings ≠ Predicting Usage

0.5 1 1.5 2 2.5 Netflix BookCrossing RMS Rating Error Alg A Alg B

slide-15
SLIDE 15

Predicting Ratings ≠ Predicting Usage

0% 20% 40% 60% 80% 100% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% Precision Recall

Online Retail Purchases

  • Alg. A
  • Alg. B
slide-16
SLIDE 16

Predicting Ratings ≠ Predicting Usage

0% 20% 40% 60% 80% 100% 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% Precision Recall

News Story Clicks

  • Alg. A
  • Alg. B
slide-17
SLIDE 17

Lesson:

The “standard,” “given,” or “commonly used” labels and loss functions may tell us very little about how useful the system is.

slide-18
SLIDE 18

If not RMSE, what?

slide-19
SLIDE 19

Precion/Recall?

slide-20
SLIDE 20

AUC?

slide-21
SLIDE 21

Mean Avg Precision?

slide-22
SLIDE 22

Precison @16?

slide-23
SLIDE 23

A better Netflix evaluation protocol

  • 1. Log usage (not just ratings)
  • 2. Train recommender on log data from before yesterday.
  • 3. Recommend items for yesterday’s users.
  • 4. Score against yesterday’s actual usage data:

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-24
SLIDE 24

A better Netflix evaluation protocol

1. Log usage (not just ratings) 2. Train recommender on log data from before yesterday. 3. Recommend items for yesterday’s users. 4. Score against yesterday’s actual usage data: Problem:

False Positive/True Negative

Maybe the user didn’t know about the video, would have happily watched it if we actually recommended it.

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-25
SLIDE 25

A better Netflix evaluation protocol

1. Log usage (not just ratings) 2. Train recommender on log data from before yesterday. 3. Recommend items for yesterday’s users. 4. Score against yesterday’s actual usage data: Problem:

False Positive/True Negative

Maybe the user didn’t know about the video, would have happily watched it if we actually recommended it.

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-26
SLIDE 26

Problem #1

Our data isn’t an i.i.d. draw – it’s collected from a real running system.

slide-27
SLIDE 27

Really?

slide-28
SLIDE 28

Really?

slide-29
SLIDE 29

A better Netflix evaluation protocol

1. Log usage (not just ratings) 2. Train recommender on log data from before yesterday. 3. Recommend items for yesterday’s users. 4. Score against yesterday’s actual usage data: Problems:

False Positive/True Negative

Maybe the user didn’t know about the video, would have happily watched it if we actually recommended it.

True Positive

Maybe the user would have watched the video already, even if we didn’t predict it.

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-30
SLIDE 30

A better Netflix evaluation protocol

1. Log usage (not just ratings) 2. Train recommender on log data from before yesterday. 3. Recommend items for yesterday’s users. 4. Score against yesterday’s actual usage data: Problems:

False Positive/True Negative

Maybe the user didn’t know about the video, would have happily watched it if we actually recommended it.

True Positive

Maybe the user would have watched the video already, even if we didn’t predict it.

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-31
SLIDE 31

Problem #2

Measuring prediction accuracy doesn’t tell us how the system will influence user behavior.

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

A better Netflix evaluation protocol

1. Log usage (not just ratings) 2. Train recommender on log data from before yesterday. 3. Recommend items for yesterday’s users. 4. Score against yesterday’s actual usage data: Problems:

False Positive/True Negative

Maybe the user didn’t know about the video, would have happily watched it if we actually recommended it.

True Positive

Maybe the user would have watched the video already, even if we didn’t predict it.

False Negative

Maybe the user watched the video but hated it.

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-36
SLIDE 36

A better Netflix evaluation protocol

1. Log usage (not just ratings) 2. Train recommender on log data from before yesterday. 3. Recommend items for yesterday’s users. 4. Score against yesterday’s actual usage data: Problems:

False Positive/True Negative

Maybe the user didn’t know about the video, would have happily watched it if we actually recommended it.

True Positive

Maybe the user would have watched the video already, even if we didn’t predict it.

False Negative

Maybe the user watched the video but hated it.

Actually Used Actually Unused Recommended True Positive False Positive Not Recommended False Negative True Negative

slide-37
SLIDE 37

Problem #3

The influence of our system may only manifest over the long term.

slide-38
SLIDE 38
  • 1. Our data isn’t an i.i.d. draw – it needs to be collected from a real

running system.

  • 2. Measuring prediction accuracy doesn’t tell us how the system will

influence user behavior.

  • 3. The influence of our system may only manifest over the long term.

How do we avoid being fooled about how useful our system is?

slide-39
SLIDE 39

How not to be fooled

  • 1. Identify what the goal is
  • Service usage
  • Sales
  • Ad monetization
  • User retention
  • 2. Randomly assign users to a control and treatment group and

measure improvement due to system, over time.

  • 3. Use (with care) offline experiments to prioritize which experiments

to run.

slide-40
SLIDE 40

The objections

Experiments are expensive and time-consuming—can only try a handful of variations. We can’t really expect scientists to build user-facing systems before they do science. Besides, I’m are confident that <insert loss function here> will generally track <insert real criterion here>. RMSE was good enough for Netflix: $1,000,000 says so. The system owner is happy with improvements in my metric.

slide-41
SLIDE 41
slide-42
SLIDE 42

Science is a bit like the joke about the drunk who is looking under a lamppost for a key that he has lost on the other side of the street, because that's where the light is. It has no other choice. Noam Chomsky (at least, according to the web)

slide-43
SLIDE 43

Science is a bit like the joke about the drunk who is looking under a lamppost for a key that he has lost on the other side of the street, because that's where the light is. It has no other choice. Noam Chomsky (at least, according to the web)

slide-44
SLIDE 44

Another choice: Build a new lamppost

(or at least a flashlight) Joachims, KDD 2002WSDM 2015: Use actual user behavior and mild assumptions about it to evaluate web search ranking. Marlin et al, IJCAI 2011: How to estimate and account for selection bias in data sets. Bottou et al, JMLR 2013: How to use data reweighting and a priori causal knowledge to correct for selection bias and make counter-factual inferences. These issues have started to be addressed, and we need to more work that builds on this start.

slide-45
SLIDE 45

Need data that

is collected through randomization of a real system records what was presented to the user (“impression logs”) records why (inputs and sampling probability/density) records what the user did