Week 2 Video 2 Diagnostic Metrics, Part 1 Different Methods, - - PowerPoint PPT Presentation

week 2 video 2
SMART_READER_LITE
LIVE PREVIEW

Week 2 Video 2 Diagnostic Metrics, Part 1 Different Methods, - - PowerPoint PPT Presentation

Week 2 Video 2 Diagnostic Metrics, Part 1 Different Methods, Different Measures Today well focus on metrics for classifiers Later this week well discuss metrics for regressors And metrics for other methods will be discussed later


slide-1
SLIDE 1

Diagnostic Metrics, Part 1

Week 2 Video 2

slide-2
SLIDE 2

Different Methods, Different Measures

¨ Today we’ll focus on metrics for classifiers ¨ Later this week we’ll discuss metrics for regressors ¨ And metrics for other methods will be discussed

later in the course

slide-3
SLIDE 3

Metrics for Classifiers

slide-4
SLIDE 4

Accuracy

slide-5
SLIDE 5

Accuracy

¨ One of the easiest measures of model goodness is

accuracy

¨ Also called agreement, when measuring inter-rater

reliability # of agreements total number of codes/assessments

slide-6
SLIDE 6

Accuracy

¨ There is general agreement across fields that

accuracy is not a good metric

slide-7
SLIDE 7

Accuracy

¨ Let’s say that my new Kindergarten Failure Detector

achieves 92% accuracy

¨ Good, right?

slide-8
SLIDE 8

Non-even assignment to categories

¨ Accuracy does poorly when there is non-even

assignment to categories

¤ Which is almost always the case

¨ Imagine an extreme case

¤ 92% of students pass Kindergarten ¤ My detector always says PASS

¨ Accuracy of 92% ¨ But essentially no information

slide-9
SLIDE 9

Kappa

slide-10
SLIDE 10

(Agreement – Expected Agreement) (1 – Expected Agreement)

Kappa

slide-11
SLIDE 11

Computing Kappa (Simple 2x2 example)

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-12
SLIDE 12

Computing Kappa (Simple 2x2 example)

  • What is the percent agreement?

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-13
SLIDE 13

Computing Kappa (Simple 2x2 example)

  • What is the percent agreement?
  • 80%

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-14
SLIDE 14

Computing Kappa (Simple 2x2 example)

  • What is Data’s expected frequency for on-task?

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-15
SLIDE 15

Computing Kappa (Simple 2x2 example)

  • What is Data’s expected frequency for on-task?
  • 75%

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-16
SLIDE 16

Computing Kappa (Simple 2x2 example)

  • What is Detector’s expected frequency for on-task?

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-17
SLIDE 17

Computing Kappa (Simple 2x2 example)

  • What is Detector’s expected frequency for on-task?
  • 65%

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-18
SLIDE 18

Computing Kappa (Simple 2x2 example)

  • What is the expected on-task agreement?

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-19
SLIDE 19

Computing Kappa (Simple 2x2 example)

  • What is the expected on-task agreement?
  • 0.65*0.75= 0.4875

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60

slide-20
SLIDE 20

Computing Kappa (Simple 2x2 example)

  • What is the expected on-task agreement?
  • 0.65*0.75= 0.4875

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60 (48.75)

slide-21
SLIDE 21

Computing Kappa (Simple 2x2 example)

  • What are Data and Detector’s expected frequencies for
  • ff-task behavior?

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60 (48.75)

slide-22
SLIDE 22

Computing Kappa (Simple 2x2 example)

  • What are Data and Detector’s expected frequencies for off-

task behavior?

  • 25% and 35%

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60 (48.75)

slide-23
SLIDE 23

Computing Kappa (Simple 2x2 example)

  • What is the expected off-task agreement?

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60 (48.75)

slide-24
SLIDE 24

Computing Kappa (Simple 2x2 example)

  • What is the expected off-task agreement?
  • 0.25*0.35= 0.0875

Detector Off-Task Detector On-Task Data Off-Task 20 5 Data On-Task 15 60 (48.75)

slide-25
SLIDE 25

Computing Kappa (Simple 2x2 example)

  • What is the expected off-task agreement?
  • 0.25*0.35= 0.0875

Detector Off-Task Detector On-Task Data Off-Task 20 (8.75) 5 Data On-Task 15 60 (48.75)

slide-26
SLIDE 26

Computing Kappa (Simple 2x2 example)

  • What is the total expected agreement?

Detector Off-Task Detector On-Task Data Off-Task 20 (8.75) 5 Data On-Task 15 60 (48.75)

slide-27
SLIDE 27

Computing Kappa (Simple 2x2 example)

  • What is the total expected agreement?
  • 0.4875+0.0875 = 0.575

Detector Off-Task Detector On-Task Data Off-Task 20 (8.75) 5 Data On-Task 15 60 (48.75)

slide-28
SLIDE 28

Computing Kappa (Simple 2x2 example)

  • What is kappa?

Detector Off-Task Detector On-Task Data Off-Task 20 (8.75) 5 Data On-Task 15 60 (48.75)

slide-29
SLIDE 29

Computing Kappa (Simple 2x2 example)

  • What is kappa?
  • (0.8 – 0.575) / (1-0.575)
  • 0.225/0.425
  • 0.529

Detector Off-Task Detector On-Task Data Off-Task 20 (8.75) 5 Data On-Task 15 60 (48.75)

slide-30
SLIDE 30

So is that any good?

  • What is kappa?
  • (0.8 – 0.575) / (1-0.575)
  • 0.225/0.425
  • 0.529

Detector Off-Task Detector On-Task Data Off-Task 20 (8.75) 5 Data On-Task 15 60 (48.75)

slide-31
SLIDE 31

Interpreting Kappa

¨ Kappa = 0

¤ Agreement is at chance

¨ Kappa = 1

¤ Agreement is perfect

¨ Kappa = -1

¤ Agreement is perfectly inverse

¨ Kappa > 1

¤ You messed up somewhere

slide-32
SLIDE 32

Kappa<0

¨ This means your model is worse than chance ¨ Very rare to see unless you’re using cross-validation ¨ Seen more commonly if you’re using cross-validation

¤ It means your model is junk

slide-33
SLIDE 33

0<Kappa<1

¨ What’s a good Kappa? ¨ There is no absolute standard

slide-34
SLIDE 34

0<Kappa<1

¨ For data mined models,

¤ Typically 0.3-0.5 is considered good enough to call the

model better than chance and publishable

¤ In affective computing, lower is still often OK

slide-35
SLIDE 35

Why is there no standard?

¨ Because Kappa is scaled by the proportion of each

category

¨ When one class is much more prevalent

¤ Expected agreement is higher than

¨ If classes are evenly balanced

slide-36
SLIDE 36

Because of this…

¨ Comparing Kappa values between two data sets, in a

principled fashion, is highly difficult

¤ It is OK to compare two Kappas, in the same data set, that

have at least one variable in common

¨ A lot of work went into statistical methods for

comparing Kappa values in the 1990s

¨ No real consensus ¨ Informally, you can compare two data sets if the

proportions of each category are “similar”

slide-37
SLIDE 37

Quiz

  • What is kappa?

A: 0.645 B: 0.502 C: 0.700 D: 0.398

Detector Insult during Collaboration Detector No Insult during Collaboration Data Insult 16 7 Data No Insult 8 19

slide-38
SLIDE 38

Quiz

  • What is kappa?

A: 0.240 B: 0.947 C: 0.959 D: 0.007

Detector Academic Suspension Detector No Academic Suspension Data Suspension 1 2 Data No Suspension 4 141

slide-39
SLIDE 39

Next lecture

¨ ROC curves ¨ A’ ¨ Precision ¨ Recall