THE PREDICTION ADVANTAGE: A UNIVERSALLY MEANINGFUL PERFORMANCE Ran - - PowerPoint PPT Presentation

the prediction advantage a universally meaningful
SMART_READER_LITE
LIVE PREVIEW

THE PREDICTION ADVANTAGE: A UNIVERSALLY MEANINGFUL PERFORMANCE Ran - - PowerPoint PPT Presentation

THE PREDICTION ADVANTAGE: A UNIVERSALLY MEANINGFUL PERFORMANCE Ran El-Yaniv, Yonatan Geifman, Yair Wiener 2 THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER OUTLINE Introduction and motivation The prediction advantage Bayesian marginal


slide-1
SLIDE 1

Ran El-Yaniv, Yonatan Geifman, Yair Wiener

THE PREDICTION ADVANTAGE: A UNIVERSALLY MEANINGFUL PERFORMANCE

slide-2
SLIDE 2

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

OUTLINE

Introduction and motivation The prediction advantage Bayesian marginal prediction PA for several loss functions Related measures Empirical results Future research and open questions Conclusion

2

slide-3
SLIDE 3

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

INTRODUCTION

Consider an imbalanced problem Does 99% accuracy is good enough? When the minority class is only 0.5%? Can 70% accuracy on multi-class with 3 classes can be compared to 70% with 4 classes? Haberman – a dataset with 26.4% of minority class with reported results of 27% We are looking for a universal measure that can obtain the complexity and the bias of the problem.

3

slide-4
SLIDE 4

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

MAIN IDEA

Lets obtain the performance advantage of the prediction function over the ”random” function Challenges: What is the “random classifier” How can we compare 2 classifiers? Which loss? Subtract? Divide? Does it general for regression and classification? For any loss function?

4

slide-5
SLIDE 5

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PREDICTION ADVANTAGE

5

PA`(f) = 1 − R`(f) R`(f0) = 1 − EX,Y (`(f(X), Y )) EX,Y (`(f0(X), Y ).

slide-6
SLIDE 6

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

BAYESIAN MARGINAL PREDICTION (BMP)

The optimal prediction function with respect to the marginal distribution of Y. The BMP predicts a constant value/class while being

  • blivious to X and P(Y|X ).

we expect the BMP to obtain only the complexity of the problem latent in P(Y).

6

slide-7
SLIDE 7

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

THE BMP IS CONSTANT

Why the BMP is a constant? Yaw principal Lemma: Consider a general function g~Q and a convex loss function

7

slide-8
SLIDE 8

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PREDICTION ADVANTAGE - PROPERTIES

Order preservation - The PA forms a weak ordering of the functions, similar to the order formed by the loss function Boundedness - the PA is bounded by 1. PA=1 achieved

  • nly by the perfect classifier.

Meaningfulness - PA=0 when f has no advantage over the BMP

8

slide-9
SLIDE 9

Cross-entropy loss - Multi class problem with k classes The BMP is the marginal probabilities for each class Labels are given in one-hot representation

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PA FOR CROSS ENTROPY LOSS

9

`(f(X), Y ) = − X

i∈C

Pr{Y = i} log (Pr{f(X) = i}) f(x) : X → Rk f0(X)i = P{Y = ei}

slide-10
SLIDE 10

Lets define an arbitrary distribution Q and

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PA FOR CROSS ENTROPY LOSS - PROOF

10

fQ(X) ∼ Q

R`(fQ) = E`(fQ(X), Y ) = X

i∈C

Pr{Y = ei}`(fQ(X), ei) = X

i∈C

− Pr{Y = ei} log (fQi(X))

R`(f0) = E`(f0(X), Y ) = X

i∈C

Pr{Y = ei}`(f0(X), ei) = X

i∈C

− Pr{Y = ei} log (Pr{Y = ei}) = H(Y )

slide-11
SLIDE 11

We calculate: The BMP loss: The PA:

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PA FOR CROSS ENTROPY LOSS - PROOF

11

R`(fQ) − R`(f0)

R`(fQ) − R`(f0) = X

i∈C

− Pr{Y = ei} log (fQi) + X

i∈C

Pr{Y = ei} log (Pr{Y = ei}) = X

i∈C

Pr{Y = ei} log (Pr{Y = ei}/fQi(X)) = Dkl(f0(X)||fQ(X)) ≥ 0.

R`(f0) = H(P(Y )) PA`(f) = 1 − R`(f) H(P(Y )).

slide-12
SLIDE 12

The BMP: The BMP risk: The PA:

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PA FOR 0/1 LOSS

12

f0 = argmaxi(Pr{Y = i}) R`0−1(f0) = 1 − max

i∈C (Pr{Y = i}) = 1 − Pr{Y = j}.

PA`(f) = 1 − R`(f) R`(f0) = 1 − R`(f) 1 − maxi∈C(Pr{Y = i})

slide-13
SLIDE 13

The BMP The BMP risk: The PA:

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PA FOR SQUARED LOSS

13

R`(f0) = EY [(Y − f0)2] = EY [(Y − E[Y ])2] = var(Y ) PA`(f) = 1 − R`(f) R`(f0) = 1 − R`(f) var(Y ) f0 = E[Y ]

slide-14
SLIDE 14

The BMP for absolute loss: The BMP risk: The PA:

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

PA FOR ABSOLUTE LOSS

14

f0 = median(Y ) R`(f0) = EY [|Y − median(Y )|] = Dmed PA`(f) = 1 − R`(f) R`(f0) = 1 − R`(f) Dmed .

slide-15
SLIDE 15

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

RELATION TO OTHER MEASURES

Some other measures defined as two numbers (e.g., precision recall), we look for one number We compared to F-score, Cohen’s kappa, and balanced accuracy The PA bounds from below all the other measures

15

slide-16
SLIDE 16

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

EMPIRICAL RESULTS

We compared some relevant performance measure on different noise levels and imbalance levels on the breast cancer dataset Measures: Balanced accuracy - (TP+TN)/2 F-measure - harmonic mean of precision and recall Cohen’s kappa - inter-rater agreement measure

16

slide-17
SLIDE 17

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

EMPIRICAL RESULTS

17

slide-18
SLIDE 18

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

EMPIRICAL RESULTS

18

slide-19
SLIDE 19

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

EMPIRICAL RESULTS

19

slide-20
SLIDE 20

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

EMPIRICAL RESULTS

20

slide-21
SLIDE 21

TEXT

PA AND SELECTIVE PREDICTION

In selective prediction for every coverage rate we have different P(Y) Risk-coverage curves are misleading We argue that in this case the objective has to be the PA and we should measure the PA-coverage curve Still not clear how to construct a reject mechanism which

  • ptimize PA

21

slide-22
SLIDE 22

THE PREDICTION ADVANTAGE - EL-YANIV, GEIFMAN, WIENER

CONCLUSION AND FUTURE WORK

We presented a universal performance measure It is still not clear how to best estimate some of the measures (entropy, median, etc…) Does the PA can be used as an optimization objective? where is it needed? how to optimize it? (non convex)

22