Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. - - PowerPoint PPT Presentation

diagnostics
SMART_READER_LITE
LIVE PREVIEW

Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. - - PowerPoint PPT Presentation

Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator (an estimator is a function of input points). x 1, x 2, ... ,x N


slide-1
SLIDE 1

Diagnostics

Gad Kimmel

slide-2
SLIDE 2

Outline

  • Introduction.
  • Bootstrap method.
  • Cross validation.
  • ROC plot.
slide-3
SLIDE 3

Introduction

slide-4
SLIDE 4

Motivation

  • Estimating properties of an estimator (an estimator

is a function of input points).

− Given data samples , evaluate some estimator,

say the average:

− How can we estimate its properties (e.g., its variance)?

  • Model selection.

− How many parameters should we use?

x1, x2,... ,x N

∑ xi

N

var∑ xi N = 1 N

2 var ∑ xi

slide-5
SLIDE 5

Bootstrap Method

slide-6
SLIDE 6

Evaluating Accuracy

  • A simple approach for accuracy estimation is to

provide the bias or variance of the estimator.

  • Example: suppose the samples are independently

identically distributed (i.i.d.), with finite variance.

− We know, by the central limit theorem, that − Roughly speaking, is normally distributed with

expectation and variance .

n

1/2 

xn−   Z~N 0,1  xn  

2/n

slide-7
SLIDE 7

Assumptions Do Not Hold

  • What if the r.v. are not i.i.d. ?
  • What if we want to evaluate another estimator (and

not )?

  • It would be nice to have many different samples of

samples.

  • In that case, one could calculate the estimator for

each sample of samples, and infer its distribution.

  • But... we don't have it.

 xn

slide-8
SLIDE 8

Solution - Bootstrap

  • Estimating the sampling distribution of an estimator

by resampling with replacement from the original sample.

  • Efron, The Annals of Statistics, '79.
slide-9
SLIDE 9

Bootstrap - Illustration

  • Goal: Sampling from P.

P

slide-10
SLIDE 10

Bootstrap - Illustration

  • Goal: Sampling from P.

P

x1, x2 , x3 , x4,... , xn

slide-11
SLIDE 11

Bootstrap - Illustration

  • Goal: Sampling from P.

... in order to estimate the variance of an estimator.

P

x1, x2 , x3 , x4,... , xn

slide-12
SLIDE 12

Bootstrap - Illustration

P

x1,1 ,x1,2 , x1,3 ,..., x1,n e1 x2,1 , x2,2 , x2,3,... ,x2, n e2 x3,1, x3,2 , x3,3 ,..., x3, n e3 x4,1 , x4,2 , x4,3,... ,x4, n e4 ... xm ,1 ,xm , 2, xm, 3,... , xm, n em

Samples Estimator

slide-13
SLIDE 13

Bootstrap - Illustration

  • What is the variance of ?

P

x1,1 ,x1,2 , x1,3 ,..., x1,n e1 x2,1 , x2,2 , x2,3,... ,x2, n e2 x3,1, x3,2 , x3,3 ,..., x3, n e3 x4,1 , x4,2 , x4,3,... ,x4, n e4 ... xm ,1 ,xm , 2, xm, 3,... , xm, n em

Samples Estimator

e

slide-14
SLIDE 14

Bootstrap - Illustration

  • Estimate the variance by

P

x1,1 ,x1,2 , x1,3 ,..., x1,n e1 x2,1 , x2,2 , x2,3,... ,x2, n e2 x3,1, x3,2 , x3,3 ,..., x3, n e3 x4,1 , x4,2 , x4,3,... ,x4, n e4 ... xm ,1 ,xm , 2, xm, 3,... , xm, n em

Samples Estimator

vare= 1 m∑i=1

m

ei− 

2

slide-15
SLIDE 15

Bootstrap - Illustration

P

x1, x2 , x3 , x4,... , xn

  • We only have 1 sample:
slide-16
SLIDE 16

Bootstrap - Illustration

P

z1,1 ,z1,2 , z1,3 ,... , z1,n e1 z 2,1, z2,2 , z2,3 ,... , z2, n e2 z3,1, z3,2 , z3,3 ,... , z3,n e3 z 4,1, z4,2 , z4,3 ,... , z4, n e4 ... zm ,1 ,z m, 2, z m, 3,..., zm , n em

Samples Estimator

x1, x2 , x3 , x4,... ,xn

  • Sampling is done from the empirical distribution.
slide-17
SLIDE 17

Formalization

  • The data is . Note that the distribution

function P is unknown.

  • We sample m samples .

contains n samples drawn from the empirical distribution of the data: Where is the number of times appears in the original data.

x1, x2,..., xn~P Y 1,Y 2,... ,Y m Y i=zi ,1, zi ,2 ,... , zi, n Pr[z j , k=xi]= # xi n # xi xi

slide-18
SLIDE 18

The Main Idea

  • .
  • We wish that . Is it (always) true? NO.
  • Rather, is an approximation of .

Y i~  P P=  P  P P

slide-19
SLIDE 19

Example 1

  • The yield of the Dow Jones Index over the past two

years is ~12%.

  • You are considering a broker that had a yield of

25%, by picking specific stocks from the Dow Jones.

  • Let x be a r.v. that represents the yield of randomly

selected stocks.

  • Do we know the distribution of x?
slide-20
SLIDE 20

Example 1

  • Prepare a sample , where each xi is the

yield of randomly selected stocks.

  • Approximate the distribution of x using this sample.

x1, x2,... ,x10,000

slide-21
SLIDE 21

Evaluation of Estimators

  • Using the approximate distribution, we can evaluate
  • estimators. E.g.:

− Variance of the mean. − Confidence intervals.

slide-22
SLIDE 22

Example 1

  • What is the probability to obtain yield larger than

25% (p-value)?

slide-23
SLIDE 23

Example 1

  • What is the probability to obtain yield larger than

25% (p-value)?

30%

slide-24
SLIDE 24

Example 2 - Decision tree

  • Decision tree - short introduction.
slide-25
SLIDE 25

Example 2

  • Building a decision tree.
slide-26
SLIDE 26

Example 2

  • Many other trees can be built, using different

algorithms.

  • For a specific tree one can calculate prediction

accuracy: # of elements classified correctly total # of elements

slide-27
SLIDE 27

Example 2

  • Many other trees can be built, using different

algorithms.

  • For a specific tree one can calculate prediction

accuracy: # of elements classified correctly total # of elements

  • For calculating error bars for this value, we need to

sample more, apply the algorithm many times, and each time evaluate the prediction.

slide-28
SLIDE 28

Example 2 - Applying Bootstrap

Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions.

slide-29
SLIDE 29

Example 2 - Applying Bootstrap

Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions.

T 1 ,T 2,... ,T n p1, p2 ,... , pn

±1.96 STD p1 , p2,..., pn

p1, p2 ,... , pn

slide-30
SLIDE 30

Example 2 - Applying Bootstrap

Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions.

But we have

  • nly one data

set !

slide-31
SLIDE 31

Example 2 - Applying Bootstrap

Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions. Use bootstrap to prepare many samples.

slide-32
SLIDE 32

Cross Validation

slide-33
SLIDE 33

Objective

  • Model selection.
slide-34
SLIDE 34

Formalization

  • Let (x, y) drawn from distribution P. Where
  • Let be a learning algorithm, with

parameter(s) .

 x∈ℜ

n and y∈ℜ

f : ℜ

n ℜ

slide-35
SLIDE 35

Example

  • Regression model.
slide-36
SLIDE 36

What Do We Want?

  • We want the method that is going to predict future

data most accurately, assuming they are drawn from the distribution P.

slide-37
SLIDE 37

What Do We Want?

  • We want the method that is going to predict future

data most accurately, assuming they are drawn from the distribution P.

  • Niels Bohr:

"It is very difficult to make an accurate prediction, especially about the future."

slide-38
SLIDE 38

Choosing the Best Model

  • For a sample (x, y) which is drawn from the

distribution function P :

  • r
  • Since (x, y) is a r.v. we are usually interested in:

 f  x− y

2

| f x−y| E[ f x−y

2]

slide-39
SLIDE 39

Choosing the Best Model (cont.)

  • Choose the parameter(s) :
  • The problem is that we don't know to sample from

P.

argmin E[ f x−y

2]

slide-40
SLIDE 40

4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20

Regression − Order of 1 (Linear)

slide-41
SLIDE 41

4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20

Regression − Order of 2

slide-42
SLIDE 42

4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20

Regression − Order of 3

slide-43
SLIDE 43

4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20

Regression − Order of 4

slide-44
SLIDE 44

4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20

Regression − Order of 5

slide-45
SLIDE 45

4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 20

Regression − Join the Dots

slide-46
SLIDE 46

Solution - Cross Validation

  • Partition the data to 2 sets:

− Training set T. − Test set S.

  • Calculate

using only the training set T.

  • Given , calculate

1 |S | ∑xi, yi∈S  f xi− yi

2

slide-47
SLIDE 47

Back to the Example

  • In our case, we should try different orders for the

regression (or different # of params).

  • Each time apply the regression only on the training

set, and calculate estimation error on the test set.

  • The # of parameters will be the one minimizing the

error.

slide-48
SLIDE 48

Variants of Cross Validation

  • Test - set.
  • Leave one out.
  • k-fold cross validation.
slide-49
SLIDE 49

K-fold Cross Validation

Train Train Test Train Train

slide-50
SLIDE 50

K-fold Cross Validation

  • We want to find a parameter that minimizes the

cross validation estimate of prediction error:

CV = 1 |N | ∑ L yi , f

−ki xi , 

slide-51
SLIDE 51

K-fold Cross Validation

  • How to choose K?
  • K=N ( = leave one out) - CV is unbiased for true

prediction error, but can have high variance.

  • When K increases - CV has lower variance, but bias

could be a problem (depending on how the performance of the learning method varies with size

  • f training set).
slide-52
SLIDE 52

ROC Plot

(Receiver Operating Characteristic)

slide-53
SLIDE 53

Definitions

  • Let be a classifier function.

f  :ℜ

n{−1,1}

Predicted positive Predicted negative Positive examples True positives False negatives Negative examples False positives True negatives

slide-54
SLIDE 54

Example - Blood Pressure and Cardio Vascular Disease (CVD)

  • Classifier: If a person has a mean blood pressure

above t, he will have some CV event during 10

  • years. We have 100 samples.
  • How do we choose t ?
slide-55
SLIDE 55

t = 0

70 30 Predicted positive Predicted negative Positive examples Negative examples

slide-56
SLIDE 56

t = 300

70 30 Predicted positive Predicted negative Positive examples Negative examples

slide-57
SLIDE 57

t = 150

30 40 10 20 Predicted positive Predicted negative Positive examples Negative examples

slide-58
SLIDE 58

More Definitions

  • True positive rate = TP / (TP + FN)
  • False positive rate = FP / (FP + TN)

TP FN FP TN Predicted positive Predicted negative Positive examples Negative examples

slide-59
SLIDE 59

ROC - Receiver Operating Characteristic Curve

FP rate TP rate 1 1

slide-60
SLIDE 60

ROC Curve

FP rate TP rate 1 1

slide-61
SLIDE 61

ROC Curve

FP rate TP rate 1 1

You are healthy!

slide-62
SLIDE 62

ROC Curve

FP rate TP rate 1 1

slide-63
SLIDE 63

ROC Curve

FP rate TP rate 1 1

You are sick!

slide-64
SLIDE 64

ROC Curve

FP rate TP rate 1 1

slide-65
SLIDE 65

ROC Curve

FP rate TP rate 1 1

slide-66
SLIDE 66

ROC Curve

FP rate TP rate 1 1

Heaven

slide-67
SLIDE 67

ROC Curve

FP rate TP rate 1 1

slide-68
SLIDE 68

ROC Curve

FP rate TP rate 1 1

???

slide-69
SLIDE 69

ROC Curve

FP rate TP rate 1 1

slide-70
SLIDE 70

ROC Curve

FP rate TP rate 1 1

slide-71
SLIDE 71

ROC Curve

FP rate TP rate 1 1

Worse than random classifier

slide-72
SLIDE 72

ROC Curve

FP rate TP rate 1 1

slide-73
SLIDE 73

ROC Curve

FP rate TP rate 1 1

slide-74
SLIDE 74

ROC Curve

FP rate TP rate 1 1

slide-75
SLIDE 75

ROC Curve

FP rate TP rate 1 1

slide-76
SLIDE 76

Alternative Terminology

  • Precision = TP / (TP+FP)

(= positive predictive value)

  • Recall = TP / (TP+FN)

(= sensitivity = true positive rate)

  • F-measure - the harmonic mean of precision and

recall: F-score = 2 Precision × Recall / (Precision+Recall)

TP FN FP TN Predicted positive Predicted negative Positive examples Negative examples

slide-77
SLIDE 77

Alternative Terminology (cont.)

  • Specificity = TN / (FP+TN)

(= 1 - false positive rate)

  • Negative predictive value = TN / (FN+TN)

TP FN FP TN Predicted positive Predicted negative Positive examples Negative examples

slide-78
SLIDE 78

The AUC Metric

  • The Area Under the Curve (AUC) metric assesses

the accuracy of the ranking in terms of separation of the classes.

  • In random classifier (bad): AUC = 0.5.
  • In perfect classifier (good): AUC = 1.
slide-79
SLIDE 79

Choosing a Point on the Curve

  • Depends on the application:

− Medical screening tests (e.g., mammography) - high TP. − Spam filtering - low FP.

slide-80
SLIDE 80

Summary

  • Methods for:

− Estimation properties of an estimator. − Model selection.

slide-81
SLIDE 81

References

  • Bootstrap Methods and their Applications. A,

Davison and D. Hinkley.

  • The Elements of Statistical Learning. T. Hastie, R.

Tibshirani and J. H. Friedman.

  • ROC Graphs: Notes and Practical Considerations for
  • Researchers. T. Fawcett.