Lecture 3: Method evaluation and tuning parameter selection Felix - - PowerPoint PPT Presentation

β–Ά
lecture 3 method evaluation and tuning parameter selection
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Method evaluation and tuning parameter selection Felix - - PowerPoint PPT Presentation

Lecture 3: Method evaluation and tuning parameter selection Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 29th March 2019 Evaluating performance of a statistical method Goals structure, e.g. in kNN


slide-1
SLIDE 1

Lecture 3: Method evaluation and tuning parameter selection

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 29th March 2019

slide-2
SLIDE 2

Evaluating performance of a statistical method

slide-3
SLIDE 3

Goals

β–Ά Model selection: Choose a hyper-parameter or model

structure, e.g. 𝑙 in kNN regression/classification, or β€œChoose between logistic regression, LDA and kNN”

β–Ά Model assessment: How well did a model do on a data

set?

1/25

slide-4
SLIDE 4

How to choose the best 𝑙 for kNN?

  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • k = 1

k = 10 k = 100 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.10 0.15 0.20 0.25 0.30

Compactness Symmetry Diagnosis

  • Benign

Malignant

β–Ά UCI breast cancer wisconsin (diagnostic) data set1 β–Ά Which 𝑙 will do best for class prediction of new data?

1https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

2/25

slide-5
SLIDE 5

Error rates (I)

β–Ά Remember: To determine the optimal regression function

  • r classifier we looked at expected prediction loss

𝐾(𝑔) = π”½π‘ž(𝐲,𝑧) [𝑀(𝑧, 𝑔(𝐲))] Note that 𝑔 was thought to be an arbitrary unknown function.

β–Ά Now: 𝑔 is estimated from data under some model

assumption

β–Ά The resulting regressor/classifier Λ†

𝑔(β‹…|𝒰) is fixated after estimation but dependent on the training samples 𝒰

β–Ά Expected prediction error for a fixed training set 𝒰

𝑆(𝒰) = π”½π‘ž(𝐲,𝑧) [𝑀(𝑧, Λ† 𝑔(𝐲|𝒰)]

3/25

slide-6
SLIDE 6

Error rates (II)

β–Ά Conditional expected prediction error for a fixed training

set 𝒰 𝑆(𝒰) = π”½π‘ž(𝐲,𝑧) [𝑀(𝑧, Λ† 𝑔(𝐲|𝒰)]

β–Ά Training samples are random too! β–Ά Total expected prediction error

𝑆 = π”½π‘ž(𝒰) [𝑆(𝒰)] = π”½π‘ž(𝒰) [π”½π‘ž(𝐲,𝑧) [𝑀(𝑧, Λ† 𝑔(𝐲|𝒰))]]

4/25

slide-7
SLIDE 7

Empirical error rates (I)

β–Ά Training error

𝑆𝑒𝑠 = 1 π‘œ

π‘œ

βˆ‘

π‘š=1

𝑀(π‘§π‘š, Λ† 𝑔(π²π‘š|𝒰)) where 𝒰 = {(π‘§π‘š, π²π‘š) ∢ 1 ≀ π‘š ≀ π‘œ}

β–Ά Test error

𝑆𝑒𝑓 = 1 𝑛

𝑛

βˆ‘

π‘š=1

𝑀( Μƒ π‘§π‘š, Λ† 𝑔( Μƒ π²π‘š|𝒰)) where ( Μƒ π‘§π‘š, Μƒ π²π‘š) for 1 ≀ π‘š ≀ 𝑛 are new samples from the same distribution as 𝒰, i.e. π‘ž(𝒰).

5/25

slide-8
SLIDE 8

Empirical error rates (II)

Can we directly use these empirical rates and approximate total or conditional expected prediction error? Observations:

β–Ά 𝒰 has already been used to determine Λ†

𝑔(β‹…|𝒰) and usually methods aim to minimize training error

β–Ά Training error is often smaller for more complex models

(so-called optimism of the training error) since they can adjust better to the available data (overfitting!)

β–Ά How do we get new samples from the data distribution

π‘ž(𝒰)? What do we do if all we have is the training sample?

6/25

slide-9
SLIDE 9

Splitting up the data

β–Ά Holdout method: If we have a lot of samples, randomly

split available data into training set and test set

β–Ά 𝑑-fold cross-validation: If we have few samples

  • 1. Randomly split available data into 𝑑 equally large subsets,

so-called folds.

  • 2. By taking turns, use 𝑑 βˆ’ 1 folds as the training set and the

last fold as the test set

7/25

slide-10
SLIDE 10

Approximations of expected prediction error

β–Ά Use test error for hold-out method, i.e.

𝑆𝑒𝑓 = 1 𝑛

𝑛

βˆ‘

π‘š=1

𝑀( Μƒ π‘§π‘š, Λ† 𝑔( Μƒ π²π‘š|𝒰)) where ( Μƒ π‘§π‘š, Μƒ π²π‘š) for 1 ≀ π‘š ≀ 𝑛 are the elements in the test set.

β–Ά Use average test error for c-fold cross-validation, i.e.

𝑆𝑑𝑀 = 1 π‘œ

𝑑

βˆ‘

π‘˜=1

βˆ‘

(π‘§π‘š,π²π‘š)βˆˆβ„±

π‘˜

𝑀(π‘§π‘š, Λ† 𝑔(π²π‘š|β„±

βˆ’π‘˜))

where β„±

π‘˜ is the π‘˜-th fold and β„± βˆ’π‘˜ is all data except fold π‘˜. 8/25

slide-11
SLIDE 11

Careful data splitting

β–Ά Note: For the approximations to be justifiable, test and

training sets need to be identically distributed

β–Ά Splitting has to be done randomly β–Ά If data is unbalanced, then stratification is necessary.

Examples:

β–Ά Class imbalance β–Ά Continuous outcome is observed more often in some

intervals than others (e.g. high values more often than low values)

9/25

slide-12
SLIDE 12

Error estimation and tuning parameters

The holdout method and cross-validation can be used to determine tuning parameters.

  • 1. For a sequence of tuning parameters πœ‡1, … , πœ‡π‘‡ calculate

𝑆𝑑𝑀(πœ‡π‘‘) = 1 π‘œ

𝑑

βˆ‘

π‘˜=1

βˆ‘

(π‘§π‘š,π²π‘š)βˆˆβ„±

π‘˜

𝑀(π‘§π‘š, Λ† 𝑔(π²π‘š|πœ‡π‘‘, β„±

βˆ’π‘˜))

  • 2. Choose

Μ‚ πœ‡ = arg min

πœ‡π‘‘

𝑆𝑑𝑀(πœ‡π‘‘) Also works for a sequence of methods 𝑁1, … , 𝑁𝑇 (e.g. kNN, QDA, Logistic Regression)

10/25

slide-13
SLIDE 13

Global rule & Simple boundary

βˆ’6 βˆ’3 3 6 βˆ’1 1

x1 x2

LDA

β–Ά The red line is the true

boundary.

β–Ά Each grey line represents

a fit to randomly chosen 20% of all data.

β–Ά The black line is the

average of the grey lines.

β–Ά Here: low variance and

low bias

11/25

slide-14
SLIDE 14

Local rule & Simple boundary

βˆ’6 βˆ’3 3 6 βˆ’1 1

x1 x2

kNN (k = 3)

β–Ά Here: high variance but on

average low bias

12/25

slide-15
SLIDE 15

Global rule & Complex boundary

βˆ’6 βˆ’3 3 6 βˆ’1 1

x1 x2

LDA

β–Ά Here: low variance but

also large bias

13/25

slide-16
SLIDE 16

Local rule & Complex boundary

βˆ’6 βˆ’3 3 6 βˆ’1 1

x1 x2

kNN (k = 3)

β–Ά Here: high variance but on

average low bias

14/25

slide-17
SLIDE 17

Global vs local rules

Observations

β–Ά Local rules are built using data in a local neighbourhood,

can capture complex boundaries, but have high variance

β–Ά Global rules are built using all data, are usually less

flexible, but have low variance

β–Ά Bias-Variance Trade-off: It can be theoretically motivated

that bias and variance affect the expected prediction

  • error. The goal is to find a balance.

15/25

slide-18
SLIDE 18

Performance of LDA vs KNN

Table 1: Average cross-validation errors for ten folds

Boundary simple complex LDA 0.011 0.092 kNN (k = 3) 0.018 0.021 LDA does better for simple boundaries, while kNN has an advantage for more complicated boundaries.

16/25

slide-19
SLIDE 19

Choosing a classification method (I)

Remember: We looked at different classification methods for solving the same classification problem

  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width

Nearest Centroids

  • ●
  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width

LDA

  • ●
  • 2

3 4 4 5 6 7 8

Sepal Length Sepal Width

QDA

Species

  • setosa

versicolor virginica 17/25

slide-20
SLIDE 20

Choosing a classification method (II)

Table 2: Average cross-validation errors for ten folds

NC LDA QDA 0.193 0.2 0.22

18/25

slide-21
SLIDE 21

Quality of a classification result

How to quantify classification quality, When we receive a classification result from our classifier? Setting:

β–Ά Language/notation comes from medical studies where

the presence or absence of a disease/condition is determined

β–Ά Binary classification with classes 0 and 1 β–Ά 0s are interpreted as negative outcomes (e.g. not sick =

healthy individual) and 1s are interpreted as positive

  • utcomes e.g. sick individuals

19/25

slide-22
SLIDE 22

Confusion matrix

Table 3: Confusion matrix

Predicted class True class Positive Negative Positive True Positive (TP) False Positive (FP) Negative False Negative (FN) True Negative (TN)

20/25

slide-23
SLIDE 23

Measures of classification quality

β–Ά Accuracy:

π‘ˆπ‘„ + π‘ˆπ‘‚ π‘ˆπ‘„ + 𝐺𝑄 + 𝐺𝑂 + π‘ˆπ‘‚

β–Ά Precision:

π‘ˆπ‘„ π‘ˆπ‘„ + 𝐺𝑄

β–Ά Sensitivity/True positive rate (TPR)/Recall:

π‘ˆπ‘„ π‘ˆπ‘„ + 𝐺𝑂

β–Ά Specificity:

π‘ˆπ‘‚ π‘ˆπ‘‚ + 𝐺𝑄

β–Ά False positive rate (FPR)/fall out: 1 - Specificity

21/25

slide-24
SLIDE 24

Combined measures

β–Ά 𝐺

1 score = 2 β‹… Precision β‹… Recall

Precision + Recall

β–Ά Matthew’s correlation coefficient:

𝑁𝐷𝐷 = π‘ˆπ‘„ β‹… π‘ˆπ‘‚ βˆ’ 𝐺𝑄 β‹… 𝐺𝑂 √(π‘ˆπ‘„ + 𝐺𝑄)(π‘ˆπ‘„ + 𝐺𝑂)(π‘ˆπ‘‚ + 𝐺𝑄)(π‘ˆπ‘‚ + 𝐺𝑂) ∈ (βˆ’1, 1) where 𝑁𝐷𝐷 = 0 for a random classifier and 𝑁𝐷𝐷 < 0 if worse than random and 𝑁𝐷𝐷 > 0 if better than random. Takes both classes into account.

β–Ά Receiver Operating Characteristic (ROC) curve: Trade-off

between FPR and TPR. Equal for a random classifier, TPR < FPR for a worse than random classifier and FPR > TPR is better than random

β–Ά Area under the ROC curve (AUC): 0.5 for a random

classifier and > 0.5 for better classifiers. Maximum 1.

22/25

slide-25
SLIDE 25

How to choose the best 𝑙 for kNN? (revisited, I)

Reminder: This motivated our discussion

  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • ●
  • k = 1

k = 10 k = 100 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.10 0.15 0.20 0.25 0.30

Compactness Symmetry Diagnosis

  • Benign

Malignant 23/25

slide-26
SLIDE 26

How to choose the best 𝑙 for kNN? (revisited, II)

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

False Positive Rate True Positive Rate k

1 3 5 10 100

Table 4: Average training and cross-validation errors for five folds

𝑙 𝑆𝑒𝑠 𝑆𝑑𝑀 1 0.000 0.276 3 0.137 0.243 5 0.160 0.228 10 0.182 0.204 100 0.204 0.207

𝑙 = 100 leads to the best measurable results. Judging from the plots for 𝑙 = 1, 𝑙 = 10 and 𝑙 = 100, kNN is trying to approximate a linear decision boundary and β€œtries to become a global method”.

24/25

slide-27
SLIDE 27

Take-home message

β–Ά Cross-validation or splitting data into a training and test

set are valuable approaches for model selection and model assessment

β–Ά Method complexity and global/local rules exhibit a

bias-variance trade-off

β–Ά There is no single best measurement of classification

quality, use multiple!

25/25