Lecture 3: Method evaluation and tuning parameter selection Felix - - PowerPoint PPT Presentation
Lecture 3: Method evaluation and tuning parameter selection Felix - - PowerPoint PPT Presentation
Lecture 3: Method evaluation and tuning parameter selection Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 29th March 2019 Evaluating performance of a statistical method Goals structure, e.g. in kNN
Evaluating performance of a statistical method
Goals
βΆ Model selection: Choose a hyper-parameter or model
structure, e.g. π in kNN regression/classification, or βChoose between logistic regression, LDA and kNNβ
βΆ Model assessment: How well did a model do on a data
set?
1/25
How to choose the best π for kNN?
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- k = 1
k = 10 k = 100 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.10 0.15 0.20 0.25 0.30
Compactness Symmetry Diagnosis
- Benign
Malignant
βΆ UCI breast cancer wisconsin (diagnostic) data set1 βΆ Which π will do best for class prediction of new data?
1https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
2/25
Error rates (I)
βΆ Remember: To determine the optimal regression function
- r classifier we looked at expected prediction loss
πΎ(π) = π½π(π²,π§) [π(π§, π(π²))] Note that π was thought to be an arbitrary unknown function.
βΆ Now: π is estimated from data under some model
assumption
βΆ The resulting regressor/classifier Λ
π(β |π°) is fixated after estimation but dependent on the training samples π°
βΆ Expected prediction error for a fixed training set π°
π(π°) = π½π(π²,π§) [π(π§, Λ π(π²|π°)]
3/25
Error rates (II)
βΆ Conditional expected prediction error for a fixed training
set π° π(π°) = π½π(π²,π§) [π(π§, Λ π(π²|π°)]
βΆ Training samples are random too! βΆ Total expected prediction error
π = π½π(π°) [π(π°)] = π½π(π°) [π½π(π²,π§) [π(π§, Λ π(π²|π°))]]
4/25
Empirical error rates (I)
βΆ Training error
ππ’π = 1 π
π
β
π=1
π(π§π, Λ π(π²π|π°)) where π° = {(π§π, π²π) βΆ 1 β€ π β€ π}
βΆ Test error
ππ’π = 1 π
π
β
π=1
π( Μ π§π, Λ π( Μ π²π|π°)) where ( Μ π§π, Μ π²π) for 1 β€ π β€ π are new samples from the same distribution as π°, i.e. π(π°).
5/25
Empirical error rates (II)
Can we directly use these empirical rates and approximate total or conditional expected prediction error? Observations:
βΆ π° has already been used to determine Λ
π(β |π°) and usually methods aim to minimize training error
βΆ Training error is often smaller for more complex models
(so-called optimism of the training error) since they can adjust better to the available data (overfitting!)
βΆ How do we get new samples from the data distribution
π(π°)? What do we do if all we have is the training sample?
6/25
Splitting up the data
βΆ Holdout method: If we have a lot of samples, randomly
split available data into training set and test set
βΆ π-fold cross-validation: If we have few samples
- 1. Randomly split available data into π equally large subsets,
so-called folds.
- 2. By taking turns, use π β 1 folds as the training set and the
last fold as the test set
7/25
Approximations of expected prediction error
βΆ Use test error for hold-out method, i.e.
ππ’π = 1 π
π
β
π=1
π( Μ π§π, Λ π( Μ π²π|π°)) where ( Μ π§π, Μ π²π) for 1 β€ π β€ π are the elements in the test set.
βΆ Use average test error for c-fold cross-validation, i.e.
πππ€ = 1 π
π
β
π=1
β
(π§π,π²π)ββ±
π
π(π§π, Λ π(π²π|β±
βπ))
where β±
π is the π-th fold and β± βπ is all data except fold π. 8/25
Careful data splitting
βΆ Note: For the approximations to be justifiable, test and
training sets need to be identically distributed
βΆ Splitting has to be done randomly βΆ If data is unbalanced, then stratification is necessary.
Examples:
βΆ Class imbalance βΆ Continuous outcome is observed more often in some
intervals than others (e.g. high values more often than low values)
9/25
Error estimation and tuning parameters
The holdout method and cross-validation can be used to determine tuning parameters.
- 1. For a sequence of tuning parameters π1, β¦ , ππ calculate
πππ€(ππ‘) = 1 π
π
β
π=1
β
(π§π,π²π)ββ±
π
π(π§π, Λ π(π²π|ππ‘, β±
βπ))
- 2. Choose
Μ π = arg min
ππ‘
πππ€(ππ‘) Also works for a sequence of methods π1, β¦ , ππ (e.g. kNN, QDA, Logistic Regression)
10/25
Global rule & Simple boundary
β6 β3 3 6 β1 1
x1 x2
LDA
βΆ The red line is the true
boundary.
βΆ Each grey line represents
a fit to randomly chosen 20% of all data.
βΆ The black line is the
average of the grey lines.
βΆ Here: low variance and
low bias
11/25
Local rule & Simple boundary
β6 β3 3 6 β1 1
x1 x2
kNN (k = 3)
βΆ Here: high variance but on
average low bias
12/25
Global rule & Complex boundary
β6 β3 3 6 β1 1
x1 x2
LDA
βΆ Here: low variance but
also large bias
13/25
Local rule & Complex boundary
β6 β3 3 6 β1 1
x1 x2
kNN (k = 3)
βΆ Here: high variance but on
average low bias
14/25
Global vs local rules
Observations
βΆ Local rules are built using data in a local neighbourhood,
can capture complex boundaries, but have high variance
βΆ Global rules are built using all data, are usually less
flexible, but have low variance
βΆ Bias-Variance Trade-off: It can be theoretically motivated
that bias and variance affect the expected prediction
- error. The goal is to find a balance.
15/25
Performance of LDA vs KNN
Table 1: Average cross-validation errors for ten folds
Boundary simple complex LDA 0.011 0.092 kNN (k = 3) 0.018 0.021 LDA does better for simple boundaries, while kNN has an advantage for more complicated boundaries.
16/25
Choosing a classification method (I)
Remember: We looked at different classification methods for solving the same classification problem
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width
Nearest Centroids
- β
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width
LDA
- β
- 2
3 4 4 5 6 7 8
Sepal Length Sepal Width
QDA
Species
- setosa
versicolor virginica 17/25
Choosing a classification method (II)
Table 2: Average cross-validation errors for ten folds
NC LDA QDA 0.193 0.2 0.22
18/25
Quality of a classification result
How to quantify classification quality, When we receive a classification result from our classifier? Setting:
βΆ Language/notation comes from medical studies where
the presence or absence of a disease/condition is determined
βΆ Binary classification with classes 0 and 1 βΆ 0s are interpreted as negative outcomes (e.g. not sick =
healthy individual) and 1s are interpreted as positive
- utcomes e.g. sick individuals
19/25
Confusion matrix
Table 3: Confusion matrix
Predicted class True class Positive Negative Positive True Positive (TP) False Positive (FP) Negative False Negative (FN) True Negative (TN)
20/25
Measures of classification quality
βΆ Accuracy:
ππ + ππ ππ + πΊπ + πΊπ + ππ
βΆ Precision:
ππ ππ + πΊπ
βΆ Sensitivity/True positive rate (TPR)/Recall:
ππ ππ + πΊπ
βΆ Specificity:
ππ ππ + πΊπ
βΆ False positive rate (FPR)/fall out: 1 - Specificity
21/25
Combined measures
βΆ πΊ
1 score = 2 β Precision β Recall
Precision + Recall
βΆ Matthewβs correlation coefficient:
ππ·π· = ππ β ππ β πΊπ β πΊπ β(ππ + πΊπ)(ππ + πΊπ)(ππ + πΊπ)(ππ + πΊπ) β (β1, 1) where ππ·π· = 0 for a random classifier and ππ·π· < 0 if worse than random and ππ·π· > 0 if better than random. Takes both classes into account.
βΆ Receiver Operating Characteristic (ROC) curve: Trade-off
between FPR and TPR. Equal for a random classifier, TPR < FPR for a worse than random classifier and FPR > TPR is better than random
βΆ Area under the ROC curve (AUC): 0.5 for a random
classifier and > 0.5 for better classifiers. Maximum 1.
22/25
How to choose the best π for kNN? (revisited, I)
Reminder: This motivated our discussion
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- β
- k = 1
k = 10 k = 100 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.10 0.15 0.20 0.25 0.30
Compactness Symmetry Diagnosis
- Benign
Malignant 23/25
How to choose the best π for kNN? (revisited, II)
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
False Positive Rate True Positive Rate k
1 3 5 10 100
Table 4: Average training and cross-validation errors for five folds
π ππ’π πππ€ 1 0.000 0.276 3 0.137 0.243 5 0.160 0.228 10 0.182 0.204 100 0.204 0.207
π = 100 leads to the best measurable results. Judging from the plots for π = 1, π = 10 and π = 100, kNN is trying to approximate a linear decision boundary and βtries to become a global methodβ.
24/25
Take-home message
βΆ Cross-validation or splitting data into a training and test
set are valuable approaches for model selection and model assessment
βΆ Method complexity and global/local rules exhibit a
bias-variance trade-off
βΆ There is no single best measurement of classification