Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model Evaluation

Overview Introduction Classification performance Regression performance Cross-validation and tuning Revisiting the churn example Additional notes on multiclass, multilabel, and calibration Monitoring and maintenance 2

The analytics process 3

It's all about generalization You have trained a model on a particular data set (e.g. a decision tree) This is your “train data”: used to build model Performance on your train data gives you an initial idea of your model’s validity But no much more than that Much more important: ensure this model will do well on unseen data (out-of-time, out-of- sample, out-of-population) As predictive models are going to be "put to work" Validation needed! Test (Hold-out) data: used to objectively measure performance! Strict separation between training and test set needed! 4

It's all about generalization At the very least, use a test set 5

What do we want to validate? Out-of-sample Out-of-time Out-of-population Not possible to foresee everything that will happen in the future, as you are by definition limited to the data you have now But your duty to be as thorough as possible 6

Classification performance 7

Confusion matrix True Label Prediction Predicted Label Correct? no 0.11 no Correct → no 0.2 no Correct yes 0.85 yes Correct yes 0.84 yes Correct Threshold: 0.50 yes 0.8 yes Correct no 0.65 yes Incorrect yes 0.44 no Incorrect no 0.1 no Correct yes 0.32 no Incorrect yes 0.87 yes Correct yes 0.61 yes Correct yes 0.6 yes Correct yes 0.78 yes Correct no 0.61 yes Incorrect 8

Confusion matrix Depends on the threshold! 9

Metrics Depends on the confusion matrix, and hence on the threshold! 10

Common metrics Accuracy = (tp + tn) / total = (3 + 7) / 14 = 0.71 Balanced accuracy = (recall + specificity) / 2 = (0.5 * tp) / (tp + fn) + (0.5 * tn) / (tn + fp) = 0.5 * 0.78 + 0.5 * 0.60 = 0.69 Recall (sensitivity) = tp / (tp + fn) = 7 / 9 = 0.78 “How much of the positives did we predict as such?” Precision = tp / (tp + fp) = 7 / 9 = 0.78 “How much of the predicted positives are we getting wrong?” 11

Tuning the threshold For each possible threshold t ∈ T with T the set of all predicted probabilities, we can obtain a confusion matrix And hence different metrics So which threshold to pick? True Label Prediction no 0.11 → no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 Recall here our discussion on "well- yes 0.78 calibrated" classifiers no 0.61 12

Tuning the model? For most models, it's extremely hard to push them towards optimizing your metric of choice They'll often inherently optimize for accuracy given the training set In most cases, you will be interested in something else The class imbalance present in the training set might conflict with a model's notion of accuracy You might want to focus on recall or precision, or... What can we do? Tuning the threshold on your metric of interest Adjust the model parameters Adjust the target definition Sample/filter the data set Apply misclassification costs Apply instance weighting (super easy way to do this: duplicate instances) Adjust the loss function (if the model supports doing so, and even then oftentimes related to accuracy concern) 13

Tuning the threshold 14

Applying misclassification costs Let's go on a small detour... Let us illustrate the basic problem with a setting you'll encounter over and over again: a binary classification problem where the class of interest (the positive class) happens rarely compared to the negative class Say fraud only occurs in 1% of cases in the training data Almost all techniques you run out of the box will show this in your confusion matrix: Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 15

Applying misclassification costs Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 What's happening here? Remember that the model will optimize for accuracy, and gets an accuracy of 99% That's why you should never believe people that only report on accuracy "No worries, I'll just pick a stricter threshold" But how to formalize this a bit better? How do I tell my model that I am willing to make some mistakes on the negative side to catch the positives? 16

Applying misclassification costs What we would like to do is set misclassification costs as such: Actual Negative Actual Positive C (0, 0) = 0 C (0, 1) = 5 Predicted Negative C (1, 0) = 1 C (1, 1) = 0 Predicted Positive Mispredicting a positive as a negative is 5 times as bad as mispredicting a negative as a positive How to determine the costs Use real average observed costs (hard to find in many settings) Expert estimate Inverse class distribution (...) 17

Applying misclassification costs Inverse class distribution 99% negative versus 1% positive 1 C (1, 0) = 0.99 = 1 0.99 1 C (0, 1) = 0.99 = 99 0.01 Actual Negative Actual Positive C (0, 0) = 0 C (0, 1) = 99 Predicted Negative C (1, 0) = 1 C (1, 1) = 0 Predicted Positive 18

Applying misclassification costs With a given cost matrix (no matter how we define it), we can calculate the expected loss Actual Negative Actual Positive C (0, 0) = 0 C (0, 1) = 5 Predicted Negative C (1, 0) = 1 C (1, 1) = 0 Predicted Positive l ( x , j ) is the expected loss for classifying an observation x as class j = p ( k ∣ x ) C ( j , k ) ∑ k For binary classification: l ( x , 0) = p (0∣ x ) C (0, 0) + p (1∣ x ) C (0, 1) = (here) p (1∣ x ) C (0, 1) l ( x , 1) = p (0∣ x ) C (1, 0) + p (1∣ x ) C (1, 1) = (here) p (0∣ x ) C (1, 0) 19

Applying misclassification costs Classify an observation as positive if the expected loss for classifying it as a positive observation is smaller than the expected loss for classifying it as a negative observation l ( x , 1) < l ( x , 0) → classify as positive (1) Actual Negative Actual Positive C (0, 0) = 0 C (0, 1) = 5 Predicted Negative C (1, 0) = 1 C (1, 1) = 0 Predicted Positive Example: cost insensitive classifier predicts p (1∣ x ) = 0.22 l ( x , 0) = p (0∣ x ) C (0, 0) + p (1∣ x ) C (0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l ( x , 1) = p (0∣ x ) C (1, 0) + p (1∣ x ) C (1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78 → Classify as positive! 20

Applying misclassification costs l ( x , 1) = l ( x , 0) p (0∣ x ) C (0, 0) + p (1∣ x ) C (0, 1) = p (0∣ x ) C (1, 0) + p (1∣ x ) C (1, 1) p (0∣ x ) = 1 − p (1∣ x ) C (1,0)− C (0,0) p (1∣ x ) = = T CS C (1,0)− C (0,0)+ C (0,1)− C (1,1) When C (1, 0) = C (0, 1) = 1 and C (1, 1) = C (0, 0) = 0 then 1−0 = = 0.5 T CS 1−0+1−0 21

Applying misclassification costs Actual Negative Actual Positive C (0, 0) = 0 C (0, 1) = 5 Predicted Negative C (1, 0) = 1 C (1, 1) = 0 Predicted Positive Example: cost insensitive classifier predicts p (1∣ x ) = 0.22 l ( x , 0) = p (0∣ x ) C (0, 0) + p (1∣ x ) C (0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l ( x , 1) = p (0∣ x ) C (1, 0) + p (1∣ x ) C (1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78 1 = = 0.1667 ≤ 0.22 → Classify as positive! T CS 1+5 22

Sampling approaches From the above, a new cost-sensitive class distribution can be obtained based on the cost-sensitive threshold as follows: 1− T CS ′ New positive number of observations n = n 1 1 T CS ′ Or, new negative number of observations n = n T CS 0 1− T CS 0 E.g. 1 positive versus 99 negative (class inverse cost matrix): Actual Negative Actual Positive C (0, 0) = 0 C (0, 1) = 99 Predicted Negative C (1, 0) = 1 C (1, 1) = 0 Predicted Positive 1 = = 0.01 T CS 1+99 1−0.01 0.01 ′ ′ n = 1 = 99 , or: n = 99 = 1 1 0 0.01 1−0.01 23

Sampling approaches And we now arrive at a nice conclusion: “ Sampling the data set so the minority class is equal to the majority class boils down to biasing the classifier in the same way as when you would “ use a cost matrix constructed from the inverse class imbalance 24

Oversampling (upsampling) 25

Undersampling (downsampling) 26

Intelligent sampling SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) 27

Sampling approaches Note: combinations of over/downsampling possible You can also try oversampling the minority class above the 1:1 level (would boil down to using even more extreme costs in cost matrix) Very closely related to the field of "cost- sensitive learning" Setting misclassification costs (some implementations allow this as well) Cost sensitive logistic regression Cost sensitive decision trees (uses modified entropy and information gain measures) Cost sensitive evaluation measures (e.g. Average Misclassification Cost) 28

Recommend

More recommend