Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model Evaluation Overview Introduction Classification performance Regression performance Cross-validation and tuning Revisiting the churn example


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Model Evaluation

slide-2
SLIDE 2

Overview

Introduction Classification performance Regression performance Cross-validation and tuning Revisiting the churn example Additional notes on multiclass, multilabel, and calibration Monitoring and maintenance

2

slide-3
SLIDE 3

The analytics process

3

slide-4
SLIDE 4

It's all about generalization

You have trained a model on a particular data set (e.g. a decision tree) This is your “train data”: used to build model

Performance on your train data gives you an initial idea of your model’s validity But no much more than that

Much more important: ensure this model will do well on unseen data (out-of-time, out-of- sample, out-of-population)

As predictive models are going to be "put to work" Validation needed!

Test (Hold-out) data: used to objectively measure performance! Strict separation between training and test set needed! 4

slide-5
SLIDE 5

It's all about generalization

At the very least, use a test set 5

slide-6
SLIDE 6

What do we want to validate?

Out-of-sample Out-of-time Out-of-population

Not possible to foresee everything that will happen in the future, as you are by definition limited to the data you have now

But your duty to be as thorough as possible

6

slide-7
SLIDE 7

Classification performance

7

slide-8
SLIDE 8

True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61

Threshold: 0.50

Predicted Label Correct? no Correct no Correct yes Correct yes Correct yes Correct yes Incorrect no Incorrect no Correct no Incorrect yes Correct yes Correct yes Correct yes Correct yes Incorrect

Confusion matrix

8

slide-9
SLIDE 9

Confusion matrix

Depends on the threshold! 9

slide-10
SLIDE 10

Metrics

Depends on the confusion matrix, and hence on the threshold! 10

slide-11
SLIDE 11

Common metrics

Accuracy = (tp + tn) / total = (3 + 7) / 14 = 0.71 Balanced accuracy = (recall + specificity) / 2 = (0.5 * tp) / (tp + fn) + (0.5 * tn) / (tn + fp) = 0.5 * 0.78 + 0.5 * 0.60 = 0.69 Recall (sensitivity) = tp / (tp + fn) = 7 / 9 = 0.78 “How much of the positives did we predict as such?” Precision = tp / (tp + fp) = 7 / 9 = 0.78 “How much of the predicted positives are we getting wrong?”

11

slide-12
SLIDE 12

True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61

Recall here our discussion on "well- calibrated" classifiers

Tuning the threshold

For each possible threshold t ∈ T with T the set of all predicted probabilities, we can obtain a confusion matrix And hence different metrics So which threshold to pick?

12

slide-13
SLIDE 13

Tuning the model?

For most models, it's extremely hard to push them towards optimizing your metric of choice They'll often inherently optimize for accuracy given the training set In most cases, you will be interested in something else

The class imbalance present in the training set might conflict with a model's notion of accuracy You might want to focus on recall or precision, or...

What can we do?

Tuning the threshold on your metric of interest Adjust the model parameters Adjust the target definition Sample/filter the data set Apply misclassification costs Apply instance weighting (super easy way to do this: duplicate instances) Adjust the loss function (if the model supports doing so, and even then oftentimes related to accuracy concern)

13

slide-14
SLIDE 14

Tuning the threshold

14

slide-15
SLIDE 15

Applying misclassification costs

Let's go on a small detour... Let us illustrate the basic problem with a setting you'll encounter over and over again: a binary classification problem where the class of interest (the positive class) happens rarely compared to the negative class

Say fraud only occurs in 1% of cases in the training data

Almost all techniques you run out of the box will show this in your confusion matrix: Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 15

slide-16
SLIDE 16

Applying misclassification costs

Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 What's happening here?

Remember that the model will optimize for accuracy, and gets an accuracy of 99% That's why you should never believe people that only report on accuracy

"No worries, I'll just pick a stricter threshold"

But how to formalize this a bit better? How do I tell my model that I am willing to make some mistakes on the negative side to catch the positives?

16

slide-17
SLIDE 17

Applying misclassification costs

What we would like to do is set misclassification costs as such: Actual Negative Actual Positive Predicted Negative

C(0, 0) = 0 C(0, 1) = 5

Predicted Positive

C(1, 0) = 1 C(1, 1) = 0

Mispredicting a positive as a negative is 5 times as bad as mispredicting a negative as a positive How to determine the costs

Use real average observed costs (hard to find in many settings) Expert estimate Inverse class distribution (...)

17

slide-18
SLIDE 18

Applying misclassification costs

Inverse class distribution

99% negative versus 1% positive

C(1, 0) = 0.99 = 1 C(0, 1) = 0.99 = 99

Actual Negative Actual Positive Predicted Negative

C(0, 0) = 0 C(0, 1) = 99

Predicted Positive

C(1, 0) = 1 C(1, 1) = 0

0.99 1 0.01 1

18

slide-19
SLIDE 19

Applying misclassification costs

With a given cost matrix (no matter how we define it), we can calculate the expected loss Actual Negative Actual Positive Predicted Negative

C(0, 0) = 0 C(0, 1) = 5

Predicted Positive

C(1, 0) = 1 C(1, 1) = 0 l(x, j) is the expected loss for classifying an observation x as class j = p(k∣x)C(j, k)

For binary classification:

l(x, 0) = p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = (here) p(1∣x)C(0, 1) l(x, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) = (here) p(0∣x)C(1, 0)

∑k

19

slide-20
SLIDE 20

Applying misclassification costs

Classify an observation as positive if the expected loss for classifying it as a positive observation is smaller than the expected loss for classifying it as a negative observation

l(x, 1) < l(x, 0) → classify as positive (1)

Actual Negative Actual Positive Predicted Negative

C(0, 0) = 0 C(0, 1) = 5

Predicted Positive

C(1, 0) = 1 C(1, 1) = 0

Example: cost insensitive classifier predicts p(1∣x) = 0.22

l(x, 0) = p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78

→ Classify as positive! 20

slide-21
SLIDE 21

Applying misclassification costs

l(x, 1) = l(x, 0) p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) p(0∣x) = 1 − p(1∣x) p(1∣x) = = T

When C(1, 0) = C(0, 1) = 1 and C(1, 1) = C(0, 0) = 0 then

T = = 0.5

C(1,0)−C(0,0)+C(0,1)−C(1,1) C(1,0)−C(0,0) CS CS 1−0+1−0 1−0

21

slide-22
SLIDE 22

Applying misclassification costs

Actual Negative Actual Positive Predicted Negative

C(0, 0) = 0 C(0, 1) = 5

Predicted Positive

C(1, 0) = 1 C(1, 1) = 0

Example: cost insensitive classifier predicts p(1∣x) = 0.22

l(x, 0) = p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78

T = = 0.1667 ≤ 0.22 → Classify as positive!

CS 1+5 1

22

slide-23
SLIDE 23

Sampling approaches

From the above, a new cost-sensitive class distribution can be obtained based on the cost-sensitive threshold as follows:

New positive number of observations n = n Or, new negative number of observations n = n

E.g. 1 positive versus 99 negative (class inverse cost matrix): Actual Negative Actual Positive Predicted Negative

C(0, 0) = 0 C(0, 1) = 99

Predicted Positive

C(1, 0) = 1 C(1, 1) = 0 T = = 0.01 n = 1 = 99, or: n = 99 = 1

1 ′ 1 TCS 1−TCS ′ 0 1−TCS TCS

CS 1+99 1 1 ′ 0.01 1−0.01 ′ 1−0.01 0.01

23

slide-24
SLIDE 24

Sampling approaches

And we now arrive at a nice conclusion: Sampling the data set so the minority class is equal to the majority class boils down to biasing the classifier in the same way as when you would use a cost matrix constructed from the inverse class imbalance

“ “

24

slide-25
SLIDE 25

Oversampling (upsampling)

25

slide-26
SLIDE 26

Undersampling (downsampling)

26

slide-27
SLIDE 27

Intelligent sampling

SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002)

27

slide-28
SLIDE 28

Sampling approaches

Note: combinations of over/downsampling possible You can also try oversampling the minority class above the 1:1 level (would boil down to using even more extreme costs in cost matrix) Very closely related to the field of "cost- sensitive learning"

Setting misclassification costs (some implementations allow this as well) Cost sensitive logistic regression Cost sensitive decision trees (uses modified entropy and information gain measures) Cost sensitive evaluation measures (e.g. Average Misclassification Cost)

28

slide-29
SLIDE 29

Sampling approaches

Only on your training set! Test set remains untouched!

Basically, a way to indicate to the learner: both classes are as important On the test set, you can use AUC or the metric you are actually interested in Note that the accuracy on the test set after up/down sampling will most likely be lower than what you got in the “just always predict the majority class every time” case I.e. your model will now start to identify cases as being fraudulent… some of these will be false positives: price to pay to get out the true positives Remember precision versus recall trade-off Experimentation with the right amounts of over/undersampling required SMOTE and other intelligent sampling techniques work well, but are not magic, you'll still need some positives... Also, don't expect SMOTE to create "hidden, future, ..." cases of positive instances

Class imbalance occurs in many settings! 29

slide-30
SLIDE 30

Sampling approaches

Some techniques also support instance weighting: not defined per cell in the confusion matrix but per instance

Indicate that some instances are more important to get right Similar derivation is possible here: a rough approach consists of duplicating instance rows that are deemed more important Again: biasing the training in the same way Again: only in the training data (the fact that some instances are more important can then be evaluated with a corresponding evaluation scheme during testing)

30

slide-31
SLIDE 31

Sampling approaches

However, note that sampling biases the training set and the probability ranges your model outputs. This is fine if you're only interested in a ranking, but distorts a calibrated view on the probabilities In case this is important, you can unbias the probability output using (Saerens et al., 2002):

p (Ci∣x) =

With C class i, p (C ∣x) the biased probability (on the sampled data set),

p (C ) the prior probability (proportion) of class C on the sampled training

data set, and p(C ) the original prior (proportion) before sampling (e.g. 1% vs. 99%)

unbiased

p (C ∣x) ∑j=1

m p (C )

s j

p(C )

j

s j

p (C ∣x)

p (C )

s i

p(C )

i

s i i s i s i i i

31

slide-32
SLIDE 32

Example

library(caret) library(tidyverse) library(magrittr) library(ROCR) library(PRROC) library(ROSE) data <- read.csv('data.csv') table(data$TARGET) # 0 1 # 4748 252 train.index <- createDataPartition(data$TARGET, p = .7, list = FALSE) train <- data[ train.index,] test <- data[-train.index,] dtree <- train(TARGET ~ ., data = train, method = "rpart", tuneLength = 10) predictions <- predict(dtree, test, type='prob')

32

slide-33
SLIDE 33

Example

train.sampled <- ROSE(TARGET ~ ., data = train, p = 0.5)$data table(train.sampled$TARGET) # 0 1 # 1754 1747 dtree.sampled <- train(TARGET ~ ., data = train.sampled, method = "rpart", tuneLength = 10) predictions <- predict(dtree.sampled, test, type='prob')

33

slide-34
SLIDE 34

Example

After rescaling: 34

slide-35
SLIDE 35

(Back to) classification performance

Let's get back on track We have seen in any case that accuracy is not the only metric we should focus

  • n

Recall and precision concerns much more important Depend on the threshold, however We have already seen a recall/precision curve Other smart approaches?

35

slide-36
SLIDE 36

ROC curve

Make a table with sensitivity and specificity for each possible cut-off Receiver operating characteristic (ROC) curve plots sensitivity (tp rate) versus 1-specificity (fp rate) for each possible cut-off Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner) ROC curve can be summarized by the area underneath (area under (RO) curve, AUC) AUC represents probability that a randomly chosen positive instance gets a higher score than a randomly chosen negative instance (Hanley and McNeil, 1982)

36

slide-37
SLIDE 37

True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61

Try this in Excel

ROC curve

37

slide-38
SLIDE 38

ROC curve

38

slide-39
SLIDE 39

ROC curve

39

slide-40
SLIDE 40

ROC curve

https://arxiv.org/pdf/1812.01388.pdf

40

slide-41
SLIDE 41

ROC curve

ROC curve can be summarized by the area underneath (area under (RO) curve, AUC) Similar AUC for precision-recall curve exists

But: visual inspection and understanding required!

You might only be interested in a certain area of the curve "Weighted" approaches exist, but not commonly known about

Also see:

http://www.rduin.nl/presentations/ROC Tutorial Peter Flach/ROCtutorialPartI.pdf https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc- curve/225221#225221

41

slide-42
SLIDE 42

Lift

Assume a random model handing out random probabilities Take the top n (top 100) See how many of them were indeed “yes”, e.g. 10 / 100 Now do the same for your model, gives e.g. 80 / 100 Lift of your model over random is 80 / 10 = 8 Lift of 1: random sorting Depends on how many n (in general, getting more hits is more difficult in a shorter list), and apriori class distribution between “no” and “yes” instances Can be done over distinct groups instead of cumulative

Recall/precision at n: same concept, for top ranked n observations

Especially important if shortlists need to be delivered E.g. common in the setting of recommender systems

Lorenz curve

Same, but from economics field

42

slide-43
SLIDE 43

h-index

A coherent alternative to the area under the ROC curve (Hand, 2009) The area under the ROC curve (AUC) is a very widely used measure of performance for classification and diagnostic rules It has the appealing property of being objective, requiring no subjective input from the user On the other hand, the AUC has disadvantages For example, the AUC can give potentially misleading results if ROC curves cross It is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules

Nice alternative, lesser used 43

slide-44
SLIDE 44

Regression performance

Hypothesis tests on the coefficients with confidence intervals

H : β = 0, H : β > 0, H : β < 0

r : coefficient of determination: the proportion of variation in y explained

("captured") by the regression model

r = 1 − S = (y − ) SSE = (y − )

1 A + 1 A − 1

2 2

Syy SSE

yy

∑i=1

n i

y ¯i 2 ∑i=1

n i

y ^i 2

44

slide-45
SLIDE 45

Regression performance

Scatter plot between predicted and true y value

Calculate e.g. Pearson correlation

45

slide-46
SLIDE 46

Regression performance

AIC (Akaike Information Criterion):

( )

A relative estimate (!) of the information lost when a given model is used the represent the process that generates the data A trade-off between the goodness of fit and the complexity of the model

BIC (Bayesian Information Criterion), a.k.a. Schwarz criterion

Closely related to AIC

r -adjusted r = 1 − (1 − r )( )

A version of r-squared adjusted for the number of predictors in the model Increases only if a new term improves the model more than would be expected by chance, decreases otherwise (r- squared would continue to increase even after dumping useless features in) Most implementations implement this, even if they might call it r

Others: deviance information criterion, Hannan-Qionn information criterion, Jensen-Shannon divergence, Kullback-Leibler divergence, minimum message length, ... Look at Mean Squared Error, Mean Absolute Deviation, Root Mean Squared Error, ...

MSE = MAD = RMSE = (standard deviation for an unbiased model) Note: cost sensitive measures and tuning exists here as well (e.g. "BSZ tuning", Bansal, Sinha, and Zhao): AMC =

n SSE n−k n+k 2 a 2 2 n−k n−1

2 n (y − ) ∑i=1

n i

y ^i 2 n ∣y − ∣ ∑i=1

n i

y ^i

√ MSE

n C(y − ) ∑i=1

n i

y ^i

46

slide-47
SLIDE 47

Regression performance

Perform some basic validation checks

Check residuals of the model Check variables with extreme coefficients (especially when applying regularization) Check the sign of the coefficients

Note that this applies for basically any model: don't just train and look at the AUC, take a look at the top misclassified instances, would they be hard for you as well? Take a look at variable importance, position of features in tree, splitting points Interpretability is key (see later) 47

slide-48
SLIDE 48

Regression performance

There's a difference between "predicting the future" and "extrapolating from training data"!

Use the appropriate technique

Also applies to all model types 48

slide-49
SLIDE 49

Cross-validation and tuning

49

slide-50
SLIDE 50

Cross-validation and tuning

50

slide-51
SLIDE 51

Cross-validation and tuning

Decision trees with early stopping: 51

slide-52
SLIDE 52

Cross-validation and tuning

General train-valid-test split: 52

slide-53
SLIDE 53

Cross-validation and tuning

53

slide-54
SLIDE 54

Cross-validation and tuning

54

slide-55
SLIDE 55

Cross-validation and tuning

55

slide-56
SLIDE 56

Cross-validation and tuning

56

slide-57
SLIDE 57

Cross-validation and tuning

# Note that we are scaling the predictors glmnet_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), preProcess = c("center", "scale"), method = "glmnet", trControl = tr) arrange(glmnet_model$results, RMSE) %>% head ## alpha lambda RMSE Rsquared RMSESD RsquaredSD ## 1 0.10 0.330925285 1.046882 0.8213086 0.3711204 0.1662474 <-- ## 2 1.00 0.033092528 1.057797 0.8151413 0.3165820 0.1661203 ## 3 0.55 0.033092528 1.058651 0.8152392 0.3179481 0.1677805 ## 4 0.10 0.033092528 1.067397 0.8131885 0.3243109 0.1708488 ## 5 1.00 0.003309253 1.073726 0.8113261 0.3224757 0.1711788 ## 6 0.55 0.003309253 1.073969 0.8109472 0.3231762 0.1722758

57

slide-58
SLIDE 58

Cross-validation and tuning

Cross validation is a way to protect against overfitting and ensuring validation by adding diversity in repeated runs

Prevent lucky hits

Many different types exist:

Repeated (nested) cross validation Repeated out of time Leave one out cross-validation (an extreme form of cross-validation)

58

slide-59
SLIDE 59

Revisiting the churn example

59

slide-60
SLIDE 60

Marketing analytics: churn prediction

Three enourmous challenges...

  • 1. Need to make a distinction between a characteristic predictor for future

churn, or a symptom of occurring churn

E.g. sudden peak in usage often occurs right before churn because customer has already decided to churn Focus on early-warning predictors

  • 2. Real-life churn data sets have very skewed class distribution (e.g. about 1-

5% churners)

Logistic regression and decision tree models cannot be appropriately estimated Use oversampling on the train (not test!) data

  • 3. How to make it actionable?

60

slide-61
SLIDE 61

Marketing analytics: churn prediction

In fact, you’re now predicting the past!

Some people like this approach: build a model and look at the false positives :(

61

slide-62
SLIDE 62

Marketing analytics: churn prediction

Better 62

slide-63
SLIDE 63

Marketing analytics: churn prediction

Better still 63

slide-64
SLIDE 64

Marketing analytics: churn prediction

Better still 64

slide-65
SLIDE 65

Marketing analytics: churn prediction

Or even (panel data analysis)

But be very careful when setting up your (cross-)validation

65

slide-66
SLIDE 66

Marketing analytics: churn prediction

Common approach 66

slide-67
SLIDE 67

Marketing analytics: churn prediction

Don't forget to apply upsampling on the minority class Which AUC to expect?

It depends on the setting 0.7-0.9 range is common > 0.9: be sceptic -- carefully check your variables, assumptions, approach, validation

67

slide-68
SLIDE 68

Some additional notes on validation

68

slide-69
SLIDE 69

What about multiclass?

Concept of confusion matrix still applies But: metrics somewhat harder to calculate (multiple "positive" classes possible here, so potentially multiple ROC curves that can be constructed and inspected!)

Averaging techniques across the curves, e.g. https://scikit- learn.org/stable/auto_examples/model_selection/plot_roc.html

69

slide-70
SLIDE 70

What about multiclass?

What if your technique only supports binary classification to begin with? One simple approach is a transformation to binary:

One-vs.-all (one-vs.-rest):

Contrast every class against all other classes For k classes, build k classifiers Assign a new observation using the highest posterior probability

One-vs.-one:

Contrast every class against every (single) other class Pairwise approach For k classes, build k(k-1)/2 classifiers Assign a new observation using the majority voting rule

70

slide-71
SLIDE 71

One-vs.-all

71

slide-72
SLIDE 72

One-vs.-one

72

slide-73
SLIDE 73

What about multilabel?

Evaluation: specific definitions for precision, recall, Jaccard index, Hamming loss, i.e. adapted to incorporate the fact that an instance can have multiple labels What if your technique does not support it?

Transform into binary classification ("binary relevance method")

Independently training one binary classifier for each label (instance has label yes/no) The combined model then predicts all labels for this sample for which the respective classifiers predict a positive result ("has label") Not the same as one-vs.-one or one-vs.-all Does not consider label relationships, but simple Alternatives exist: e.g. classifier chaining

Transform into multi-class problem

Based on making the powerset over the labels E.g., if possible labels are Dog, Cat, Duck, the label powerset representation of this problem is a multi-class classification problem with the classes a:[0 0 0], b:[1 0 0], c:[0 1 0], d:[0 0 1], e:[1 1 0], f:[1 0 1], g:[0 1 1], h:[1 1 1] where for example [1 0 1] denotes an example where labels Dog and Duck are present and label Cat is absent Simple but leads to an explosion of classes! Better: ensemble methods or neural network based approaches

73

slide-74
SLIDE 74

Validation is hard

What if final test set evaluation gives bad results? (Throw away the whole project? Hunt for new data set?)

You should, but it happens Be sure to know the risks

Should feature engineering and transformation be done on the whole data set? (“It’s so hard not to”)

  • Definitely. (Python packages are often more

sensible to this regard)

Even when waiting to use to final test set, too much re-use of same train/validation split leads to hidden overtraining (“I’ll just make a small parameter tuning”)

So does too much parameter combination runs (over-usage of the same data) Suddenly, the test set result will be dissapointing

Some models try to avoid overfitting by themselves (see later: bootstrapping) Also, if scores are too good to be true, they probably are (target variable “leakage”)

74

slide-75
SLIDE 75

http://scikit-learn.org/stable/modules/calibration.html http://fastml.com/classifier-calibration-with-platts-scaling- and-isotonic-regression

Probability calibration

As seen above, some models can give you poor estimates of the class probabilities and some even do not support probability prediction Sampling the training set also biases the probability distribution Logistic regression returns well calibrated predictions by default as it directly optimizes log-

  • loss. In contrast, the other methods return biased

probabilities; with different biases per method E.g. methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or

  • ne away from these values

Calibration methods exist to "fix" this

75

slide-76
SLIDE 76

Monitoring and maintenance

76

slide-77
SLIDE 77

Monitoring

Validation doesn’t stop at deployment

Input data

Distributions, check categorical levels, check missing values System stability index

Output predictions

Hard to monitor unless true outcomes are tracked But we can monitor prediction distribution

https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population- stability/

77

slide-78
SLIDE 78

Monitoring

What to report, which performance metrics

“Does AUC matter?” Excel, scorecard, traffic lights API (REST) Oftentimes prediction probability is combined with another factor: risk, consequence, damage, value…

78

slide-79
SLIDE 79

Monitoring

Monitoring your population at deployment…

The goal is to set up a host of warnings which initiate a retraining (maintenance) trigger

79

slide-80
SLIDE 80

Monitoring

Assertive R Programming with assertr

https://cran.r-project.org/web/packages/assertr/vignettes/assertr.html

mtcars %>% insist(within_n_sds(2), mpg) %>% group_by(cyl) %>% summarise(avg.mpg=mean(mpg))

80

slide-81
SLIDE 81

Monitoring

Visibility and Monitoring for Machine Learning Models There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to. http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/

What’s your ML test score? A rubric for production ML systems.

https://research.google.com/pubs/pub45742.html

81

slide-82
SLIDE 82

Monitoring

https://research.google.com/pubs/pub45742.html

82

slide-83
SLIDE 83

The road to data science maturity

Domino Data Labs https://www.dominodatalab.com/resources/data-science-maturity-model/

83

slide-84
SLIDE 84

The road to data science maturity

Structured Processes:

When I get any request, I first check this library for existing work When I select data, I must note the assumptions and limitations of my sample Model validation requires three sign-offs: peer, manager, and business stakeholder Datasets including certain demographic variables need compliance sign-off Models have a pre-defined shelf life and variation tolerance which triggers reviews or re-development

https://www.dominodatalab.com/resources/data-science-maturity-model/

84

slide-85
SLIDE 85

The road to data science maturity

Ten ways your data project is going to fail

http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-fail/

  • 1. Your data isn’t ready

Has the data has been used before in a project? If not, add 6-12 months onto the schedule for data cleansing

  • 2. Somebody heard “data is the new oil”

Data is not a commodity, it needs to be transformed into a product before it's valuable

  • 3. You’re data scientists are about to quit
  • 4. You don’t have a data scientist leader
  • 5. You shouldn’t have hired scientists
  • 6. Your boss read a blog post about machine learning
  • 7. Your models are too complex

Use an interpretable model first

  • 8. Your results are not reproducible
  • 9. R&D is alien to your company culture
  • 10. Designing data products without seeing live data

The core concern is data! 85

slide-86
SLIDE 86

Data science platforms as the solution

A lot of "data science platforms" entered the market in previous years

H2O Domino Databricks Dataiku Anaconda MLflow CometML ...

86

slide-87
SLIDE 87

Data science platforms as the solution

https://www.comet.ml/

87

slide-88
SLIDE 88

Data science platforms as the solution

https://www.dominodatalab.com/

88

slide-89
SLIDE 89

Data science platforms as the solution

https://www.dataiku.com/

89

slide-90
SLIDE 90

Data science platforms as the solution

https://mlflow.org/

90

slide-91
SLIDE 91

Do it yourself?

https://github.com/spotify/luigi https://github.com/thieman/dagobah https://airflow.apache.org/

91

slide-92
SLIDE 92

Data science platforms as the solution?

Most of these focus towards the data scientist in the role of a model developer:

Versioning: for models (but also data?) Collaboration Scalable execution Multiple language/environment support

But it should also be about:

Reproducibility (model, data, environment freezing) Acyclic dependency graphs Monitoring Scheduling Checks warning that retraining is in order Models as data

92

slide-93
SLIDE 93

Data science platforms as the solution?

Great to see ML "governance" work being done on the training-part of the pipeline. Seems like this provides a Domino Data Labs based dashboards but without the walled garden environment. I've yet to see similar great initiatives also tackling the deployment-part. E.g. something similar you can stick on top of your model's API (or scheduled batch predictive outputs), as well as incoming instances, to monitor usage patterns, population shifts through time, probability distributions, newly popping up missing values or categorical levels, logs, etc, in order to provide warning lights to indicate that a retraining might be in order, for instance. Google's "What's your ML test score" paper provides some great insights, but I hope someone will tackle this with a turnkey solution as well.

“ “

gidim 10 months ago Thanks! We indeed solve a similar pain point as Domino but we unlike them we allow you to train your models on your own infra/laptop. As for monitoring production models that's something we're also working on. It was important to get the training part out first so we can measure those distributions changing.

“ “

93

slide-94
SLIDE 94

Closing with two brain teasers

94

slide-95
SLIDE 95

Brain teaser

You want to predict that a machine will fail in the near future Decision tree result: 95% accuracy You have 5% positives in your data set, failure doesn't happen often Your tree looks like this:

Target = NoFail Are you happy? “But I used a test set?” 95

slide-96
SLIDE 96

⬋ (yes) ⬊ (no) Target = Fail ...

Brain teaser

You fix the previous issue... and train again Somewhere in your decision tree, you spot:

PurchaseYear < 2015 Are you happy? 96