Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Model Evaluation Overview Introduction Classification performance Regression performance Cross-validation and tuning Revisiting the churn example
Overview
Introduction Classification performance Regression performance Cross-validation and tuning Revisiting the churn example Additional notes on multiclass, multilabel, and calibration Monitoring and maintenance
2
The analytics process
3
It's all about generalization
You have trained a model on a particular data set (e.g. a decision tree) This is your “train data”: used to build model
Performance on your train data gives you an initial idea of your model’s validity But no much more than that
Much more important: ensure this model will do well on unseen data (out-of-time, out-of- sample, out-of-population)
As predictive models are going to be "put to work" Validation needed!
Test (Hold-out) data: used to objectively measure performance! Strict separation between training and test set needed! 4
It's all about generalization
At the very least, use a test set 5
What do we want to validate?
Out-of-sample Out-of-time Out-of-population
Not possible to foresee everything that will happen in the future, as you are by definition limited to the data you have now
But your duty to be as thorough as possible
6
Classification performance
7
True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61
→
Threshold: 0.50
Predicted Label Correct? no Correct no Correct yes Correct yes Correct yes Correct yes Incorrect no Incorrect no Correct no Incorrect yes Correct yes Correct yes Correct yes Correct yes Incorrect
Confusion matrix
8
Confusion matrix
Depends on the threshold! 9
Metrics
Depends on the confusion matrix, and hence on the threshold! 10
Common metrics
Accuracy = (tp + tn) / total = (3 + 7) / 14 = 0.71 Balanced accuracy = (recall + specificity) / 2 = (0.5 * tp) / (tp + fn) + (0.5 * tn) / (tn + fp) = 0.5 * 0.78 + 0.5 * 0.60 = 0.69 Recall (sensitivity) = tp / (tp + fn) = 7 / 9 = 0.78 “How much of the positives did we predict as such?” Precision = tp / (tp + fp) = 7 / 9 = 0.78 “How much of the predicted positives are we getting wrong?”
11
True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61
→
Recall here our discussion on "well- calibrated" classifiers
Tuning the threshold
For each possible threshold t ∈ T with T the set of all predicted probabilities, we can obtain a confusion matrix And hence different metrics So which threshold to pick?
12
Tuning the model?
For most models, it's extremely hard to push them towards optimizing your metric of choice They'll often inherently optimize for accuracy given the training set In most cases, you will be interested in something else
The class imbalance present in the training set might conflict with a model's notion of accuracy You might want to focus on recall or precision, or...
What can we do?
Tuning the threshold on your metric of interest Adjust the model parameters Adjust the target definition Sample/filter the data set Apply misclassification costs Apply instance weighting (super easy way to do this: duplicate instances) Adjust the loss function (if the model supports doing so, and even then oftentimes related to accuracy concern)
13
Tuning the threshold
14
Applying misclassification costs
Let's go on a small detour... Let us illustrate the basic problem with a setting you'll encounter over and over again: a binary classification problem where the class of interest (the positive class) happens rarely compared to the negative class
Say fraud only occurs in 1% of cases in the training data
Almost all techniques you run out of the box will show this in your confusion matrix: Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 15
Applying misclassification costs
Actual Negative Actual Positive Predicted Negative TN: 99 FN: 1 Predicted Positive FP: 0 TP: 0 What's happening here?
Remember that the model will optimize for accuracy, and gets an accuracy of 99% That's why you should never believe people that only report on accuracy
"No worries, I'll just pick a stricter threshold"
But how to formalize this a bit better? How do I tell my model that I am willing to make some mistakes on the negative side to catch the positives?
16
Applying misclassification costs
What we would like to do is set misclassification costs as such: Actual Negative Actual Positive Predicted Negative
C(0, 0) = 0 C(0, 1) = 5
Predicted Positive
C(1, 0) = 1 C(1, 1) = 0
Mispredicting a positive as a negative is 5 times as bad as mispredicting a negative as a positive How to determine the costs
Use real average observed costs (hard to find in many settings) Expert estimate Inverse class distribution (...)
17
Applying misclassification costs
Inverse class distribution
99% negative versus 1% positive
C(1, 0) = 0.99 = 1 C(0, 1) = 0.99 = 99
Actual Negative Actual Positive Predicted Negative
C(0, 0) = 0 C(0, 1) = 99
Predicted Positive
C(1, 0) = 1 C(1, 1) = 0
0.99 1 0.01 1
18
Applying misclassification costs
With a given cost matrix (no matter how we define it), we can calculate the expected loss Actual Negative Actual Positive Predicted Negative
C(0, 0) = 0 C(0, 1) = 5
Predicted Positive
C(1, 0) = 1 C(1, 1) = 0 l(x, j) is the expected loss for classifying an observation x as class j = p(k∣x)C(j, k)
For binary classification:
l(x, 0) = p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = (here) p(1∣x)C(0, 1) l(x, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) = (here) p(0∣x)C(1, 0)
∑k
19
Applying misclassification costs
Classify an observation as positive if the expected loss for classifying it as a positive observation is smaller than the expected loss for classifying it as a negative observation
l(x, 1) < l(x, 0) → classify as positive (1)
Actual Negative Actual Positive Predicted Negative
C(0, 0) = 0 C(0, 1) = 5
Predicted Positive
C(1, 0) = 1 C(1, 1) = 0
Example: cost insensitive classifier predicts p(1∣x) = 0.22
l(x, 0) = p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78
→ Classify as positive! 20
Applying misclassification costs
l(x, 1) = l(x, 0) p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) p(0∣x) = 1 − p(1∣x) p(1∣x) = = T
When C(1, 0) = C(0, 1) = 1 and C(1, 1) = C(0, 0) = 0 then
T = = 0.5
C(1,0)−C(0,0)+C(0,1)−C(1,1) C(1,0)−C(0,0) CS CS 1−0+1−0 1−0
21
Applying misclassification costs
Actual Negative Actual Positive Predicted Negative
C(0, 0) = 0 C(0, 1) = 5
Predicted Positive
C(1, 0) = 1 C(1, 1) = 0
Example: cost insensitive classifier predicts p(1∣x) = 0.22
l(x, 0) = p(0∣x)C(0, 0) + p(1∣x)C(0, 1) = 0.78 × 0 + 0.22 × 5 = 1.10 l(x, 1) = p(0∣x)C(1, 0) + p(1∣x)C(1, 1) = 0.78 × 1 + 0.22 × 0 = 0.78
T = = 0.1667 ≤ 0.22 → Classify as positive!
CS 1+5 1
22
Sampling approaches
From the above, a new cost-sensitive class distribution can be obtained based on the cost-sensitive threshold as follows:
New positive number of observations n = n Or, new negative number of observations n = n
E.g. 1 positive versus 99 negative (class inverse cost matrix): Actual Negative Actual Positive Predicted Negative
C(0, 0) = 0 C(0, 1) = 99
Predicted Positive
C(1, 0) = 1 C(1, 1) = 0 T = = 0.01 n = 1 = 99, or: n = 99 = 1
1 ′ 1 TCS 1−TCS ′ 0 1−TCS TCS
CS 1+99 1 1 ′ 0.01 1−0.01 ′ 1−0.01 0.01
23
Sampling approaches
And we now arrive at a nice conclusion: Sampling the data set so the minority class is equal to the majority class boils down to biasing the classifier in the same way as when you would use a cost matrix constructed from the inverse class imbalance
“ “
24
Oversampling (upsampling)
25
Undersampling (downsampling)
26
Intelligent sampling
SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002)
27
Sampling approaches
Note: combinations of over/downsampling possible You can also try oversampling the minority class above the 1:1 level (would boil down to using even more extreme costs in cost matrix) Very closely related to the field of "cost- sensitive learning"
Setting misclassification costs (some implementations allow this as well) Cost sensitive logistic regression Cost sensitive decision trees (uses modified entropy and information gain measures) Cost sensitive evaluation measures (e.g. Average Misclassification Cost)
28
Sampling approaches
Only on your training set! Test set remains untouched!
Basically, a way to indicate to the learner: both classes are as important On the test set, you can use AUC or the metric you are actually interested in Note that the accuracy on the test set after up/down sampling will most likely be lower than what you got in the “just always predict the majority class every time” case I.e. your model will now start to identify cases as being fraudulent… some of these will be false positives: price to pay to get out the true positives Remember precision versus recall trade-off Experimentation with the right amounts of over/undersampling required SMOTE and other intelligent sampling techniques work well, but are not magic, you'll still need some positives... Also, don't expect SMOTE to create "hidden, future, ..." cases of positive instances
Class imbalance occurs in many settings! 29
Sampling approaches
Some techniques also support instance weighting: not defined per cell in the confusion matrix but per instance
Indicate that some instances are more important to get right Similar derivation is possible here: a rough approach consists of duplicating instance rows that are deemed more important Again: biasing the training in the same way Again: only in the training data (the fact that some instances are more important can then be evaluated with a corresponding evaluation scheme during testing)
30
Sampling approaches
However, note that sampling biases the training set and the probability ranges your model outputs. This is fine if you're only interested in a ranking, but distorts a calibrated view on the probabilities In case this is important, you can unbias the probability output using (Saerens et al., 2002):
p (Ci∣x) =
With C class i, p (C ∣x) the biased probability (on the sampled data set),
p (C ) the prior probability (proportion) of class C on the sampled training
data set, and p(C ) the original prior (proportion) before sampling (e.g. 1% vs. 99%)
unbiased
p (C ∣x) ∑j=1
m p (C )
s j
p(C )
j
s j
p (C ∣x)
p (C )
s i
p(C )
i
s i i s i s i i i
31
Example
library(caret) library(tidyverse) library(magrittr) library(ROCR) library(PRROC) library(ROSE) data <- read.csv('data.csv') table(data$TARGET) # 0 1 # 4748 252 train.index <- createDataPartition(data$TARGET, p = .7, list = FALSE) train <- data[ train.index,] test <- data[-train.index,] dtree <- train(TARGET ~ ., data = train, method = "rpart", tuneLength = 10) predictions <- predict(dtree, test, type='prob')
32
Example
train.sampled <- ROSE(TARGET ~ ., data = train, p = 0.5)$data table(train.sampled$TARGET) # 0 1 # 1754 1747 dtree.sampled <- train(TARGET ~ ., data = train.sampled, method = "rpart", tuneLength = 10) predictions <- predict(dtree.sampled, test, type='prob')
33
Example
After rescaling: 34
(Back to) classification performance
Let's get back on track We have seen in any case that accuracy is not the only metric we should focus
- n
Recall and precision concerns much more important Depend on the threshold, however We have already seen a recall/precision curve Other smart approaches?
35
ROC curve
Make a table with sensitivity and specificity for each possible cut-off Receiver operating characteristic (ROC) curve plots sensitivity (tp rate) versus 1-specificity (fp rate) for each possible cut-off Perfect model has sensitivity of 1 and specificity of 1 (i.e. upper left corner) ROC curve can be summarized by the area underneath (area under (RO) curve, AUC) AUC represents probability that a randomly chosen positive instance gets a higher score than a randomly chosen negative instance (Hanley and McNeil, 1982)
36
True Label Prediction no 0.11 no 0.2 yes 0.85 yes 0.84 yes 0.8 no 0.65 yes 0.44 no 0.1 yes 0.32 yes 0.87 yes 0.61 yes 0.6 yes 0.78 no 0.61
→
Try this in Excel
ROC curve
37
ROC curve
38
ROC curve
39
ROC curve
https://arxiv.org/pdf/1812.01388.pdf
40
ROC curve
ROC curve can be summarized by the area underneath (area under (RO) curve, AUC) Similar AUC for precision-recall curve exists
But: visual inspection and understanding required!
You might only be interested in a certain area of the curve "Weighted" approaches exist, but not commonly known about
Also see:
http://www.rduin.nl/presentations/ROC Tutorial Peter Flach/ROCtutorialPartI.pdf https://stats.stackexchange.com/questions/225210/accuracy-vs-area-under-the-roc- curve/225221#225221
41
Lift
Assume a random model handing out random probabilities Take the top n (top 100) See how many of them were indeed “yes”, e.g. 10 / 100 Now do the same for your model, gives e.g. 80 / 100 Lift of your model over random is 80 / 10 = 8 Lift of 1: random sorting Depends on how many n (in general, getting more hits is more difficult in a shorter list), and apriori class distribution between “no” and “yes” instances Can be done over distinct groups instead of cumulative
Recall/precision at n: same concept, for top ranked n observations
Especially important if shortlists need to be delivered E.g. common in the setting of recommender systems
Lorenz curve
Same, but from economics field
42
h-index
A coherent alternative to the area under the ROC curve (Hand, 2009) The area under the ROC curve (AUC) is a very widely used measure of performance for classification and diagnostic rules It has the appealing property of being objective, requiring no subjective input from the user On the other hand, the AUC has disadvantages For example, the AUC can give potentially misleading results if ROC curves cross It is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules
Nice alternative, lesser used 43
Regression performance
Hypothesis tests on the coefficients with confidence intervals
H : β = 0, H : β > 0, H : β < 0
r : coefficient of determination: the proportion of variation in y explained
("captured") by the regression model
r = 1 − S = (y − ) SSE = (y − )
1 A + 1 A − 1
2 2
Syy SSE
yy
∑i=1
n i
y ¯i 2 ∑i=1
n i
y ^i 2
44
Regression performance
Scatter plot between predicted and true y value
Calculate e.g. Pearson correlation
45
Regression performance
AIC (Akaike Information Criterion):
( )
A relative estimate (!) of the information lost when a given model is used the represent the process that generates the data A trade-off between the goodness of fit and the complexity of the model
BIC (Bayesian Information Criterion), a.k.a. Schwarz criterion
Closely related to AIC
r -adjusted r = 1 − (1 − r )( )
A version of r-squared adjusted for the number of predictors in the model Increases only if a new term improves the model more than would be expected by chance, decreases otherwise (r- squared would continue to increase even after dumping useless features in) Most implementations implement this, even if they might call it r
Others: deviance information criterion, Hannan-Qionn information criterion, Jensen-Shannon divergence, Kullback-Leibler divergence, minimum message length, ... Look at Mean Squared Error, Mean Absolute Deviation, Root Mean Squared Error, ...
MSE = MAD = RMSE = (standard deviation for an unbiased model) Note: cost sensitive measures and tuning exists here as well (e.g. "BSZ tuning", Bansal, Sinha, and Zhao): AMC =
n SSE n−k n+k 2 a 2 2 n−k n−1
2 n (y − ) ∑i=1
n i
y ^i 2 n ∣y − ∣ ∑i=1
n i
y ^i
√ MSE
n C(y − ) ∑i=1
n i
y ^i
46
Regression performance
Perform some basic validation checks
Check residuals of the model Check variables with extreme coefficients (especially when applying regularization) Check the sign of the coefficients
Note that this applies for basically any model: don't just train and look at the AUC, take a look at the top misclassified instances, would they be hard for you as well? Take a look at variable importance, position of features in tree, splitting points Interpretability is key (see later) 47
Regression performance
There's a difference between "predicting the future" and "extrapolating from training data"!
Use the appropriate technique
Also applies to all model types 48
Cross-validation and tuning
49
Cross-validation and tuning
50
Cross-validation and tuning
Decision trees with early stopping: 51
Cross-validation and tuning
General train-valid-test split: 52
Cross-validation and tuning
53
Cross-validation and tuning
54
Cross-validation and tuning
55
Cross-validation and tuning
56
Cross-validation and tuning
# Note that we are scaling the predictors glmnet_model <- train(annual_pm ~ ., data = dplyr::select(lur, -site_id), preProcess = c("center", "scale"), method = "glmnet", trControl = tr) arrange(glmnet_model$results, RMSE) %>% head ## alpha lambda RMSE Rsquared RMSESD RsquaredSD ## 1 0.10 0.330925285 1.046882 0.8213086 0.3711204 0.1662474 <-- ## 2 1.00 0.033092528 1.057797 0.8151413 0.3165820 0.1661203 ## 3 0.55 0.033092528 1.058651 0.8152392 0.3179481 0.1677805 ## 4 0.10 0.033092528 1.067397 0.8131885 0.3243109 0.1708488 ## 5 1.00 0.003309253 1.073726 0.8113261 0.3224757 0.1711788 ## 6 0.55 0.003309253 1.073969 0.8109472 0.3231762 0.1722758
57
Cross-validation and tuning
Cross validation is a way to protect against overfitting and ensuring validation by adding diversity in repeated runs
Prevent lucky hits
Many different types exist:
Repeated (nested) cross validation Repeated out of time Leave one out cross-validation (an extreme form of cross-validation)
58
Revisiting the churn example
59
Marketing analytics: churn prediction
Three enourmous challenges...
- 1. Need to make a distinction between a characteristic predictor for future
churn, or a symptom of occurring churn
E.g. sudden peak in usage often occurs right before churn because customer has already decided to churn Focus on early-warning predictors
- 2. Real-life churn data sets have very skewed class distribution (e.g. about 1-
5% churners)
Logistic regression and decision tree models cannot be appropriately estimated Use oversampling on the train (not test!) data
- 3. How to make it actionable?
60
Marketing analytics: churn prediction
In fact, you’re now predicting the past!
Some people like this approach: build a model and look at the false positives :(
61
Marketing analytics: churn prediction
Better 62
Marketing analytics: churn prediction
Better still 63
Marketing analytics: churn prediction
Better still 64
Marketing analytics: churn prediction
Or even (panel data analysis)
But be very careful when setting up your (cross-)validation
65
Marketing analytics: churn prediction
Common approach 66
Marketing analytics: churn prediction
Don't forget to apply upsampling on the minority class Which AUC to expect?
It depends on the setting 0.7-0.9 range is common > 0.9: be sceptic -- carefully check your variables, assumptions, approach, validation
67
Some additional notes on validation
68
What about multiclass?
Concept of confusion matrix still applies But: metrics somewhat harder to calculate (multiple "positive" classes possible here, so potentially multiple ROC curves that can be constructed and inspected!)
Averaging techniques across the curves, e.g. https://scikit- learn.org/stable/auto_examples/model_selection/plot_roc.html
69
What about multiclass?
What if your technique only supports binary classification to begin with? One simple approach is a transformation to binary:
One-vs.-all (one-vs.-rest):
Contrast every class against all other classes For k classes, build k classifiers Assign a new observation using the highest posterior probability
One-vs.-one:
Contrast every class against every (single) other class Pairwise approach For k classes, build k(k-1)/2 classifiers Assign a new observation using the majority voting rule
70
One-vs.-all
71
One-vs.-one
72
What about multilabel?
Evaluation: specific definitions for precision, recall, Jaccard index, Hamming loss, i.e. adapted to incorporate the fact that an instance can have multiple labels What if your technique does not support it?
Transform into binary classification ("binary relevance method")
Independently training one binary classifier for each label (instance has label yes/no) The combined model then predicts all labels for this sample for which the respective classifiers predict a positive result ("has label") Not the same as one-vs.-one or one-vs.-all Does not consider label relationships, but simple Alternatives exist: e.g. classifier chaining
Transform into multi-class problem
Based on making the powerset over the labels E.g., if possible labels are Dog, Cat, Duck, the label powerset representation of this problem is a multi-class classification problem with the classes a:[0 0 0], b:[1 0 0], c:[0 1 0], d:[0 0 1], e:[1 1 0], f:[1 0 1], g:[0 1 1], h:[1 1 1] where for example [1 0 1] denotes an example where labels Dog and Duck are present and label Cat is absent Simple but leads to an explosion of classes! Better: ensemble methods or neural network based approaches
73
Validation is hard
What if final test set evaluation gives bad results? (Throw away the whole project? Hunt for new data set?)
You should, but it happens Be sure to know the risks
Should feature engineering and transformation be done on the whole data set? (“It’s so hard not to”)
- Definitely. (Python packages are often more
sensible to this regard)
Even when waiting to use to final test set, too much re-use of same train/validation split leads to hidden overtraining (“I’ll just make a small parameter tuning”)
So does too much parameter combination runs (over-usage of the same data) Suddenly, the test set result will be dissapointing
Some models try to avoid overfitting by themselves (see later: bootstrapping) Also, if scores are too good to be true, they probably are (target variable “leakage”)
74
http://scikit-learn.org/stable/modules/calibration.html http://fastml.com/classifier-calibration-with-platts-scaling- and-isotonic-regression
Probability calibration
As seen above, some models can give you poor estimates of the class probabilities and some even do not support probability prediction Sampling the training set also biases the probability distribution Logistic regression returns well calibrated predictions by default as it directly optimizes log-
- loss. In contrast, the other methods return biased
probabilities; with different biases per method E.g. methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or
- ne away from these values
Calibration methods exist to "fix" this
75
Monitoring and maintenance
76
Monitoring
Validation doesn’t stop at deployment
Input data
Distributions, check categorical levels, check missing values System stability index
Output predictions
Hard to monitor unless true outcomes are tracked But we can monitor prediction distribution
https://www.dataminingapps.com/2016/10/what-is-a-system-stability-index-ssi-and-how-can-it-be-used-to-monitor-population- stability/
77
Monitoring
What to report, which performance metrics
“Does AUC matter?” Excel, scorecard, traffic lights API (REST) Oftentimes prediction probability is combined with another factor: risk, consequence, damage, value…
78
Monitoring
Monitoring your population at deployment…
The goal is to set up a host of warnings which initiate a retraining (maintenance) trigger
79
Monitoring
Assertive R Programming with assertr
https://cran.r-project.org/web/packages/assertr/vignettes/assertr.html
mtcars %>% insist(within_n_sds(2), mpg) %>% group_by(cyl) %>% summarise(avg.mpg=mean(mpg))
80
Monitoring
Visibility and Monitoring for Machine Learning Models There’s a great paper that I highly recommend you read by this guy named D. Sculley, who is a professor at Tufts, engineer at Google. He says machine learning is the high interest credit card of technical debt because machine learning is basically spaghetti code that you deploy on purpose. That’s essentially what machine learning is. You’re taking a bunch of data, generating a bunch of numbers and then putting it in a rush intentionally. And then trying to figure out, reverse engineer how does this thing actually work. There are a bunch of terrible downstream consequences to this. It’s a risky thing to do. So you only want to do it when you absolutely have to. http://blog.launchdarkly.com/visibility-and-monitoring-for-machine-learning-models/
What’s your ML test score? A rubric for production ML systems.
https://research.google.com/pubs/pub45742.html
81
Monitoring
https://research.google.com/pubs/pub45742.html
82
The road to data science maturity
Domino Data Labs https://www.dominodatalab.com/resources/data-science-maturity-model/
83
The road to data science maturity
Structured Processes:
When I get any request, I first check this library for existing work When I select data, I must note the assumptions and limitations of my sample Model validation requires three sign-offs: peer, manager, and business stakeholder Datasets including certain demographic variables need compliance sign-off Models have a pre-defined shelf life and variation tolerance which triggers reviews or re-development
https://www.dominodatalab.com/resources/data-science-maturity-model/
84
The road to data science maturity
Ten ways your data project is going to fail
http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-fail/
- 1. Your data isn’t ready
Has the data has been used before in a project? If not, add 6-12 months onto the schedule for data cleansing
- 2. Somebody heard “data is the new oil”
Data is not a commodity, it needs to be transformed into a product before it's valuable
- 3. You’re data scientists are about to quit
- 4. You don’t have a data scientist leader
- 5. You shouldn’t have hired scientists
- 6. Your boss read a blog post about machine learning
- 7. Your models are too complex
Use an interpretable model first
- 8. Your results are not reproducible
- 9. R&D is alien to your company culture
- 10. Designing data products without seeing live data
The core concern is data! 85
Data science platforms as the solution
A lot of "data science platforms" entered the market in previous years
H2O Domino Databricks Dataiku Anaconda MLflow CometML ...
86
Data science platforms as the solution
https://www.comet.ml/
87
Data science platforms as the solution
https://www.dominodatalab.com/
88
Data science platforms as the solution
https://www.dataiku.com/
89
Data science platforms as the solution
https://mlflow.org/
90
Do it yourself?
https://github.com/spotify/luigi https://github.com/thieman/dagobah https://airflow.apache.org/
91
Data science platforms as the solution?
Most of these focus towards the data scientist in the role of a model developer:
Versioning: for models (but also data?) Collaboration Scalable execution Multiple language/environment support
But it should also be about:
Reproducibility (model, data, environment freezing) Acyclic dependency graphs Monitoring Scheduling Checks warning that retraining is in order Models as data
92
Data science platforms as the solution?
Great to see ML "governance" work being done on the training-part of the pipeline. Seems like this provides a Domino Data Labs based dashboards but without the walled garden environment. I've yet to see similar great initiatives also tackling the deployment-part. E.g. something similar you can stick on top of your model's API (or scheduled batch predictive outputs), as well as incoming instances, to monitor usage patterns, population shifts through time, probability distributions, newly popping up missing values or categorical levels, logs, etc, in order to provide warning lights to indicate that a retraining might be in order, for instance. Google's "What's your ML test score" paper provides some great insights, but I hope someone will tackle this with a turnkey solution as well.
“ “
gidim 10 months ago Thanks! We indeed solve a similar pain point as Domino but we unlike them we allow you to train your models on your own infra/laptop. As for monitoring production models that's something we're also working on. It was important to get the training part out first so we can measure those distributions changing.
“ “
93
Closing with two brain teasers
94
Brain teaser
You want to predict that a machine will fail in the near future Decision tree result: 95% accuracy You have 5% positives in your data set, failure doesn't happen often Your tree looks like this:
Target = NoFail Are you happy? “But I used a test set?” 95
⬋ (yes) ⬊ (no) Target = Fail ...
Brain teaser
You fix the previous issue... and train again Somewhere in your decision tree, you spot:
PurchaseYear < 2015 Are you happy? 96