Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Ensemble Modeling: Bagging and Boosting Opening the Black Box (Part 1) Overview The basic motivation Combination rules and voting systems Bagging Boosting
Overview
The basic motivation Combination rules and voting systems Bagging Boosting Opening the black box (part 1)
2
The Basic Motivation
3
Ensemble modeling: the basic motivation
Are two models better than one?
Intuitively, this does make sense: you might have two models that each are good at predicting a certain (different) subsegment of your data set So this seems like a good idea to increase performance We’ll also see that we will be able to make models more robust to overfitting, more robust to noise “Wisdom of (simulated) crowds” Combination of models is called an “ensemble”
https://towardsdatascience.com/the-unexpected-lesson-within-a-jelly-bean-jar-1b6de9c40cca
4
Can we have it all?
Overfitting:
Model is too specific, works great on training data but not on a new data set E.g.: a very deep decision tree
5
Can we have it all?
We have seen early stopping and pruning
Using a strong validation setup, too But at the end: an accuracy level we might not be happy with
6
Can we have it all?
Also consider: what if we could combine multiple linear classifiers? 7
Combination Rules and Voting Systems
8
Combination rules
Let’s say we’ve created two models How to combine them? True label Model 1 (threshold: 0.54) Model 2 (threshold: 0.50) Ensemble? Yes 0.80 (yes) 0.70 (yes) Yes 0.78 (yes) 0.50 (yes) Yes 0.54 (yes) 0.50 (yes) No 0.57 (yes) 0.30 (no) No 0.30 (no) 0.70 (yes) No 0.22 (no) 0.40 (no)
9
Combination rules
Algebraic combination Determine new, optimal cutoff! True label Model 1 (0.54) Model 2 (0.50) Min (0.50) Max (0.78) Mean (0.52) Yes 0.80 (yes) 0.70 (yes) 0.70 (yes) 0.80 (yes) 0.75 (yes) Yes 0.78 (yes) 0.50 (yes) 0.50 (yes) 0.78 (yes) 0.64 (yes) Yes 0.54 (yes) 0.50 (yes) 0.50 (yes) 0.54 (no) 0.52 (yes) No 0.57 (yes) 0.30 (no) 0.30 (no) 0.57 (no) 0.44 (no) No 0.30 (no) 0.70 (yes) 0.30 (no) 0.70 (no) 0.50 (no) No 0.22 (no) 0.40 (no) 0.22 (no) 0.40 (no) 0.31 (no) As always, the mean is pretty stable, especially when combining well-calibrated classifiers Breaks down with uncalibrated classifiers when adding many models together A learning step in itself (meta-learning)
10
Voting
Useful when combining models: majority voting Less sensitive to underlying probability distributions, no need for calibration or determination
- f optimal new cutoff
1 2 3 4 5 6 → “yes” wins (4 to 2) What about weighted voting? 1 2 3 4 5 6
Model 4 gets 5 votes, the others 1
→ “no” wins (5+1 to 4) We could also go for a linear combination of the probabilities
Though again: how to determine the weights? A learning step in itself
11
Mixture of experts
Jordan and Jacobs’ mixture of experts (Jacobs, 1991) generates several “experts” (classifiers) whose outputs are combined through a linear rule
The weights of this combination are determined by a “gating network”, typically trained using the expectation maximization (EM) algorithm But: loss of interpretability, additional production strain!
12
Stacking
Wolpert’s (Wolpert, 1992) stacked generalization (or stacking):
An ensemble of Tier 1 classifiers is first trained on a subset of the training data Outputs of these classifiers are then used to train a Tier 2 classifier (meta-classifier) The underlying idea is to learn whether training data have been properly learned For example, if a particular classifier incorrectly learned a certain region of the feature space, then the Tier 2 classifier may be able to learn this behavior But: loss of interpretability, additional production strain! http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/
13
Smoothing
http://www.overkillanalytics.net/more-is-always-better-the-power-of-simple-ensembles/ λ× +(1 − λ)×
14
Bagging
15
Bagging
How about techniques with a “built-in” ensemble system? Bagging (bootstrap aggregating) is one of the earliest, most intuitive and perhaps the simplest ensemble based algorithms, with a surprisingly good performance (Breiman, 1996)
The main idea is to add diversity to the classifiers Obtained by using bootstrapped replicas of the training data: different training data subsets are randomly drawn – with replacement – from the entire training dataset Each training data subset is used to train a different classifier of the same type Individual classifiers are then combined by taking a simple majority vote of their decisions
Since the training datasets may overlap substantially, additional measures can be used to increase diversity, such as
Using a subset of the training data for training each classifier Using a subset of features Using unstable classifiers Other ideas
16
Bagging
17
Out-of-bag (OOB) validation
When using bagging, one can already estimate the generalization capabilities of the ensemble model using the training data: out-of-bag (OOB) validation
When validating an instance i, only consider those models which did not have i in their bootstrap sample A good initial validation check, though an independent test set is still required!
18
Random forests: the quintessential bagging technique
Random forests: bagging-based ensemble learning method for classification and regression
Construct a multitude of decision trees at training time and outputting the class that is the majority vote of the classes (classification) or mean prediction (regression) of the individual trees Applies bagging, so one part of randomness comes from bootstrapping each decision tree, i.e. each decision tree sees a random bootstrap of the training data However, random forests use an additional piece of randomness, i.e. to select the candidate attributes to split on at every split in every tree only consider a random subset of features (sampled at every split!)
Random decision forests correct for decision trees’ habit of overfitting to their training set
No more pruning needed The algorithm was developed by Leo Breiman and Adele Cutler Great performance in most cases!
19
Random forests: the quintessential bagging technique
How many trees?
No risk of overfit, so use plenty
Depth of tree?
No pruning necessary, strictly speaking But one can still decide to apply some pruning or early stopping mechanisms (many techniques will do so)
Size of bootstrap
Can be 100% (this doesn’t mean selecting all instances, as we’re drawing with replacement!) Lower values possible given enough data points Key is to build enough trees
M: size of subset of features?
1, 2, all (i.e. “default bagging”)? Heuristic: for regression, for classification (with N the number of features) Alternative: find through cross-validation!
Thinking points: how to assign a probability? How to set the thresholds of the base classifiers (do we need to)?
max(⌊ , 1⌋)
N 3
⌊√N⌋
20
Random forests: the quintessential bagging technique
Random forests are easy to use, don’t require much configuration or preprocessing Because you are building many trees, will include lots of interaction effects “for free” Good at avoiding overfitting (by design) However… how to explain 100 trees vs. 1… Many fun extensions, e.g. Extra Randomized Trees: also consider a random subset of the possible splitting points, instead of only a random subset features!
See also Maximizing Tree Diversity by Building Complete-Random Decision Trees (Liu et al., 2005) There is even a thing such as completely randomized trees (and we’ll see an application of those soon)
21
Boosting
22
Boosting
Similar to bagging, boosting also creates an ensemble of classifiers which are then combined by majority voting However, not using bootstrapping this time around Instead, classifiers are added sequentially were each new classifier aims at correcting the mistakes by the ensemble thus far In short: steering the learning towards fixing the mistakes it made in a previous step Main idea is cooperation between classifiers, rather than adding diversity
23
Boosting
24
AdaBoost: a (not so quintessential any more) boosting technique
Iterative approach: AdaBoost trains an ensemble of weak learners over a number
- f rounds T
At first, every instance has the same weight ( ), so AdaBoost trains a normal classifier Next, samples that were misclassified by the ensemble so far are given a heavier weight The learner is also given a weight ( ) depending on its accuracy and incorporated into the ensemble AdaBoost then constructs a new learner: now incorporating the weights so far D1 = 1/N αt
25
AdaBoost
Friedman et al. showed that AdaBoost can be implemented as additive logistic regression model
Assuming logistic regression as the base, weak learner AdaBoost optimizes exponential loss function Nice mathematical solution, shows that AdaBoost is closely linked to a particular loss function
26
Gradient Boosting Machines
Friedman et al.: “what if we would want to optimize a different loss function?”
Sadly doesn’t work with standard additive logistic regression setup or AdaBoost So take a different view: instead of weighting instances (and classifiers in the ensemble) in every cycle, let every sequential classifier predict on the residuals on the ensemble so far
Learning to predict the errors Boils down to the same: predicting the errors and then adjusting accordingly But leads to a nicer mathematical formulation: allows to optimize for any loss function based on its gradients Weak decision trees are used as the base learner
Now boosting can be applied on a variety of settings:
MSE = LogLoss (binary classification) = MLogLoss (multiclass classification) = ∑n
i=1(yi − ^
yi)2
1 n
− ∑n
i=1(yilog(pi) + (1 − yi)log(1 − pi)) 1 n
− ∑n
i=1 ∑m j=1 yijlog(pij) 1 n
27
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical improvement that speeds up training)
First, it uses second partial derivatives of the loss function, which provides more information about the direction of gradients and how to get to the minimum of our loss function (Jacobian and Hessian needed)
As implemented by xgboost , lightgbm and catboost packages
Very powerful techniques: wins the majority of Kaggle competitions that deal with structured data!
But:
How about the risk of overfitting? Explainability? We still have a model with 100 trees…
28
Extreme Gradient Boosting
Expansion on GBM idea by improving the loss optimization method (mathematical improvement that speeds up training)
Second, it adds regularization (L1 or L2), which improves model generalization
Again, defined as a “control” parameter on the complexity of the model
Here defined over the depth of the tree Objective function combines loss and simplicity of the trees
29
Comparing Bagging and Boosting
30
Comparing bagging and boosting
Bagging can be done in parallel (each sub-model is built independently) Boosting is sequential (try to add new sub-models that do well where previous model lacks)
But the trees themselves can be parallelized somewhat
Bagging decreases variance, not necessarily bias
Suitable to combine high variance low bias models (complex models) Example algorithm: random forest Reducing the overfit of ensembles of complex models (strength of diversity) Hence: deep decision trees
Boosting decreases bias, not necessarily variance
Suitable to combine low variance high bias models (simple models) Example algorithms: AdaBoost, gradient boosting machines (GBM), xgboost Reducing the error of ensembles of simple models (strength of cooperation) Hence: logistic regression, or non-deep, “weak” decision stumps
31
Comparing bagging and boosting
Bagging decreases variance Boosting decreases bias High bias ←—————— High variance ——————————————————————→ 32
Comparing bagging and boosting
This means that you need to be careful of overfitting when using boosting techniques
There is such thing as overboosting Theory and practice argues on this point xgboost and others implement regularization and other strategies to protect against this Requires significant hyperparameter tuning, however In most practical use cases, the same accuracy can be obtained using random forests without risk of overfit
33
Wrap-up: the strength of ensembles
One part working-together, one part randomness 34
Opening the Black Box
35
Opening the black box
Since we’ve now stepped into the domain of black-box models, we need to look at appropriate techniques to gain understanding of the models we construct! Model interpretability techniques:
Some of these are “native” to the concept of decision trees used in our ensembles Others can be used for any type of model (“model agnostic” techniques)
Work at different levels:
Help to explain the model (make the model simpler) Help to explain a feature (explain which ones are important and how they drive the outcome) Help to explain a particular instance-level prediction
How to explain 100 trees versus 1?
“ “
36
Feature importance
37
Feature importance
Which features are important according to the model? Different techniques exist:
Based on mean decrease in impurity in tree based ensembles
Danger: can be biased: http://explained.ai/rf-importance/index.html Model-dependent (gradient boosted models, random forests)
Based on position in trees
Model-dependent as well
Drop feature importance
Robust, but requires multiple retraining-runs (slow) Model agnostic
Permutation importance
The best we have at the moment Model agnostic
38
Feature importance
Randomly permute a feature, have the model predict again, and assess drop in AUC, accuracy,… Note: permutating a single feature breaks up interaction effects
But possible to investigate multiple features at once to zoom in on interaction effects, by permutating two or more features together
Important: correlated features will “share” importance, which might lead to misinterpretation!
Still important to check correlated features! See rfpimp Python package for an implementation which does this built-in
Feature importance rankings can be used for feature selection
First, build a large random forest using all features Then retrain using only the top-n features, for decreasing number of trees In R, implemented as the Boruta package; but recall the remarks above
Extensions exist which add significance values, e.g. https://academic.oup.com/bioinformatics/article/26/10/1340/193348
39
Feature importance
https://explained.ai/rf-importance/index.html We’ve known for years that this common mechanism for computing feature importance is biased; i.e. it tends to inflate the importance of continuous or high-cardinality categorical variables. For example, in 2007 Strobl et al pointed out in Bias in random forest variable importance measures: “the variable importance measures of Breiman’s original Random Forest method … are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories.” That’s unfortunate because not having to normalize or otherwise futz with predictor variables for Random Forests is very convenient.
“ “
40
Feature importance
y ~ x0 * x1 + x2 + noise print(gbr.feature_importances_) print(rfr.feature_importances_) x0 x1 x2 gradient boosting rankings: [0.17174317 0.19054922 0.6377076 ] random forest rankings: [0.19065131 0.18586131 0.62348737]
Check your documentation, many implementations (e.g. scikit-learn) implement Gini Impurity based importance
And also: take into account correlation effects!
Train or test set? Both have their use!
Test set: in line with evaluation concerns: e.g. the feature importance based on training data might make us mistakenly believe that features are important for the predictions, when in reality the model was just overfitting and the features were not important at all Train set: learn and understand how the model has actually relied on each feature
41
Partial dependence plots
Each point on the partial dependence plot is the average vote percentage (or average continuous prediction) across all observations, keeping one feature under observation as-is and using the median/mode for others 42
Partial dependence plots
Impute with median and mode over all instances, keep feature under
- bservation as-is
Also possible to define a grid-based range over the feature under observation between its
- bserved minimum and maximum
Have the trained model predict on the new dataset and plot the results over the values of the feature under observation
43
Partial dependence plots
Note that absence of evidence does not mean evidence of absence
E.g. interaction effects might not show in partial dependence plot given the ceteris paribus approach (keeping everything fixed except for the variable under observation)
y ~ x0 * x1 + x2 + noise
44
Partial dependence plots
On the other hand, this is also a benefit
You might infer correlations from this univariate investigation which might not be true given the presence of interaction effects! E.g. “sales drop for customers between 30 and 40 years old” (data) vs. “sales stay constant” (partial dependence plot) indicates presence of interaction effects: age alone is not a sufficient explanation! Need to inspect both!
Why not look at the data itself to assess the impact on an outcome (i.e. by constructing bins on a feature under observation and looking at the percentage of yes vs. no cases per bin)?
“ “
45
Partial dependence plots
Here too, like permutation importance, it is possible to keep more than one feature as-is whilst replacing the others with their median and mode Harder to visualize, however (e.g. contour plots, 3d plots… infeasible for higher dimensions)
http://forestfloor.dk/ R package
46
Individual conditional expectation (ICE) plots
Similar idea to partial dependence plots
Key idea is that we do not replace features with their median and mode, but keep every feature as is, but create new instances based on the values of the feature under observation Again, also possible to define a grid-based range over the feature under observation between its observed minimum and maximum
47
Individual conditional expectation (ICE) plots
Note that every original instance now leads to multiple rows in the modified data set Again, we let the model predict over all these instances For each distinct value for the feature under observation, we now have multiple predictions We can hence plot multiple lines Finally, the lines are commonly centered An average line can also be plotted (yellow line in plots below)
48
Tree inspection
“Feed” an instance through every tree in the forest and tally which variable was used more often
Or only for trees agreeing with the majority vote outcome or all of them Can also keep track of the splitting points per variable
https://medium.com/airbnb-engineering/unboxing-the-random-forest-classifier-the-threshold-distributions-22ea2bb58ea6
49
Decision path gathering
Decision paths: for a forest, the prediction is simply the average of the bias terms plus the average contribution of each feature
http://blog.datadive.net/interpreting-random-forests/
50
LIME
Local interpretable model-agnostic explanations
“Local surrogate model” Works on the instance level
https://github.com/marcotcr/lime
51
LIME
52
LIME
53
LIME
A simple model is trained over this perturbed data set, e.g. a Lasso regression
Regression as the output is now the predicted probabilities of the black box model: a continuous value
This provides us with a simple, local decision boundary which can be easily inspected:
54
LIME
Decision boundary of “explanatory model” approximates decision boundary of black box model around the instance under study
55
LIME
Also works on non tabular data: changes how the perturbation is performed Text: new texts are created by randomly removing words from the original text Images: perturbing individual pixels does not make a lot of sense. Instead, groups of pixels in the image are perturbed at once by “blanking” them (removing them from the image)
These groups are called “superpixels”, based on interconnected pixels with similar coloring Can be found using e.g. k-means
Ribeiro et. al., 2016
56
LIME
Defining neighborhood around instance to define weights is difficult
Distance measure or bandwidth of exponential smoothing kernel can impact results One can also simply select k nearest neighbors around instance under study (but need to decide upon appropriate k)
Choosing the simple model is somewhat arbitrary
E.g. tuning of regularization parameter of LASSO regression can impact results
Main advantage is that LIME is easy to understand and works on tabular data, text and images
57
Global surrogate
Simply train an interpretable model to use whilst explaining, which hopefully is pretty close to the original model However:
How close is close enough? It could happen that the interpretable model is very close for one subset of the dataset, but widely divergent for another subset: interpretation for the simple model would not be equally good for all data points
Also remember the basics regarding model inspection (look at your confusion matrix, look at the “most confused” instances) 58
Shapley values
Combines the best of the above: importance, partial dependence, and instance-level Originated in game theory Represent the fair payout for each player in a cooperative game Can be used to measure the contribution of a feature to a model
59
Shapley values
We want to play a game. What is contribution of each player to the team? We are interested in the marginal contribution of each player
60
Shapley values
Suppose the (hypothetical) value of the team with only player A is 6 where v is called characteristic function Next, we add player B
B has a marginal contribution of 4
Next, add player C
C has a marginal contribution of 2
v({A}) = 6
v({A}) = 6 v({A, B}) = 10 v({A, B, C}) = 12
61
Shapley values
Adding C before B?
C has a marginal contribution of 4, B of 2
To be fair to C in terms of contribution payout, we need to average their contribution in all formations of the team
Contribution does not depend on the order of players that were added before Contribution is also independent from how remaining players are added afterwards
v({A}) = 6 v({A, C}) = 10 v({A, C, B}) = 12 v({A, B, C}) − v({A, B}) = v({B, A, C}) − v({B, A})
62
Shapley values
The shapley value of a player j is now: Now how to bring this to machine learning?
The players are the features with their values, the game that is played is the prediction of an instance using a model
Prediction of the model using only included feature values minus average prediction ϕj = ∑
S⊆{x1,…,xp}∖{xj}
(v(S ∪ {xj} − v(S))) |S|! × (p − |S| − 1)! p!
x = {x1, … , xp} ^ f
v(XS) = ∫ ^ f (x1, … , xp)dPx∉S − EX( ^ f (X)) 63
Shapley values
Problem: lots of possible subsets of features. Also: prediction with only included features means retraining, or imputation with median/mode?
Monte Carlo sampling with permutation based approximation
To calculate the Shapley value for a feature j and an instance x, we sample M times and:
Draw a random instance z from the data Construct two new instances, one with the value of x for feature j ( ), and one with value of z ( ) The other features are randomly chosen (permuted) across x and z We then let the model predict for both
^ ϕj =
M
∑
m=1
( ^ f (xm
+j) − ^
f (xm
−j))
1 M
xm
+j
xm
−j
64
Shapley values
F and x is the feature/instance under observation, we permute over {R, Debt} versus {age, M, income} 65
Shapley values
Start from the default (average) prediction and assess the contribution of each variable towards
- utcome
Shapley value represent contribution of feature to the given output, not by how much that
- utput would change when removing the feature
Most commonly used instance explainability technique in industry
0.1 (br) + 0.4 (age=65) - 0.3 (sex=F) + 0.1 (bp=180) + 0.1 (bmi=40) = 0.4 66
Shapley values
Explaining a single instance 67
Shapley values
Features, feature values, impact on model output, across all instances 68
Shapley values
Singling out one or two features and plotting their shapley values across instances: comparable to partial dependence plot
Bring in second feature to look at interaction effects
69
Closing
More reading and packages:
Fantastic book on the topic: https://christophm.github.io/interpretable-ml-book/ Great overview: https://github.com/jphall663/awesome-machine-learning- interpretability Forest Floor (http://forestfloor.dk/) for higher-dimensional partial depence plots in R The pdp R package: https://cran.r-project.org/web/packages/pdp/pdp.pdf The iml R Package: https://cran.r-project.org/web/packages/iml/index.html Descriptive mAchine Learning EXplanations ( DALEX ) R Package: https://github.com/pbiecek/DALEX
eli5 for Python: https://eli5.readthedocs.io/en/latest/index.html Skater for Python: https://github.com/datascienceinc/Skater scikit-learn has Gini-reduction based importance (not nice), but permutation importance can be done manually:
sklearn.model_selection.permutation_test_score.html Or with https://github.com/parrt/random-forest-importances ( rfpimp package: recommended) Or with https://github.com/ralphhaygood/sklearn-gbmi ( sklearn-gbmi )
pdpbox for Python: https://github.com/SauceCat/PDPbox (recommended) vip for Python (and R): https://koalaverse.github.io/vip/index.html
Shapley Values: https://github.com/slundberg/shap https://medium.com/@Zelros/a-brief-history-of-machine-learning-models-explainability-f1c3301be9dc http://blog.macuyiko.com/post/2019/discovering-interaction-effects-in-ensemble-models.html http://explained.ai/rf-importance/index.html