Lecture 4: Rule-based classification and regression Felix Held, - - PowerPoint PPT Presentation
Lecture 4: Rule-based classification and regression Felix Held, - - PowerPoint PPT Presentation
Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019 Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit
Amendment: Bias-Variance Tradeoff
Bias-Variance Decomposition
π = π½π(π°,π²,π§) [(π§ β Λ π(π²))2] Total expected prediction error = π2 Irreducible Error + π½π(π²) [(π(π²) β π½π(π°) [ Λ π(π²)])
2
] Bias2 averaged over π² + π½π(π²) [Varπ(π°) [ Λ π(π²)]] Variance of Λ π averaged over π²
π
Underfit Overfit
Model complexity Error
Bias2 Variance Irreducible Error
1/26
Observations
βΆ Irreducible error cannot be changed βΆ Bias and variance of Λ
π are sample-size dependent
βΆ For a consistent estimator Λ
π π½π(π°)[ Λ π(π¦)] β π(π¦) for increasing sample size
βΆ In many cases:
Varπ(π°)( Λ π(π¦)) β 0 for increasing sample size
βΆ Caution: Theoretical guarantees are often dependent on
the number of variables π staying fixed and increasing π. Might not be fulfilled in reality.
2/26
Amendment: Leave-One-Out Cross-validation (LOOCV)
Cross-validation with π = π is called leave-one-out cross-validation.
βΆ Popular because explicit formulas (or approximations)
exist for many special cases (e.g. regularized regression)
βΆ Uses the most data for training possible βΆ More variable than π-fold CV for π < π since only one data
point is used for testing and the training sets are very similar
βΆ In praxis: Try out different values for π. Be cautious if
results vary drastically with π. Maybe the underlying model assumptions are not appropriate.
3/26
Classification and Partitions
Classification and Partitions
A classification algorithm constructs a partition of feature space and assigns a class to each.
βΆ kNN creates local neighbourhoods in feature space and
assigns a class in each
βΆ Logistic regression divides feature space implicitly by
modelling π(π|π²) and determines decision boundaries through Bayesβ rule
βΆ Discriminant analysis creates an explicit model of the
feature space conditional on the class. It models π(π², π) by assuming that π(π²|π) is a normal distribution and either estimates π(π) from data or through prior knowledge.
4/26
New point-of-view: Rectangular Partitioning
Idea: Create an explicit partition by dividing feature space into rectangular regions and assign a constant conditional mean (regression) or constant conditional class probability (classification) to each region. Given regions ππ for π = 1, β¦ , π, a classification rule for classes π β {1, β¦ , πΏ} is Μ π(π²) = arg max
1β€πβ€πΏ π
β
π=1
1(π² β ππ) ( β
π²πβππ
1(ππ = π)) and a regression function is given by Λ π(π²) =
π
β
π=1
( 1 |ππ| β
π²πβππ
π§π) 1(π² β ππ) (Derivations are similar to kNN with regions instead of neighbourhoods.)
5/26
Classification and Regression Trees (CART)
βΆ Complexity of partitioning:
Arbitrary Partition > Rectangular Partition > Partition from a sequence of binary splits
βΆ Classification and Regression Trees create a sequence of
binary axis-parallel splits in order to reduce variability of values/classes in each region
11 11 1 1 1 1 1 1 1 1 00 0 0 00 00 1 2 3 4 2 4 6
x1 x2
x2 >= 2.2 x1 >= 3.5 1.00 .00 60% 1.00 .00 20% 1 .00 1.00 20%
yes no 6/26
CART: Tree building/growing
- 1. Start with all data in a root node
- 2. Binary splitting
2.1 Consider each feature π¦β π for π = 1, β¦ , π. Choose a threshold π’
π (for continuous features) or a partition of the
feature categories (for categorical features) that results in the greatest improvement in node purity: {ππ βΆ π¦ππ > π’
π}
and {ππ βΆ π¦ππ β€ π’
π}
2.2 Choose the feature π that led to the best splitting of the data and create a new child node for each subset
- 3. Repeat Step 2 on all child nodes until the tree reaches a
stopping criterion All nodes without descendents are called leaf nodes. The sequence of splits preceding them defines the regions ππ.
7/26
Measures of node purity
Use Λ πππ = 1 |ππ| β
π²πβππ
1(ππ = π)
βΆ Three common measures to determine impurity in a
region ππ are (for classification trees) Misclassification error: 1 β maxπ Λ πππ Gini impurity: β
πΏ π=1 Λ
πππ(1 β Λ πππ) Entropy/deviance: β β
πΏ π=1 Λ
πππ log Λ πππ
βΆ All criteria are zero when only one class is present and
maximal when all classes are equally common.
βΆ For regression trees the decrease in mean squared error
after a split can be used as an impurity measure.
8/26
Node impurity in two class case
Example for a two-class problem (π = 0 or 1). Λ π0π is the empirical frequency of class 0 in a region ππ.
0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00
Ο0m Impurity Impurity Measure
Entropy Gini Misclassification
Only gini impurity and entropy are used in practice (averaging problems for misclassification error).
9/26
Stopping criteria
βΆ Minimum size of leaf nodes (e.g. 5 samples per leaf node) βΆ Minimum decrease in impurity (e.g. cutoff at 1%) βΆ Maximum tree depth, i.e. number of splits (e.g. maximum
30 splits from root node)
βΆ Maximum number of leaf nodes
Running CART until one of these criteria is fulfilled generates a max tree.
10/26
Summary of CART
βΆ Pro: Outcome is easily interpretable βΆ Pro: Can easily handle missing data βΆ Neutral: Only suitable for axis-parallel decision
boundaries
βΆ Con: Features with more potential splits have a higher
chance of being picked
βΆ Con: Prone to overfitting/unstable (only the best feature
is used for splitting and which is best might change with small changes of the data)
11/26
CART and overfitting
How can overfitting be avoided?
βΆ Tuning of stopping criteria: These can easily lead to early
stopping since a weak split might lead to a strong split later
βΆ Pruning: Build a max tree first. Then reduce its size by
collapsing internal nodes. This can be more effective since weak splits are allowed during tree building. (βThe silly certainty of hindsightβ)
βΆ Ensemble methods: Examples are bagging, boosting,
stacking, β¦
12/26
A note on pruning
βΆ A common strategy is cost-complexity pruning. βΆ For a given π½ > 0 and a tree π its cost-complexity is
defined as π·π½(π) = β
ππβπ
( 1 |ππ| β
π²πβππ
1(ππ β Μ π(π²))) ββ΅ β΅ β΅ β΅ β΅ β΅ β΅ β΅ β΅ββ΅ β΅ β΅ β΅ β΅ β΅ β΅ β΅ β΅β
Cost
+ π½|π| β
Complexity
where (ππ, π²π) is the training data, Μ π the CART classification rule and |π| is the number of leaf nodes/regions defined by the tree.
βΆ It can be shown that successive subtrees ππ of the max
tree πmax can be found such that each tree ππ minimizes π·π½π(ππ) where π½1 β₯ β― β₯ π½πΎ
βΆ The tree with the lowest cost-complexity is chosen
13/26
Re-cap of the bootstrap and variance reduction
The Bootstrap β A short recapitulation (I)
Given a sample π¦π, π = 1, β¦ , π from an underlying population estimate a statistic π by Μ π = Μ π(π¦1, β¦ , π¦π). What is the uncertainty of Μ π? Solution: Find confidence intervals (CIs) quantifying the variability of Μ π. Computation:
βΆ Through theoretical results (e.g. linear models) if
distributional assumptions fulfilled
βΆ Linearisation for more complex models (e.g. nonlinear or
generalized linear models)
βΆ Nonparametric approaches using the data (e.g.
bootstrap) All of these approaches require fairly large sample sizes.
14/26
The Bootstrap β A short recapitulation (II)
Nonparametric bootstrap Given a sample π¦1, β¦ , π¦π bootstrapping performs for π = 1, β¦ , πΆ
- 1. Sample
Μ π¦1, β¦ , Μ π¦π with replacement from original sample
- 2. Calculate
Μ ππ( Μ π¦1, β¦ , Μ π¦π)
βΆ πΆ should be large (in the 1000β10000s) βΆ The distribution of
Μ ππ approximates the sampling distribution of Μ π
βΆ The bootstrap makes exactly one strong assumption:
The data is discrete and values not seen in the data are impossible.1
1Check out this blog post!
15/26
CI for statistics of an exponential random variable
0.0 0.1 0.2 0.3 0.4 5 10 15
x Frequency
Data (n = 200) simulated from π¦ βΌ Exp(1/3), i.e. π½π(π¦)[π¦] = 3
βΆ Orange histogram shows original sample βΆ Blue line is the true density βΆ Black outlined histogram shows a bootstrapped sample βΆ Vertical lines are the mean of π¦ (dashed) and the 99% quantile
(dotted) [red = empirical, blue = theoretical]
16/26
CI calculation: Normal approximation and percentile method
- 1. Normal approximation: Set π = 1
πΆ
πΆ
β
π=1
Μ ππ and estimate the standard error of Μ π as Λ ππ‘π = β β
πΆ π=1( Μ
ππ β π)2 πΆ β 1 Assume the distribution of Μ π is approximately π( Μ π, Λ ππ‘π) giving CI Μ π Β± π¨1βπ½/2Λ ππ‘π
- 2. Percentile/quantile method: Take the π½ and π½/2 quantiles
- f the bootstrap estimates
Μ ππ as boundaries of CI
17/26
CI calculation: Applied to example
0.0 0.4 0.8 1.2 1.6 2.5 3.0 3.5 4.0
ΞΈb,mean Frequency
0.0 0.4 0.8 1.2 1.6 10 12 14 16 18
ΞΈb,0.99 Frequency
Based on πΆ = 1000 bootstrap samples For the mean value, normal approximation assumption seems reasonable 95% CIs Normal Approx. (2.68, 3.65)
- Perc. Method (2.71, 3.67)
For the quantile, bootstrapping requires much larger π and shows high uncertainty
18/26
Modifications to nonparametric bootstrap
βΆ Different sampling strategies. Some examples:
βΆ π-out-of-π bootstrap: Draw π < π samples without
replacement
βΆ Draw from a smooth density estimate of the data βΆ Draw from a parametric distribution fitted to the original
data
βΆ Normal approximation doesnβt always apply and
percentile method is unstable for complicated statistics. Example of alternative
βΆ Bootstrap-t: Instead of normal quantiles, estimate
quantiles from Μ ππ β Μ π Λ ππ where Λ ππ is an estimate of the standard error
βΆ Many other alternatives exist β¦
19/26
Limitations of the bootstrap
βΆ Number of samples needs to be quite large βΆ Extreme values (minimum, maximum very small or large
quantiles) can be hard to estimate since they might not even appear in data
βΆ Many basic CI estimation algorithms assume that the
bootstrap distribution is approximately normal (often not the case in reality)
20/26
Bootstrap aggregation (bagging)
- 1. Given a training sample (π§π, π²π) or (ππ, π²π), we want to fit a
predictive model Λ π(π²)
- 2. For π = 1, β¦ , πΆ, form bootstrap samples of the training
data and fit the model, resulting in Λ π
π(π²)
- 3. Define
Λ π
bag(π²) = 1
πΆ
πΆ
β
π=1
Λ π
π(π²)
where Λ π
π(π²) is a continuous value for a regression
problem or a vector of class probabilities for a classification problem Majority vote can be used for classification problems instead
- f averaging
21/26
Bagging and variance reduction
βΆ Bagging using averages approximates
π
ag(π²) = π½π(π°) [ Λ
π(π²)]
βΆ For the conditional expected error in squared error loss
π½π(π°,π§|π²)[(π§ β Λ π(π²))2] β₯ π½π(π°,π§|π²)[(π§ β π
ag(π²))2]
βΆ Some notes:
βΆ Remember the graphs of kNN from last lecture: Noisy
individually, more stable (less variable) on average
βΆ Bagging shows no effect on linear models
22/26
Correlation and bagged variance
Recall: For identically distributed (i.d.) random variables π¦π, π = 1, β¦ , π Var ( 1 π
π
β
π=1
π¦π) = 1 β π π π2 + ππ2 where π β [0, 1) is the (positive) pairwise correlation coefficient and π2 is the variance of each π¦π.
βΆ Bootstrap samples are correlated and increase total
variance
βΆ Decreasing correlation between bootstrap samples would
decrease the variance of a bagging estimate
23/26
Random Forests
Random Forests
- 1. Given a training sample with π features, do for π = 1, β¦ , πΆ
1.1 Draw a bootstrap sample of size π from training data (with replacement) 1.2 Grow a tree ππ until each node reaches minimal node size πmin
1.2.1 Randomly select π variables from the π available 1.2.2 Find best splitting variable among these π 1.2.3 Split the node
- 2. For a new π² predict
Regression: Λ π
π π(π²) = 1 πΆ β πΆ π=1 ππ(π²)
Classification: Majority vote at π² across trees Note: Step 1.2.1 leads to less correlation between trees built
- n bootstrapped data.
24/26
Variable importance
- 1. Impurity index: Splitting on a feature leads to a reduction
- f node impurity. Summing all improvements over all
trees per feature gives a measure for variable importance
- 2. Out-of-bag error
βΆ During bootstrapping for large enough π, each sample has
a chance of about 63% to be selected
βΆ For bagging the remaining samples are out-of-bag. βΆ These out-of-bag samples for tree ππ can be used as a test
set for that particular tree, since they were not used during training. Resulting in test error πΉ0
βΆ Permute variable π in the out-of-bag samples and
calculate test error again πΉ(π)
1
βΆ The increase in error
πΉ(π)
1
β πΉ0 β₯ 0 serves as an importance measure for variable π
25/26
Take-home message
βΆ Direct partitioning of feature space is a complex task βΆ Simplifications in form of binary splits resulting in tree
models work well
βΆ High interpretability of CART, but also high variability βΆ Random Forests tackles variance reduction though