Lecture 4: Rule-based classification and regression Felix Held, - - PowerPoint PPT Presentation

β–Ά
lecture 4 rule based classification and regression
SMART_READER_LITE
LIVE PREVIEW

Lecture 4: Rule-based classification and regression Felix Held, - - PowerPoint PPT Presentation

Lecture 4: Rule-based classification and regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 1st April 2019 Amendment: Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit


slide-1
SLIDE 1

Lecture 4: Rule-based classification and regression

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 1st April 2019

slide-2
SLIDE 2

Amendment: Bias-Variance Tradeoff

Bias-Variance Decomposition

𝑆 = π”½π‘ž(𝒰,𝐲,𝑧) [(𝑧 βˆ’ Λ† 𝑔(𝐲))2] Total expected prediction error = 𝜏2 Irreducible Error + π”½π‘ž(𝐲) [(𝑔(𝐲) βˆ’ π”½π‘ž(𝒰) [ Λ† 𝑔(𝐲)])

2

] Bias2 averaged over 𝐲 + π”½π‘ž(𝐲) [Varπ‘ž(𝒰) [ Λ† 𝑔(𝐲)]] Variance of Λ† 𝑔 averaged over 𝐲

𝑆

Underfit Overfit

Model complexity Error

Bias2 Variance Irreducible Error

1/26

slide-3
SLIDE 3

Observations

β–Ά Irreducible error cannot be changed β–Ά Bias and variance of Λ†

𝑔 are sample-size dependent

β–Ά For a consistent estimator Λ†

𝑔 π”½π‘ž(𝒰)[ Λ† 𝑔(𝑦)] β†’ 𝑔(𝑦) for increasing sample size

β–Ά In many cases:

Varπ‘ž(𝒰)( Λ† 𝑔(𝑦)) β†’ 0 for increasing sample size

β–Ά Caution: Theoretical guarantees are often dependent on

the number of variables π‘ž staying fixed and increasing π‘œ. Might not be fulfilled in reality.

2/26

slide-4
SLIDE 4

Amendment: Leave-One-Out Cross-validation (LOOCV)

Cross-validation with 𝑑 = π‘œ is called leave-one-out cross-validation.

β–Ά Popular because explicit formulas (or approximations)

exist for many special cases (e.g. regularized regression)

β–Ά Uses the most data for training possible β–Ά More variable than 𝑑-fold CV for 𝑑 < π‘œ since only one data

point is used for testing and the training sets are very similar

β–Ά In praxis: Try out different values for 𝑑. Be cautious if

results vary drastically with 𝑑. Maybe the underlying model assumptions are not appropriate.

3/26

slide-5
SLIDE 5

Classification and Partitions

slide-6
SLIDE 6

Classification and Partitions

A classification algorithm constructs a partition of feature space and assigns a class to each.

β–Ά kNN creates local neighbourhoods in feature space and

assigns a class in each

β–Ά Logistic regression divides feature space implicitly by

modelling π‘ž(𝑗|𝐲) and determines decision boundaries through Bayes’ rule

β–Ά Discriminant analysis creates an explicit model of the

feature space conditional on the class. It models π‘ž(𝐲, 𝑗) by assuming that π‘ž(𝐲|𝑗) is a normal distribution and either estimates π‘ž(𝑗) from data or through prior knowledge.

4/26

slide-7
SLIDE 7

New point-of-view: Rectangular Partitioning

Idea: Create an explicit partition by dividing feature space into rectangular regions and assign a constant conditional mean (regression) or constant conditional class probability (classification) to each region. Given regions 𝑆𝑛 for 𝑛 = 1, … , 𝑁, a classification rule for classes 𝑗 ∈ {1, … , 𝐿} is Μ‚ 𝑑(𝐲) = arg max

1≀𝑗≀𝐿 𝑁

βˆ‘

𝑛=1

1(𝐲 ∈ 𝑆𝑛) ( βˆ‘

π²π‘šβˆˆπ‘†π‘›

1(π‘—π‘š = 𝑗)) and a regression function is given by Λ† 𝑔(𝐲) =

𝑁

βˆ‘

𝑛=1

( 1 |𝑆𝑛| βˆ‘

π²π‘šβˆˆπ‘†π‘›

π‘§π‘š) 1(𝐲 ∈ 𝑆𝑛) (Derivations are similar to kNN with regions instead of neighbourhoods.)

5/26

slide-8
SLIDE 8

Classification and Regression Trees (CART)

β–Ά Complexity of partitioning:

Arbitrary Partition > Rectangular Partition > Partition from a sequence of binary splits

β–Ά Classification and Regression Trees create a sequence of

binary axis-parallel splits in order to reduce variability of values/classes in each region

11 11 1 1 1 1 1 1 1 1 00 0 0 00 00 1 2 3 4 2 4 6

x1 x2

x2 >= 2.2 x1 >= 3.5 1.00 .00 60% 1.00 .00 20% 1 .00 1.00 20%

yes no 6/26

slide-9
SLIDE 9

CART: Tree building/growing

  • 1. Start with all data in a root node
  • 2. Binary splitting

2.1 Consider each feature π‘¦β‹…π‘˜ for π‘˜ = 1, … , π‘ž. Choose a threshold 𝑒

π‘˜ (for continuous features) or a partition of the

feature categories (for categorical features) that results in the greatest improvement in node purity: {π‘—π‘š ∢ π‘¦π‘šπ‘˜ > 𝑒

π‘˜}

and {π‘—π‘š ∢ π‘¦π‘šπ‘˜ ≀ 𝑒

π‘˜}

2.2 Choose the feature π‘˜ that led to the best splitting of the data and create a new child node for each subset

  • 3. Repeat Step 2 on all child nodes until the tree reaches a

stopping criterion All nodes without descendents are called leaf nodes. The sequence of splits preceding them defines the regions 𝑆𝑛.

7/26

slide-10
SLIDE 10

Measures of node purity

Use Λ† πœŒπ‘—π‘› = 1 |𝑆𝑛| βˆ‘

π²π‘šβˆˆπ‘†π‘›

1(π‘—π‘š = 𝑗)

β–Ά Three common measures to determine impurity in a

region 𝑆𝑛 are (for classification trees) Misclassification error: 1 βˆ’ max𝑗 Λ† πœŒπ‘—π‘› Gini impurity: βˆ‘

𝐿 𝑗=1 Λ†

πœŒπ‘—π‘›(1 βˆ’ Λ† πœŒπ‘—π‘›) Entropy/deviance: βˆ’ βˆ‘

𝐿 𝑗=1 Λ†

πœŒπ‘—π‘› log Λ† πœŒπ‘—π‘›

β–Ά All criteria are zero when only one class is present and

maximal when all classes are equally common.

β–Ά For regression trees the decrease in mean squared error

after a split can be used as an impurity measure.

8/26

slide-11
SLIDE 11

Node impurity in two class case

Example for a two-class problem (𝑗 = 0 or 1). Λ† 𝜌0𝑛 is the empirical frequency of class 0 in a region 𝑆𝑛.

0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00

Ο€0m Impurity Impurity Measure

Entropy Gini Misclassification

Only gini impurity and entropy are used in practice (averaging problems for misclassification error).

9/26

slide-12
SLIDE 12

Stopping criteria

β–Ά Minimum size of leaf nodes (e.g. 5 samples per leaf node) β–Ά Minimum decrease in impurity (e.g. cutoff at 1%) β–Ά Maximum tree depth, i.e. number of splits (e.g. maximum

30 splits from root node)

β–Ά Maximum number of leaf nodes

Running CART until one of these criteria is fulfilled generates a max tree.

10/26

slide-13
SLIDE 13

Summary of CART

β–Ά Pro: Outcome is easily interpretable β–Ά Pro: Can easily handle missing data β–Ά Neutral: Only suitable for axis-parallel decision

boundaries

β–Ά Con: Features with more potential splits have a higher

chance of being picked

β–Ά Con: Prone to overfitting/unstable (only the best feature

is used for splitting and which is best might change with small changes of the data)

11/26

slide-14
SLIDE 14

CART and overfitting

How can overfitting be avoided?

β–Ά Tuning of stopping criteria: These can easily lead to early

stopping since a weak split might lead to a strong split later

β–Ά Pruning: Build a max tree first. Then reduce its size by

collapsing internal nodes. This can be more effective since weak splits are allowed during tree building. (β€œThe silly certainty of hindsight”)

β–Ά Ensemble methods: Examples are bagging, boosting,

stacking, …

12/26

slide-15
SLIDE 15

A note on pruning

β–Ά A common strategy is cost-complexity pruning. β–Ά For a given 𝛽 > 0 and a tree π‘ˆ its cost-complexity is

defined as 𝐷𝛽(π‘ˆ) = βˆ‘

π‘†π‘›βˆˆπ‘ˆ

( 1 |𝑆𝑛| βˆ‘

π²π‘šβˆˆπ‘†π‘›

1(π‘—π‘š β‰  Μ‚ 𝑑(𝐲))) ⏟⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡⏟⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡ ⎡⏟

Cost

+ 𝛽|π‘ˆ| ⏟

Complexity

where (π‘—π‘š, π²π‘š) is the training data, Μ‚ 𝑑 the CART classification rule and |π‘ˆ| is the number of leaf nodes/regions defined by the tree.

β–Ά It can be shown that successive subtrees π‘ˆπ‘™ of the max

tree π‘ˆmax can be found such that each tree π‘ˆπ‘™ minimizes 𝐷𝛽𝑙(π‘ˆπ‘™) where 𝛽1 β‰₯ β‹― β‰₯ 𝛽𝐾

β–Ά The tree with the lowest cost-complexity is chosen

13/26

slide-16
SLIDE 16

Re-cap of the bootstrap and variance reduction

slide-17
SLIDE 17

The Bootstrap – A short recapitulation (I)

Given a sample 𝑦𝑗, 𝑗 = 1, … , π‘œ from an underlying population estimate a statistic πœ„ by Μ‚ πœ„ = Μ‚ πœ„(𝑦1, … , π‘¦π‘œ). What is the uncertainty of Μ‚ πœ„? Solution: Find confidence intervals (CIs) quantifying the variability of Μ‚ πœ„. Computation:

β–Ά Through theoretical results (e.g. linear models) if

distributional assumptions fulfilled

β–Ά Linearisation for more complex models (e.g. nonlinear or

generalized linear models)

β–Ά Nonparametric approaches using the data (e.g.

bootstrap) All of these approaches require fairly large sample sizes.

14/26

slide-18
SLIDE 18

The Bootstrap – A short recapitulation (II)

Nonparametric bootstrap Given a sample 𝑦1, … , π‘¦π‘œ bootstrapping performs for 𝑐 = 1, … , 𝐢

  • 1. Sample

Μƒ 𝑦1, … , Μƒ π‘¦π‘œ with replacement from original sample

  • 2. Calculate

Μ‚ πœ„π‘( Μƒ 𝑦1, … , Μƒ π‘¦π‘œ)

β–Ά 𝐢 should be large (in the 1000–10000s) β–Ά The distribution of

Μ‚ πœ„π‘ approximates the sampling distribution of Μ‚ πœ„

β–Ά The bootstrap makes exactly one strong assumption:

The data is discrete and values not seen in the data are impossible.1

1Check out this blog post!

15/26

slide-19
SLIDE 19

CI for statistics of an exponential random variable

0.0 0.1 0.2 0.3 0.4 5 10 15

x Frequency

Data (n = 200) simulated from 𝑦 ∼ Exp(1/3), i.e. π”½π‘ž(𝑦)[𝑦] = 3

β–Ά Orange histogram shows original sample β–Ά Blue line is the true density β–Ά Black outlined histogram shows a bootstrapped sample β–Ά Vertical lines are the mean of 𝑦 (dashed) and the 99% quantile

(dotted) [red = empirical, blue = theoretical]

16/26

slide-20
SLIDE 20

CI calculation: Normal approximation and percentile method

  • 1. Normal approximation: Set πœ„ = 1

𝐢

𝐢

βˆ‘

𝑐=1

Μ‚ πœ„π‘ and estimate the standard error of Μ‚ πœ„ as Λ† πœπ‘‘π‘“ = √ βˆ‘

𝐢 𝑐=1( Μ‚

πœ„π‘ βˆ’ πœ„)2 𝐢 βˆ’ 1 Assume the distribution of Μ‚ πœ„ is approximately 𝑂( Μ‚ πœ„, Λ† πœπ‘‘π‘“) giving CI Μ‚ πœ„ Β± 𝑨1βˆ’π›½/2Λ† πœπ‘‘π‘“

  • 2. Percentile/quantile method: Take the 𝛽 and 𝛽/2 quantiles
  • f the bootstrap estimates

Μ‚ πœ„π‘ as boundaries of CI

17/26

slide-21
SLIDE 21

CI calculation: Applied to example

0.0 0.4 0.8 1.2 1.6 2.5 3.0 3.5 4.0

ΞΈb,mean Frequency

0.0 0.4 0.8 1.2 1.6 10 12 14 16 18

ΞΈb,0.99 Frequency

Based on 𝐢 = 1000 bootstrap samples For the mean value, normal approximation assumption seems reasonable 95% CIs Normal Approx. (2.68, 3.65)

  • Perc. Method (2.71, 3.67)

For the quantile, bootstrapping requires much larger π‘œ and shows high uncertainty

18/26

slide-22
SLIDE 22

Modifications to nonparametric bootstrap

β–Ά Different sampling strategies. Some examples:

β–Ά 𝑛-out-of-π‘œ bootstrap: Draw 𝑛 < π‘œ samples without

replacement

β–Ά Draw from a smooth density estimate of the data β–Ά Draw from a parametric distribution fitted to the original

data

β–Ά Normal approximation doesn’t always apply and

percentile method is unstable for complicated statistics. Example of alternative

β–Ά Bootstrap-t: Instead of normal quantiles, estimate

quantiles from Μ‚ πœ„π‘ βˆ’ Μ‚ πœ„ Λ† πœπ‘ where Λ† πœπ‘ is an estimate of the standard error

β–Ά Many other alternatives exist …

19/26

slide-23
SLIDE 23

Limitations of the bootstrap

β–Ά Number of samples needs to be quite large β–Ά Extreme values (minimum, maximum very small or large

quantiles) can be hard to estimate since they might not even appear in data

β–Ά Many basic CI estimation algorithms assume that the

bootstrap distribution is approximately normal (often not the case in reality)

20/26

slide-24
SLIDE 24

Bootstrap aggregation (bagging)

  • 1. Given a training sample (π‘§π‘š, π²π‘š) or (π‘—π‘š, π²π‘š), we want to fit a

predictive model Λ† 𝑔(𝐲)

  • 2. For 𝑐 = 1, … , 𝐢, form bootstrap samples of the training

data and fit the model, resulting in Λ† 𝑔

𝑐(𝐲)

  • 3. Define

Λ† 𝑔

bag(𝐲) = 1

𝐢

𝐢

βˆ‘

𝑐=1

Λ† 𝑔

𝑐(𝐲)

where Λ† 𝑔

𝑐(𝐲) is a continuous value for a regression

problem or a vector of class probabilities for a classification problem Majority vote can be used for classification problems instead

  • f averaging

21/26

slide-25
SLIDE 25

Bagging and variance reduction

β–Ά Bagging using averages approximates

𝑔

ag(𝐲) = π”½π‘ž(𝒰) [ Λ†

𝑔(𝐲)]

β–Ά For the conditional expected error in squared error loss

π”½π‘ž(𝒰,𝑧|𝐲)[(𝑧 βˆ’ Λ† 𝑔(𝐲))2] β‰₯ π”½π‘ž(𝒰,𝑧|𝐲)[(𝑧 βˆ’ 𝑔

ag(𝐲))2]

β–Ά Some notes:

β–Ά Remember the graphs of kNN from last lecture: Noisy

individually, more stable (less variable) on average

β–Ά Bagging shows no effect on linear models

22/26

slide-26
SLIDE 26

Correlation and bagged variance

Recall: For identically distributed (i.d.) random variables 𝑦𝑗, 𝑗 = 1, … , π‘œ Var ( 1 π‘œ

π‘œ

βˆ‘

𝑗=1

𝑦𝑗) = 1 βˆ’ 𝜍 π‘œ 𝜏2 + 𝜍𝜏2 where 𝜍 ∈ [0, 1) is the (positive) pairwise correlation coefficient and 𝜏2 is the variance of each 𝑦𝑗.

β–Ά Bootstrap samples are correlated and increase total

variance

β–Ά Decreasing correlation between bootstrap samples would

decrease the variance of a bagging estimate

23/26

slide-27
SLIDE 27

Random Forests

slide-28
SLIDE 28

Random Forests

  • 1. Given a training sample with π‘ž features, do for 𝑐 = 1, … , 𝐢

1.1 Draw a bootstrap sample of size π‘œ from training data (with replacement) 1.2 Grow a tree π‘ˆπ‘ until each node reaches minimal node size π‘œmin

1.2.1 Randomly select 𝑛 variables from the π‘ž available 1.2.2 Find best splitting variable among these 𝑛 1.2.3 Split the node

  • 2. For a new 𝐲 predict

Regression: Λ† 𝑔

𝑠𝑐(𝐲) = 1 𝐢 βˆ‘ 𝐢 𝑐=1 π‘ˆπ‘(𝐲)

Classification: Majority vote at 𝐲 across trees Note: Step 1.2.1 leads to less correlation between trees built

  • n bootstrapped data.

24/26

slide-29
SLIDE 29

Variable importance

  • 1. Impurity index: Splitting on a feature leads to a reduction
  • f node impurity. Summing all improvements over all

trees per feature gives a measure for variable importance

  • 2. Out-of-bag error

β–Ά During bootstrapping for large enough π‘œ, each sample has

a chance of about 63% to be selected

β–Ά For bagging the remaining samples are out-of-bag. β–Ά These out-of-bag samples for tree π‘ˆπ‘ can be used as a test

set for that particular tree, since they were not used during training. Resulting in test error 𝐹0

β–Ά Permute variable π‘˜ in the out-of-bag samples and

calculate test error again 𝐹(π‘˜)

1

β–Ά The increase in error

𝐹(π‘˜)

1

βˆ’ 𝐹0 β‰₯ 0 serves as an importance measure for variable π‘˜

25/26

slide-30
SLIDE 30

Take-home message

β–Ά Direct partitioning of feature space is a complex task β–Ά Simplifications in form of binary splits resulting in tree

models work well

β–Ά High interpretability of CART, but also high variability β–Ά Random Forests tackles variance reduction though

bagging and random selection of splitting features

26/26