On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - - PowerPoint PPT Presentation

on feature selection bias variance and bagging
SMART_READER_LITE
LIVE PREVIEW

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - - PowerPoint PPT Presentation

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1


slide-1
SLIDE 1

On Feature Selection, Bias-Variance, and Bagging

Art Munson1 Rich Caruana2

1Department of Computer Science

Cornell University

2Microsoft Corporation

ECML-PKDD 2009

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1 / 22

slide-2
SLIDE 2

Task: Model Presence/Absence of Birds

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

slide-3
SLIDE 3

Task: Model Presence/Absence of Birds

Tried: SVMs boosted decision trees bagged decision trees neural networks · · ·

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

slide-4
SLIDE 4

Task: Model Presence/Absence of Birds

Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

slide-5
SLIDE 5

Bagging Likes Many Noisy Features (?)

0.35 0.355 0.36 0.365 0.37 0.375 0.38 0.385 5 10 15 20 25 30 RMS # features European Starling bagging all features

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 3 / 22

slide-6
SLIDE 6

Surprised Reviewers

Reviewer A

[I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set].

Reviewer B

It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 4 / 22

slide-7
SLIDE 7

Purpose of this Study Does bagging often benefit from many features? If so, why?

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 5 / 22

slide-8
SLIDE 8

Outline

1

Story Behind the Paper

2

Background

3

Experiment 1: FS and Bias-Variance

4

Experiment 2: Weak, Noisy Features

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 6 / 22

slide-9
SLIDE 9

Review of Bagging

Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 7 / 22

slide-10
SLIDE 10

Facts about Bagging

Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96].

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 8 / 22

slide-11
SLIDE 11

Review of Bias-Variance Decomposition

Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x’s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error(x) = noise(x) + bias(x) + variance(x) On real problems, cannot separately measure bias and noise.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 9 / 22

slide-12
SLIDE 12

Measuring Bias & Variance (Squared Error)

Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1

2 of the training data.

Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction ym for every test example.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

slide-13
SLIDE 13

Measuring Bias & Variance (Squared Error)

Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1

2 of the training data.

Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction ym for every test example. For each test example x with true label t: bias(x) = (t − ym)2 variance(x) = 1 R

R

  • i=1

(ym − yi)2 Average over test cases to get expected bias & variance for algorithm.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

slide-14
SLIDE 14

Review of Feature Selection

Forward Stepwise Feature Selection

Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

slide-15
SLIDE 15

Review of Feature Selection

Forward Stepwise Feature Selection

Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria.

Correlation-based Feature Filtering

Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

slide-16
SLIDE 16

Experiment 1: Bias-Variance of Feature Selection

Summary: 19 datasets

  • rder features using feature selection

forward stepwise feature selection or correlation feature filtering, depending

  • n dataset size

estimate bias & variance at multiple feature set sizes 5-fold cross-validation

1 10 100 1000 10000 100000 1e+06 100 1000 10000 100000 # Features # Samples Dataset Sizes

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 12 / 22

slide-17
SLIDE 17

Case 1: No Improvement from Feature Selection

bias/noise variance

0.02 0.05 0.06 0.07 0.08 54 50 40 30 20 10 5 4 3 2 1 MSE # features covtype 0.03 0.04

single decision tree

bagged decision tree

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 13 / 22

slide-18
SLIDE 18

Case 2: FS Improves Non-Bagged Model

bias/noise variance

0.06 0.075 0.08 0.085 0.09 0.095 0.1 63 60 50 40 30 20 10 5 4 3 2 1 MSE # features medis 0.065 0.07

❆ ❆ ❆ ❑

  • verfits with too

many features

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 14 / 22

slide-19
SLIDE 19

Take Away Points

More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point.

Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 15 / 22

slide-20
SLIDE 20

Why Does Bagging Benefit from so Many Features?

variance bias/noise

0.08 0.095 0.1 0.105 0.11 0.115 0.12 0.125 0.13 1,341 800 400 200 100 50 25 10 5 1 MSE # features cryst 0.085 0.09

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

slide-21
SLIDE 21

Why Does Bagging Benefit from so Many Features?

variance bias/noise

0.08 0.095 0.1 0.105 0.11 0.115 0.12 0.125 0.13 1,341 800 400 200 100 50 25 10 5 1 MSE # features cryst 0.085 0.09

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

slide-22
SLIDE 22

Why Does Bagging Benefit from so Many Features?

variance bias/noise

0.08 0.095 0.1 0.105 0.11 0.115 0.12 0.125 0.13 1,341 800 400 200 100 50 25 10 5 1 MSE # features cryst 0.085 0.09

✛ ✛

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

slide-23
SLIDE 23

Hypothesis Bagging improves base learner’s ability to benefit from weak, noisy features.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 17 / 22

slide-24
SLIDE 24

Experiment 2: Noisy Informative Features

Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X% of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to:

ideal, unblemished feature set, and no noisy features (3 non-duplicated only)

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 18 / 22

slide-25
SLIDE 25

Bagging Extracts More Info from Noisy Features

variance bias/noise

0.05 0.1 0.15 0.2 0.25 0.3 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 MSE fraction feature values corrupted damaged

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

slide-26
SLIDE 26

Bagging Extracts More Info from Noisy Features

variance bias/noise

0.05 0.1 0.15 0.2 0.25 0.3 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 MSE fraction feature values corrupted damaged

6 original features (ideal)

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

slide-27
SLIDE 27

Bagging Extracts More Info from Noisy Features

variance bias/noise

0.05 0.1 0.15 0.2 0.25 0.3 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 MSE fraction feature values corrupted damaged

6 original features (ideal)

❍❍❍❍❍ ❍ ❥

3 non-duplicated features (baseline)

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

slide-28
SLIDE 28

Bagging Extracts More Info from Noisy Features

variance bias/noise

0.05 0.1 0.15 0.2 0.25 0.3 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 MSE fraction feature values corrupted damaged

6 original features (ideal)

❍❍❍❍❍ ❍ ❥

3 non-duplicated features (baseline) everything else: 3 non-duplicated features + 60 noisy features

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

slide-29
SLIDE 29

Conclusions

After training 9,060,936 decision trees . . . Experiment 1: More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance. Best subset size corresponds to best bias/variance tradeoff point. Experiment 2: Bagged trees surprisingly good at extracting useful information from noisy features. Different weak features in different trees.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 20 / 22

slide-30
SLIDE 30

Bibliography Kamal M. Ali and Michael J. Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173–202, 1996. Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1-2):105–139, 1999. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 21 / 22

slide-31
SLIDE 31

Exception: Overfitting Pseudo-Identifiers

variance bias/noise

0.13 0.16 0.17 0.18 0.19 0.2 175 160 140 120 100 80 60 40 20 10 5 4 3 2 1 MSE # features bunting 0.14 0.15

Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 22 / 22