on feature selection bias variance and bagging
play

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich - PowerPoint PPT Presentation

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1


  1. On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of Computer Science Cornell University 2 Microsoft Corporation ECML-PKDD 2009 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 1 / 22

  2. Task: Model Presence/Absence of Birds Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

  3. Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

  4. Task: Model Presence/Absence of Birds Tried: SVMs boosted decision trees bagged decision trees neural networks · · · Ultimate goal: understand avian population dynamics Ran feature selection to find smallest feature set with excellent performance. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 2 / 22

  5. Bagging Likes Many Noisy Features (?) European Starling 0.385 bagging 0.38 all features 0.375 0.37 RMS 0.365 0.36 0.355 0.35 0 5 10 15 20 25 30 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 3 / 22

  6. Surprised Reviewers Reviewer A [I] also found that the results reported in Figure 2 [were] strange, where the majority [of] results show that classifiers built from selected features are actually inferior to the ones trained from the whole feature [set]. Reviewer B It is very surprising that the performance of all methods improves (or stays constant) when the number of features is increased. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 4 / 22

  7. Purpose of this Study Does bagging often benefit from many features? If so, why? Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 5 / 22

  8. Outline Story Behind the Paper 1 Background 2 Experiment 1: FS and Bias-Variance 3 Experiment 2: Weak, Noisy Features 4 Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 6 / 22

  9. Review of Bagging Bagging: simple ensemble learning algorithm [Bre96]: draw random sample of training data train a model using sample (e.g. decision tree) repeat N times (e.g. 25 times) bagged predictions: average predictions of N models Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 7 / 22

  10. Facts about Bagging Surprisingly competitive performance & rarely overfits [BK99]. Main benefit is reducing variance of constituent models [BK99]. Improves ability to ignore irrelevant features [AP96]. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 8 / 22

  11. Review of Bias-Variance Decomposition Error of learning algorithm on example x comes from 3 sources: noise intrinsic error / uncertainty for x ’s true label bias how close, on average, is algorithm to optimal prediction variance how much does prediction change if change training set Error decomposes as: error ( x ) = noise ( x ) + bias ( x ) + variance ( x ) On real problems, cannot separately measure bias and noise. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 9 / 22

  12. Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

  13. Measuring Bias & Variance (Squared Error) Generate empirical distribution of the algorithm’s predictions [BK99]: Randomly sample 1 2 of the training data. Train model using sample and make predictions y for test data. Repeat R times (e.g. 20 times). Compute average prediction y m for every test example. For each test example x with true label t : bias ( x ) = ( t − y m ) 2 R variance ( x ) = 1 � ( y m − y i ) 2 R i = 1 Average over test cases to get expected bias & variance for algorithm. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 10 / 22

  14. Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

  15. Review of Feature Selection Forward Stepwise Feature Selection Start from empty selected set. Evaluate benefit of selecting each non-selected feature (train model for each choice). Select most beneficial feature. Repeat search until stopping criteria. Correlation-based Feature Filtering Rank features by individual correlation with class label. Choose cutoff point (by statistical test or cross-validation). Keep features above cutoff point. Discard rest. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 11 / 22

  16. Experiment 1: Bias-Variance of Feature Selection Summary: Dataset Sizes 19 datasets 1e+06 100000 order features using feature selection 10000 forward stepwise feature selection or # Features 1000 correlation feature filtering, depending 100 on dataset size 10 estimate bias & variance at multiple 1 feature set sizes 100 1000 10000 100000 # Samples 5-fold cross-validation Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 12 / 22

  17. Case 1: No Improvement from Feature Selection covtype 0.08 variance bias/noise 0.07 0.06 single decision tree MSE 0.05 � � bagged decision tree ✠ � � 0.04 � ✠ � 0.03 0.02 1 2 3 4 5 10 20 30 40 50 54 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 13 / 22

  18. Case 2: FS Improves Non-Bagged Model medis 0.1 variance bias/noise 0.095 ❑ ❆ ❆ 0.09 ❆ 0.085 overfits with too many features MSE 0.08 0.075 0.07 0.065 0.06 1 2 3 4 5 10 20 30 40 50 60 63 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 14 / 22

  19. Take Away Points More features ⇒ lower bias/noise, higher variance. Feature selection does not improve bagged model performance (1 exception). Best subset size corresponds to best bias/variance tradeoff point. Algorithm dependant Relevant features may be discarded if variance increase outweighs extra information Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 15 / 22

  20. Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 0.11 MSE 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

  21. Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 ✛ 0.11 MSE 0.105 0.1 0.095 0.09 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

  22. Why Does Bagging Benefit from so Many Features? cryst 0.13 variance 0.125 bias/noise 0.12 0.115 ✛ 0.11 MSE 0.105 0.1 0.095 0.09 ✛ 0.085 0.08 1 5 10 25 50 100 200 400 800 1,341 # features Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 16 / 22

  23. Hypothesis Bagging improves base learner’s ability to benefit from weak, noisy features. Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 17 / 22

  24. Experiment 2: Noisy Informative Features Summary: generate synthetic data (6 features) duplicate 1/2 of the features 20 times corrupt X % of values in duplicated features train single and bagged trees with corrupted features and 3 non-duplicated features compare to: ideal, unblemished feature set, and no noisy features (3 non-duplicated only) Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 18 / 22

  25. Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 0.2 MSE 0.15 0.1 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

  26. Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 0.2 MSE 0.15 0.1 ✲ 6 original features (ideal) 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

  27. Bagging Extracts More Info from Noisy Features damaged 0.3 variance bias/noise 0.25 3 non-duplicated features (baseline) ❍❍❍❍❍ 0.2 MSE ❍ ❥ 0.15 0.1 ✲ 6 original features (ideal) 0.05 core 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 fraction feature values corrupted Munson; Caruana (Cornell; Microsoft) On Feature Selection ECML-PKDD 2009 19 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend