Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c

Introduction ◮ Have a base learner/algorithm; use multiple versions of it to form a final classifier (or regression model). Goal: improve over the base/weaker learner (and others). Often the base learner is a simple tree (e.g. stump). ◮ Include Bagging ( § 8.7), boosting (Chapter 10), random forest (Chapter 15). Others: Bayesian model averaging (Chapter 8); Model averaging and stacking ( § 8.8); ARM (Yang, JASA), ...

Bagging ◮ Bootstrap Aggregation (Bagging) ( § 8.7). ◮ Training data: D = { ( X i , Y i ) | i = 1 , ..., n } . ◮ A bootstrap sample is a random sample of D with size n and with replacement. ◮ Bagging regression: 1) Draw B bootstrap samples D ∗ 1 ,..., D ∗ B ; 2. Fit a (base) model f ∗ b ( x ) with D ∗ b for each b = 1 , ..., B ; 3. The bagging estimate is ˆ f B ( x ) = � B b =1 f ∗ b ( x ) / B . ◮ If f ( x ) is linear, then ˆ f B ( x ) → ˆ f ( x ) as B → ∞ ; but not in general. ◮ A surprise (Breiman 1996): ˆ f B ( x ) can be much better than ˆ f ( x ), especially so if the base learner is not stable (e.g. tree).

◮ Classification: same as regression but 1) ˆ G B ( x ) = majority of (ˆ 1 ( x ) , ..., ˆ G ∗ G ∗ B ( x )); or 2) if ˆ π K ) ′ , then f ( x ) = (ˆ π 1 , ..., ˆ ˆ b ( x ) / B , ˆ G B ( x ) = arg max k ˆ f B ( x ) = � B b =1 f ∗ f B ( x ) . 2) may be better than 1); ◮ Example: Fig 8.9, Fig 8.10. ◮ Why does bagging work? to reduce the variance of the base learner. but not always, while always increases bias! (Buja) explains why sometimes not the best.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 8 c Original Tree b = 1 b = 2 x.1 < 0.395 x.1 < 0.555 x.2 < 0.205 | | | 1 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 b = 3 b = 4 b = 5 x.2 < 0.285 x.3 < 0.985 x.4 < −1.36 | | | 0 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 b = 6 b = 7 b = 8 x.1 < 0.395 x.1 < 0.395 x.3 < 0.985 | | | 1 1 1 0 0 1 0 0 1 1 0 0 0 1 0 1 b = 9 b = 10 b = 11 x.1 < 0.395 x.1 < 0.555 x.1 < 0.555 | | | 1 0 1 0 1 1 0 0 1 0 1 1 0 0 1

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 8 c 0.50 Consensus Probability Original Tree 0.45 0.40 Bagged Trees Test Error 0.35 0.30 0.25 Bayes 0.20 0 50 100 150 200 Number of Bootstrap Samples FIGURE 8.10. Error curves for the bagging example of Figure 8.9. Shown is the test error of the original tree and bagged trees as a function of the number of bootstrap samples. The orange points correspond to the consensus vote, while the green points average the prob- abilities.

◮ An approximation: log Pr ( Z |M m , ˆ BIC ( M m ) = θ m ( Z )) − log( n ) p / 2 ≈∝ log Pr ( M m | Z ) . ◮ hence, use weights ∝ exp[ BIC ( M m )]. ◮ Buckland et al (1997, Biometrics): use AIC. log Pr ( Z |M m , ˆ AIC ( M m ) = θ m ( Z )) − p E Z ∗ log Pr ( Z ∗ |M m , ˆ ≈ θ m ( Z )) . ◮ ARM (Yang 2001): use sample-splitting (or CV), log Pr ( Z ts |M m , ˆ θ m ( Z tr )) .

Stacking ◮ § 8.8; Wolpert (1992, Neural Networks), Breiman (1996, ML). ◮ ˆ f ( x ) = � M m =1 w m ˆ f m ( x ), w = ( w 1 , ..., w M ) ′ . ◮ Ideally, if P is the distr for ( X , Y ), M w m ˆ f m ( X )] 2 . � w = arg min ˆ w E P [ Y − m =1 ◮ But P is unknown, use its empirical distr: n M � � w m ˆ f m ( X i )] 2 . w = arg min ˆ [ Y i − w i =1 m =1 Good? why? think about best subset selection ... ◮ Stacking: ˆ f − i m : f m fitted without ( X i , Y i ); LOOCV. n M w st = arg min � � w m ˆ f − i m ( X i )] 2 . ˆ [ Y i − w i =1 m =1 w st ≥ 0 and � M ◮ How? OLS; but QP if impose ˆ m =1 w st m = 1.

Adaptive Regression by Mixing ◮ Yang (2001, JASA). ◮ ˆ m =1 w m ˆ f ( x ) = � M f m ( x ), w = ( w 1 , ..., w M ) ′ . ◮ Key: how to estimate w ? ◮ ARM: 1. Partition the data into two parts D = D 1 ∪ D 2 ; 2. Use D 1 to fit the candidate models ˆ f m ( x ; ˆ θ m ( D 1 )); i ∈ D 2 ˆ f m ( X i ; ˆ 3. Use D 2 to estimate weights: w m ∝ � θ m ( D 1 )). ◮ Note: AIC is asymptotically unbiased for the predictive log-likelihood, so ARM ≈ ...?

(8000) Other topics ◮ Model selection vs model mixing (averaging). Theory: Yang (2003, Statistica Sinica); Shen & Huang (2006; JASA); My summary: if easy, use the former; o/w use the latter. Applications: to testing in genomics and genetics (Newton et al 2007, Ann Appl Stat; Pan et al 2014, Genetics). ◮ Generalize model averaging to input-dependent weighting: w m = w m ( x ). Pan et al (2006, Stat Sinica). ◮ Generalize model selection to “localized model selection” (Yang 2008, Econometric Theory). ◮ Model selection: AIC or BIC or CV? LOOCV or k-fold CV? Zhang & Yang (2015, J Econometrics).

Random Forest ◮ RF (Chapter 15); by Breiman (2001). ◮ Main idea: similar to bagging, 1) use bootstrap samples to generate many trees; 2) In generating each tree, i) at each node, rather than using the best splitting variable among all the predictors, use the best one out of a random subset of predictors (the size m is a tuning parameter to be determined by the user; not too sensitive); m ∼ √ p . ii) each tree is grown to the max size; no pruning;

◮ Why do so? 1) Better base trees improve the performance; 2) The correlations among the base trees decrease the performance. Reducing m decreases the correlations (& performance of a tree). ◮ Output: Give an OOB estimate of the prediction error. Some obs’s will not be in some bootstrap samples and can be treated as test data (for the base trees trained on these bootstrap samples)! ◮ Output: Give a measure of the importance of each predictor. 1) use the original data to get an OOB estimate e 0 ; 2) permute the values of x j across obs’s, then use the permuted data to get an OOB estimate e j ; 3) Importance of x j is defined as e j − e 0 . ◮ RF can handle large datasets, and can do clustering! ◮ Example code: ex6.1.R

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 15 c Spam Data 0.070 Bagging Random Forest Gradient Boosting (5 Node) 0.065 0.060 Test Error 0.055 0.050 0.045 0.040 0 500 1000 1500 2000 2500 Number of Trees FIGURE 15.1. Bagging, random forest, and gradient boosting, applied to the spam data. For boosting, 5 -node trees were used, and the number of trees were chosen by 10 -fold cross-validation ( 2500 trees). Each “step” in the figure corresponds to a change in a single misclassification (in a test set of 1536 ).

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 15 c California Housing Data RF m=2 0.44 RF m=6 GBM depth=4 GBM depth=6 0.42 Test Average Absolute Error 0.40 0.38 0.36 0.34 0.32 0 200 400 600 800 1000 Number of Trees FIGURE 15.3. Random forests compared to gradient boosting on the California housing data. The curves represent mean absolute error on the test data as a function of the number of trees in the models. Two random forests are shown, with m = 2 and m = 6 . The two gradient boosted models use a shrinkage parameter ν = 0 . 05 in (10.41), and have interaction depths of 4 and 6 . The boosted models outperform random forests.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 15 c 0.075 OOB Error Misclassification Error Test Error 0.065 0.055 0.045 0 500 1000 1500 2000 2500 Number of Trees FIGURE 15.4. oob error computed on the spam training data, compared to the test error computed on the test set.

Boosting ◮ Chapter 10. ◮ AdaBoost: proposed by Freund and Schapire (1997). ◮ Main idea: see Fig 10.1 1. Fit multiple models using weighted samples; 2. Misclassified obs’s are weighted more and more; 3. Combine the multiple models by weighted majority voting. ◮ Training data: { ( Y i , X i ) | i = 1 , ..., n } and Y i = ± 1.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 10 c Final Classifier �� M � G ( x ) = sign m =1 α m G m ( x ) G M ( x ) Weighted Sample Weighted Sample Weighted Sample Weighted Sample Weighted Sample G 3 ( x ) Weighted Sample Weighted Sample Weighted Sample Weighted Sample Weighted Sample G 2 ( x ) Weighted Sample Weighted Sample Weighted Sample Weighted Sample Weighted Sample G 1 ( x ) Training Sample Training Sample Training Sample Training Sample Training Sample FIGURE 10.1. Schematic of AdaBoost. Classifiers are trained on weighted versions of the dataset, and then combined to produce a final prediction.

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 Wei Pan c Introduction Have a base learner/algorithm;

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Introduction to ensemble methods EN S EMBLE METH ODS IN P YTH ON Romn de las Heras Data

What is it? Instrument or ensemble? Lars Bo Andersen, Humans and IT research seminar, 13/5-2015

Decision trees and Ensemble methods Camilo Fosco CS109A Introduction to Data Science Pavlos

Ensemble Methods Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

Overview of Decision Trees, Ensemble Methods and Reinforcement Learning CMSC 678 UMBC Outline

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels

Stochastic Physics Perturbations For Ensemble Forecast Yuejian Zhu Ensemble Team Environmental

Ensemble verification: Old scores, new perspectives Sabrina Wahl, Petra Friederichs, Jan Keller

State Song & Dance Ensemble LIETUVA proposal of cooperation Who are we? We are

Linear ensemble transform filters: A unified perspective on ensemble Kalman and particle filters

Ensemble Docking Revisited Oliver Korb Cambridge Crystallographic Data Centre

Gaussian ensemble screening (GES): A new Gaussian ensemble screening (GES): A new approach to

Ensemble Models for Dependency Parsing: Cheap and Good? Mihai Surdeanu and Christopher D. Manning

CIS4930/5930: Machine Learning Introduction to ML Alan Kuhnle Florida State University Slides

Exploring Python Bytecode @AnjanaVakil EuroPython 2016 Hi! Im Anjana, and Im a Pythoholic

Email Spam and the Ethics of An3spam measures Behrooz

Bias, Fairness, Accountability, and Transparency in Machine Learning CS 115 Computing for the

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 Wei Pan c Introduction Have a base learner/algorithm;

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Introduction to ensemble methods EN S EMBLE METH ODS IN P YTH ON Romn de las Heras Data

What is it? Instrument or ensemble? Lars Bo Andersen, Humans and IT research seminar, 13/5-2015

Decision trees and Ensemble methods Camilo Fosco CS109A Introduction to Data Science Pavlos

Ensemble Methods Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math &amp; CS, Emory

Overview of Decision Trees, Ensemble Methods and Reinforcement Learning CMSC 678 UMBC Outline

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun &amp; Rich Zemels

Stochastic Physics Perturbations For Ensemble Forecast Yuejian Zhu Ensemble Team Environmental

Ensemble verification: Old scores, new perspectives Sabrina Wahl, Petra Friederichs, Jan Keller

State Song &amp; Dance Ensemble LIETUVA proposal of cooperation Who are we? We are

Linear ensemble transform filters: A unified perspective on ensemble Kalman and particle filters

Ensemble Docking Revisited Oliver Korb Cambridge Crystallographic Data Centre

Gaussian ensemble screening (GES): A new Gaussian ensemble screening (GES): A new approach to

Ensemble Models for Dependency Parsing: Cheap and Good? Mihai Surdeanu and Christopher D. Manning

CIS4930/5930: Machine Learning Introduction to ML Alan Kuhnle Florida State University Slides

Exploring Python Bytecode @AnjanaVakil EuroPython 2016 Hi! Im Anjana, and Im a Pythoholic

Email Spam and the Ethics of An3spam measures Behrooz

Bias, Fairness, Accountability, and Transparency in Machine Learning CS 115 Computing for the

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

CSC 411: Lecture 17: Ensemble Methods I Class based on Raquel Urtasun & Rich Zemels

State Song & Dance Ensemble LIETUVA proposal of cooperation Who are we? We are