Bagging and Random Forests David S. Rosenberg New York University - PowerPoint PPT Presentation

Bagging and Random Forests David S. Rosenberg New York University April 10, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 1 / 29

Contents Ensemble Methods: Introduction 1 The Benefits of Averaging 2 Review: Bootstrap 3 Bagging 4 Random Forests 5 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 2 / 29

Ensemble Methods: Introduction David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 3 / 29

Ensembles: Parallel vs Sequential Ensemble methods combine multiple models Parallel ensembles : each model is built independently e.g. bagging and random forests Main Idea: Combine many (high complexity, low bias) models to reduce variance Sequential ensembles : Models are generated sequentially Try to add new models that do well where previous models lack David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 4 / 29

The Benefits of Averaging David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 5 / 29

A Poor Estimator Let Z , Z 1 ,..., Z n i.i.d. E Z = µ and Var Z = σ 2 . We could use any single Z i to estimate µ . Performance? Unbiased: E Z i = µ . Standard error of estimator would be σ . The standard error is the standard deviation of the sampling distribution of a statistic. √ σ 2 = σ . � SD ( Z ) = Var ( Z ) = David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 6 / 29

Variance of a Mean Let Z , Z 1 ,..., Z n i.i.d. E Z = µ and Var Z = σ 2 . Let’s consider the average of the Z i ’s. Average has the same expected value but smaller standard error: n n � � � � = σ 2 1 1 � � E Z i = µ Z i n . Var n n i = 1 i = 1 Clearly the average is preferred to a single Z i as estimator. Can we apply this to reduce variance of general prediction functions? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 7 / 29

Averaging Independent Prediction Functions Suppose we have B independent training sets from the same distribution. Learning algorithm gives B decision functions: ˆ f 1 ( x ) , ˆ f 2 ( x ) ,..., ˆ f B ( x ) Define the average prediction function as: B f avg = 1 ˆ � ˆ f b B b = 1 What’s random here? The B independent training sets are random, which gives rise to variation among the ˆ f b ’s. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 8 / 29

Averaging Independent Prediction Functions Fix some particular x 0 ∈ X . Then average prediction on x 0 is B f avg ( x 0 ) = 1 � ˆ ˆ f b ( x 0 ) . B b = 1 Consider ˆ f avg ( x 0 ) and ˆ f 1 ( x 0 ) ,..., ˆ f B ( x 0 ) as random variables Since the training sets were random We have no idea about the distributions of ˆ f 1 ( x 0 ) ,..., ˆ f B ( x 0 ) – they could be crazy... But we do know that ˆ f 1 ( x 0 ) ,..., ˆ f B ( x 0 ) are i.i.d. And that’s all we need here... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 9 / 29

Averaging Independent Prediction Functions The average prediction on x 0 is B f avg ( x 0 ) = 1 � ˆ ˆ f b ( x 0 ) . B b = 1 f avg ( x 0 ) and ˆ ˆ f b ( x 0 ) have the same expected value, but ˆ f avg ( x 0 ) has smaller variance: � B � 1 � Var ( ˆ ˆ f avg ( x 0 )) = f b ( x 0 ) B 2 Var b = 1 1 � � ˆ = f 1 ( x 0 ) B Var David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 10 / 29

Averaging Independent Prediction Functions Using B f avg = 1 � ˆ ˆ f b B b = 1 seems like a win. But in practice we don’t have B independent training sets... Instead, we can use the bootstrap .... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 11 / 29

Review: Bootstrap David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 12 / 29

The Bootstrap Sample Definition A bootstrap sample from D n is a sample of size n drawn with replacement from D n . In a bootstrap sample, some elements of D n will show up multiple times, some won’t show up at all. Each X i has a probability ( 1 − 1 / n ) n of not being selected. Recall from analysis that for large n , � n � 1 − 1 ≈ 1 e ≈ . 368 . n So we expect ~63.2% of elements of D will show up at least once. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 13 / 29

The Bootstrap Method Definition A bootstrap method is when you simulate having B independent samples from P by taking B bootstrap samples from the sample D n . Given original data D n , compute B bootstrap samples D 1 n ,..., D B n . For each bootstrap sample, compute some function φ ( D 1 n ) ,..., φ ( D B n ) Work with these values as though D 1 n ,..., D B n were i.i.d. P . Amazing fact: Things often come out very close to what we’d get with independent samples from P . David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 14 / 29

Bagging David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 15 / 29

Bagging Draw B bootstrap samples D 1 ,..., D B from original data D . Let ˆ f 1 , ˆ f 2 ,..., ˆ f B be the prediction functions for each set. The bagged prediction function is a combination of these: � � ˆ f 1 ( x ) , ˆ ˆ f 2 ( x ) ,..., ˆ f avg ( x ) = Combine f B ( x ) How might we combine prediction functions for regression? binary class predictions? binary probability predictions? multiclass predictions? Bagging proposed by Leo Breiman (1996). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 16 / 29

Bagging for Regression Draw B bootstrap samples D 1 ,..., D B from original data D . Let ˆ f 1 , ˆ f 2 ,..., ˆ f B : X → R be the predictions functions for each set. Bagged prediction function is given as B f bag ( x ) = 1 � ˆ ˆ f b ( x ) . B b = 1 Empirically, ˆ f bag often performs similarly to what we’d get from training on B independent samples: f bag ( x ) has same expectation as ˆ ˆ f 1 ( x ) , but f bag ( x ) has smaller variance than ˆ ˆ f 1 ( x ) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 17 / 29

Out-of-Bag Error Estimation Each bagged predictor is trained on about 63% of the data. Remaining 37% are called out-of-bag (OOB) observations. For i th training point, let b | D b does not contain i th point � � S i = . The OOB prediction on x i is f OOB ( x i ) = 1 � ˆ ˆ f b ( x i ) . | S i | b ∈ S i The OOB error is a good estimate of the test error. OOB error is similar to cross validation error – both are computed on training set. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 18 / 29

Bagging Classification Trees Input space X = R 5 and output space Y = { − 1 , 1 } . Sample size n = 30 Each bootstrap tree is quite different Different splitting variable at the root This high degree of variability from small perturbations of the training data is why tree methods are described as high variance . From HTF Figure 8.9 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 19 / 29

Comparing Classification Combination Methods Two ways to combine classifications: consensus class or average probabilities. From HTF Figure 8.10 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 20 / 29

Terms “Bias” and “Variance” in Casual Usage (Warning! Confusion Zone!) Restricting the hypothesis space F “ biases ” the fit away from the best possible fit of the training data, and towards a [usually] simpler model. Full, unpruned decision trees have very little bias. Pruning decision trees introduces a bias. Variance describes how much the fit changes across different random training sets. If different random training sets give very similar fits, then algorithm has high stability . Decision trees are found to be high variance (i.e. not very stable). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 21 / 29

Conventional Wisdom on When Bagging Helps Hope is that bagging reduces variance without making bias worse. General sentiment is that bagging helps most when Relatively unbiased base prediction functions High variance / low stability i.e. small changes in training set can cause large changes in predictions Hard to find clear and convincing theoretical results on this But following this intuition leads to improved ML methods, e.g. Random Forests David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 22 / 29

Random Forests David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 23 / 29

Recall the Motivating Principal of Bagging Averaging ˆ f 1 ,..., ˆ f B reduces variance, if they’re based on i.i.d. samples from P X × Y Bootstrap samples are independent samples from the training set, but are not independent samples from P X × Y . This dependence limits the amount of variance reduction we can get. Would be nice to reduce the dependence between ˆ f i ’s... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 24 / 29

Bagging and Random Forests David S. Rosenberg New York University - PowerPoint PPT Presentation

Bagging and Random Forests David S. Rosenberg New York University April 10, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 1 / 29 Contents Ensemble Methods: Introduction 1 The Benefits of Averaging

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Lecture #15: Regression Trees & Random Forests Data Science 1 CS 109A, STAT 121A, AC 209A,

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc.

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

Consideration of participating in the upgrade of the IID S Line Neil Millar Executive Director

INTERNATIONAL INSURER DEPARTMENT (IID) - NEW FILING SYSTEM FOR COMPANY FILERS NEW IID

Reducing Nonresponse Bias through Responsive Design and External Benchmarks Julia Lee University

SAMMAMISH HIG IGH SCHOOL BOARD PRESENTATION February ry 12, 2019 Sammamish Hig igh School

AMER SPORTS Q3/2012 October 25, 2012 // Heikki Takala, President and CEO Broad-based growth,

I N V E S TO R P R E S E NTATION N O V E M B E R 2 0 1 7 N YS E : CIO F ORWARD -L OOKING S

what is is an I.I.D. a Marketing Intelligent, Interactive Display, Marketing Cloud platform that

Decision on the Delaney-Colorado River Transmission Project Neil Millar Executive Director,

Bagging and Random Forests David S. Rosenberg New York University - PowerPoint PPT Presentation

Bagging and Random Forests David S. Rosenberg New York University April 10, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 10, 2018 1 / 29 Contents Ensemble Methods: Introduction 1 The Benefits of Averaging

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Bagging and Boosting Amit Srinet Dave Snyder Outline Bagging Definition Variants Examples

Random Forest Bagging Bagging or bootstrap aggregation a technique for reducing the variance

Lecture 13 Lecture 13 Oct-27-2007 Bagging Bagging Generate T random sample from training

Introduction to Machine Learning Random Forest: Benchmarking Trees, Forests, and Bagging K-NN

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Lecture #15: Regression Trees &amp; Random Forests Data Science 1 CS 109A, STAT 121A, AC 209A,

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck &amp; Co., Inc.

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

Consideration of participating in the upgrade of the IID S Line Neil Millar Executive Director

INTERNATIONAL INSURER DEPARTMENT (IID) - NEW FILING SYSTEM FOR COMPANY FILERS NEW IID

Reducing Nonresponse Bias through Responsive Design and External Benchmarks Julia Lee University

SAMMAMISH HIG IGH SCHOOL BOARD PRESENTATION February ry 12, 2019 Sammamish Hig igh School

AMER SPORTS Q3/2012 October 25, 2012 // Heikki Takala, President and CEO Broad-based growth,

I N V E S TO R P R E S E NTATION N O V E M B E R 2 0 1 7 N YS E : CIO F ORWARD -L OOKING S

what is is an I.I.D. a Marketing Intelligent, Interactive Display, Marketing Cloud platform that

Decision on the Delaney-Colorado River Transmission Project Neil Millar Executive Director,

Lecture #15: Regression Trees & Random Forests Data Science 1 CS 109A, STAT 121A, AC 209A,

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc.