Random Forests What, Why, And How Andy Liaw Biometrics Research, - PowerPoint PPT Presentation

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc. andy_liaw@merck.com

Outline ● Brief description of random forests ● Why does it work? ● Tuning random forests ● Comparison with other algorithms ● Gravy ● Wrap up

CART ● Find best “gap” in a variable to split the data into two parts ● Repeat until futile ● Naturally handle categorical and numerical variables ● Very greedy algorithm => unstable ● Algorithm can be parallelized at different levels ● Finding “right - sized” tree requires cross -validation ● Generally not very accurate

Example: Iris Data

… Pick the best column to split the data into two parts Y X1 X2 X3 X4 X5 X6 X7 X8 Y X1 X2 X3 X4 X5 X6 X7 X8 Y X1 X5 X7 Y X2 X3 X8 1 1 1 Random 1 3 3 3 … 3 4 4 4 4 6 6 Forests Y X1 X2 X3 X4 X5 X6 X7 X8 Y X2 X3 X6 8 8 6 6 9 9 8 8 Randomly subset columns 9 Randomly subset columns 9 … Y X1 X2 X3 X4 X5 X6 X7 X8 Y X2 X3 X8 2 2 Randomly sample rows Y X1 X2 X3 X4 X5 X6 X7 X8 Y X1 X2 X6 3 3 2 2 5 5 7 7 6 6 … 9 9 7 7 Y X1 X2 X3 X4 X5 X6 X7 X8 9 9 1 Y X1 X2 X3 X4 X5 X6 X7 X8 Y X4 X5 X8 2 … 3 3 3 Y X1 X2 X3 X4 X5 X6 X7 X8 5 5 4 1 6 6 … … 5 3 6 5 7 7 8 8 9 … 10 10 … Y X1 X2 X3 X4 X5 X6 X7 X8 Y X4 X5 X8 2 2 Y X1 X2 X3 X4 X5 X6 X7 X8 Y X1 X6 X7 5 5 2 2 … 6 6 4 4 5 5 Y X1 X2 X3 X4 X5 X6 X7 X8 Y X2 X3 X7 6 6 4 4 7 7 8 8 7 7 5 8 8

CART Random Forest

Why Is RF Popular? ● Inherits many advantages of CART: places very little requirement on data preprocessing ● High Performance ● “Does not overfit”

How Ensemble Models Work Every model need to Y Model1 Model2 Model3 Model4 Model5 Aggregate be better than random 0 0 1 0 0 1 0 guessing 1 0 1 1 1 1 1 1 1 1 0 1 1 1 Try to have different 0 1 0 0 1 0 0 model make mistakes 0 0 0 1 0 1 0 on different data 1 1 1 1 0 0 1 67% 83% 67% 67% 50% 100%

Correlation and Strength Correlation: How similar the base predictors are Strength: How accurate the base predictors are Low correlation => high diversity Correlation comes with sufficiently high strength Find good compromise between the two Small mtry promotes diversity

Nearest Neighbor Classifier Terminal nodes in a decision tree CART represent groups of similar data, with sizes of neighborhoods decided from training data (hyper-rectangular regions) RF averages terminal nodes from many trees, so neighborhoods are varied -- “smooth out” the crude RF neighborhoods of a single tree

What Controls RF’s Model Complexity? ● Viewed as adaptive weighted NN, increasing number of trees makes weights “smoother” ● Sizes of neighborhoods can also be an indicator of model complexity ● There is evidence that smaller trees in RF work better for some data ● Given the same data, smaller trees ⇔ larger neighborhoods

Median Correlation vs. Median RMSE sampsize =20%, 30%, …, 80% nodesize=3, 5, 7,15, 30 mtry=3, 6, 9 mtry=3, 6, 9 Higher diversity but worse accuracy 2 3 1 3 1 2 Higher accuracy but less diversity

Tuning RF ● Use mtry to balance correlation and strength ● A larger nodesize forces the algorithm to produce smaller trees, thus larger neighborhoods ● A smaller sampsize also induces smaller trees, also make trees more diverse (but should be used with larger number of trees)

Simulated Example: Friedman #1 sampsize =20%, 30%, …, 80% sampsize =20%, 30%, …, 80% nodesize=3, 5, 7,15, 30 nodesize=3, 5, 7, 15 mtry=3, 6, 9 mtry=3, 6, 9 mtry=6

“Does Not Overfit” • Definition of “ overfitting ”? • Very different from something like Neural Networks • Early-stopping: monitor difference between training and validation errors; divergence of the two is seen as overfitting • RF grows each tree to maximum size, thus have nearly 0 training error • For boosting, test set error can keep decreasing as iterations go on even after training error reached 0! (But it will eventually increase – can’t boost forever) • Bottom line: gap between training error and test set error does not necessarily indicate overfitting; increasing test set error with increasing model complexity does

RF vs. Boosting RF Boosting Trees are independently grown Trees are grown sequentially ● ● Use randomness to get diverse Each tree tries to correct ● ● trees previous mistake Grow trees to maximum sizes Keep each tree relatively small ● ● Number of trees is not a tuning Number of trees should be ● ● parameter tuned Model size can be huge due to Model size is usually small ● ● maximal size trees

MDS Projections of Individual Tree Predictions RF GBM (100 trees, shrinkage=.2) GBM (700 trees, shrinkage=.05) last tree 1 st tree

RF vs. DNN RF: DNN: Adding more trees to RF does not Complexity is predetermined by seem to increase model complexity network architecture No explicit “optimization” Optimization with controlled greed Prediction can not exceed the range Prediction can be unbounded, of training data depending on activation function Difficult to update model with new Trivial to update model with new data data

RF vs. XGBoost vs. DNN: Performance Courtesy of R. P. Sheridan

RF vs. XGBoost vs. DNN: Training Time Courtesy of R. P. Sheridan

RF vs. XGBoost vs. DNN: Model Sizes Courtesy of R. P. Sheridan

Variable Importance by Permutations Breiman’s idea of permuting data one variable at a time and seeing how accuracy drops can be apply to any algorithm, not just RF Shuffle the k th feature n times Hold-out For the k th feature at the i th Data shuffle: 𝑒 𝑗𝑙 = 𝐹 𝑗𝑙 − 𝐹 , Trained Model 𝑛𝑓𝑏𝑜(𝑒 𝑗𝑙 ) VarImp(𝐺 𝑙 ) = 𝑇𝐸 𝑒 𝑗𝑙 Error on i- th Error on original Error on i- th Error on data ( i th Permuted Hold-out hold-out data, E Permuted Hold-out shuffle of k th Data, E ik Data, E ik feature), E ik

Partial Dependence Plot Every predictive model represents a function with multiple variables y = 𝑔(𝑦 1 , … , 𝑦 𝑞 ) • The marginal relation between y and a particular variable/predictor 𝑦 𝑗 can be examined using • Partial Dependence (proposed by Friedman 1999). Example: Assuming a model with 2 predicators, y = 𝑔(𝑦 1 , 𝑦 2 ) , The partial dependence on 𝑦 1 ∶ 𝒒 𝒚 𝟐 = ∫ 𝑔 𝑦 1 , 𝑦 2 𝑒𝑦 2 i.e., all remaining variables are integrated out

Computing Partial Dependence Computing the partial dependence of variable 𝑦 1 for model y = 𝑔(𝑦 1 , … , 𝑦 𝑞 ) … x1 x2 x3 … f(x) x1 x2 x3 f1 1.2 Compute the average f2 1.2 prediction, 𝑞(𝑦 1 = 1.2) Replace the Predict f3 1.2 original values of outcome f4 𝑦 1 with some 1.2 using f5 1.2 constant, such as modified f6 1.2 1.2 data … Repeat the process with 1.2 d ifferent 𝑦 1 values to obtain the … … … … partial dependence function 𝑞(𝑦 1 ) Original data Modified data Note: R package “ pdp ”, “ ALEPlot ”, and “ ICEbox ” implement this and extensions 24

Prediction Intervals ● The idea behind quantregForest enable prediction intervals to be formed by post-processing a randomForest object ○ For each new data point to be predicted, it lands in a leaf in each tree and is predicted by the mean of the (in-bag) data in that leaf, then averaged over all trees ○ We can use the in-bag data that fell in the same terminal nodes as the new data point as a sample from the conditional distribution, thus can be used to estimate the conditional quantiles ○ The grf package takes this idea further and use it for local likelihood ● While it might be possible to get such intervals with other methods by customizing the loss function to estimate quantiles, it requires fitting a separate model for each quantile

Room for Improvement ● Classification runs faster than regression, due to lack of pre-sorting in regression ● Currently tree depth is not tracked, thus cannot easily control it ● Splitting criteria are hard coded, no easy way to customize ● Handling of large number of categories is tricky ● Missing value handling can be better ● Some special tricks can speed up algorithm for some specific data type (e.g, all binary predictors)

Wrapping Up ● RF is a flexible, robust and high performance ML method ● Basic understanding of how it works can give intuitions on how to tune it ● For large data, try small sampsize and larger number of trees ● Some of the ideas introduced with the method can be extended to other methods

Acknowledgement ● Leo Breiman ● Adele Cutler ● Vladimir Svetnik ● Matt Wiener ● Numerous former interns ● Users who reported bugs

Random Forests What, Why, And How Andy Liaw Biometrics Research, - PowerPoint PPT Presentation

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc. andy_liaw@merck.com Outline Brief description of random forests Why does it work? Tuning random forests Comparison with other

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Introduction to Machine Learning Random Forests: Proximities compstat-lmu.github.io/lecture_i2ml

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Forests NSW Forests NSW Spotted Gum ( Corymbia spp.) Tree improvement and deployment strategy

Our Changing Forests Harvard Forest Schoolyard Project August 22, 2019 1. How do forests change?

Conservation Plan Update Liz Dent, State Forests Division Chief Brian Pew, State Forests Deputy

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Hold out in residential projects: Land assembly revisited Thomas Boogaerts & Geert Goeyvaerts

Learning Faster from Easy Data II Wouter Koolen Tim van Erven Aim of the Workshop

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: supervised learning Training data

Deep Learning for Perception Robert Platt Northeastern University Perception problems We will

Travel Insurance Niall Palmer Saga Insurance Overview About Saga Types of Travel

Presentation to NZPIF Communications Meeting Wellington May 2019 Official Insurance Partner

Advanced Lesson 26 Topic 26: Phrasal verbs literal and idiomatic meaning A phrasal verb is a

Random Forests What, Why, And How Andy Liaw Biometrics Research, - PowerPoint PPT Presentation

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc. andy_liaw@merck.com Outline Brief description of random forests Why does it work? Tuning random forests Comparison with other

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Introduction to Machine Learning Random Forests: Proximities compstat-lmu.github.io/lecture_i2ml

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Forests NSW Forests NSW Spotted Gum ( Corymbia spp.) Tree improvement and deployment strategy

Our Changing Forests Harvard Forest Schoolyard Project August 22, 2019 1. How do forests change?

Conservation Plan Update Liz Dent, State Forests Division Chief Brian Pew, State Forests Deputy

ECON 950 Winter 2020 Prof. James MacKinnon 7. Boosting Like bagging and random forests,

Ricco RAKOTOMALALA Ricco Rakotomalala 1 Tutoriels Tanagra -

Hold out in residential projects: Land assembly revisited Thomas Boogaerts &amp; Geert Goeyvaerts

Learning Faster from Easy Data II Wouter Koolen Tim van Erven Aim of the Workshop

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: supervised learning Training data

Deep Learning for Perception Robert Platt Northeastern University Perception problems We will

Travel Insurance Niall Palmer Saga Insurance Overview About Saga Types of Travel

Presentation to NZPIF Communications Meeting Wellington May 2019 Official Insurance Partner

Advanced Lesson 26 Topic 26: Phrasal verbs literal and idiomatic meaning A phrasal verb is a

Hold out in residential projects: Land assembly revisited Thomas Boogaerts & Geert Goeyvaerts