Understanding Random Forests Gilles Louppe (@glouppe) CERN, - PowerPoint PPT Presentation

Understanding Random Forests Gilles Louppe (@glouppe) CERN, September 21, 2015

Outline 1 Motivation 2 Growing decision trees 3 Random forests 4 Boosting 5 Variable importances 6 Summary 2 / 28

Motivation 3 / 28

Running example From physicochemical properties (alcohol, acidity, sulphates, ...), learn a model to predict wine taste preferences . 4 / 28

Outline 1 Motivation 2 Growing decision trees 3 Random Forests 4 Boosting 5 Variable importances 6 Summary

Supervised learning • Data comes as a finite learning set L = (X, y) where Input samples are given as an array of shape (n samples, n features) E.g., feature values for wine physicochemical properties: # fixed acidity, volatile acidity, ... X = [[ 7.4 0. ... 0.56 9.4 0. ] [ 7.8 0. ... 0.68 9.8 0. ] ... [ 7.8 0.04 ... 0.65 9.8 0. ]] Output values are given as an array of shape (n samples,) E.g., wine taste preferences (from 0 to 10): y = [5 5 5 ... 6 7 6] • The goal is to build an estimator ϕ L : X �→ Y minimizing Err ( ϕ L ) = E X , Y { L ( Y , ϕ L .predict( X ) ) } . 5 / 28

Decision trees (Breiman et al., 1984) 𝒚 S plit node X 2 t 5 𝑢 1 Leaf node ≤ > 𝑌 1 ≤ 0.7 t 3 0 . 5 𝑢 2 𝑢 3 t 4 ≤ > 𝑌 2 ≤ 0.5 𝑢 5 𝑢 4 0 . 7 X 1 𝑞 ( 𝑍 = 𝑑 | 𝑌 = 𝒚 ) function BuildDecisionTree ( L ) Create node t if the stopping criterion is met for t then Assign a model to � y t else Find the split on L that maximizes impurity decrease s ∗ = arg max i ( t ) − p L i ( t s L ) − p R i ( t s R ) s Partition L into L t L ∪ L t R according to s ∗ t L = BuildDecisionTree ( L t L ) t R = BuildDecisionTree ( L t R ) end if return t end function 6 / 28

Composability of decision trees Decision trees can be used to solve several machine learning tasks by swapping the impurity and leaf model functions: 0-1 loss (classification) � y t = arg max c ∈ Y p ( c | t ) , i ( t ) = entropy ( t ) or i ( t ) = gini ( t ) Mean squared error (regression) � 1 y t ) 2 y t = mean ( y | t ) , i ( t ) = � x , y ∈ L t ( y − � N t Least absolute deviance (regression) � 1 � y t = median ( y | t ) , i ( t ) = x , y ∈ L t | y − � y t | N t Density estimation � y t = N ( µ t , Σ t ) , i ( t ) = differential entropy ( t ) 7 / 28

Sample weights Sample weights can be accounted for by adapting the impurity and leaf model functions. Weighted mean squared error � 1 � y t = x , y , w ∈ L t wy � w w � 1 y t ) 2 x , y , w ∈ L t w ( y − � i ( t ) = � w w Weights are assumed to be non-negative since these quantities may otherwise be undefined. (E.g., what if � w w < 0?) 8 / 28

sklearn.tree # Fit a decision tree from sklearn.tree import DecisionTreeRegressor estimator = DecisionTreeRegressor(criterion="mse", # Set i(t) function max_leaf_nodes=5) estimator.fit(X_train, y_train) # Predict target values y_pred = estimator.predict(X_test) # MSE on test data from sklearn.metrics import mean_squared_error score = mean_squared_error(y_test, y_pred) >>> 0.572049826453 9 / 28

Visualize and interpret # Display tree from sklearn.tree import export_graphviz export_graphviz(estimator, out_file="tree.dot", feature_names=feature_names) 10 / 28

Strengths and weaknesses of decision trees • Non-parametric model, proved to be consistent. • Support heterogeneous data (continuous, ordered or categorical variables). • Flexibility in loss functions (but choice is limited). • Fast to train, fast to predict. In the average case, complexity of training is Θ ( pN log 2 N ) . • Easily interpretable. • Low bias, but usually high variance Solution: Combine the predictions of several randomized trees into a single model. 11 / 28

Random Forests (Breiman, 2001; Geurts et al., 2006) 𝒚 𝜒 1 𝜒 𝑁 … 𝑞 𝜒 𝑛 (𝑍 = 𝑑|𝑌 = 𝒚) 𝑞 𝜒 1 (𝑍 = 𝑑|𝑌 = 𝒚) ∑ 𝑞 𝜔 (𝑍 = 𝑑|𝑌 = 𝒚) Randomization • Bootstrap samples } Random Forests • Random selection of K � p split variables } Extra-Trees • Random selection of the threshold 12 / 28

Bias and variance 13 / 28

Bias-variance decomposition Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error E L { Err ( ψ L , θ 1 ,..., θ M ( x )) } at X = x of an ensemble of M randomized models ϕ L , θ m is E L { Err ( ψ L , θ 1 ,..., θ M ( x )) } = noise ( x ) + bias 2 ( x ) + var ( x ) , where noise ( x ) = Err ( ϕ B ( x )) , bias 2 ( x ) = ( ϕ B ( x ) − E L , θ { ϕ L , θ ( x ) } ) 2 , L , θ ( x ) + 1 − ρ ( x ) var ( x ) = ρ ( x ) σ 2 σ 2 L , θ ( x ) . M and where ρ ( x ) is the Pearson correlation coefficient between the predictions of two randomized trees built on the same learning set. 14 / 28

Diagnosing the error of random forests (Louppe, 2014) • Bias: Identical to the bias of a single randomized tree. L , θ ( x ) + 1 − ρ ( x ) • Variance: var ( x ) = ρ ( x ) σ 2 σ 2 L , θ ( x ) M As M → ∞ , var ( x ) → ρ ( x ) σ 2 L , θ ( x ) The stronger the randomization, ρ ( x ) → 0, var ( x ) → 0. The weaker the randomization, ρ ( x ) → 1, var ( x ) → σ 2 L , θ ( x ) Bias-variance trade-off. Randomization increases bias but makes it possible to reduce the variance of the corresponding ensemble model. The crux of the problem is to find the right trade-off. 15 / 28

Tuning randomization in sklearn.ensemble from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from sklearn.cross_validation import ShuffleSplit from sklearn.learning_curve import validation_curve # Validation of max_features, controlling randomness in forests param_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] _, test_scores = validation_curve( RandomForestRegressor(n_estimators=100, n_jobs=-1), X, y, cv=ShuffleSplit(n=len(X), n_iter=10, test_size=0.25), param_name="max_features", param_range=param_range, scoring="mean_squared_error") test_scores_mean = np.mean(-test_scores, axis=1) plt.plot(param_range, test_scores_mean, label="RF", color="g") _, test_scores = validation_curve( ExtraTreesRegressor(n_estimators=100, n_jobs=-1), X, y, cv=ShuffleSplit(n=len(X), n_iter=10, test_size=0.25), param_name="max_features", param_range=param_range, scoring="mean_squared_error") test_scores_mean = np.mean(-test_scores, axis=1) plt.plot(param_range, test_scores_mean, label="ETs", color="r") 16 / 28

Tuning randomization in sklearn.ensemble Best-tradeoff: ExtraTrees, for max features=6 . 17 / 28

Benchmarks and implementation Scikit-Learn provides a robust implementation combining both algorithmic and code optimizations. It is one of the fastest among all libraries and programming languages. 14000 13427.06 Scikit-Learn-RF Scikit-Learn-ETs randomForest OpenCV-RF R, Fortran 12000 OpenCV-ETs OK3-RF 10941.72 OK3-ETs Orange Weka-RF 10000 Python R-RF Orange-RF Fit time (s) 8000 6000 OpenCV C++ 4464.65 4000 3342.83 OK3 C Weka 2000 1518.14 1711.94 Java Scikit-Learn 1027.91 Python, Cython 203.01 211.53 0 18 / 28

Benchmarks and implementation 19 / 28

Strengths and weaknesses of forests • One of the best off-the-self learning algorithm, requiring almost no tuning. • Fine control of bias and variance through averaging and randomization, resulting in better performance. • Moderately fast to train and to predict. N log 2 � Θ ( MK � N ) for RFs (where � N = 0.632 N ) Θ ( MKN log N ) for ETs • Embarrassingly parallel (use n jobs ). • Less interpretable than decision trees. 20 / 28

Gradient Boosted Regression Trees (Friedman, 2001) • GBRT fits an additive model of the form M � ϕ ( x ) = γ m h m ( x ) m = 1 • The ensemble is built in a forward stagewise manner. That is ϕ m ( x ) = ϕ m − 1 ( x ) + γ m h m ( x ) where h m : X �→ R is a regression tree approximating the gradient step ∆ ϕ L ( Y , ϕ m − 1 ( X )) . Ground truth tree 1 tree 2 tree 3 2.5 2.0 1.5 1.0 0.5 ∼ + + y 0.0 0.5 1.0 1.5 2.0 2 6 10 2 6 10 2 6 10 2 6 10 x x x x 21 / 28

Careful tuning required from sklearn.ensemble import GradientBoostingRegressor from sklearn.cross_validation import ShuffleSplit from sklearn.grid_search import GridSearchCV # Careful tuning is required to obtained good results param_grid = {"loss": ["mse", "lad", "huber"], "learning_rate": [0.1, 0.01, 0.001], "max_depth": [3, 5, 7], "min_samples_leaf": [1, 3, 5], "subsample": [1.0, 0.9, 0.8]} est = GradientBoostingRegressor(n_estimators=1000) grid = GridSearchCV(est, param_grid, cv=ShuffleSplit(n=len(X), n_iter=10, test_size=0.25), scoring="mean_squared_error", n_jobs=-1).fit(X, y) gbrt = grid.best_estimator_ See our PyData 2014 tutorial for further guidance https://github.com/pprett/pydata-gbrt-tutorial 22 / 28

Understanding Random Forests Gilles Louppe (@glouppe) CERN, - PowerPoint PPT Presentation

Understanding Random Forests Gilles Louppe (@glouppe) CERN, September 21, 2015 Outline 1 Motivation 2 Growing decision trees 3 Random forests 4 Boosting 5 Variable importances 6 Summary 2 / 28 Motivation 3 / 28 Running example From

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc.

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Introduction to Machine Learning Random Forests: Proximities compstat-lmu.github.io/lecture_i2ml

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Our Changing Forests Harvard Forest Schoolyard Project August 22, 2019 1. How do forests change?

Conservation Plan Update Liz Dent, State Forests Division Chief Brian Pew, State Forests Deputy

Forests NSW Forests NSW Spotted Gum ( Corymbia spp.) Tree improvement and deployment strategy

Modelling and Forecasting Australian Domestic Tourism George Athanasopoulos & Rob Hyndman

The EU ETS and industry Is Europe pulling its weight on climate change? Conservatively

Champion Meeting #2 January 15, 2013 Welcome! R5DC Resilient Region Plan Is a Community Driven

E3 E3T Energy Efficiency Emerging Technologies E3T ResTAG BPA E3T Residential Buildings

Algebraic QC-LDPC Codes with Girth 6 and Free of Small Size Elementary Trapping Sets Daniel

EU ETS Reform: Partial carbon price floor(s) and the market stability reserve Universit

Performance Implications of Packet Filtering with Linux eBPF Dominik Scholz , Daniel Raumer, Paul

Transport Equations for Internet Transmission Control F. Baccelli INRIA and ENS ICERM, October

Understanding Random Forests Gilles Louppe (@glouppe) CERN, - PowerPoint PPT Presentation

Understanding Random Forests Gilles Louppe (@glouppe) CERN, September 21, 2015 Outline 1 Motivation 2 Growing decision trees 3 Random forests 4 Boosting 5 Variable importances 6 Summary 2 / 28 Motivation 3 / 28 Running example From

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

STK-IN4300 Details of Random Forests Statistical Learning Methods in Data Science Adaptive

Random Forests September 29, 2019 Random Forests September 29, 2019 1 / 30 Motto The clearest

A Look at our Wyoming Forests December 18 - 20, 2013 Governors Task Force on Forests Forests

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck &amp; Co., Inc.

Forests and Climate Forests and Climate Keeping Earth a Livable Place Keeping Earth a Livable

South- -East East Pahang Pahang Peat Peat South Swamp Forests, Malaysia Swamp Forests,

Mangrove forests and sea level rise 1 / 48 00001 - 00:00:01 Mangrove forests and sea level rise

Econ 2148, fall 2019 Trees, forests, and causal trees Maximilian Kasy Department of Economics,

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Introduction to Machine Learning Random Forests: Proximities compstat-lmu.github.io/lecture_i2ml

Random Forests COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Random

Our Changing Forests Harvard Forest Schoolyard Project August 22, 2019 1. How do forests change?

Conservation Plan Update Liz Dent, State Forests Division Chief Brian Pew, State Forests Deputy

Forests NSW Forests NSW Spotted Gum ( Corymbia spp.) Tree improvement and deployment strategy

Modelling and Forecasting Australian Domestic Tourism George Athanasopoulos &amp; Rob Hyndman

The EU ETS and industry Is Europe pulling its weight on climate change? Conservatively

Champion Meeting #2 January 15, 2013 Welcome! R5DC Resilient Region Plan Is a Community Driven

E3 E3T Energy Efficiency Emerging Technologies E3T ResTAG BPA E3T Residential Buildings

Algebraic QC-LDPC Codes with Girth 6 and Free of Small Size Elementary Trapping Sets Daniel

EU ETS Reform: Partial carbon price floor(s) and the market stability reserve Universit

Performance Implications of Packet Filtering with Linux eBPF Dominik Scholz , Daniel Raumer, Paul

Transport Equations for Internet Transmission Control F. Baccelli INRIA and ENS ICERM, October

Random Forests What, Why, And How Andy Liaw Biometrics Research, Merck & Co., Inc.

Modelling and Forecasting Australian Domestic Tourism George Athanasopoulos & Rob Hyndman