[PPT] - Gradient Boosted Regression Trees scikit Peter Prettenhofer PowerPoint Presentation

SLIDE 1

Gradient Boosted Regression Trees

scikit Peter Prettenhofer (@pprett)

DataRobot

Gilles Louppe (@glouppe)

Universit´ e de Li` ege, Belgium

SLIDE 2

Motivation

SLIDE 3

Motivation

SLIDE 4

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

SLIDE 5

About us

Peter

@pprett
Python & ML ∼ 6 years
sklearn dev since 2010

Gilles

@glouppe
PhD student (Li`

ege, Belgium)

sklearn dev since 2011

Chief tree hugger

SLIDE 6

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

SLIDE 7

Machine Learning 101

Data comes as...
A set of examples {(xi, yi)|0 ≤ i < n samples}, with
Feature vector x ∈ Rn features, and
Response y ∈ R (regression) or y ∈ {−1, 1} (classification)
Goal is to...
Find a function ˆ

y = f (x)

Such that error L(y, ˆ

y) on new (unseen) x is minimal

SLIDE 8

Classification and Regression Trees [Breiman et al, 1984]

MedInc <= 5.04 MedInc <= 3.07 MedInc <= 6.82 AveRooms <= 4.31 AveOccup <= 2.37 1.62 1.16 2.79 1.88 AveOccup <= 2.74 MedInc <= 7.82 3.39 2.56 3.73 4.57

sklearn.tree.DecisionTreeClassifier|Regressor

SLIDE 9

Function approximation with Regression Trees

2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y

ground truth RT max_depth=1 RT max_depth=3 RT max_depth=20

SLIDE 10

Function approximation with Regression Trees

2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y

ground truth RT max_depth=1 RT max_depth=3 RT max_depth=20

Deprecated

Nowadays seldom used alone
Ensembles: Random Forest, Bagging, or Boosting

(see sklearn.ensemble)

SLIDE 11

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

SLIDE 12

Gradient Boosted Regression Trees

Advantages

Heterogeneous data (features measured on different scale)
Supports different loss functions (e.g. huber)
Automatically detects (non-linear) feature interactions

Disadvantages

Requires careful tuning
Slow to train (but fast to predict)
Cannot extrapolate

SLIDE 13

Boosting

AdaBoost [Y. Freund & R. Schapire, 1995]

Ensemble: each member is an expert on the errors of its

predecessor

Iteratively re-weights training examples based on errors

2 1 1 2 3 x0 2 1 1 2 x1 2 1 1 2 3 x0 2 1 1 2 3 x0 2 1 1 2 3 x0

sklearn.ensemble.AdaBoostClassifier|Regressor

SLIDE 14

Boosting

AdaBoost [Y. Freund & R. Schapire, 1995]

Ensemble: each member is an expert on the errors of its

predecessor

Iteratively re-weights training examples based on errors

2 1 1 2 3 x0 2 1 1 2 x1 2 1 1 2 3 x0 2 1 1 2 3 x0 2 1 1 2 3 x0

sklearn.ensemble.AdaBoostClassifier|Regressor

Huge success

Viola-Jones Face Detector (2001)
Freund & Schapire won the G¨
del prize 2003

SLIDE 15

Gradient Boosting [J. Friedman, 1999]

Statistical view on boosting

⇒ Generalization of boosting to arbitrary loss functions

2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y

Ground truth

2 6 10 x

∼ tree 1

2 6 10 x

+ tree 2

2 6 10 x

+ tree 3

SLIDE 16

Gradient Boosting [J. Friedman, 1999]

Statistical view on boosting

⇒ Generalization of boosting to arbitrary loss functions

Residual fitting

2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y

Ground truth

2 6 10 x

∼ tree 1

2 6 10 x

+ tree 2

2 6 10 x

+ tree 3

sklearn.ensemble.GradientBoostingClassifier|Regressor

SLIDE 17

Functional Gradient Descent

Least Squares Regression

Squared loss: L(yi, f (xi)) = (yi − f (xi))2
The residual ∼ the (negative) gradient ∂L(yi, f (xi))

∂f (xi)

4 3 2 1 1 2 3 4 y−f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 1 2 3 4 y ·f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss

SLIDE 18

Functional Gradient Descent

Least Squares Regression

Squared loss: L(yi, f (xi)) = (yi − f (xi))2
The residual ∼ the (negative) gradient ∂L(yi, f (xi))

∂f (xi)

Steepest Descent

Regression trees approximate the (negative) gradient
Each tree is a successive gradient descent step

4 3 2 1 1 2 3 4 y−f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 1 2 3 4 y ·f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss

SLIDE 19

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

SLIDE 20

GBRT in scikit-learn

How to use it

>>> from sklearn.ensemble import GradientBoostingClassifier >>> from sklearn.datasets import make_hastie_10_2 >>> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33])

Implementation

Written in pure Python/Numpy (easy to extend).
Builds on top of sklearn.tree.DecisionTreeRegressor (Cython).
Custom node splitter that uses pre-sorting (better for shallow trees).

SLIDE 21

Example

from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)

2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y High bias - low variance Low bias - high variance

ground truth RT max_depth=1 RT max_depth=3 GBRT max_depth=1

SLIDE 22

Model complexity & Overfitting

test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap

Test Train

SLIDE 23

Model complexity & Overfitting

test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap

Test Train

Regularization

GBRT provides a number of knobs to control

verfitting
Tree structure
Shrinkage
Stochastic Gradient Boosting

SLIDE 24

Regularization: Tree structure

The max depth of the trees controls the degree of features interactions
Use min samples leaf to have a sufficient nr. of samples per leaf.

SLIDE 25

Regularization: Shrinkage

Slow learning by shrinking tree predictions with 0 < learning rate <= 1
Lower learning rate requires higher n estimators

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Requires more trees Lower test error

Test Train Test learning_rate=0.1 Train learning_rate=0.1

SLIDE 26

Regularization: Stochastic Gradient Boosting

Samples: random subset of the training set (subsample)
Features: random subset of features (max features)
Improved accuracy – reduced runtime

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Even lower test error Subsample alone does poorly

Train Test Train subsample=0.5, learning_rate=0.1 Test subsample=0.5, learning_rate=0.1

SLIDE 27

Hyperparameter tuning

1. Set n estimators as high as possible (eg. 3000)
2. Tune hyperparameters via grid search.

from sklearn.grid_search import GridSearchCV param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01], ’max_depth’: [4, 6], ’min_samples_leaf’: [3, 5, 9, 17], ’max_features’: [1.0, 0.3, 0.1]} est = GradientBoostingRegressor(n_estimators=3000) gs_cv = GridSearchCV(est, param_grid).fit(X, y) # best hyperparameter setting gs_cv.best_params_

3. Finally, set n estimators even higher and tune

learning rate.

SLIDE 28

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

SLIDE 29

Case Study

California Housing dataset

Predict log(medianHouseValue)
Block groups in 1990 census
20.640 groups with 8 features

(median income, median age, lat, lon, ...)

Evaluation: Mean absolute error
n 80/20 split

Challenges

Heterogeneous features
Non-linear interactions

SLIDE 30

Predictive accuracy & runtime

Train time [s] Test time [ms] MAE Mean

0.4635

Ridge 0.006 0.11 0.2756 SVR 28.0 2000.00 0.1888 RF 26.3 605.00 0.1620 GBRT 192.0 439.00 0.1438

500 1000 1500 2000 2500 3000 n_estimators 0.0 0.1 0.2 0.3 0.4 0.5 error

Test Train

SLIDE 31

Model interpretation

Which features are important?

>>> est.feature_importances_ array([ 0.01, 0.38, ...])

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Relative importance HouseAge Population AveBedrms Latitude AveOccup Longitude AveRooms MedInc

SLIDE 32

Model interpretation

What is the effect of a feature on the response?

from sklearn.ensemble import partial_dependence import as pd features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’, (’AveOccup’, ’HouseAge’)] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names)

1.5 3.0 4.5 6.0 7.5 MedInc 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 2.0 2.5 3.0 3.5 4.0 4.5 AveOccup 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 10 20 30 40 50 60 HouseAge 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 4 5 6 7 8 AveRooms 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 2.0 2.5 3.0 3.5 4.0 AveOccup 10 20 30 40 50 HouseAge

0.12
.

5 0.02 0.09 0.16 . 2 3

Partial dependence of house value on nonlocation features for the California housing dataset

SLIDE 33

Model interpretation

Automatically detects spatial effects

longitude latitude

1.54
1.22
0.91
0.60
0.28

0.03 0.34 0.66 0.97 partial dep. on median house value longitude latitude

0.15
0.07

0.01 0.09 0.17 0.25 0.33 0.41 0.49 0.57 partial dep. on median house value

SLIDE 34

Summary

Flexible non-parametric classification and regression technique
Applicable to a variety of problems
Solid, battle-worn implementation in scikit-learn

SLIDE 35

Thanks! Questions?

SLIDE 36

Benchmarks

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Error

gbm sklearn-0.15

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Train time Arcene Boston California Covtype Example 10.2 Expedia Madelon Solar Spam YahooLTRC bioresp dataset 0.0 0.2 0.4 0.6 0.8 1.0 Test time

SLIDE 37

Tipps & Tricks 1

Input layout

Use dtype=np.float32 to avoid memory copies and fortan layout for slight runtime benefit. X = np.asfortranarray(X, dtype=np.float32)

SLIDE 38

Tipps & Tricks 2

Feature interactions

GBRT automatically detects feature interactions but often explicit interactions help. Trees required to approximate X1 − X2: 10 (left), 1000 (right).

x 0.0 0.2 0.4 0.6 0.8 1.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x - y 0.3 0.2 0.1 0.0 0.1 0.2 0.3 x 0.0 0.2 0.4 0.6 0.8 1.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x - y 1.0 0.5 0.0 0.5 1.0

SLIDE 39

Tipps & Tricks 3

Categorical variables

Sklearn requires that categorical variables are encoded as numerics. Tree-based methods work well with ordinal encoding: df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]}) # ordinal encoding df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao, return_inverse=True)[1]}) X = np.asfortranarray(df_enc.values, dtype=np.float32)