Gradient Boosted Regression Trees scikit Peter Prettenhofer - - PowerPoint PPT Presentation

gradient boosted regression trees
SMART_READER_LITE
LIVE PREVIEW

Gradient Boosted Regression Trees scikit Peter Prettenhofer - - PowerPoint PPT Presentation

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe) DataRobot Universit e de Li` ege, Belgium Motivation Motivation Outline 1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4


slide-1
SLIDE 1

Gradient Boosted Regression Trees

scikit Peter Prettenhofer (@pprett)

DataRobot

Gilles Louppe (@glouppe)

Universit´ e de Li` ege, Belgium

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

slide-5
SLIDE 5

About us

Peter

  • @pprett
  • Python & ML ∼ 6 years
  • sklearn dev since 2010

Gilles

  • @glouppe
  • PhD student (Li`

ege, Belgium)

  • sklearn dev since 2011

Chief tree hugger

slide-6
SLIDE 6

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

slide-7
SLIDE 7

Machine Learning 101

  • Data comes as...
  • A set of examples {(xi, yi)|0 ≤ i < n samples}, with
  • Feature vector x ∈ Rn features, and
  • Response y ∈ R (regression) or y ∈ {−1, 1} (classification)
  • Goal is to...
  • Find a function ˆ

y = f (x)

  • Such that error L(y, ˆ

y) on new (unseen) x is minimal

slide-8
SLIDE 8

Classification and Regression Trees [Breiman et al, 1984]

MedInc <= 5.04 MedInc <= 3.07 MedInc <= 6.82 AveRooms <= 4.31 AveOccup <= 2.37 1.62 1.16 2.79 1.88 AveOccup <= 2.74 MedInc <= 7.82 3.39 2.56 3.73 4.57

sklearn.tree.DecisionTreeClassifier|Regressor

slide-9
SLIDE 9

Function approximation with Regression Trees

2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y

ground truth RT max_depth=1 RT max_depth=3 RT max_depth=20

slide-10
SLIDE 10

Function approximation with Regression Trees

2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y

ground truth RT max_depth=1 RT max_depth=3 RT max_depth=20

Deprecated

  • Nowadays seldom used alone
  • Ensembles: Random Forest, Bagging, or Boosting

(see sklearn.ensemble)

slide-11
SLIDE 11

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

slide-12
SLIDE 12

Gradient Boosted Regression Trees

Advantages

  • Heterogeneous data (features measured on different scale)
  • Supports different loss functions (e.g. huber)
  • Automatically detects (non-linear) feature interactions

Disadvantages

  • Requires careful tuning
  • Slow to train (but fast to predict)
  • Cannot extrapolate
slide-13
SLIDE 13

Boosting

AdaBoost [Y. Freund & R. Schapire, 1995]

  • Ensemble: each member is an expert on the errors of its

predecessor

  • Iteratively re-weights training examples based on errors

2 1 1 2 3 x0 2 1 1 2 x1 2 1 1 2 3 x0 2 1 1 2 3 x0 2 1 1 2 3 x0

sklearn.ensemble.AdaBoostClassifier|Regressor

slide-14
SLIDE 14

Boosting

AdaBoost [Y. Freund & R. Schapire, 1995]

  • Ensemble: each member is an expert on the errors of its

predecessor

  • Iteratively re-weights training examples based on errors

2 1 1 2 3 x0 2 1 1 2 x1 2 1 1 2 3 x0 2 1 1 2 3 x0 2 1 1 2 3 x0

sklearn.ensemble.AdaBoostClassifier|Regressor

Huge success

  • Viola-Jones Face Detector (2001)
  • Freund & Schapire won the G¨
  • del prize 2003
slide-15
SLIDE 15

Gradient Boosting [J. Friedman, 1999]

Statistical view on boosting

  • ⇒ Generalization of boosting to arbitrary loss functions

2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y

Ground truth

2 6 10 x

∼ tree 1

2 6 10 x

+ tree 2

2 6 10 x

+ tree 3

slide-16
SLIDE 16

Gradient Boosting [J. Friedman, 1999]

Statistical view on boosting

  • ⇒ Generalization of boosting to arbitrary loss functions

Residual fitting

2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y

Ground truth

2 6 10 x

∼ tree 1

2 6 10 x

+ tree 2

2 6 10 x

+ tree 3

sklearn.ensemble.GradientBoostingClassifier|Regressor

slide-17
SLIDE 17

Functional Gradient Descent

Least Squares Regression

  • Squared loss: L(yi, f (xi)) = (yi − f (xi))2
  • The residual ∼ the (negative) gradient ∂L(yi, f (xi))

∂f (xi)

4 3 2 1 1 2 3 4 y−f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 1 2 3 4 y ·f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss

slide-18
SLIDE 18

Functional Gradient Descent

Least Squares Regression

  • Squared loss: L(yi, f (xi)) = (yi − f (xi))2
  • The residual ∼ the (negative) gradient ∂L(yi, f (xi))

∂f (xi)

Steepest Descent

  • Regression trees approximate the (negative) gradient
  • Each tree is a successive gradient descent step

4 3 2 1 1 2 3 4 y−f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 1 2 3 4 y ·f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss

slide-19
SLIDE 19

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

slide-20
SLIDE 20

GBRT in scikit-learn

How to use it

>>> from sklearn.ensemble import GradientBoostingClassifier >>> from sklearn.datasets import make_hastie_10_2 >>> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33])

Implementation

  • Written in pure Python/Numpy (easy to extend).
  • Builds on top of sklearn.tree.DecisionTreeRegressor (Cython).
  • Custom node splitter that uses pre-sorting (better for shallow trees).
slide-21
SLIDE 21

Example

from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)

2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y High bias - low variance Low bias - high variance

ground truth RT max_depth=1 RT max_depth=3 GBRT max_depth=1

slide-22
SLIDE 22

Model complexity & Overfitting

test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap

Test Train

slide-23
SLIDE 23

Model complexity & Overfitting

test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap

Test Train

Regularization

GBRT provides a number of knobs to control

  • verfitting
  • Tree structure
  • Shrinkage
  • Stochastic Gradient Boosting
slide-24
SLIDE 24

Regularization: Tree structure

  • The max depth of the trees controls the degree of features interactions
  • Use min samples leaf to have a sufficient nr. of samples per leaf.
slide-25
SLIDE 25

Regularization: Shrinkage

  • Slow learning by shrinking tree predictions with 0 < learning rate <= 1
  • Lower learning rate requires higher n estimators

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Requires more trees Lower test error

Test Train Test learning_rate=0.1 Train learning_rate=0.1

slide-26
SLIDE 26

Regularization: Stochastic Gradient Boosting

  • Samples: random subset of the training set (subsample)
  • Features: random subset of features (max features)
  • Improved accuracy – reduced runtime

200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Even lower test error Subsample alone does poorly

Train Test Train subsample=0.5, learning_rate=0.1 Test subsample=0.5, learning_rate=0.1

slide-27
SLIDE 27

Hyperparameter tuning

  • 1. Set n estimators as high as possible (eg. 3000)
  • 2. Tune hyperparameters via grid search.

from sklearn.grid_search import GridSearchCV param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01], ’max_depth’: [4, 6], ’min_samples_leaf’: [3, 5, 9, 17], ’max_features’: [1.0, 0.3, 0.1]} est = GradientBoostingRegressor(n_estimators=3000) gs_cv = GridSearchCV(est, param_grid).fit(X, y) # best hyperparameter setting gs_cv.best_params_

  • 3. Finally, set n estimators even higher and tune

learning rate.

slide-28
SLIDE 28

Outline

1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing

slide-29
SLIDE 29

Case Study

California Housing dataset

  • Predict log(medianHouseValue)
  • Block groups in 1990 census
  • 20.640 groups with 8 features

(median income, median age, lat, lon, ...)

  • Evaluation: Mean absolute error
  • n 80/20 split

Challenges

  • Heterogeneous features
  • Non-linear interactions
slide-30
SLIDE 30

Predictive accuracy & runtime

Train time [s] Test time [ms] MAE Mean

  • 0.4635

Ridge 0.006 0.11 0.2756 SVR 28.0 2000.00 0.1888 RF 26.3 605.00 0.1620 GBRT 192.0 439.00 0.1438

500 1000 1500 2000 2500 3000 n_estimators 0.0 0.1 0.2 0.3 0.4 0.5 error

Test Train

slide-31
SLIDE 31

Model interpretation

Which features are important?

>>> est.feature_importances_ array([ 0.01, 0.38, ...])

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Relative importance HouseAge Population AveBedrms Latitude AveOccup Longitude AveRooms MedInc

slide-32
SLIDE 32

Model interpretation

What is the effect of a feature on the response?

from sklearn.ensemble import partial_dependence import as pd features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’, (’AveOccup’, ’HouseAge’)] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names)

1.5 3.0 4.5 6.0 7.5 MedInc 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 2.0 2.5 3.0 3.5 4.0 4.5 AveOccup 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 10 20 30 40 50 60 HouseAge 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 4 5 6 7 8 AveRooms 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 2.0 2.5 3.0 3.5 4.0 AveOccup 10 20 30 40 50 HouseAge

  • 0.12
  • .

5 0.02 0.09 0.16 . 2 3

Partial dependence of house value on nonlocation features for the California housing dataset

slide-33
SLIDE 33

Model interpretation

Automatically detects spatial effects

longitude latitude

  • 1.54
  • 1.22
  • 0.91
  • 0.60
  • 0.28

0.03 0.34 0.66 0.97 partial dep. on median house value longitude latitude

  • 0.15
  • 0.07

0.01 0.09 0.17 0.25 0.33 0.41 0.49 0.57 partial dep. on median house value

slide-34
SLIDE 34

Summary

  • Flexible non-parametric classification and regression technique
  • Applicable to a variety of problems
  • Solid, battle-worn implementation in scikit-learn
slide-35
SLIDE 35

Thanks! Questions?

slide-36
SLIDE 36

Benchmarks

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Error

gbm sklearn-0.15

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Train time Arcene Boston California Covtype Example 10.2 Expedia Madelon Solar Spam YahooLTRC bioresp dataset 0.0 0.2 0.4 0.6 0.8 1.0 Test time

slide-37
SLIDE 37

Tipps & Tricks 1

Input layout

Use dtype=np.float32 to avoid memory copies and fortan layout for slight runtime benefit. X = np.asfortranarray(X, dtype=np.float32)

slide-38
SLIDE 38

Tipps & Tricks 2

Feature interactions

GBRT automatically detects feature interactions but often explicit interactions help. Trees required to approximate X1 − X2: 10 (left), 1000 (right).

x 0.0 0.2 0.4 0.6 0.8 1.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x - y 0.3 0.2 0.1 0.0 0.1 0.2 0.3 x 0.0 0.2 0.4 0.6 0.8 1.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x - y 1.0 0.5 0.0 0.5 1.0

slide-39
SLIDE 39

Tipps & Tricks 3

Categorical variables

Sklearn requires that categorical variables are encoded as numerics. Tree-based methods work well with ordinal encoding: df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]}) # ordinal encoding df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao, return_inverse=True)[1]}) X = np.asfortranarray(df_enc.values, dtype=np.float32)