Gradient Boosted Regression Trees
scikit Peter Prettenhofer (@pprett)
DataRobot
Gilles Louppe (@glouppe)
Universit´ e de Li` ege, Belgium
Gradient Boosted Regression Trees scikit Peter Prettenhofer - - PowerPoint PPT Presentation
Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) Gilles Louppe (@glouppe) DataRobot Universit e de Li` ege, Belgium Motivation Motivation Outline 1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4
DataRobot
Universit´ e de Li` ege, Belgium
1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing
1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing
y = f (x)
y) on new (unseen) x is minimal
MedInc <= 5.04 MedInc <= 3.07 MedInc <= 6.82 AveRooms <= 4.31 AveOccup <= 2.37 1.62 1.16 2.79 1.88 AveOccup <= 2.74 MedInc <= 7.82 3.39 2.56 3.73 4.57
sklearn.tree.DecisionTreeClassifier|Regressor
2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y
ground truth RT max_depth=1 RT max_depth=3 RT max_depth=20
2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y
ground truth RT max_depth=1 RT max_depth=3 RT max_depth=20
1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing
2 1 1 2 3 x0 2 1 1 2 x1 2 1 1 2 3 x0 2 1 1 2 3 x0 2 1 1 2 3 x0
sklearn.ensemble.AdaBoostClassifier|Regressor
2 1 1 2 3 x0 2 1 1 2 x1 2 1 1 2 3 x0 2 1 1 2 3 x0 2 1 1 2 3 x0
sklearn.ensemble.AdaBoostClassifier|Regressor
2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y
Ground truth
2 6 10 x
∼ tree 1
2 6 10 x
+ tree 2
2 6 10 x
+ tree 3
2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y
Ground truth
2 6 10 x
∼ tree 1
2 6 10 x
+ tree 2
2 6 10 x
+ tree 3
sklearn.ensemble.GradientBoostingClassifier|Regressor
∂f (xi)
4 3 2 1 1 2 3 4 y−f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 1 2 3 4 y ·f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss
∂f (xi)
4 3 2 1 1 2 3 4 y−f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 1 2 3 4 y ·f(x) 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss
1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing
>>> from sklearn.ensemble import GradientBoostingClassifier >>> from sklearn.datasets import make_hastie_10_2 >>> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33])
from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1)
2 4 6 8 10 x 8 6 4 2 2 4 6 8 10 y High bias - low variance Low bias - high variance
ground truth RT max_depth=1 RT max_depth=3 GBRT max_depth=1
test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap
Test Train
test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’)
200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap
Test Train
200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Requires more trees Lower test error
Test Train Test learning_rate=0.1 Train learning_rate=0.1
200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Even lower test error Subsample alone does poorly
Train Test Train subsample=0.5, learning_rate=0.1 Test subsample=0.5, learning_rate=0.1
from sklearn.grid_search import GridSearchCV param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01], ’max_depth’: [4, 6], ’min_samples_leaf’: [3, 5, 9, 17], ’max_features’: [1.0, 0.3, 0.1]} est = GradientBoostingRegressor(n_estimators=3000) gs_cv = GridSearchCV(est, param_grid).fit(X, y) # best hyperparameter setting gs_cv.best_params_
1 Basics 2 Gradient Boosting 3 Gradient Boosting in scikit-learn 4 Case Study: California housing
California Housing dataset
(median income, median age, lat, lon, ...)
Challenges
500 1000 1500 2000 2500 3000 n_estimators 0.0 0.1 0.2 0.3 0.4 0.5 error
Test Train
>>> est.feature_importances_ array([ 0.01, 0.38, ...])
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Relative importance HouseAge Population AveBedrms Latitude AveOccup Longitude AveRooms MedInc
from sklearn.ensemble import partial_dependence import as pd features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’, (’AveOccup’, ’HouseAge’)] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names)
1.5 3.0 4.5 6.0 7.5 MedInc 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 2.0 2.5 3.0 3.5 4.0 4.5 AveOccup 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 10 20 30 40 50 60 HouseAge 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 4 5 6 7 8 AveRooms 0.4 0.2 0.0 0.2 0.4 0.6 Partial dependence 2.0 2.5 3.0 3.5 4.0 AveOccup 10 20 30 40 50 HouseAge
5 0.02 0.09 0.16 . 2 3
Partial dependence of house value on nonlocation features for the California housing dataset
longitude latitude
0.03 0.34 0.66 0.97 partial dep. on median house value longitude latitude
0.01 0.09 0.17 0.25 0.33 0.41 0.49 0.57 partial dep. on median house value
0.0 0.2 0.4 0.6 0.8 1.0 1.2 Error
gbm sklearn-0.15
0.0 0.5 1.0 1.5 2.0 2.5 3.0 Train time Arcene Boston California Covtype Example 10.2 Expedia Madelon Solar Spam YahooLTRC bioresp dataset 0.0 0.2 0.4 0.6 0.8 1.0 Test time
Use dtype=np.float32 to avoid memory copies and fortan layout for slight runtime benefit. X = np.asfortranarray(X, dtype=np.float32)
GBRT automatically detects feature interactions but often explicit interactions help. Trees required to approximate X1 − X2: 10 (left), 1000 (right).
x 0.0 0.2 0.4 0.6 0.8 1.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x - y 0.3 0.2 0.1 0.0 0.1 0.2 0.3 x 0.0 0.2 0.4 0.6 0.8 1.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x - y 1.0 0.5 0.0 0.5 1.0
Sklearn requires that categorical variables are encoded as numerics. Tree-based methods work well with ordinal encoding: df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]}) # ordinal encoding df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao, return_inverse=True)[1]}) X = np.asfortranarray(df_enc.values, dtype=np.float32)