Applied Machine Learning Applied Machine Learning
Bootstrap, Bagging and Boosting
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020)
1
Applied Machine Learning Applied Machine Learning Bootstrap, - - PowerPoint PPT Presentation
Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives bootstrap for uncertainty
Siamak Ravanbakhsh Siamak Ravanbakhsh
COMP 551 COMP 551 (winter 2020) (winter 2020)
1
2
a simple approach to estimate the uncertainty in prediction train a model on each of these bootstrap datasets (called bootstrap samples) given the dataset
D = {(x , y )}
(n) (n) n=1 N
subsample with replacement B datasets of size N
D
=b
{(x , y )}
, b =(n,b) (n,b) n=1 N
1, … , B
produce a measure of uncertainty from these models for model parameters for predictions non-parametric bootstrap
3 . 1
ϕ
(x) =k
e−
s2 (x−μ
)k 2
(n)
(n)
(n)
#x: N #y: N plt.plot(x, y, 'b.') phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 w = np.linalg.lstsq(Phi, y)[0] yh = np.dot(Phi,w) plt.plot(x, yh, 'g-') 1 2 3 4 5 6 7 8 9
before adding noise
noise
Recall: linear model with nonlinear Gaussian bases (N=100)
3 . 2
ϕ
(x) =k
e−
s2 (x−μ
)k 2
ws = np.zeros((B,D)) for b in range(B): #Phi: N x D 1 #y: N 2 B = 500 3 4 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 6 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 y_b = y[inds] #N #fit the subsampled data #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 8 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 10 11 plt.hist(ws, bins=50) 12
Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters
each color is a different weight w
d
3 . 3
Winter 2020 | Applied Machine Learning (COMP551)
ϕ
(x) =k
e−
s2 (x−μ
)k 2
#Phi: N x D #Phi_test: Nt x D #y: N #ws: B x D from previous code y_hats = np.zeros((B, Nt)) 1 2 3 4 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 6 7 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 y_5 = np.quantile(y_hats, .05, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 11 12
Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples also gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles
(for each point we can get these across bootstrap model predictions)
3 . 4
Var(z
+1
z
) =2
E[(z
+1
z
) ] −2 2
E[z
+1
z
]2 2
use bootstrap for more accurate prediction (not just uncertainty)
= E[z
+1 2
z
+2 2
2z
z ] −1 2
(E[z
] +1
E[z
])2 2
= E[z
] +1 2
E[z
] +2 2
E[2z
z ] −1 2
E[z
] −1 2
E[z
] −2 2
2E[z
]E[z ]1 2
= Var(z
) +1
Var(z
) +2
2Cov(z
, z )1 2
variance of sum of random variables
for uncorrelated variables this term is zero
4 . 1
are uncorrelated random variables with mean and variance
z
, … , z1 B
μ σ2
the average has mean and variance
= z ˉ
zB 1 ∑b b
μ
average of uncorrelated random variables has a lower variance
use this to reduce the variance of our models (bias remains the same) regression: average the model predictions
(x) =f ^
(x)B 1 ∑b f
^
b
use bootstrap samples to reduce correlation issue: model predictions are not uncorrelated (trained using the same data) bagging (bootstrap aggregation) use bootstrap for more accurate prediction (not just uncertainty)
Var(
z ) =B 1 ∑b b
Var( z ) =B2 1
∑b
b
Bσ =B2 1 2
σB 1 2
4 . 2
mode of iid classifiers that are better than chance is a better classifier use voting
crowds are wiser when individuals are better than random votes are uncorrelated
are IID Bernoulli random variables with mean
z
, … , z ∈1 B
{0, 1}
= z ˉ
zB 1 ∑b b
μ = .5 + ϵ
wisdom of crowds
for we have
p( > z ˉ .5)
goes to 1 as B grows > 0
use bootstrap samples to reduce correlation bagging (bootstrap aggregation) averaging makes sense for regression, how about classification?
4 . 3
example
setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label
voting for the most probably class averaging probabilities Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable!
4 . 4
further reduce the correlation between decision trees feature sub-sampling
further reduce the dependence between decision trees
magic number? this is a hyper-parameter, can be optimized using CV
Out Of Bag (OOB) samples: the instances not included in a bootsrap dataset can be used for validation simultaneous validation of decision trees in a forest no need to set aside data for cross validation
4 . 5
N=4601 emails binary classification task: spam - not spam D=57 features:
48 words: percentage of words in the email that match these words
e.g., business,address,internet, free, George (customized per user)
6 characters: again percentage of characters that match these
ch; , ch( ,ch[ ,ch! ,ch$ , ch#
average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT average value of these features in the spam and non-spam emails an example of feature engineering
Dataset
4 . 6
decision tree after pruning
number of leaves (17) in optimal pruning decided based on cross-validation error
test error cv error
misclassification rate on test data
4 . 7
Winter 2020 | Applied Machine Learning (COMP551)
Bagging and Random Forests do much better than a single decision tree! Out Of Bag (OOB) error can be used for parameter tuning (e.g., size of the forest)
4 . 8
Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models Random forests: Bagging + further de-corelation of features at each split OOB validation instead of CV destroy interpretability of decision trees perform well in practice can fail if only few relevant features exist (due to feature-sampling)
5
several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks
d d d
in boosting each basis is a classifier or regression function (weak learner, or base learner) create a strong learner by sequentially combining week learners
6 . 1
Winter 2020 | Applied Machine Learning (COMP551)
model f(x) =
wϕ(x; v ) ∑t=1
T {t} {t}
a simple model, such as decision stump (decision tree with one node)
cost
J({w , v }
) ={t} {t} t
L(y, f(x )) ∑n=1
N (n) (n)
so far we have seen L2 loss, log loss and hinge loss
add one weak-learner in each stage t, to reduce the error of previous stage
f (x) =
{t}
f (x ) +
{t−1} (n)
w ϕ(x ; v )
{t} (n) {t}
v , w =
{t} {t}
arg min
L(y, f (x ) +
v,w ∑n=1 N (n) {t−1} (n)
wϕ(x ; v))
(n)
6 . 2
cost
arg min
(y−
d,w
d 2
1 ∑n=1 N (n)
(f (x ) +
{t−1} (n)
w
x ))d d (n) 2 using L2 loss for regression at stage t consider weak learners that are individual features ϕ
(x) =
{t}
w x
{t} d{t}
model
residual r(n)
recall: optimal weight for each d is w
=d
x∑n
d (n) 2
x r∑n
d (n) d (n)
pick the feature that most significantly reduces the residual
f (x) =
{t}
αw x∑t
d{t} {t} d{t} the model at time-step t:
using a small helps with test error
α
7 . 1
is this related to L1-regularized linear regression?
Winter 2020 | Applied Machine Learning (COMP551)
example
using small learning rate L2 Boosting has a similar regularization path to lasso
boosting
α = .01
we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K)
the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter
∑d
d
t
lasso
w
d {t}
at each time-step only one feature is updated / added
d{t}
7 . 2
loss functions for binary classification L(y, f(x)) = log (1 + e )
−yf(x)
log-loss
(aka cross entropy loss or binomial deviance)
y ∈ {−1, +1}
predicted label is
=y ^ sign(f(x))
misclassification loss
(0-1 loss)
L(y, f(x)) = I(yf(x) > 0) L(y, f(x)) = max(0, 1 − yf(x)) Hinge loss
support vector loss
yet another loss function is exponential loss L(y, f(x)) = e−yf(x)
note that the loss grows faster than the other surrogate losses (more sensitive to outliers)
useful property when working with additive models:
L(y, f (x) +
{t−1}
w ϕ(x, v )) =
{t} {t}
L(y, f (x)) ⋅
{t−1}
L(y, w ϕ(x, v ))
{t} {t}
treat this as a weight q for an instance instances that are not properly classified before receive a higher weight 8 . 1
J({w , v }
) ={t} {t} t
L(y, f (x ) + ∑n=1
N (n) {t−1} (n)
w ϕ(x , v )) =
{t} (n) {t}
qL(y , w ϕ(x , v )) ∑n
(n) (n) {t} (n) {t}
cost
L(y , f (x ))
(n) {t−1} (n)
loss for this instance at previous stage
using exponential loss
= e
qI(y =
−w{t} ∑n (n) (n)
ϕ(x , v )) +
(n) {t}
e
qI(y
=w{t} ∑n (n) (n) ϕ(x
, v ))
(n) {t}
discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1
J({w , v }
) ={t} {t} t
qe ∑n
(n) −y w ϕ(x ,v )
(n) {t} (n) {t}
= e
q+
−w{t} ∑n (n)
(e −
w{t}
e )
qI(y
=−w{t} ∑n (n) (n) ϕ(x
, v ))
(n) {t} assuming the weak learner should minimize this cost this is classification with weighted intances
w ≥
{t}
does not depend on the weak learner 8 . 2
= e
q+
−w{t} ∑n (n)
(e −
w{t}
e )
qI(y
=−w{t} ∑n (n) (n) ϕ(x
, v ))
(n) {t} assuming the weak learner should minimize this cost this is classification with weighted instances this gives
w ≥
{t}
does not depend on the weak learner
J({w , v }
) ={t} {t} t
qL(y , w ϕ(x , v )) ∑n
(n) (n) {t} (n) {t}
cost
still need to find the optimal w{t}
v{t}
setting gives
=∂w{t} ∂J
{t}
log2 1 ℓ{t} 1−ℓ{t}
weight-normalized misclassification error
ℓ =
{t}
q∑n
(n)
qI(ϕ(x ;v )
=y) ∑n
(n) (n) {t} (n)
we can now update instance weights q for next iteration
(multiply by the new loss)
q =
(n),{t+1}
q e
(n),{t} −w y ϕ(x ;v )
{t} (n) (n) {t} since weak learner is better than chance and so w
≥
{t}
ℓ <
{t}
.5
8 . 3 since w > 0, the weight q of misclassified points increase and the rest decrease
return initialize for t=1:T fit the simple classifier to the weighted dataset
w :
{t} =
log2 1 ℓ{t} 1−ℓ{t}
ℓ :
{t} =
q∑n
(n)
qI(ϕ(x ;v )
=y) ∑n
(n) (n) {t} (n)
q :
(n) = q
e ∀n
(n) −w y ϕ(x ;v )
{t} (n) (n) {t}
q :
(n) =
∀n
N 1 ϕ(x, v )
{t}
f(x) = sign(
wϕ(x; v )) ∑t
{t} {t}
w ϕ(x; v )
{1} {1}
w ϕ(x; v )
{2} {2}
w ϕ(x; v )
{3} {3}
w ϕ(x; v )
{T} {T}
f(x) = sign(
wϕ(x; v )) ∑t
{t} {t} 8 . 4
example
w ϕ(x; v )
{1} {1}
w ϕ(x; v )
{2} {2}
w ϕ(x; v )
{3} {3}
w ϕ(x; v )
{T} {T}
=y ^ sign(
wϕ(x; v )) ∑t
{t} {t}
each weak learner is a decision stump (dashed line)
t = 1 t = 2 t = 3 t = 6 t = 10 t = 150
green is the decision boundary of f {t} circle size is proportional to qn,{t}
8 . 5
Winter 2020 | Applied Machine Learning (COMP551)
example
features are samples from standard Gaussian
x
, … , x1 (n) 10 (n)
y =
(n)
I(
x >∑d
d (n)
9.34)
label N=2000 training examples notice that test error does not increase AdaBoost is very slow to overfit
8 . 6
Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate
100% detection FP rate cumulative FP rate
fast enough for real-time (object) detection cascade is applied over all image subwindows
image source: David Lowe slides
9
let f
=
{t}
[f (x ), … , f (x )]
{t} (1) {t} (N) ⊤
y = [y , … , y ]
(1) (N) ⊤
and true labels
= f ^ arg min
L(f, y)f
ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps
f ^ = f ^ f =
{T}
f −
{0}
wg ∑t=1
T {t} {t}
so far we treated f as a parameter vector
w =
{t}
arg min
L(f−
w {t−1}
wg )
{t}
we can look for the optimal step size
L(f, y)
∂f ∂ {t−1}
gradient vector its role is similar to residual
10 . 1
idea fit the weak learner to the gradient of the cost
we are fitting the gradient using L2 loss regardless of the original loss function
fit the weak-learner to negative of the gradient
v =
{t}
arg min
∣∣ϕ −v 2 1 v
(−g)∣∣
2 2 ϕ
=v
[ϕ(x ; v), … , ϕ(x ; v)]
(1) (N) ⊤
apply gradient boosting to CART (classification and regression trees)
initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions readjust predictions per region update return
f {0}
f (x) =
{t}
f (x) +
{t−1}
w I(x ∈∑k=1
K k
R
)k
r = −
L(f, y)
∂f ∂ {t−1}
X, r
N × D N
R
, … , R1 K
w
=k
arg min
L(y, f (x ) +
w ∑x ∈R
(n) k
(n) {t−1} (n)
w
)k
f (x)
{T} this is effectively the line-search
stochastic gradient boosting combines bootstrap and boosting use a subsample at each iteration above similar to stochastic gradient descent
α
using a small learning rate here improves test error (shrinkage)
shallow trees of K = 4-8 leaf usually work well as weak learners
decide T using a validation set (early stopping) 10 . 2
example
features are samples from standard Gaussian
x
, … , x1 (n) 10 (n)
y =
(n)
I(
x >∑d
d (n)
9.34)
label N=2000 training examples recall the synthetic example:
since sum of features are used in prediction using stumps work best Gradient tree boosting (using log-loss) works better than Adaboost
10 . 3
example
features are samples from standard Gaussian
x
, … , x1 (n) 10 (n)
y =
(n)
I(
x >∑d
d (n)
9.34)
label N=2000 training examples recall the synthetic example:
deviance = cross entropy = log-loss
(K=2) stump K=6 in both cases using shrinkage helps
α = .2
while test loss may increase, test misclassification error does not
10 . 4
example
features are samples from standard Gaussian
x
, … , x1 (n) 10 (n)
y =
(n)
I(
x >∑d
d (n)
9.34)
label N=2000 training examples recall the synthetic example: α = 1 α = 1 α = .1 α = .1
stochastic with batch size 50% stochastic with batch size 50%
α = .1 α = .1
and stochastic and stochastic
both shrinkage and subsampling can help more hyper-parameters to tune
10 . 5
Winter 2020 | Applied Machine Learning (COMP551)
example
see the interactive demo: https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html
10 . 6
two ensemble methods bagging & random forests (reduce variance) produce models with minimal correlation use their average prediction boosting (reduces the bias of the weak learner) models are added in steps a single cost function is minimized for exponential loss: interpret as re-weighting the instance (AdaBoost) gradient boosting: fit the weak learner to the negative of the gradient interpretation as L1 regularization for "weak learner"-selection also related to max-margin classification (for large number of steps T) random forests and (gradient) boosting generally perform very well
11
−
L(f, y)
∂f ∂ {t−1}
= f ^ f =
{T}
f −
{0}
w L(f, y) ∑t=1
T {t} ∂f ∂ {t−1}
Gradient for some loss functions
setting loss function regression
∣∣y −2 1
f∣∣
2 2
y − f
regression regression
∣∣y − f∣∣
1
sign(y − f)
multiclass
classification multi-class cross-entropy
Y
− PN × C predicted class probabilities N × C
binary
classification
12
exp(−yf)
exponential loss
−y exp(−yf)
P
=c,:
softmax(f
)[c]