Applied Machine Learning Applied Machine Learning Bootstrap, - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Bootstrap, - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives bootstrap for uncertainty


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Bootstrap, Bagging and Boosting

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

bootstrap for uncertainty estimation bagging for variance reduction random forests boosting AdaBoost gradient boosting relationship to L1 regularization

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction train a model on each of these bootstrap datasets (called bootstrap samples) given the dataset

D = {(x , y )}

(n) (n) n=1 N

subsample with replacement B datasets of size N

D

=

b

{(x , y )}

, b =

(n,b) (n,b) n=1 N

1, … , B

produce a measure of uncertainty from these models for model parameters for predictions non-parametric bootstrap

3 . 1

slide-4
SLIDE 4

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 w = np.linalg.lstsq(Phi, y)[0] yh = np.dot(Phi,w) plt.plot(x, yh, 'g-') 1 2 3 4 5 6 7 8 9

before adding noise

  • ur fit to data using 10 Gaussian bases

noise

Recall: linear model with nonlinear Gaussian bases (N=100)

3 . 2

slide-5
SLIDE 5

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

ws = np.zeros((B,D)) for b in range(B): #Phi: N x D 1 #y: N 2 B = 500 3 4 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 6 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 y_b = y[inds] #N #fit the subsampled data #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 8 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 10 11 plt.hist(ws, bins=50) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters

each color is a different weight w

d

3 . 3

slide-6
SLIDE 6

Winter 2020 | Applied Machine Learning (COMP551)

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

#Phi: N x D #Phi_test: Nt x D #y: N #ws: B x D from previous code y_hats = np.zeros((B, Nt)) 1 2 3 4 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 6 7 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 y_5 = np.quantile(y_hats, .05, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 11 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples also gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

slide-7
SLIDE 7

Bagging Bagging

Var(z

+

1

z

) =

2

E[(z

+

1

z

) ] −

2 2

E[z

+

1

z

]

2 2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z

+

1 2

z

+

2 2

2z

z ] −

1 2

(E[z

] +

1

E[z

])

2 2

= E[z

] +

1 2

E[z

] +

2 2

E[2z

z ] −

1 2

E[z

] −

1 2

E[z

] −

2 2

2E[z

]E[z ]

1 2

= Var(z

) +

1

Var(z

) +

2

2Cov(z

, z )

1 2

variance of sum of random variables

for uncorrelated variables this term is zero

4 . 1

slide-8
SLIDE 8

Bagging Bagging

are uncorrelated random variables with mean and variance

z

, … , z

1 B

μ σ2

the average has mean and variance

= z ˉ

z

B 1 ∑b b

μ

average of uncorrelated random variables has a lower variance

use this to reduce the variance of our models (bias remains the same) regression: average the model predictions

(x) =

f ^

(x)

B 1 ∑b f

^

b

use bootstrap samples to reduce correlation issue: model predictions are not uncorrelated (trained using the same data) bagging (bootstrap aggregation) use bootstrap for more accurate prediction (not just uncertainty)

Var(

z ) =

B 1 ∑b b

Var( z ) =

B2 1

∑b

b

Bσ =

B2 1 2

σ

B 1 2

4 . 2

slide-9
SLIDE 9

Bagging Bagging for classification for classification

mode of iid classifiers that are better than chance is a better classifier use voting

crowds are wiser when individuals are better than random votes are uncorrelated

are IID Bernoulli random variables with mean

z

, … , z ∈

1 B

{0, 1}

= z ˉ

z

B 1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have

p( > z ˉ .5)

goes to 1 as B grows > 0

use bootstrap samples to reduce correlation bagging (bootstrap aggregation) averaging makes sense for regression, how about classification?

4 . 3

slide-10
SLIDE 10

Bagging decision trees Bagging decision trees

example

setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label

B

voting for the most probably class averaging probabilities Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable!

4 . 4

slide-11
SLIDE 11

Random forests Random forests

further reduce the correlation between decision trees feature sub-sampling

  • nly a random subset of features are available for split at each step

further reduce the dependence between decision trees

D

magic number? this is a hyper-parameter, can be optimized using CV

Out Of Bag (OOB) samples: the instances not included in a bootsrap dataset can be used for validation simultaneous validation of decision trees in a forest no need to set aside data for cross validation

4 . 5

slide-12
SLIDE 12

Example Example: spam detection : spam detection

N=4601 emails binary classification task: spam - not spam D=57 features:

48 words: percentage of words in the email that match these words

e.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match these

ch; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT average value of these features in the spam and non-spam emails an example of feature engineering

Dataset

4 . 6

slide-13
SLIDE 13

Example Example: spam detection : spam detection

decision tree after pruning

number of leaves (17) in optimal pruning decided based on cross-validation error

test error cv error

misclassification rate on test data

4 . 7

slide-14
SLIDE 14

Winter 2020 | Applied Machine Learning (COMP551)

Example Example: spam detection : spam detection

Bagging and Random Forests do much better than a single decision tree! Out Of Bag (OOB) error can be used for parameter tuning (e.g., size of the forest)

4 . 8

slide-15
SLIDE 15

Summary so far... Summary so far...

Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models Random forests: Bagging + further de-corelation of features at each split OOB validation instead of CV destroy interpretability of decision trees perform well in practice can fail if only few relevant features exist (due to feature-sampling)

5

slide-16
SLIDE 16

Adaptive bases Adaptive bases

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

in boosting each basis is a classifier or regression function (weak learner, or base learner) create a strong learner by sequentially combining week learners

6 . 1

slide-17
SLIDE 17

Winter 2020 | Applied Machine Learning (COMP551)

Forward stagewise additive modelling Forward stagewise additive modelling

model f(x) =

w

ϕ(x; v ) ∑t=1

T {t} {t}

a simple model, such as decision stump (decision tree with one node)

cost

J({w , v }

) =

{t} {t} t

L(y

, f(x )) ∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

  • ptimizing this cost is difficult given the form of f
  • ptimization idea

add one weak-learner in each stage t, to reduce the error of previous stage

f (x) =

{t}

f (x ) +

{t−1} (n)

w ϕ(x ; v )

{t} (n) {t}

  • 2. add it to the current model

v , w =

{t} {t}

arg min

L(y

, f (x ) +

v,w ∑n=1 N (n) {t−1} (n)

wϕ(x ; v))

(n)

  • 1. find the best weak learner

6 . 2

slide-18
SLIDE 18

cost

arg min

(y

d,w

d 2

1 ∑n=1 N (n)

(f (x ) +

{t−1} (n)

w

x ))

d d (n) 2 using L2 loss for regression at stage t consider weak learners that are individual features ϕ

(x) =

{t}

w x

{t} d{t}

model

residual r(n)

  • ptimization

recall: optimal weight for each d is w

=

d

x

∑n

d (n) 2

x r

∑n

d (n) d (n)

pick the feature that most significantly reduces the residual

f (x) =

{t}

αw x

∑t

d{t} {t} d{t} the model at time-step t:

using a small helps with test error

α

7 . 1

is this related to L1-regularized linear regression?

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

slide-19
SLIDE 19

Winter 2020 | Applied Machine Learning (COMP551)

example

using small learning rate L2 Boosting has a similar regularization path to lasso

boosting

α = .01

we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K)

the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

∣w ∣

∑d

d

t

lasso

w

d {t}

at each time-step only one feature is updated / added

d{t}

7 . 2

slide-20
SLIDE 20

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification L(y, f(x)) = log (1 + e )

−yf(x)

log-loss

(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is

=

y ^ sign(f(x))

misclassification loss

(0-1 loss)

L(y, f(x)) = I(yf(x) > 0) L(y, f(x)) = max(0, 1 − yf(x)) Hinge loss

support vector loss

yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

useful property when working with additive models:

L(y, f (x) +

{t−1}

w ϕ(x, v )) =

{t} {t}

L(y, f (x)) ⋅

{t−1}

L(y, w ϕ(x, v ))

{t} {t}

treat this as a weight q for an instance instances that are not properly classified before receive a higher weight 8 . 1

slide-21
SLIDE 21

Exponential loss & Exponential loss & AdaBoost AdaBoost

J({w , v }

) =

{t} {t} t

L(y

, f (x ) + ∑n=1

N (n) {t−1} (n)

w ϕ(x , v )) =

{t} (n) {t}

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))

(n) {t−1} (n)

loss for this instance at previous stage

using exponential loss

= e

q

I(y =

−w{t} ∑n (n) (n)

ϕ(x , v )) +

(n) {t}

e

q

I(y

=

w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v }

) =

{t} {t} t

q

e ∑n

(n) −y w ϕ(x ,v )

(n) {t} (n) {t}

  • bjective is to find the weak learner minimizing the cost above
  • ptimization

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t} assuming the weak learner should minimize this cost this is classification with weighted intances

w ≥

{t}

does not depend on the weak learner 8 . 2

slide-22
SLIDE 22

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t} assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

still need to find the optimal w{t}

v{t}

setting gives

=

∂w{t} ∂J

w =

{t}

log

2 1 ℓ{t} 1−ℓ{t}

weight-normalized misclassification error

ℓ =

{t}

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

we can now update instance weights q for next iteration

(multiply by the new loss)

q =

(n),{t+1}

q e

(n),{t} −w y ϕ(x ;v )

{t} (n) (n) {t} since weak learner is better than chance and so w

{t}

ℓ <

{t}

.5

8 . 3 since w > 0, the weight q of misclassified points increase and the rest decrease

slide-23
SLIDE 23

return initialize for t=1:T fit the simple classifier to the weighted dataset

w :

{t} =

log

2 1 ℓ{t} 1−ℓ{t}

ℓ :

{t} =

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

q :

(n) = q

e ∀n

(n) −w y ϕ(x ;v )

{t} (n) (n) {t}

q :

(n) =

∀n

N 1 ϕ(x, v )

{t}

f(x) = sign(

w

ϕ(x; v )) ∑t

{t} {t}

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2}

w ϕ(x; v )

{3} {3}

w ϕ(x; v )

{T} {T}

f(x) = sign(

w

ϕ(x; v )) ∑t

{t} {t} 8 . 4

slide-24
SLIDE 24

AdaBoost AdaBoost

example

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2}

w ϕ(x; v )

{3} {3}

w ϕ(x; v )

{T} {T}

=

y ^ sign(

w

ϕ(x; v )) ∑t

{t} {t}

each weak learner is a decision stump (dashed line)

t = 1 t = 2 t = 3 t = 6 t = 10 t = 150

green is the decision boundary of f {t} circle size is proportional to qn,{t}

8 . 5

slide-25
SLIDE 25

Winter 2020 | Applied Machine Learning (COMP551)

AdaBoost AdaBoost

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples notice that test error does not increase AdaBoost is very slow to overfit

8 . 6

slide-26
SLIDE 26

application: application: Viola-Jones face detection Viola-Jones face detection

Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate

100% detection FP rate cumulative FP rate

fast enough for real-time (object) detection cascade is applied over all image subwindows

image source: David Lowe slides

9

slide-27
SLIDE 27

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps

f ^ = f ^ f =

{T}

f −

{0}

w

g ∑t=1

T {t} {t}

so far we treated f as a parameter vector

w =

{t}

arg min

L(f

w {t−1}

wg )

{t}

we can look for the optimal step size

L(f

, y)

∂f ∂ {t−1}

gradient vector its role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

we are fitting the gradient using L2 loss regardless of the original loss function

fit the weak-learner to negative of the gradient

v =

{t}

arg min

∣∣ϕ −

v 2 1 v

(−g)∣∣

2 2 ϕ

=

v

[ϕ(x ; v), … , ϕ(x ; v)]

(1) (N) ⊤

slide-28
SLIDE 28

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T} this is effectively the line-search

stochastic gradient boosting combines bootstrap and boosting use a subsample at each iteration above similar to stochastic gradient descent

α

using a small learning rate here improves test error (shrinkage)

shallow trees of K = 4-8 leaf usually work well as weak learners

decide T using a validation set (early stopping) 10 . 2

slide-29
SLIDE 29

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

since sum of features are used in prediction using stumps work best Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3

slide-30
SLIDE 30

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

deviance = cross entropy = log-loss

(K=2) stump K=6 in both cases using shrinkage helps

α = .2

while test loss may increase, test misclassification error does not

10 . 4

slide-31
SLIDE 31

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example: α = 1 α = 1 α = .1 α = .1

stochastic with batch size 50% stochastic with batch size 50%

α = .1 α = .1

and stochastic and stochastic

both shrinkage and subsampling can help more hyper-parameters to tune

10 . 5

slide-32
SLIDE 32

Winter 2020 | Applied Machine Learning (COMP551)

Gradient tree boosting Gradient tree boosting

example

see the interactive demo: https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

10 . 6

slide-33
SLIDE 33

Summary Summary

two ensemble methods bagging & random forests (reduce variance) produce models with minimal correlation use their average prediction boosting (reduces the bias of the weak learner) models are added in steps a single cost function is minimized for exponential loss: interpret as re-weighting the instance (AdaBoost) gradient boosting: fit the weak learner to the negative of the gradient interpretation as L1 regularization for "weak learner"-selection also related to max-margin classification (for large number of steps T) random forests and (gradient) boosting generally perform very well

11

slide-34
SLIDE 34

Gradient boosting Gradient boosting

L(f

, y)

∂f ∂ {t−1}

= f ^ f =

{T}

f −

{0}

w L(f

, y) ∑t=1

T {t} ∂f ∂ {t−1}

Gradient for some loss functions

  • ne-hot coding for C-class classification

setting loss function regression

∣∣y −

2 1

f∣∣

2 2

y − f

regression regression

∣∣y − f∣∣

1

sign(y − f)

multiclass

classification multi-class cross-entropy

Y

− P

N × C predicted class probabilities N × C

binary

classification

12

exp(−yf)

exponential loss

−y exp(−yf)

P

=

c,:

softmax(f

)

[c]