Applied Machine Learning Applied Machine Learning Bootstrap, - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Bootstrap, - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives bootstrap for uncertainty


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Bootstrap, Bagging and Boosting

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020) 1

slide-2
SLIDE 2

bootstrap for uncertainty estimation bagging for variance reduction random forests boosting AdaBoost gradient boosting relationship to L1 regularization

Learning objectives Learning objectives

2

slide-3
SLIDE 3

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction

3 . 1

slide-4
SLIDE 4

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction given the dataset

D = {(x , y )}

(n) (n) n=1 N 3 . 1

slide-5
SLIDE 5

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction given the dataset

D = {(x , y )}

(n) (n) n=1 N

subsample with replacement B datasets of size N

D

=

b

{(x , y )}

, b =

(n,b) (n,b) n=1 N

1, … , B

3 . 1

slide-6
SLIDE 6

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction train a model on each of these bootstrap datasets (called bootstrap samples) given the dataset

D = {(x , y )}

(n) (n) n=1 N

subsample with replacement B datasets of size N

D

=

b

{(x , y )}

, b =

(n,b) (n,b) n=1 N

1, … , B

3 . 1

slide-7
SLIDE 7

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction train a model on each of these bootstrap datasets (called bootstrap samples) given the dataset

D = {(x , y )}

(n) (n) n=1 N

subsample with replacement B datasets of size N

D

=

b

{(x , y )}

, b =

(n,b) (n,b) n=1 N

1, … , B

produce a measure of uncertainty from these models for model parameters for predictions

3 . 1

slide-8
SLIDE 8

Bootstrap Bootstrap

a simple approach to estimate the uncertainty in prediction train a model on each of these bootstrap datasets (called bootstrap samples) given the dataset

D = {(x , y )}

(n) (n) n=1 N

subsample with replacement B datasets of size N

D

=

b

{(x , y )}

, b =

(n,b) (n,b) n=1 N

1, … , B

produce a measure of uncertainty from these models for model parameters for predictions non-parametric bootstrap

3 . 1

slide-9
SLIDE 9

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

y =

(n)

sin(x ) +

(n)

cos(

) +

∣x ∣

(n)

ϵ

#x: N #y: N plt.plot(x, y, 'b.') phi = lambda x,mu: np.exp(-(x-mu)**2) mu = np.linspace(0,10,10) #10 Gaussians bases Phi = phi(x[:,None], mu[None,:]) #N x 10 w = np.linalg.lstsq(Phi, y)[0] yh = np.dot(Phi,w) plt.plot(x, yh, 'g-') 1 2 3 4 5 6 7 8 9

before adding noise

  • ur fit to data using 10 Gaussian bases

noise

Recall: linear model with nonlinear Gaussian bases (N=100)

3 . 2

slide-10
SLIDE 10

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

ws = np.zeros((B,D)) for b in range(B): #Phi: N x D 1 #y: N 2 B = 500 3 4 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters

3 . 3

slide-11
SLIDE 11

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

ws = np.zeros((B,D)) for b in range(B): #Phi: N x D 1 #y: N 2 B = 500 3 4 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 6 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters

each color is a different weight w

d

3 . 3

slide-12
SLIDE 12

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

ws = np.zeros((B,D)) for b in range(B): #Phi: N x D 1 #y: N 2 B = 500 3 4 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 6 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 y_b = y[inds] #N #fit the subsampled data #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 8 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters

each color is a different weight w

d

3 . 3

slide-13
SLIDE 13

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

ws = np.zeros((B,D)) for b in range(B): #Phi: N x D 1 #y: N 2 B = 500 3 4 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 inds = np.random.randint(N, size=(N)) Phi_b = Phi[inds,:] #N x D #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 6 7 y_b = y[inds] #N 8 #fit the subsampled data 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 y_b = y[inds] #N #fit the subsampled data #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 8 9 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] 10 11 plt.hist(ws, bins=50) 12 ws[b,:] = np.linalg.lstsq(Phi_b, y_b[:,b])[0] #Phi: N x D 1 #y: N 2 B = 500 3 ws = np.zeros((B,D)) 4 for b in range(B): 5 inds = np.random.randint(N, size=(N)) 6 Phi_b = Phi[inds,:] #N x D 7 y_b = y[inds] #N 8 #fit the subsampled data 9 10 11 plt.hist(ws, bins=50) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples gives a measure of uncertainty of the parameters

each color is a different weight w

d

3 . 3

slide-14
SLIDE 14

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples

3 . 4

slide-15
SLIDE 15

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples also gives a measure of uncertainty of the predictions

3 . 4

slide-16
SLIDE 16

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

#Phi: N x D #Phi_test: Nt x D #y: N #ws: B x D from previous code y_hats = np.zeros((B, Nt)) 1 2 3 4 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples also gives a measure of uncertainty of the predictions

3 . 4

slide-17
SLIDE 17

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

#Phi: N x D #Phi_test: Nt x D #y: N #ws: B x D from previous code y_hats = np.zeros((B, Nt)) 1 2 3 4 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 6 7 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples also gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

slide-18
SLIDE 18

Winter 2020 | Applied Machine Learning (COMP551)

Bootstrap: Bootstrap: example example

ϕ

(x) =

k

e−

s2 (x−μ

)

k 2

#Phi: N x D #Phi_test: Nt x D #y: N #ws: B x D from previous code y_hats = np.zeros((B, Nt)) 1 2 3 4 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 for b in range(B): wb = ws[b,:] y_hats[b,:] = np.dot(Phi_test, wb) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 6 7 8 9 # get 95% quantiles 10 y_5 = np.quantile(y_hats, .05, axis=0) 11 y_95 = np.quantile(y_hats, .95, axis=0) 12 y_5 = np.quantile(y_hats, .05, axis=0) y_95 = np.quantile(y_hats, .95, axis=0) #Phi: N x D 1 #Phi_test: Nt x D 2 #y: N 3 #ws: B x D from previous code 4 y_hats = np.zeros((B, Nt)) 5 for b in range(B): 6 wb = ws[b,:] 7 y_hats[b,:] = np.dot(Phi_test, wb) 8 9 # get 95% quantiles 10 11 12

Recall: linear model with nonlinear Gaussian bases (N=100) using B=500 bootstrap samples also gives a measure of uncertainty of the predictions the red lines are 5% and 95% quantiles

(for each point we can get these across bootstrap model predictions)

3 . 4

slide-19
SLIDE 19

Bagging Bagging

Var(z

+

1

z

) =

2

E[(z

+

1

z

) ] −

2 2

E[z

+

1

z

]

2 2

use bootstrap for more accurate prediction (not just uncertainty) variance of sum of random variables

4 . 1

slide-20
SLIDE 20

Bagging Bagging

Var(z

+

1

z

) =

2

E[(z

+

1

z

) ] −

2 2

E[z

+

1

z

]

2 2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z

+

1 2

z

+

2 2

2z

z ] −

1 2

(E[z

] +

1

E[z

])

2 2

variance of sum of random variables

4 . 1

slide-21
SLIDE 21

Bagging Bagging

Var(z

+

1

z

) =

2

E[(z

+

1

z

) ] −

2 2

E[z

+

1

z

]

2 2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z

+

1 2

z

+

2 2

2z

z ] −

1 2

(E[z

] +

1

E[z

])

2 2

= E[z

] +

1 2

E[z

] +

2 2

E[2z

z ] −

1 2

E[z

] −

1 2

E[z

] −

2 2

2E[z

]E[z ]

1 2

variance of sum of random variables

4 . 1

slide-22
SLIDE 22

Bagging Bagging

Var(z

+

1

z

) =

2

E[(z

+

1

z

) ] −

2 2

E[z

+

1

z

]

2 2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z

+

1 2

z

+

2 2

2z

z ] −

1 2

(E[z

] +

1

E[z

])

2 2

= E[z

] +

1 2

E[z

] +

2 2

E[2z

z ] −

1 2

E[z

] −

1 2

E[z

] −

2 2

2E[z

]E[z ]

1 2

= Var(z

) +

1

Var(z

) +

2

2Cov(z

, z )

1 2

variance of sum of random variables

4 . 1

slide-23
SLIDE 23

Bagging Bagging

Var(z

+

1

z

) =

2

E[(z

+

1

z

) ] −

2 2

E[z

+

1

z

]

2 2

use bootstrap for more accurate prediction (not just uncertainty)

= E[z

+

1 2

z

+

2 2

2z

z ] −

1 2

(E[z

] +

1

E[z

])

2 2

= E[z

] +

1 2

E[z

] +

2 2

E[2z

z ] −

1 2

E[z

] −

1 2

E[z

] −

2 2

2E[z

]E[z ]

1 2

= Var(z

) +

1

Var(z

) +

2

2Cov(z

, z )

1 2

variance of sum of random variables

for uncorrelated variables this term is zero

4 . 1

slide-24
SLIDE 24

Bagging Bagging

average of uncorrelated random variables has a lower variance

use bootstrap for more accurate prediction (not just uncertainty)

4 . 2

slide-25
SLIDE 25

Bagging Bagging

are uncorrelated random variables with mean and variance

z

, … , z

1 B

μ σ2

the average has mean and variance

= z ˉ

z

B 1 ∑b b

μ

average of uncorrelated random variables has a lower variance

use bootstrap for more accurate prediction (not just uncertainty)

4 . 2

slide-26
SLIDE 26

Bagging Bagging

are uncorrelated random variables with mean and variance

z

, … , z

1 B

μ σ2

the average has mean and variance

= z ˉ

z

B 1 ∑b b

μ

average of uncorrelated random variables has a lower variance

use bootstrap for more accurate prediction (not just uncertainty)

Var(

z ) =

B 1 ∑b b

Var( z ) =

B2 1

∑b

b

Bσ =

B2 1 2

σ

B 1 2

4 . 2

slide-27
SLIDE 27

Bagging Bagging

are uncorrelated random variables with mean and variance

z

, … , z

1 B

μ σ2

the average has mean and variance

= z ˉ

z

B 1 ∑b b

μ

average of uncorrelated random variables has a lower variance

use this to reduce the variance of our models (bias remains the same) regression: average the model predictions

(x) =

f ^

(x)

B 1 ∑b f

^

b

use bootstrap for more accurate prediction (not just uncertainty)

Var(

z ) =

B 1 ∑b b

Var( z ) =

B2 1

∑b

b

Bσ =

B2 1 2

σ

B 1 2

4 . 2

slide-28
SLIDE 28

Bagging Bagging

are uncorrelated random variables with mean and variance

z

, … , z

1 B

μ σ2

the average has mean and variance

= z ˉ

z

B 1 ∑b b

μ

average of uncorrelated random variables has a lower variance

use this to reduce the variance of our models (bias remains the same) regression: average the model predictions

(x) =

f ^

(x)

B 1 ∑b f

^

b

use bootstrap samples to reduce correlation issue: model predictions are not uncorrelated (trained using the same data) bagging (bootstrap aggregation) use bootstrap for more accurate prediction (not just uncertainty)

Var(

z ) =

B 1 ∑b b

Var( z ) =

B2 1

∑b

b

Bσ =

B2 1 2

σ

B 1 2

4 . 2

slide-29
SLIDE 29

Bagging Bagging for classification for classification

averaging makes sense for regression, how about classification?

4 . 3

slide-30
SLIDE 30

Bagging Bagging for classification for classification

are IID Bernoulli random variables with mean

z

, … , z ∈

1 B

{0, 1}

= z ˉ

z

B 1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have

p( > z ˉ .5)

goes to 1 as B grows > 0

averaging makes sense for regression, how about classification?

4 . 3

slide-31
SLIDE 31

Bagging Bagging for classification for classification

mode of iid classifiers that are better than chance is a better classifier use voting

are IID Bernoulli random variables with mean

z

, … , z ∈

1 B

{0, 1}

= z ˉ

z

B 1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have

p( > z ˉ .5)

goes to 1 as B grows > 0

averaging makes sense for regression, how about classification?

4 . 3

slide-32
SLIDE 32

Bagging Bagging for classification for classification

mode of iid classifiers that are better than chance is a better classifier use voting

crowds are wiser when individuals are better than random votes are uncorrelated

are IID Bernoulli random variables with mean

z

, … , z ∈

1 B

{0, 1}

= z ˉ

z

B 1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have

p( > z ˉ .5)

goes to 1 as B grows > 0

averaging makes sense for regression, how about classification?

4 . 3

slide-33
SLIDE 33

Bagging Bagging for classification for classification

mode of iid classifiers that are better than chance is a better classifier use voting

crowds are wiser when individuals are better than random votes are uncorrelated

are IID Bernoulli random variables with mean

z

, … , z ∈

1 B

{0, 1}

= z ˉ

z

B 1 ∑b b

μ = .5 + ϵ

wisdom of crowds

for we have

p( > z ˉ .5)

goes to 1 as B grows > 0

use bootstrap samples to reduce correlation bagging (bootstrap aggregation) averaging makes sense for regression, how about classification?

4 . 3

slide-34
SLIDE 34

Bagging decision trees Bagging decision trees

example

setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label

4 . 4

slide-35
SLIDE 35

Bagging decision trees Bagging decision trees

example

setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label

Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable!

4 . 4

slide-36
SLIDE 36

Bagging decision trees Bagging decision trees

example

setup synthetic dataset 5 correlated features 1st feature is a noisy predictor of the label

B

voting for the most probably class averaging probabilities Bootstrap samples create different decision trees (due to high variance) compared to decision trees, no longer interpretable!

4 . 4

slide-37
SLIDE 37

Random forests Random forests

further reduce the correlation between decision trees

4 . 5

slide-38
SLIDE 38

Random forests Random forests

further reduce the correlation between decision trees feature sub-sampling

  • nly a random subset of features are available for split at each step

further reduce the dependence between decision trees

4 . 5

slide-39
SLIDE 39

Random forests Random forests

further reduce the correlation between decision trees feature sub-sampling

  • nly a random subset of features are available for split at each step

further reduce the dependence between decision trees

D

magic number? this is a hyper-parameter, can be optimized using CV

4 . 5

slide-40
SLIDE 40

Random forests Random forests

further reduce the correlation between decision trees feature sub-sampling

  • nly a random subset of features are available for split at each step

further reduce the dependence between decision trees

D

magic number? this is a hyper-parameter, can be optimized using CV

Out Of Bag (OOB) samples: the instances not included in a bootsrap dataset can be used for validation simultaneous validation of decision trees in a forest no need to set aside data for cross validation

4 . 5

slide-41
SLIDE 41

Example Example: spam detection : spam detection

N=4601 emails binary classification task: spam - not spam D=57 features:

48 words: percentage of words in the email that match these words

e.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match these

ch; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT

Dataset

4 . 6

slide-42
SLIDE 42

Example Example: spam detection : spam detection

N=4601 emails binary classification task: spam - not spam D=57 features:

48 words: percentage of words in the email that match these words

e.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match these

ch; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT an example of feature engineering

Dataset

4 . 6

slide-43
SLIDE 43

Example Example: spam detection : spam detection

N=4601 emails binary classification task: spam - not spam D=57 features:

48 words: percentage of words in the email that match these words

e.g., business,address,internet, free, George (customized per user)

6 characters: again percentage of characters that match these

ch; , ch( ,ch[ ,ch! ,ch$ , ch#

average, max, sum of length of uninterrupted sequences of capital letters: CAPAVE, CAPMAX, CAPTOT average value of these features in the spam and non-spam emails an example of feature engineering

Dataset

4 . 6

slide-44
SLIDE 44

Example Example: spam detection : spam detection

decision tree after pruning

4 . 7

slide-45
SLIDE 45

Example Example: spam detection : spam detection

decision tree after pruning misclassification rate on test data

4 . 7

slide-46
SLIDE 46

Example Example: spam detection : spam detection

decision tree after pruning

number of leaves (17) in optimal pruning decided based on cross-validation error

test error cv error

misclassification rate on test data

4 . 7

slide-47
SLIDE 47

Example Example: spam detection : spam detection

Bagging and Random Forests do much better than a single decision tree!

4 . 8

slide-48
SLIDE 48

Winter 2020 | Applied Machine Learning (COMP551)

Example Example: spam detection : spam detection

Bagging and Random Forests do much better than a single decision tree! Out Of Bag (OOB) error can be used for parameter tuning (e.g., size of the forest)

4 . 8

slide-49
SLIDE 49

Summary so far... Summary so far...

Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models

5

slide-50
SLIDE 50

Summary so far... Summary so far...

Bootstrap is a powerful technique to get uncertainty estimates Bootstrep aggregation (Bagging) can reduce the variance of unstable models Random forests: Bagging + further de-corelation of features at each split OOB validation instead of CV destroy interpretability of decision trees perform well in practice can fail if only few relevant features exist (due to feature-sampling)

5

slide-51
SLIDE 51

Adaptive bases Adaptive bases

several methods can be classified as learning these bases adaptively decision trees generalized additive models boosting neural networks

f(x) =

w ϕ (x; v )

∑d

d d d

in boosting each basis is a classifier or regression function (weak learner, or base learner) create a strong learner by sequentially combining week learners

6 . 1

slide-52
SLIDE 52

Forward stagewise additive modelling Forward stagewise additive modelling

6 . 2

slide-53
SLIDE 53

Forward stagewise additive modelling Forward stagewise additive modelling

model f(x) =

w

ϕ(x; v ) ∑t=1

T {t} {t}

a simple model, such as decision stump (decision tree with one node)

6 . 2

slide-54
SLIDE 54

Forward stagewise additive modelling Forward stagewise additive modelling

model f(x) =

w

ϕ(x; v ) ∑t=1

T {t} {t}

a simple model, such as decision stump (decision tree with one node)

cost

J({w , v }

) =

{t} {t} t

L(y

, f(x )) ∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

  • ptimizing this cost is difficult given the form of f

6 . 2

slide-55
SLIDE 55

Winter 2020 | Applied Machine Learning (COMP551)

Forward stagewise additive modelling Forward stagewise additive modelling

model f(x) =

w

ϕ(x; v ) ∑t=1

T {t} {t}

a simple model, such as decision stump (decision tree with one node)

cost

J({w , v }

) =

{t} {t} t

L(y

, f(x )) ∑n=1

N (n) (n)

so far we have seen L2 loss, log loss and hinge loss

  • ptimizing this cost is difficult given the form of f
  • ptimization idea

add one weak-learner in each stage t, to reduce the error of previous stage

f (x) =

{t}

f (x ) +

{t−1} (n)

w ϕ(x ; v )

{t} (n) {t}

  • 2. add it to the current model

v , w =

{t} {t}

arg min

L(y

, f (x ) +

v,w ∑n=1 N (n) {t−1} (n)

wϕ(x ; v))

(n)

  • 1. find the best weak learner

6 . 2

slide-56
SLIDE 56

consider weak learners that are individual features ϕ

(x) =

{t}

w x

{t} d{t}

model

7 . 1

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

slide-57
SLIDE 57

cost

arg min

(y

d,w

d 2

1 ∑n=1 N (n)

(f (x ) +

{t−1} (n)

w

x ))

d d (n) 2

using L2 loss for regression at stage t consider weak learners that are individual features ϕ

(x) =

{t}

w x

{t} d{t}

model

7 . 1

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

slide-58
SLIDE 58

cost

arg min

(y

d,w

d 2

1 ∑n=1 N (n)

(f (x ) +

{t−1} (n)

w

x ))

d d (n) 2

using L2 loss for regression at stage t consider weak learners that are individual features ϕ

(x) =

{t}

w x

{t} d{t}

model

residual r(n)

7 . 1

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

slide-59
SLIDE 59

cost

arg min

(y

d,w

d 2

1 ∑n=1 N (n)

(f (x ) +

{t−1} (n)

w

x ))

d d (n) 2

using L2 loss for regression at stage t consider weak learners that are individual features ϕ

(x) =

{t}

w x

{t} d{t}

model

residual r(n)

  • ptimization

recall: optimal weight for each d is w

=

d

x

∑n

d (n) 2

x r

∑n

d (n) d (n)

pick the feature that most significantly reduces the residual

7 . 1

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

slide-60
SLIDE 60

cost

arg min

(y

d,w

d 2

1 ∑n=1 N (n)

(f (x ) +

{t−1} (n)

w

x ))

d d (n) 2

using L2 loss for regression at stage t consider weak learners that are individual features ϕ

(x) =

{t}

w x

{t} d{t}

model

residual r(n)

  • ptimization

recall: optimal weight for each d is w

=

d

x

∑n

d (n) 2

x r

∑n

d (n) d (n)

pick the feature that most significantly reduces the residual

f (x) =

{t}

αw x

∑t

d{t} {t} d{t} the model at time-step t:

using a small helps with test error

α

7 . 1

is this related to L1-regularized linear regression?

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

slide-61
SLIDE 61

example

using small learning rate L2 Boosting has a similar regularization path to lasso

boosting

α = .01

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

∣w ∣

∑d

d

t

lasso

w

d {t}

at each time-step only one feature is updated / added

d{t}

7 . 2

slide-62
SLIDE 62

Winter 2020 | Applied Machine Learning (COMP551)

example

using small learning rate L2 Boosting has a similar regularization path to lasso

boosting

α = .01

we can view boosting as doing feature (base learner) selection in exponentially large spaces (e.g., all trees of size K)

the number of steps t plays a similar role to (the inverse of) regularization hyper-parameter

loss & forward stagewise loss & forward stagewise linear linear model model

L

2

∣w ∣

∑d

d

t

lasso

w

d {t}

at each time-step only one feature is updated / added

d{t}

7 . 2

slide-63
SLIDE 63

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification y ∈ {−1, +1} predicted label is

=

y ^ sign(f(x))

8 . 1

slide-64
SLIDE 64

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification y ∈ {−1, +1} predicted label is

=

y ^ sign(f(x))

misclassification loss

(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

8 . 1

slide-65
SLIDE 65

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification L(y, f(x)) = log (1 + e )

−yf(x)

log-loss

(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is

=

y ^ sign(f(x))

misclassification loss

(0-1 loss)

L(y, f(x)) = I(yf(x) > 0)

8 . 1

slide-66
SLIDE 66

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification L(y, f(x)) = log (1 + e )

−yf(x)

log-loss

(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is

=

y ^ sign(f(x))

misclassification loss

(0-1 loss)

L(y, f(x)) = I(yf(x) > 0) L(y, f(x)) = max(0, 1 − yf(x)) Hinge loss

support vector loss

8 . 1

slide-67
SLIDE 67

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification L(y, f(x)) = log (1 + e )

−yf(x)

log-loss

(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is

=

y ^ sign(f(x))

misclassification loss

(0-1 loss)

L(y, f(x)) = I(yf(x) > 0) L(y, f(x)) = max(0, 1 − yf(x)) Hinge loss

support vector loss

yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

8 . 1

slide-68
SLIDE 68

Exponential loss Exponential loss & AdaBoost & AdaBoost

loss functions for binary classification L(y, f(x)) = log (1 + e )

−yf(x)

log-loss

(aka cross entropy loss or binomial deviance)

y ∈ {−1, +1}

predicted label is

=

y ^ sign(f(x))

misclassification loss

(0-1 loss)

L(y, f(x)) = I(yf(x) > 0) L(y, f(x)) = max(0, 1 − yf(x)) Hinge loss

support vector loss

yet another loss function is exponential loss L(y, f(x)) = e−yf(x)

note that the loss grows faster than the other surrogate losses (more sensitive to outliers)

useful property when working with additive models:

L(y, f (x) +

{t−1}

w ϕ(x, v )) =

{t} {t}

L(y, f (x)) ⋅

{t−1}

L(y, w ϕ(x, v ))

{t} {t}

treat this as a weight q for an instance instances that are not properly classified before receive a higher weight

8 . 1

slide-69
SLIDE 69

Exponential loss & Exponential loss & AdaBoost AdaBoost

J({w , v }

) =

{t} {t} t

L(y

, f (x ) + ∑n=1

N (n) {t−1} (n)

w ϕ(x , v )) =

{t} (n) {t}

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))

(n) {t−1} (n)

loss for this instance at previous stage

using exponential loss

8 . 2

slide-70
SLIDE 70

Exponential loss & Exponential loss & AdaBoost AdaBoost

J({w , v }

) =

{t} {t} t

L(y

, f (x ) + ∑n=1

N (n) {t−1} (n)

w ϕ(x , v )) =

{t} (n) {t}

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))

(n) {t−1} (n)

loss for this instance at previous stage

using exponential loss

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

8 . 2

slide-71
SLIDE 71

Exponential loss & Exponential loss & AdaBoost AdaBoost

J({w , v }

) =

{t} {t} t

L(y

, f (x ) + ∑n=1

N (n) {t−1} (n)

w ϕ(x , v )) =

{t} (n) {t}

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))

(n) {t−1} (n)

loss for this instance at previous stage

using exponential loss

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v }

) =

{t} {t} t

q

e ∑n

(n) −y w ϕ(x ,v )

(n) {t} (n) {t}

  • bjective is to find the weak learner minimizing the cost above
  • ptimization

8 . 2

slide-72
SLIDE 72

Exponential loss & Exponential loss & AdaBoost AdaBoost

J({w , v }

) =

{t} {t} t

L(y

, f (x ) + ∑n=1

N (n) {t−1} (n)

w ϕ(x , v )) =

{t} (n) {t}

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))

(n) {t−1} (n)

loss for this instance at previous stage

using exponential loss

= e

q

I(y =

−w{t} ∑n (n) (n)

ϕ(x , v )) +

(n) {t}

e

q

I(y

=

w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v }

) =

{t} {t} t

q

e ∑n

(n) −y w ϕ(x ,v )

(n) {t} (n) {t}

  • bjective is to find the weak learner minimizing the cost above
  • ptimization

8 . 2

slide-73
SLIDE 73

Exponential loss & Exponential loss & AdaBoost AdaBoost

J({w , v }

) =

{t} {t} t

L(y

, f (x ) + ∑n=1

N (n) {t−1} (n)

w ϕ(x , v )) =

{t} (n) {t}

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

L(y , f (x ))

(n) {t−1} (n)

loss for this instance at previous stage

using exponential loss

= e

q

I(y =

−w{t} ∑n (n) (n)

ϕ(x , v )) +

(n) {t}

e

q

I(y

=

w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

discrete AdaBoost: assume this is a simple classifier, so its output is +/- 1

J({w , v }

) =

{t} {t} t

q

e ∑n

(n) −y w ϕ(x ,v )

(n) {t} (n) {t}

  • bjective is to find the weak learner minimizing the cost above
  • ptimization

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted intances

w ≥

{t}

does not depend on the weak learner 8 . 2

slide-74
SLIDE 74

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

v{t}

8 . 3

slide-75
SLIDE 75

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

still need to find the optimal w{t}

v{t}

8 . 3

slide-76
SLIDE 76

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

still need to find the optimal w{t}

v{t}

setting gives

=

∂w{t} ∂J

w =

{t}

log

2 1 ℓ{t} 1−ℓ{t}

weight-normalized misclassification error

ℓ =

{t}

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

8 . 3

slide-77
SLIDE 77

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

still need to find the optimal w{t}

v{t}

setting gives

=

∂w{t} ∂J

w =

{t}

log

2 1 ℓ{t} 1−ℓ{t}

weight-normalized misclassification error

ℓ =

{t}

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

since weak learner is better than chance and so w

{t}

ℓ <

{t}

.5

8 . 3

slide-78
SLIDE 78

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

still need to find the optimal w{t}

v{t}

setting gives

=

∂w{t} ∂J

w =

{t}

log

2 1 ℓ{t} 1−ℓ{t}

weight-normalized misclassification error

ℓ =

{t}

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

we can now update instance weights q for next iteration

(multiply by the new loss)

q =

(n),{t+1}

q e

(n),{t} −w y ϕ(x ;v )

{t} (n) (n) {t} since weak learner is better than chance and so w

{t}

ℓ <

{t}

.5

8 . 3

slide-79
SLIDE 79

Exponential loss Exponential loss & & AdaBoost AdaBoost

= e

q

+

−w{t} ∑n (n)

(e −

w{t}

e )

q

I(y

=

−w{t} ∑n (n) (n)  ϕ(x

, v ))

(n) {t}

assuming the weak learner should minimize this cost this is classification with weighted instances this gives

w ≥

{t}

does not depend on the weak learner

J({w , v }

) =

{t} {t} t

q

L(y , w ϕ(x , v )) ∑n

(n) (n) {t} (n) {t}

cost

still need to find the optimal w{t}

v{t}

setting gives

=

∂w{t} ∂J

w =

{t}

log

2 1 ℓ{t} 1−ℓ{t}

weight-normalized misclassification error

ℓ =

{t}

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

we can now update instance weights q for next iteration

(multiply by the new loss)

q =

(n),{t+1}

q e

(n),{t} −w y ϕ(x ;v )

{t} (n) (n) {t} since weak learner is better than chance and so w

{t}

ℓ <

{t}

.5

8 . 3 since w > 0, the weight q of misclassified points increase and the rest decrease

slide-80
SLIDE 80

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

8 . 4

slide-81
SLIDE 81

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

w ϕ(x; v )

{1} {1} 8 . 4

slide-82
SLIDE 82

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2} 8 . 4

slide-83
SLIDE 83

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2}

w ϕ(x; v )

{3} {3}

w ϕ(x; v )

{T} {T} 8 . 4

slide-84
SLIDE 84

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2}

w ϕ(x; v )

{3} {3}

w ϕ(x; v )

{T} {T}

f(x) = sign(

w

ϕ(x; v )) ∑t

{t} {t} 8 . 4

slide-85
SLIDE 85

return initialize for t=1:T fit the simple classifier to the weighted dataset

w :

{t} =

log

2 1 ℓ{t} 1−ℓ{t}

ℓ :

{t} =

q

∑n

(n)

q

I(ϕ(x ;v )

=y

) ∑n

(n) (n) {t}  (n)

q :

(n) = q

e ∀n

(n) −w y ϕ(x ;v )

{t} (n) (n) {t}

q :

(n) =

∀n

N 1 ϕ(x, v )

{t}

f(x) = sign(

w

ϕ(x; v )) ∑t

{t} {t}

Exponential loss Exponential loss & & AdaBoost AdaBoost

  • verall algorithm for discrete AdaBoost

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2}

w ϕ(x; v )

{3} {3}

w ϕ(x; v )

{T} {T}

f(x) = sign(

w

ϕ(x; v )) ∑t

{t} {t} 8 . 4

slide-86
SLIDE 86

AdaBoost AdaBoost

example

w ϕ(x; v )

{1} {1}

w ϕ(x; v )

{2} {2}

w ϕ(x; v )

{3} {3}

w ϕ(x; v )

{T} {T}

=

y ^ sign(

w

ϕ(x; v )) ∑t

{t} {t}

each weak learner is a decision stump (dashed line)

t = 1 t = 2 t = 3 t = 6 t = 10 t = 150

green is the decision boundary of f {t} circle size is proportional to qn,{t}

8 . 5

slide-87
SLIDE 87

AdaBoost AdaBoost

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n) 8 . 6

slide-88
SLIDE 88

AdaBoost AdaBoost

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

(n) d 2

9.34)

label

8 . 6

slide-89
SLIDE 89

AdaBoost AdaBoost

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

(n) d 2

9.34)

label N=2000 training examples

8 . 6

slide-90
SLIDE 90

AdaBoost AdaBoost

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

(n) d 2

9.34)

label N=2000 training examples

8 . 6

slide-91
SLIDE 91

Winter 2020 | Applied Machine Learning (COMP551)

AdaBoost AdaBoost

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

(n) d 2

9.34)

label N=2000 training examples notice that test error does not increase AdaBoost is very slow to overfit

8 . 6

slide-92
SLIDE 92

application: application: Viola-Jones face detection Viola-Jones face detection

Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate

image source: David Lowe slides

9

slide-93
SLIDE 93

application: application: Viola-Jones face detection Viola-Jones face detection

Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate

100% detection FP rate cumulative FP rate

image source: David Lowe slides

9

slide-94
SLIDE 94

application: application: Viola-Jones face detection Viola-Jones face detection

Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate

100% detection FP rate cumulative FP rate

cascade is applied over all image subwindows

image source: David Lowe slides

9

slide-95
SLIDE 95

application: application: Viola-Jones face detection Viola-Jones face detection

Haar features are computationally efficient each feature is a weak learner AdaBoost picks one feature at a time (label: face/no-face) Still can be inefficient: use the fact that faces are rare (.01% of subwindows are faces) cascade of classifiers due to small rate

100% detection FP rate cumulative FP rate

fast enough for real-time (object) detection cascade is applied over all image subwindows

image source: David Lowe slides

9

slide-96
SLIDE 96

Gradient boosting Gradient boosting

10 . 1

idea fit the weak learner to the gradient of the cost

slide-97
SLIDE 97

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

10 . 1

idea fit the weak learner to the gradient of the cost

slide-98
SLIDE 98

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss

10 . 1

idea fit the weak learner to the gradient of the cost

slide-99
SLIDE 99

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps

f ^ = f ^ f =

{T}

f −

{0}

w

g ∑t=1

T {t} {t}

L(f

, y)

∂f ∂ {t−1}

gradient vector its role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

slide-100
SLIDE 100

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps

f ^ = f ^ f =

{T}

f −

{0}

w

g ∑t=1

T {t} {t}

w =

{t}

arg min

L(f

w {t−1}

wg )

{t}

we can look for the optimal step size

L(f

, y)

∂f ∂ {t−1}

gradient vector its role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

slide-101
SLIDE 101

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps

f ^ = f ^ f =

{T}

f −

{0}

w

g ∑t=1

T {t} {t}

so far we treated f as a parameter vector

w =

{t}

arg min

L(f

w {t−1}

wg )

{t}

we can look for the optimal step size

L(f

, y)

∂f ∂ {t−1}

gradient vector its role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

slide-102
SLIDE 102

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps

f ^ = f ^ f =

{T}

f −

{0}

w

g ∑t=1

T {t} {t}

so far we treated f as a parameter vector

w =

{t}

arg min

L(f

w {t−1}

wg )

{t}

we can look for the optimal step size

L(f

, y)

∂f ∂ {t−1}

gradient vector its role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost fit the weak-learner to negative of the gradient

v =

{t}

arg min

∣∣ϕ −

v 2 1 v

(−g)∣∣

2 2

slide-103
SLIDE 103

Gradient boosting Gradient boosting

let f

=

{t}

[f (x ), … , f (x )]

{t} (1) {t} (N) ⊤

y = [y , … , y ]

(1) (N) ⊤

and true labels

= f ^ arg min

L(f, y)

f

ignoring the structure of f if we use gradient descent to minimize the loss write as a sum of steps

f ^ = f ^ f =

{T}

f −

{0}

w

g ∑t=1

T {t} {t}

so far we treated f as a parameter vector

w =

{t}

arg min

L(f

w {t−1}

wg )

{t}

we can look for the optimal step size

L(f

, y)

∂f ∂ {t−1}

gradient vector its role is similar to residual

10 . 1

idea fit the weak learner to the gradient of the cost

we are fitting the gradient using L2 loss regardless of the original loss function

fit the weak-learner to negative of the gradient

v =

{t}

arg min

∣∣ϕ −

v 2 1 v

(−g)∣∣

2 2 ϕ

=

v

[ϕ(x ; v), … , ϕ(x ; v)]

(1) (N) ⊤

slide-104
SLIDE 104

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T}

α

10 . 2

slide-105
SLIDE 105

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T}

α

decide T using a validation set (early stopping)

10 . 2

slide-106
SLIDE 106

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T}

α

shallow trees of K = 4-8 leaf usually work well as weak learners

decide T using a validation set (early stopping)

10 . 2

slide-107
SLIDE 107

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T} this is effectively the line-search

α

shallow trees of K = 4-8 leaf usually work well as weak learners

decide T using a validation set (early stopping)

10 . 2

slide-108
SLIDE 108

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T} this is effectively the line-search

α

using a small learning rate here improves test error (shrinkage)

shallow trees of K = 4-8 leaf usually work well as weak learners

decide T using a validation set (early stopping)

10 . 2

slide-109
SLIDE 109

Gradient Gradient tree tree boosting boosting

apply gradient boosting to CART (classification and regression trees)

initialize to predict a constant for t=1:T calculate the negative of the gradient fit a regression tree to and produce regions re­adjust predictions per region update return

f {0}

f (x) =

{t}

f (x) +

{t−1}

w I(x ∈

∑k=1

K k

R

)

k

r = −

L(f

, y)

∂f ∂ {t−1}

X, r

N × D N

R

, … , R

1 K

w

=

k

arg min

L(y

, f (x ) +

w ∑x ∈R

(n) k

(n) {t−1} (n)

w

)

k

f (x)

{T} this is effectively the line-search

stochastic gradient boosting combines bootstrap and boosting use a subsample at each iteration above similar to stochastic gradient descent

α

using a small learning rate here improves test error (shrinkage)

shallow trees of K = 4-8 leaf usually work well as weak learners

decide T using a validation set (early stopping)

10 . 2

slide-110
SLIDE 110

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

10 . 3

slide-111
SLIDE 111

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3

slide-112
SLIDE 112

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

since sum of features are used in prediction using stumps work best Gradient tree boosting (using log-loss) works better than Adaboost

10 . 3

slide-113
SLIDE 113

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

α = .2

10 . 4

slide-114
SLIDE 114

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

deviance = cross entropy = log-loss

(K=2) stump

α = .2

10 . 4

slide-115
SLIDE 115

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

deviance = cross entropy = log-loss

(K=2) stump K=6

α = .2

10 . 4

slide-116
SLIDE 116

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example:

deviance = cross entropy = log-loss

(K=2) stump K=6 in both cases using shrinkage helps

α = .2

while test loss may increase, test misclassification error does not

10 . 4

slide-117
SLIDE 117

Gradient tree boosting Gradient tree boosting

example

features are samples from standard Gaussian

x

, … , x

1 (n) 10 (n)

y =

(n)

I(

x >

∑d

d (n)

9.34)

label N=2000 training examples recall the synthetic example: α = 1 α = 1 α = .1 α = .1

stochastic with batch size 50% stochastic with batch size 50%

α = .1 α = .1

and stochastic and stochastic

both shrinkage and subsampling can help more hyper-parameters to tune

10 . 5

slide-118
SLIDE 118

Winter 2020 | Applied Machine Learning (COMP551)

Gradient tree boosting Gradient tree boosting

example

see the interactive demo: https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html

10 . 6

slide-119
SLIDE 119

Summary Summary

two ensemble methods

11

slide-120
SLIDE 120

Summary Summary

two ensemble methods bagging & random forests (reduce variance) produce models with minimal correlation use their average prediction

11

slide-121
SLIDE 121

Summary Summary

two ensemble methods bagging & random forests (reduce variance) produce models with minimal correlation use their average prediction boosting (reduces the bias of the weak learner) models are added in steps a single cost function is minimized for exponential loss: interpret as re-weighting the instance (AdaBoost) gradient boosting: fit the weak learner to the negative of the gradient interpretation as L1 regularization for "weak learner"-selection also related to max-margin classification (for large number of steps T)

11

slide-122
SLIDE 122

Summary Summary

two ensemble methods bagging & random forests (reduce variance) produce models with minimal correlation use their average prediction boosting (reduces the bias of the weak learner) models are added in steps a single cost function is minimized for exponential loss: interpret as re-weighting the instance (AdaBoost) gradient boosting: fit the weak learner to the negative of the gradient interpretation as L1 regularization for "weak learner"-selection also related to max-margin classification (for large number of steps T) random forests and (gradient) boosting generally perform very well

11

slide-123
SLIDE 123

Gradient boosting Gradient boosting

L(f

, y)

∂f ∂ {t−1}

= f ^ f =

{T}

f −

{0}

w L(f

, y) ∑t=1

T {t} ∂f ∂ {t−1}

Gradient for some loss functions

  • ne-hot coding for C-class classification

setting loss function regression

∣∣y −

2 1

f∣∣

2 2

y − f

regression regression

∣∣y − f∣∣

1

sign(y − f)

multiclass

classification multi-class cross-entropy

Y

− P

N × C predicted class probabilities N × C

binary

classification

12

exp(−yf)

exponential loss

−y exp(−yf)

P

=

c,:

softmax(f

)

[c]