Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

cross validation ensembling
SMART_READER_LITE
LIVE PREVIEW

Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw - - PowerPoint PPT Presentation

Cross Validation & Ensembling Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34 Outline Cross


slide-1
SLIDE 1

Cross Validation & Ensembling

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 1 / 34

slide-2
SLIDE 2

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 2 / 34

slide-3
SLIDE 3

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 3 / 34

slide-4
SLIDE 4

Cross Validation

So far, we use the hold out method for:

Hyperparameter tuning: validation set Performance reporting: testing set

What if we get an “unfortunate” split?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

slide-5
SLIDE 5

Cross Validation

So far, we use the hold out method for:

Hyperparameter tuning: validation set Performance reporting: testing set

What if we get an “unfortunate” split? K-fold cross validation:

1

Split the data set X evenly into K subsets X(i) (called folds)

2

For i = 1,··· ,K, train fN(i) using all data but the i-th fold (X\X(i))

3

Report the cross-validation error CCV by averaging all testing errors C[fN(i)]’s on X(i)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 4 / 34

slide-6
SLIDE 6

Nested Cross Validation

Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5⇥2 nested CV

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

slide-7
SLIDE 7

Nested Cross Validation

Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5⇥2 nested CV

1

Inner (2 folds): select hyperparameters giving lowest CCV

Can be wrapped by grid search

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

slide-8
SLIDE 8

Nested Cross Validation

Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5⇥2 nested CV

1

Inner (2 folds): select hyperparameters giving lowest CCV

Can be wrapped by grid search

2

Train final model using both training and validation sets with the selected hyperparameters

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

slide-9
SLIDE 9

Nested Cross Validation

Cross validation (CV) can be applied to both hyperparameter tuning and performance reporting E.g, 5⇥2 nested CV

1

Inner (2 folds): select hyperparameters giving lowest CCV

Can be wrapped by grid search

2

Train final model using both training and validation sets with the selected hyperparameters

3

Outer (5 folds): report CCV as test error

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 5 / 34

slide-10
SLIDE 10

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 6 / 34

slide-11
SLIDE 11

How Many Folds K? I

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

slide-12
SLIDE 12

How Many Folds K? I

The cross-validation error CCV is an average of C[fN(i)]’s

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

slide-13
SLIDE 13

How Many Folds K? I

The cross-validation error CCV is an average of C[fN(i)]’s Regard each C[fN(i)] as an estimator of the expected generalization error EX(C[fN])

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

slide-14
SLIDE 14

How Many Folds K? I

The cross-validation error CCV is an average of C[fN(i)]’s Regard each C[fN(i)] as an estimator of the expected generalization error EX(C[fN]) CCV is an estimator too, and we have MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 7 / 34

slide-15
SLIDE 15

Point Estimation Revisited: Mean Square Error

Let ˆ θn be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θn: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

slide-16
SLIDE 16

Point Estimation Revisited: Mean Square Error

Let ˆ θn be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θn: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]+E[ ˆ θn]θ)2⇤

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

slide-17
SLIDE 17

Point Estimation Revisited: Mean Square Error

Let ˆ θn be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θn: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]+E[ ˆ θn]θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

slide-18
SLIDE 18

Point Estimation Revisited: Mean Square Error

Let ˆ θn be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θn: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]+E[ ˆ θn]θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +E ⇥ (E[ ˆ θn]θ)2⇤ +2E ˆ θn E[ ˆ θn]

  • (E[ ˆ

θn]θ)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

slide-19
SLIDE 19

Point Estimation Revisited: Mean Square Error

Let ˆ θn be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θn: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]+E[ ˆ θn]θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +E ⇥ (E[ ˆ θn]θ)2⇤ +2E ˆ θn E[ ˆ θn]

  • (E[ ˆ

θn]θ) = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +

  • E[ ˆ

θn]θ 2 +2·0·(E[ ˆ θn]θ)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

slide-20
SLIDE 20

Point Estimation Revisited: Mean Square Error

Let ˆ θn be an estimator of quantity θ related to random variable x mapped from n i.i.d samples of x Mean square error of ˆ θn: MSE( ˆ θn) = EX ⇥ ( ˆ θn θ)2⇤ Can be decomposed into the bias and variance: EX ⇥ ( ˆ θn θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn]+E[ ˆ θn]θ)2⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2 +(E[ ˆ θn]θ)2 +2( ˆ θn E[ ˆ θn])(E[ ˆ θn]θ) ⇤ = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +E ⇥ (E[ ˆ θn]θ)2⇤ +2E ˆ θn E[ ˆ θn]

  • (E[ ˆ

θn]θ) = E ⇥ ( ˆ θn E[ ˆ θn])2⇤ +

  • E[ ˆ

θn]θ 2 +2·0·(E[ ˆ θn]θ) = VarX( ˆ θn)+bias( ˆ θn)2 MSE of an unbiased estimator is its variance

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 8 / 34

slide-21
SLIDE 21

Example: 5-Fold vs. 10-Fold CV

MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

slide-22
SLIDE 22

Example: 5-Fold vs. 10-Fold CV

MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2 Consider polynomial regression where P(y|x) = sin(x)+ε,ε ⇠ N (0,σ2)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

slide-23
SLIDE 23

Example: 5-Fold vs. 10-Fold CV

MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2 Consider polynomial regression where P(y|x) = sin(x)+ε,ε ⇠ N (0,σ2) Let C[·] be the MSE of predictions (made by a function) to true labels

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

slide-24
SLIDE 24

Example: 5-Fold vs. 10-Fold CV

MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2 Consider polynomial regression where P(y|x) = sin(x)+ε,ε ⇠ N (0,σ2) Let C[·] be the MSE of predictions (made by a function) to true labels EX(C[fN]): read line

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

slide-25
SLIDE 25

Example: 5-Fold vs. 10-Fold CV

MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2 Consider polynomial regression where P(y|x) = sin(x)+ε,ε ⇠ N (0,σ2) Let C[·] be the MSE of predictions (made by a function) to true labels EX(C[fN]): read line bias(CCV): gaps between the red and other solid lines (EX[CCV])

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

slide-26
SLIDE 26

Example: 5-Fold vs. 10-Fold CV

MSE(CCV) = EX[(CCV EX(C[fN]))2] = VarX(CCV)+bias(CCV)2 Consider polynomial regression where P(y|x) = sin(x)+ε,ε ⇠ N (0,σ2) Let C[·] be the MSE of predictions (made by a function) to true labels EX(C[fN]): read line bias(CCV): gaps between the red and other solid lines (EX[CCV]) VarX (CCV): shaded areas

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 9 / 34

slide-27
SLIDE 27

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

slide-28
SLIDE 28

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

slide-29
SLIDE 29

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

= 1

K ∑i E

  • C[fN(i)]
  • E(C[fN])

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

slide-30
SLIDE 30

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

= 1

K ∑i E

  • C[fN(i)]
  • E(C[fN])

= E

  • C[fN(s)]
  • E(C[fN]),8s

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

slide-31
SLIDE 31

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

= 1

K ∑i E

  • C[fN(i)]
  • E(C[fN])

= E

  • C[fN(s)]
  • E(C[fN]),8s

= bias

  • C[fN(s)]
  • ,8s

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

slide-32
SLIDE 32

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

= 1

K ∑i E

  • C[fN(i)]
  • E(C[fN])

= E

  • C[fN(s)]
  • E(C[fN]),8s

= bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = Var

  • ∑i

1 KC[fN(i)]

  • = 1

K2 Var

  • ∑i C[fN(i)]
  • Shan-Hung Wu (CS, NTHU)

CV & Ensembling Machine Learning 10 / 34

slide-33
SLIDE 33

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

= 1

K ∑i E

  • C[fN(i)]
  • E(C[fN])

= E

  • C[fN(s)]
  • E(C[fN]),8s

= bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = Var

  • ∑i

1 KC[fN(i)]

  • = 1

K2 Var

  • ∑i C[fN(i)]
  • = 1

K2

  • ∑i Var
  • C[fN(i)]
  • +2∑i,j,j>i CovX
  • C[fN(i)],C[fN(j)]
  • Shan-Hung Wu (CS, NTHU)

CV & Ensembling Machine Learning 10 / 34

slide-34
SLIDE 34

Decomposing Bias and Variance

CCV is an estimator of the expected generalization error EX(C[fN]): MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = EX (CCV)EX(C[fN]) = E

  • ∑i

1 KC[fN(i)]

  • E(C[fN])

= 1

K ∑i E

  • C[fN(i)]
  • E(C[fN])

= E

  • C[fN(s)]
  • E(C[fN]),8s

= bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = Var

  • ∑i

1 KC[fN(i)]

  • = 1

K2 Var

  • ∑i C[fN(i)]
  • = 1

K2

  • ∑i Var
  • C[fN(i)]
  • +2∑i,j,j>i CovX
  • C[fN(i)],C[fN(j)]
  • = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 10 / 34

slide-35
SLIDE 35

How Many Folds K? II

MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = bias

  • C[fN(s)]
  • ,8s

Var(CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

We can reduce bias(CCV) and Var(CCV) by learning theory

Choosing the right model complexity avoiding both underfitting and

  • verfitting

Collecting more training examples (N)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

slide-36
SLIDE 36

How Many Folds K? II

MSE(CCV) = VarX(CCV)+bias(CCV)2, where bias(CCV) = bias

  • C[fN(s)]
  • ,8s

Var(CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

We can reduce bias(CCV) and Var(CCV) by learning theory

Choosing the right model complexity avoiding both underfitting and

  • verfitting

Collecting more training examples (N)

Furthermore, we can reduce Var(CCV) by making fN(i) and fN(j) uncorrelated

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 11 / 34

slide-37
SLIDE 37

How Many Folds K? III

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

With a large K, the CCV tends to have:

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

slide-38
SLIDE 38

How Many Folds K? III

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

With a large K, the CCV tends to have:

Low bias

  • C[fN(s)]
  • and Var
  • C[fN(s)]
  • , as fN(s) is trained on more

examples

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

slide-39
SLIDE 39

How Many Folds K? III

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

With a large K, the CCV tends to have:

Low bias

  • C[fN(s)]
  • and Var
  • C[fN(s)]
  • , as fN(s) is trained on more

examples High Cov

  • C[fN(i)],C[fN(j)]
  • , as training sets X\X(i) and X\X(j) are

more similar thus C[fN(i)] and C[fN(j)] are more positively correlated

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 12 / 34

slide-40
SLIDE 40

How Many Folds K? IV

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

Conversely, with a small K, the cross-validation error tends to have a high bias

  • C[fN(s)]
  • and Var
  • C[fN(s)]
  • but low Cov
  • C[fN(i)],C[fN(j)]
  • Shan-Hung Wu (CS, NTHU)

CV & Ensembling Machine Learning 13 / 34

slide-41
SLIDE 41

How Many Folds K? IV

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

Conversely, with a small K, the cross-validation error tends to have a high bias

  • C[fN(s)]
  • and Var
  • C[fN(s)]
  • but low Cov
  • C[fN(i)],C[fN(j)]
  • In practice, we usually set K = 5 or 10

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 13 / 34

slide-42
SLIDE 42

Leave-One-Out CV

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

For very small dataset:

MSE(CCV) is dominated by bias

  • C[fN(s)]
  • and Var
  • C[fN(s)]
  • Not Cov
  • C[fN(i)],C[fN(j)]
  • Shan-Hung Wu (CS, NTHU)

CV & Ensembling Machine Learning 14 / 34

slide-43
SLIDE 43

Leave-One-Out CV

bias(CCV) = bias

  • C[fN(s)]
  • ,8s

VarX (CCV) = 1

KVar

  • C[fN(s)]
  • + 2

K2 ∑i,j,j>i Cov

  • C[fN(i)],C[fN(j)]
  • ,8s

For very small dataset:

MSE(CCV) is dominated by bias

  • C[fN(s)]
  • and Var
  • C[fN(s)]
  • Not Cov
  • C[fN(i)],C[fN(j)]
  • We can choose K = N, which we call the leave-one-out CV

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 14 / 34

slide-44
SLIDE 44

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 15 / 34

slide-45
SLIDE 45

Ensemble Methods

No free lunch theorem: there is no single ML algorithm that always

  • utperforms the others in all domains/tasks

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

slide-46
SLIDE 46

Ensemble Methods

No free lunch theorem: there is no single ML algorithm that always

  • utperforms the others in all domains/tasks

Can we combine multiple base-learners to improve

Applicability across different domains, and/or Generalization performance in a specific task?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

slide-47
SLIDE 47

Ensemble Methods

No free lunch theorem: there is no single ML algorithm that always

  • utperforms the others in all domains/tasks

Can we combine multiple base-learners to improve

Applicability across different domains, and/or Generalization performance in a specific task?

These are the goals of ensemble learning

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

slide-48
SLIDE 48

Ensemble Methods

No free lunch theorem: there is no single ML algorithm that always

  • utperforms the others in all domains/tasks

Can we combine multiple base-learners to improve

Applicability across different domains, and/or Generalization performance in a specific task?

These are the goals of ensemble learning How to “combine” multiple base-learners?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 16 / 34

slide-49
SLIDE 49

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 17 / 34

slide-50
SLIDE 50

Voting

Voting: linear combining the predictions of base-learners for each x: ˜ yk = ∑

j

wjˆ y(j)

k

where wj 0,∑

j

wj = 1. If all learners are given equal weight wj = 1/L, we have the plurality vote (multi-class version of majority vote) Voting Rule Formular Sum ˜ yk = 1

L ∑L j=1 ˆ

y(j)

k

Weighted sum ˜ yk = ∑j wjˆ y(j)

k ,wj 0,∑j wj = 1

Median ˜ yk = medianjˆ y(j)

k

Minimum ˜ yk = minj ˆ y(j)

k

Maximum ˜ yk = maxj ˆ y(j)

k

Product ˜ yk = ∏j ˆ y(j)

k

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 18 / 34

slide-51
SLIDE 51

Why Voting Works? I

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

slide-52
SLIDE 52

Why Voting Works? I

Assume that each ˆ y(j) has the expected value EX

  • ˆ

y(j) |x

  • and variance

VarX

  • ˆ

y(j) |x

  • When wj = 1/L, we have:

EX (˜ y|x) = E

j

1 L ˆ y(j) |x ! = 1 L ∑

j

E ⇣ ˆ y(j) |x ⌘ = E ⇣ ˆ y(j) |x ⌘ VarX (˜ y|x) = Var

j

1 L ˆ y(j) |x ! = 1 L2 Var

j

ˆ y(j) |x ! = 1 LVar ⇣ ˆ y(j) |x ⌘ + 2 L2 ∑

i,j,i<j

Cov ⇣ ˆ y(i), ˆ y(j) |x ⌘

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

slide-53
SLIDE 53

Why Voting Works? I

Assume that each ˆ y(j) has the expected value EX

  • ˆ

y(j) |x

  • and variance

VarX

  • ˆ

y(j) |x

  • When wj = 1/L, we have:

EX (˜ y|x) = E

j

1 L ˆ y(j) |x ! = 1 L ∑

j

E ⇣ ˆ y(j) |x ⌘ = E ⇣ ˆ y(j) |x ⌘ VarX (˜ y|x) = Var

j

1 L ˆ y(j) |x ! = 1 L2 Var

j

ˆ y(j) |x ! = 1 LVar ⇣ ˆ y(j) |x ⌘ + 2 L2 ∑

i,j,i<j

Cov ⇣ ˆ y(i), ˆ y(j) |x ⌘

The expected value doesn’t change, so the bias doesn’t change

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 19 / 34

slide-54
SLIDE 54

Why Voting Works? II

VarX (˜ y|x) = 1 LVar ⇣ ˆ y(j) |x ⌘ + 2 L2 ∑

i,j,i<j

Cov ⇣ ˆ y(i), ˆ y(j) |x ⌘

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

slide-55
SLIDE 55

Why Voting Works? II

VarX (˜ y|x) = 1 LVar ⇣ ˆ y(j) |x ⌘ + 2 L2 ∑

i,j,i<j

Cov ⇣ ˆ y(i), ˆ y(j) |x ⌘

If ˆ y(i) and ˆ y(j) are uncorrelated, the variance can be reduced

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

slide-56
SLIDE 56

Why Voting Works? II

VarX (˜ y|x) = 1 LVar ⇣ ˆ y(j) |x ⌘ + 2 L2 ∑

i,j,i<j

Cov ⇣ ˆ y(i), ˆ y(j) |x ⌘

If ˆ y(i) and ˆ y(j) are uncorrelated, the variance can be reduced Unfortunately, ˆ y(j)’s may not be i.i.d. in practice If voters are positively correlated, variance increases

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 20 / 34

slide-57
SLIDE 57

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 21 / 34

slide-58
SLIDE 58

Bagging

Bagging (short for bootstrap aggregating) is a voting method, but base-learners are made different deliberately How?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

slide-59
SLIDE 59

Bagging

Bagging (short for bootstrap aggregating) is a voting method, but base-learners are made different deliberately How? Why not train them using slightly different training sets?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

slide-60
SLIDE 60

Bagging

Bagging (short for bootstrap aggregating) is a voting method, but base-learners are made different deliberately How? Why not train them using slightly different training sets?

1

Generate L slightly different samples from a given sample is done by bootstrap: given X of size N, we draw N points randomly from X with replacement to get X(j)

It is possible that some instances are drawn more than once and some are not at all

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

slide-61
SLIDE 61

Bagging

Bagging (short for bootstrap aggregating) is a voting method, but base-learners are made different deliberately How? Why not train them using slightly different training sets?

1

Generate L slightly different samples from a given sample is done by bootstrap: given X of size N, we draw N points randomly from X with replacement to get X(j)

It is possible that some instances are drawn more than once and some are not at all

2

Train a base-learner for each X(j)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 22 / 34

slide-62
SLIDE 62

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 23 / 34

slide-63
SLIDE 63

Boosting

In bagging, generating “uncorrelated” base-learners is left to chance and unstability of the learning method

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

slide-64
SLIDE 64

Boosting

In bagging, generating “uncorrelated” base-learners is left to chance and unstability of the learning method In boosting, we generate complementary base-learners How?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

slide-65
SLIDE 65

Boosting

In bagging, generating “uncorrelated” base-learners is left to chance and unstability of the learning method In boosting, we generate complementary base-learners How? Why not train the next learner on the mistakes of the previous learners

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

slide-66
SLIDE 66

Boosting

In bagging, generating “uncorrelated” base-learners is left to chance and unstability of the learning method In boosting, we generate complementary base-learners How? Why not train the next learner on the mistakes of the previous learners For simplicity, let’s consider the binary classification here: d(j)(x) 2 {1,1} The original boosting algorithm combines three weak learners to generate a strong learner

A week learner has error probability less than 1/2 (better than random guessing) A strong learner has arbitrarily small error probability

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 24 / 34

slide-67
SLIDE 67

Original Boosting Algorithm

Training

1

Given a large training set, randomly divide it into three

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

slide-68
SLIDE 68

Original Boosting Algorithm

Training

1

Given a large training set, randomly divide it into three

2

Use X(1) to train the first learner d(1) and feed X(2) to d(1)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

slide-69
SLIDE 69

Original Boosting Algorithm

Training

1

Given a large training set, randomly divide it into three

2

Use X(1) to train the first learner d(1) and feed X(2) to d(1)

3

Use all points misclassified by d(1) and X(2) to train d(2). Then feed X(3) to d(1) and d(2)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

slide-70
SLIDE 70

Original Boosting Algorithm

Training

1

Given a large training set, randomly divide it into three

2

Use X(1) to train the first learner d(1) and feed X(2) to d(1)

3

Use all points misclassified by d(1) and X(2) to train d(2). Then feed X(3) to d(1) and d(2)

4

Use the points on which d(1) and d(2) disagree to train d(3)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

slide-71
SLIDE 71

Original Boosting Algorithm

Training

1

Given a large training set, randomly divide it into three

2

Use X(1) to train the first learner d(1) and feed X(2) to d(1)

3

Use all points misclassified by d(1) and X(2) to train d(2). Then feed X(3) to d(1) and d(2)

4

Use the points on which d(1) and d(2) disagree to train d(3) Testing

1

Feed a point it to d(1) and d(2) first. If their outputs agree, use them as the final prediction

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

slide-72
SLIDE 72

Original Boosting Algorithm

Training

1

Given a large training set, randomly divide it into three

2

Use X(1) to train the first learner d(1) and feed X(2) to d(1)

3

Use all points misclassified by d(1) and X(2) to train d(2). Then feed X(3) to d(1) and d(2)

4

Use the points on which d(1) and d(2) disagree to train d(3) Testing

1

Feed a point it to d(1) and d(2) first. If their outputs agree, use them as the final prediction

2

Otherwise the output of d(3) is taken

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 25 / 34

slide-73
SLIDE 73

Example

Assuming X(1), X(2), and X(3) are the same: Disadvantage: requires a large training set to afford the three-way split

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 26 / 34

slide-74
SLIDE 74

AdaBoost

AdaBoost: uses the same training set over and over again How to make some points “larger?”

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

slide-75
SLIDE 75

AdaBoost

AdaBoost: uses the same training set over and over again How to make some points “larger?” Modify the probabilities of drawing the instances as a function of error

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

slide-76
SLIDE 76

AdaBoost

AdaBoost: uses the same training set over and over again How to make some points “larger?” Modify the probabilities of drawing the instances as a function of error Notation: Pr(i,j): probability that an example (x(i),y(i)) is drawn to train the jth base-learner d(j) ε(j) = ∑i Pr(i,j) 1(y(i) 6= d(j)(x(i))): error rate of d(j) on its training set

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 27 / 34

slide-77
SLIDE 77

Algorithm

Training

1

Initialize Pr(i,1) = 1

N for all i

2

Start from j = 1:

1

Randomly draw N examples from X with probabilities Pr(i,j) and use them to train d(j)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

slide-78
SLIDE 78

Algorithm

Training

1

Initialize Pr(i,1) = 1

N for all i

2

Start from j = 1:

1

Randomly draw N examples from X with probabilities Pr(i,j) and use them to train d(j)

2

Stop adding new base-learners if ε(j) 1

2

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

slide-79
SLIDE 79

Algorithm

Training

1

Initialize Pr(i,1) = 1

N for all i

2

Start from j = 1:

1

Randomly draw N examples from X with probabilities Pr(i,j) and use them to train d(j)

2

Stop adding new base-learners if ε(j) 1

2

3

Define αj = 1

2 log

1ε(j) ε(j)

⌘ > 0 and set Pr(i,j+1) = Pr(i,j) ·exp(αjy(i)d(j)(x(i))) for all i

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

slide-80
SLIDE 80

Algorithm

Training

1

Initialize Pr(i,1) = 1

N for all i

2

Start from j = 1:

1

Randomly draw N examples from X with probabilities Pr(i,j) and use them to train d(j)

2

Stop adding new base-learners if ε(j) 1

2

3

Define αj = 1

2 log

1ε(j) ε(j)

⌘ > 0 and set Pr(i,j+1) = Pr(i,j) ·exp(αjy(i)d(j)(x(i))) for all i

4

Normalize Pr(i,j+1), 8i, by multiplying ⇣ ∑i Pr(i,j+1)⌘1

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

slide-81
SLIDE 81

Algorithm

Training

1

Initialize Pr(i,1) = 1

N for all i

2

Start from j = 1:

1

Randomly draw N examples from X with probabilities Pr(i,j) and use them to train d(j)

2

Stop adding new base-learners if ε(j) 1

2

3

Define αj = 1

2 log

1ε(j) ε(j)

⌘ > 0 and set Pr(i,j+1) = Pr(i,j) ·exp(αjy(i)d(j)(x(i))) for all i

4

Normalize Pr(i,j+1), 8i, by multiplying ⇣ ∑i Pr(i,j+1)⌘1

Testing

1

Given x, calculate ˆ y(j) for all j

2

Make final prediction ˜ y by voting: ˜ y = ∑j αjd(j)(x)

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 28 / 34

slide-82
SLIDE 82

Example

d(j+1) complements d(j) and d(j1) by focusing on predictions they disagree

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

slide-83
SLIDE 83

Example

d(j+1) complements d(j) and d(j1) by focusing on predictions they disagree Voting weights (αj = 1

2 log

1ε(j) ε(j)

⌘ ) in predictions are proportional to the base-learner’s accuracy

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 29 / 34

slide-84
SLIDE 84

Outline

1

Cross Validation How Many Folds?

2

Ensemble Methods Voting Bagging Boosting Why AdaBoost Works?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 30 / 34

slide-85
SLIDE 85

Why AdaBoost Works

Why AdaBoost improves performance?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

slide-86
SLIDE 86

Why AdaBoost Works

Why AdaBoost improves performance? By increasing model complexity?

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

slide-87
SLIDE 87

Why AdaBoost Works

Why AdaBoost improves performance? By increasing model complexity? Not exactly

Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

slide-88
SLIDE 88

Why AdaBoost Works

Why AdaBoost improves performance? By increasing model complexity? Not exactly

Empirical study: AdaBoost reduces overfitting as L grows, even when there is no training error

AdaBoost increases margin [1, 2]

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 31 / 34

slide-89
SLIDE 89

Margin as Confidence of Predictions

Recall in SVC, a larger margin improves generalizability

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

slide-90
SLIDE 90

Margin as Confidence of Predictions

Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

slide-91
SLIDE 91

Margin as Confidence of Predictions

Recall in SVC, a larger margin improves generalizability Due to higher confidence predictions over training examples We can define the margin for AdaBoost similarly In binary classification, define margin of a prediction of an example (x(i),y(i)) 2 X as: margin(x(i),y(i)) = y(i)f(x(i)) =

j:y(i)=d(j)(x(i))

αj

j:y(i)6=d(j)(x(i))

αj

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 32 / 34

slide-92
SLIDE 92

Margin Distribution

Margin distribution over θ: PrX(y(i)f(x(i))  θ) ⇡ |(x(i),y(i)) : y(i)f(x(i))  θ| |X|

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

slide-93
SLIDE 93

Margin Distribution

Margin distribution over θ: PrX(y(i)f(x(i))  θ) ⇡ |(x(i),y(i)) : y(i)f(x(i))  θ| |X| A complementary learner: Clarifies low confidence areas Increases margin of points in these areas

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 33 / 34

slide-94
SLIDE 94

Reference I

[1] Yoav Freund, Robert Schapire, and N Abe. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999. [2] Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. On the margin explanation of boosting algorithms. In COLT, pages 479–490. Citeseer, 2008.

Shan-Hung Wu (CS, NTHU) CV & Ensembling Machine Learning 34 / 34