Ensemble methods CS 446 Why ensembles? Standard machine learning - - PowerPoint PPT Presentation

ensemble methods
SMART_READER_LITE
LIVE PREVIEW

Ensemble methods CS 446 Why ensembles? Standard machine learning - - PowerPoint PPT Presentation

Ensemble methods CS 446 Why ensembles? Standard machine learning setup: We have some data. We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ). We output the best on a validation set. 1 / 27 Why ensembles? Standard


slide-1
SLIDE 1

Ensemble methods

CS 446

slide-2
SLIDE 2

Why ensembles?

Standard machine learning setup: ◮ We have some data. ◮ We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ). ◮ We output the best on a validation set.

1 / 27

slide-3
SLIDE 3

Why ensembles?

Standard machine learning setup: ◮ We have some data. ◮ We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ). ◮ We output the best on a validation set. Question: can we do better than the best?

1 / 27

slide-4
SLIDE 4

Why ensembles?

Standard machine learning setup: ◮ We have some data. ◮ We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ). ◮ We output the best on a validation set. Question: can we do better than the best? What if we use an ensemble/aggregate/combination?

1 / 27

slide-5
SLIDE 5

Why ensembles?

Standard machine learning setup: ◮ We have some data. ◮ We train 10 predictors (3-nn, least squares, SVM, ResNet, . . . ). ◮ We output the best on a validation set. Question: can we do better than the best? What if we use an ensemble/aggregate/combination? We’ll consider two approaches: boosting and bagging.

1 / 27

slide-6
SLIDE 6

Bagging

2 / 27

slide-7
SLIDE 7

Bagging?

This first approach is based upon a simple idea: ◮ If the predictors have indepedent errors, a majority vote of their outputs should be good. Let’s first check this.

3 / 27

slide-8
SLIDE 8

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

4 / 27

slide-9
SLIDE 9

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

4 / 27

slide-10
SLIDE 10

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25

#classifiers = n = 10

4 / 27

slide-11
SLIDE 11

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175

#classifiers = n = 20

4 / 27

slide-12
SLIDE 12

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

#classifiers = n = 30

4 / 27

slide-13
SLIDE 13

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 40

4 / 27

slide-14
SLIDE 14

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 50

4 / 27

slide-15
SLIDE 15

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4).

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60

4 / 27

slide-16
SLIDE 16

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Red: all classifiers wrong.

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.1 0.2 0.3 0.4 0.5

#classifiers = n = 2, fraction red = 0.16

4 / 27

slide-17
SLIDE 17

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Red: all classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4

#classifiers = n = 3, fraction red = 0.064

4 / 27

slide-18
SLIDE 18

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Red: all classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

#classifiers = n = 4, fraction red = 0.0256

4 / 27

slide-19
SLIDE 19

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Red: all classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

#classifiers = n = 5, fraction red = 0.01024

4 / 27

slide-20
SLIDE 20

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Red: all classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30

#classifiers = n = 6, fraction red = 0.004096

4 / 27

slide-21
SLIDE 21

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Red: all classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30

#classifiers = n = 7, fraction red = 0.0016384

4 / 27

slide-22
SLIDE 22

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25

#classifiers = n = 10, fraction green = 0.366897

4 / 27

slide-23
SLIDE 23

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175

#classifiers = n = 20, fraction green = 0.244663

4 / 27

slide-24
SLIDE 24

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

#classifiers = n = 30, fraction green = 0.175369

4 / 27

slide-25
SLIDE 25

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 40, fraction green = 0.129766

4 / 27

slide-26
SLIDE 26

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 50, fraction green = 0.0978074

4 / 27

slide-27
SLIDE 27

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

4 / 27

slide-28
SLIDE 28

Combining classifiers

Suppose we have n classifiers. Suppose each is wrong independently with probability 0.4. Model classifier errors as random variables (Zi)n

i=1 (thus E(Zi) = 0.4).

We can model the distribution of errors with Binom(n, 0.4). Green: at least half classifiers wrong.

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! 0.075 ≪ 0.4 !!!

4 / 27

slide-29
SLIDE 29

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25

#classifiers = n = 10, fraction green = 0.366897

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

5 / 27

slide-30
SLIDE 30

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175

#classifiers = n = 20, fraction green = 0.244663

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

5 / 27

slide-31
SLIDE 31

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

#classifiers = n = 30, fraction green = 0.175369

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

5 / 27

slide-32
SLIDE 32

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 40, fraction green = 0.129766

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

5 / 27

slide-33
SLIDE 33

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12

#classifiers = n = 50, fraction green = 0.0978074

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

5 / 27

slide-34
SLIDE 34

Majority vote

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! Suppose yi ∈ {−1, +1}. MAJ(y1, . . . , yn) :=

  • +1

when

i yi ≥ 0,

−1 when

i yi < 0.

Error rate of majority classifier (with individual error probability p): Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

5 / 27

slide-35
SLIDE 35

Bottom line

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote!

6 / 27

slide-36
SLIDE 36

Bottom line

0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10

#classifiers = n = 60, fraction green = 0.0746237

Green region is error of majority vote! Error of majority vote classifier goes down exponentially in n. Pr[Binom(n, p) ≥ n/2] =

n

  • i=n/2

n i

  • pi(1−p)n−i ≤ exp
  • −n(1/2 − p)2

.

6 / 27

slide-37
SLIDE 37

From independent errors to an algorithm

How to use independent errors in an algorithm?

  • 1. For t = 1, 2, . . . , T:

1.1 Obtain IID data St := ((x(t)

i , y(t) i ))n i=1,

1.2 Train classifier ft on St.

  • 2. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

7 / 27

slide-38
SLIDE 38

From independent errors to an algorithm

How to use independent errors in an algorithm?

  • 1. For t = 1, 2, . . . , T:

1.1 Obtain IID data St := ((x(t)

i , y(t) i ))n i=1,

1.2 Train classifier ft on St.

  • 2. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

◮ Good news: errors are independent! (Our exponential error estimate from before is valid.) ◮ Bad news: classifiers trained on 1/T fraction of data (why not just train ResNet on all of it. . . ).

7 / 27

slide-39
SLIDE 39

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

  • 1. Obtain IID data S := ((xi, yi))n

i=1.

  • 2. For t = 1, 2, . . . , T:

2.1 Resample n points uniformly at random with replacement from S,

  • btaining “Bootstrap sample” St.

2.2 Train classifier ft on St.

  • 3. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

8 / 27

slide-40
SLIDE 40

Bagging

Bagging = Bootstrap aggregating (Leo Breiman, 1994).

  • 1. Obtain IID data S := ((xi, yi))n

i=1.

  • 2. For t = 1, 2, . . . , T:

2.1 Resample n points uniformly at random with replacement from S,

  • btaining “Bootstrap sample” St.

2.2 Train classifier ft on St.

  • 3. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

◮ Good news: using most of the data for each ft ! ◮ Bad news: errors no longer indepedent. . . ?

8 / 27

slide-41
SLIDE 41

Sampling with replacement?

Question: Take n samples uniformly at random with replacement from a population of size n. What is the probability that a given individual is not picked?

9 / 27

slide-42
SLIDE 42

Sampling with replacement?

Question: Take n samples uniformly at random with replacement from a population of size n. What is the probability that a given individual is not picked? Answer:

  • 1 − 1

n n ; for large n: lim

n→∞

  • 1 − 1

n n = 1 e ≈ 0.3679 .

9 / 27

slide-43
SLIDE 43

Sampling with replacement?

Question: Take n samples uniformly at random with replacement from a population of size n. What is the probability that a given individual is not picked? Answer:

  • 1 − 1

n n ; for large n: lim

n→∞

  • 1 − 1

n n = 1 e ≈ 0.3679 . Implications for bagging: ◮ Each bootstrap sample contains about 63% of the data set. ◮ Remaining 37% can be used to estimate error rate of classifier trained on the bootstrap sample. ◮ If we have three classifiers, some of their error estimates must share examples! Independence is violated!

9 / 27

slide-44
SLIDE 44

Random Forests

Random Forests (Leo Breiman, 2001).

  • 1. Obtain IID data S := ((xi, yi))n

i=1.

  • 2. For t = 1, 2, . . . , T:

2.1 Resample n points uniformly at random with replacement from S,

  • btaining “Bootstrap sample” St.

2.2 Train a decision tree fT on St as follows: when greedily splitting tree nodes, consider only √ d and not d possible features.

  • 3. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

10 / 27

slide-45
SLIDE 45

Random Forests

Random Forests (Leo Breiman, 2001).

  • 1. Obtain IID data S := ((xi, yi))n

i=1.

  • 2. For t = 1, 2, . . . , T:

2.1 Resample n points uniformly at random with replacement from S,

  • btaining “Bootstrap sample” St.

2.2 Train a decision tree fT on St as follows: when greedily splitting tree nodes, consider only √ d and not d possible features.

  • 3. Output x → MAJ
  • f1(x), . . . , fT (x)
  • .

◮ Heuristic news: maybe errors are more independent now?

10 / 27

slide-46
SLIDE 46

Boosting — a quick look

11 / 27

slide-47
SLIDE 47

Boosting overview

◮ We no longer assume classifiers have independent errors. ◮ We no longer output a simple majority: we reweight the classifiers via optimization. ◮ There is a rich theory with many interpretations.

12 / 27

slide-48
SLIDE 48

Simplified boosting scheme

  • 1. Start with data ((xi, yi)n

i=1 and classifiers (h1, . . . , hT ).

  • 2. Find weights w ∈ RT which approximately minimize

1 n

n

  • i=1

ℓ  yi

T

  • j=1

wjhj(xi)   = 1 n

n

  • i=1

  • yiw

Tzi

  • ,

where zi =

  • h1(xi), . . . , hT (xi)
  • ∈ RT .

(We use classifiers to give us features.)

  • 3. Predict with x → T

j=1 wjhj(x).

13 / 27

slide-49
SLIDE 49

Simplified boosting scheme

  • 1. Start with data ((xi, yi)n

i=1 and classifiers (h1, . . . , hT ).

  • 2. Find weights w ∈ RT which approximately minimize

1 n

n

  • i=1

ℓ  yi

T

  • j=1

wjhj(xi)   = 1 n

n

  • i=1

  • yiw

Tzi

  • ,

where zi =

  • h1(xi), . . . , hT (xi)
  • ∈ RT .

(We use classifiers to give us features.)

  • 3. Predict with x → T

j=1 wjhj(x).

Remarks. ◮ If ℓ is convex, this is standard linear prediction: convex in w. ◮ In the classical setting: ℓ(r) = exp(−r), optimizer = coordinate descent, T = ∞. ◮ Most commonly, (h1, . . . , hT ) are decision stumps. ◮ Popular software implementation: xgboost.

13 / 27

slide-50
SLIDE 50

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

14 / 27

slide-51
SLIDE 51

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

ˆ y = 2

14 / 27

slide-52
SLIDE 52

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7

14 / 27

slide-53
SLIDE 53

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 ˆ y = 1 ˆ y = 3

14 / 27

slide-54
SLIDE 54

Decision stumps?

sepal length/width 1.5 2 2.5 3 petal length/width 2 2.5 3 3.5 4 4.5 5 5.5 6

Classifying irises by sepal and petal measurements ◮ X = R2, Y = {1, 2, 3} ◮ x1 = ratio of sepal length to width ◮ x2 = ratio of petal length to width

x1 > 1.7 ˆ y = 1 ˆ y = 3

. . . and stop there!

14 / 27

slide-55
SLIDE 55

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3.000
  • 1.500

0.000 0.000 1.500 1.500 3.000 4.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4

.

  • 3

.

  • 2.000
  • 1.000

0.000 1 . 2 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3

. 2

  • 2

. 4

  • 1

. 6

  • 0.800

0.000 0.800 1.600

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

15 / 27

slide-56
SLIDE 56

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4.000
  • 2.000

0.000 . 2.000 2.000 4.000 6.000 8.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6

.

  • 4

. 5

  • 3.000
  • 1.500

. 1.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 8.000
  • 6

.

  • 4.000
  • 2

. . 2 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

15 / 27

slide-57
SLIDE 57

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5

.

  • 2

. 5

  • 2

. 5 0.000 . 2.500 5.000 7.500 10.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 8

.

  • 6

.

  • 4.000
  • 2

. . . 2 . 4.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9.000
  • 6.000
  • 3

. 0.000 0.000 3.000 6 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

15 / 27

slide-58
SLIDE 58

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5.000
  • 2

. 5

  • 2

. 5 0.000 0.000 2.500 5.000 7.500 10.000 12.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 10.000
  • 7.500
  • 5

.

  • 2

. 5 0.000 . 2 . 5 5 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 1

6 .

  • 12.000
  • 8.000
  • 4.000

0.000 . 4.000 8 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

15 / 27

slide-59
SLIDE 59

Boosting decision stumps

Minimizing 1

n

n

i=1 ℓ

  • yi

T

j=1 wjhj(xi)

  • ver w ∈ RT ,

where (h1, . . . , hT ) are decision stumps.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6.000
  • 3

.

  • 3

. 0.000 0.000 3.000 6.000 9.000 12.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9.000
  • 6

.

  • 3

. 0.000 0.000 3 . 6 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 20.000
  • 15.000
  • 10.000
  • 5.000

0.000 . 5.000 10.000

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

15 / 27

slide-60
SLIDE 60
slide-61
SLIDE 61

Boosting — classical perspective

16 / 27

slide-62
SLIDE 62

Coordinate descent?

The classical methods used coordinate descent: ◮ Find the maximum magnitude coordinate of the gradient: arg max

j

  • d

dwj

n

  • i=1

  • j

wjhj(xi)yi

  • = arg max

j

  • n
  • i=1

ℓ′

j

wjhj(xi)yi

  • hj(xi)yi
  • = arg max

j

  • n
  • i=1

qihj(xi)yi

  • ,

where we’ve defined qi := ℓ′(

j hj(xi)yi).

◮ Iterate: w′ := w − ηsej, where j is the maximum coordinate, s ∈ {−1, +1} is its sign, and η is a step size.

17 / 27

slide-63
SLIDE 63

Interpreting coordinate descent

Suppose hj : Rd → {−1, +1}; then hj(x)y = 2 · 1[hj(x) = y] − 1, and each step solves arg max

j

  • n
  • i=1

qihj(x)iyi

  • = arg max

j

  • n
  • i=1

qi

  • 1[hj(xi) = y] − 1/2
  • .

We are solving a weighted zero-one loss minimization problem.

18 / 27

slide-64
SLIDE 64

Interpreting coordinate descent

Suppose hj : Rd → {−1, +1}; then hj(x)y = 2 · 1[hj(x) = y] − 1, and each step solves arg max

j

  • n
  • i=1

qihj(x)iyi

  • = arg max

j

  • n
  • i=1

qi

  • 1[hj(xi) = y] − 1/2
  • .

We are solving a weighted zero-one loss minimization problem. Remarks: ◮ The classical choice of coordinate descent is equivalent to solving a problem akin to weighted zero-one loss minimization. ◮ We can abstract away a finite set (h1, . . . , hT ), and have an arbitrary set of predictors (e.g., all linear classifiers).

18 / 27

slide-65
SLIDE 65

Classical boosting setup

There is a Weak Learning Oracle, and a corresponding γ-weak-learnable assumption: A set of points is γ-weak-learnable a weak learning oracle if for any weighting q, it returns predictor h so that Eq

  • h(X)Y
  • ≥ γ.

Interpretation: for any reweighting q, we get a predictor h which is at least γ-correlated with the target.

19 / 27

slide-66
SLIDE 66

Classical boosting setup

There is a Weak Learning Oracle, and a corresponding γ-weak-learnable assumption: A set of points is γ-weak-learnable a weak learning oracle if for any weighting q, it returns predictor h so that Eq

  • h(X)Y
  • ≥ γ.

Interpretation: for any reweighting q, we get a predictor h which is at least γ-correlated with the target. Remarks: ◮ The classical methods iteratively invoke the oracle with different weightings and then output a final aggregated predictor. ◮ The best-known method, AdaBoost, performs coordinate-descent updates (invoking the oracle) with a specific step size, and needs O( 1

γ2 ln( 1 ǫ )) iterations for accuracy ǫ > 0.

◮ The original description of AdaBoost is in terms of the sequence of weightings q1, q2, . . ., and says nothing about coordinate descent. ◮ Adaptive Boosting: method doesn’t need to know γ, and adapts to varying γt := Eqt(ht(X)Y ).

19 / 27

slide-67
SLIDE 67

Example: AdaBoost with decision stumps

(This example from Schapire&Freund’s book.)

Weak learning oracle (WLO): pick the best decision stump, meaning F :=

  • x → sign(xi − b) : i ∈ {1, . . . , d}, b ∈ R
  • .

(Straightforward to handle weights in ERM.)

20 / 27

slide-68
SLIDE 68

Example: AdaBoost with decision stumps

(This example from Schapire&Freund’s book.)

Weak learning oracle (WLO): pick the best decision stump, meaning F :=

  • x → sign(xi − b) : i ∈ {1, . . . , d}, b ∈ R
  • .

(Straightforward to handle weights in ERM.) Remark: ◮ Only need to consider O(n) stumps (Why?).

20 / 27

slide-69
SLIDE 69

Example: execution of AdaBoost

D1

21 / 27

slide-70
SLIDE 70

Example: execution of AdaBoost

D1 f1

21 / 27

slide-71
SLIDE 71

Example: execution of AdaBoost

D1 D2 f1

21 / 27

slide-72
SLIDE 72

Example: execution of AdaBoost

D1 D2 f1 f2

21 / 27

slide-73
SLIDE 73

Example: execution of AdaBoost

D1 D2 D3

+ + – –

f1 f2

21 / 27

slide-74
SLIDE 74

Example: execution of AdaBoost

D1 D2 D3

+ + – – + + – –

f1 f2 f3

21 / 27

slide-75
SLIDE 75

Example: final classifier from AdaBoost

+ + – –

f1 f2 f3

22 / 27

slide-76
SLIDE 76

Example: final classifier from AdaBoost

+ + – –

f1 f2 f3 Final classifier ˆ f(x) = sign(0.42f1(x) + 0.65f2(x) + 0.92f3(x)) (Zero training error rate!)

22 / 27

slide-77
SLIDE 77
slide-78
SLIDE 78

A typical run of boosting.

AdaBoost+C4.5 on “letters” dataset. Error rate

10 100 1000 5 10 15 20

AdaBoost training error AdaBoost test error C4.5 test error

# of rounds T (# nodes across all decision trees in ˆ f is >2 × 106) Training error rate is zero after just five rounds, but test error rate continues to decrease, even up to 1000 rounds!

(Figure 1.7 from Schapire & Freund text)

23 / 27

slide-79
SLIDE 79

Boosting the margin.

Final classifier from AdaBoost: ˆ f(x) = sign T

t=1 αtft(x)

T

t=1 |αt|

  • g(x) ∈ [−1, +1]

. Call y · g(x) ∈ [−1, +1] the margin achieved on example (x, y). (Note: ℓ1 not ℓ2 normalized.)

24 / 27

slide-80
SLIDE 80

Boosting the margin.

Final classifier from AdaBoost: ˆ f(x) = sign T

t=1 αtft(x)

T

t=1 |αt|

  • g(x) ∈ [−1, +1]

. Call y · g(x) ∈ [−1, +1] the margin achieved on example (x, y). (Note: ℓ1 not ℓ2 normalized.) Margin theory [Schapire, Freund, Bartlett, and Lee, 1998]: ◮ Larger margins ⇒ better generalization, independent of T. ◮ AdaBoost tends to increase margins on training examples.

“letters” dataset:

T = 5 T = 100 T = 1000 training error rate 0.0% 0.0% 0.0% test error rate 8.4% 3.3% 3.1% % margins ≤0.5 7.7% 0.0% 0.0%

  • min. margin

0.14 0.52 0.55 ◮ Similar phenomenon in deep networks and gradient descent.

24 / 27

slide-81
SLIDE 81

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

25 / 27

slide-82
SLIDE 82

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

1 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3.000
  • 1.500

0.000 0.000 1.500 1.500 3.000 4.500

Boosted stumps. (O(n) param.)

25 / 27

slide-83
SLIDE 83

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4.000
  • 2.000

0.000 . 2.000 2.000 4.000 6.000 8.000

Boosted stumps. (O(n) param.)

25 / 27

slide-84
SLIDE 84

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5

.

  • 2

. 5

  • 2

. 5 0.000 . 2.500 5.000 7.500 10.000

Boosted stumps. (O(n) param.)

25 / 27

slide-85
SLIDE 85

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5.000
  • 2

. 5

  • 2

. 5 0.000 0.000 2.500 5.000 7.500 10.000 12.500

Boosted stumps. (O(n) param.)

25 / 27

slide-86
SLIDE 86

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6.000
  • 3

.

  • 3

. 0.000 0.000 3.000 6.000 9.000 12.000

Boosted stumps. (O(n) param.)

25 / 27

slide-87
SLIDE 87

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

1 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 2 1 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 2 1 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3.000
  • 1.500

0.000 0.000 1.500 1.500 3.000 4.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4

.

  • 3

.

  • 2.000
  • 1.000

0.000 1 . 2 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 3

. 2

  • 2

. 4

  • 1

. 6

  • 0.800

0.000 0.800 1.600

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

25 / 27

slide-88
SLIDE 88

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 4.000
  • 2.000

0.000 . 2.000 2.000 4.000 6.000 8.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6

.

  • 4

. 5

  • 3.000
  • 1.500

. 1.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 8.000
  • 6

.

  • 4.000
  • 2

. . 2 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

25 / 27

slide-89
SLIDE 89

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 16 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5

.

  • 2

. 5

  • 2

. 5 0.000 . 2.500 5.000 7.500 10.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 8

.

  • 6

.

  • 4.000
  • 2

. . . 2 . 4.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9.000
  • 6.000
  • 3

. 0.000 0.000 3.000 6 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

25 / 27

slide-90
SLIDE 90

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 5.000
  • 2

. 5

  • 2

. 5 0.000 0.000 2.500 5.000 7.500 10.000 12.500 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 10.000
  • 7.500
  • 5

.

  • 2

. 5 0.000 . 2 . 5 5 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 1

6 .

  • 12.000
  • 8.000
  • 4.000

0.000 . 4.000 8 .

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

25 / 27

slide-91
SLIDE 91

Margin plots

Given ((xi, yi))n

i=1 and f, plot unnormalized margin distribution

f(xi)yi − max

y=yi f(xi)y.

2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 6.000
  • 3

.

  • 3

. 0.000 0.000 3.000 6.000 9.000 12.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 12.000
  • 9.000
  • 6

.

  • 3

. 0.000 0.000 3 . 6 . 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

  • 20.000
  • 15.000
  • 10.000
  • 5.000

0.000 . 5.000 10.000

Boosted stumps. (O(n) param.) 2-layer ReLU. (O(n) param.) 3-layer ReLU. (O(n) param.)

25 / 27

slide-92
SLIDE 92

Summary

26 / 27

slide-93
SLIDE 93

Summary

◮ We can do better than best predictor. ◮ (Bagging.) If errors are independent, majority vote works well. ◮ (Boosting.) If they are not independent, reweighted majority works well; weights can be found with convex optimization.

27 / 27