BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - - PowerPoint PPT Presentation

bbm406
SMART_READER_LITE
LIVE PREVIEW

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost - - PowerPoint PPT Presentation

Illustration adapted from Alex Rogozhnikov BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe University // Fall 2019 Last time Bias/Variance Tradeo ff slide by David Sontag Graphical illustration of


slide-1
SLIDE 1

Aykut Erdem // Hacettepe University // Fall 2019

Lecture 20:

AdaBoost

BBM406

Fundamentals of 
 Machine Learning

Illustration adapted from Alex Rogozhnikov

slide-2
SLIDE 2

Last time… Bias/Variance Tradeoff

2

http://scott.fortmann-roe.com/docs/BiasVariance.html Graphical illustration of bias and variance.

slide by David Sontag
slide-3
SLIDE 3

Last time… Bagging

  • Leo Breiman (1994)
  • Take repeated bootstrap samples from training set D.
  • Bootstrap sampling: Given set D containing N

training examples, create D’ by drawing N examples at random with replacement from D.

  • Bagging:
  • Create k bootstrap samples D1 ... Dk.
  • Train distinct classifier on each Di.
  • Classify new instance by majority vote / average.

3

slide by David Sontag
slide-4
SLIDE 4

Last time… Random Forests

4

slide by Nando de Freitas

[From the book of Hastie, Friedman and Tibshirani]

Tree t=1 t=2 t=3

slide-5
SLIDE 5

Boosting

5

slide-6
SLIDE 6

Boosting Ideas

  • Main idea: use weak learner to create

strong learner.

  • Ensemble method: combine base

classifiers returned by weak learner.

  • Finding simple relatively accurate base

classifiers often not hard.

  • But, how should base classifiers be

combined?

6

slide by Mehryar Mohri
slide-7
SLIDE 7

Example: “How May I Help You?”

  • Goal: automatically categorize type of call requested by

phone customer (Collect, CallingCard, PersonToPerson, etc.)

  • yes I’d like to place a collect call long distance

please (Collect)

  • operator I need to make a call but I need to bill it to

my office (ThirdNumber)

  • yes I’d like to place a call on my master card please

(CallingCard)

  • I just called a number in sioux city and I musta rang

the wrong number because I got the wrong party and I would like to have that taken off of my bill

(BillingCredit)

  • Observation:
  • easy to find “rules of thumb” that are “often” correct
  • e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ”
  • hard to find single highly accurate prediction rule

7

[Gorin et al.]

slide by Rob Schapire
slide-8
SLIDE 8

Boosting: Intuition

  • Instead of learning a single (weak) classifier, learn many

weak classifiers that are good at different parts of the input space

  • Output class: (Weighted) vote of each classifier
  • Classifiers that are most “sure” will vote with more

conviction

  • Classifiers will be most “sure” about a particular part of the

space

  • On average, do better than single classifier!
  • But how do you???
  • force classifiers to learn about different parts of the input

space?

  • weigh the votes of different classifiers?

8

slide by Aarti Singh & Barnabas Poczos
slide-9
SLIDE 9

Boosting [Schapire, 1989]

  • Idea: given a weak learner, run it multiple times on (reweighted)

training data, then let the learned classifiers vote

  • On each iteration t:
  • weight each training example by how incorrectly it was classified
  • Learn a hypothesis – ht
  • A strength for this hypothesis – at
  • Final classifier:
  • A linear combination of the votes of the different classifiers

weighted by their strength

  • Practically useful
  • Theoretically interesting

9

slide by Aarti Singh & Barnabas Poczos
slide-10
SLIDE 10

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

10

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-11
SLIDE 11

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

11

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-12
SLIDE 12

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

12

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-13
SLIDE 13

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

13

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-14
SLIDE 14

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

14

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-15
SLIDE 15

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

15

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-16
SLIDE 16

Boosting: Intuition

  • Want to pick weak classifiers that contribute something

to the ensemble

16

Greedy algorithm: for m=1,...,M

  • Pick a weak classifier hm
  • Adjust weights: misclassified

examples get “heavier”

  • αm set according to weighted

error of hm

slide by Raquel Urtasun

[Source: G. Shakhnarovich]

slide-17
SLIDE 17

First Boosting Algorithms

  • [Schapire ’89]:
  • first provable boosting algorithm
  • [Freund ’90]:
  • “optimal” algorithm that “boosts by majority”
  • [Drucker, Schapire & Simard ’92]:
  • first experiments using boosting
  • limited by practical drawbacks
  • [Freund & Schapire ’95]:
  • introduced “AdaBoost” algorithm
  • strong practical advantages over previous boosting

algorithms

17

slide by Rob Schapire
slide-18
SLIDE 18

The AdaBoost Algorithm

18

slide-19
SLIDE 19

Toy Example

weak hypotheses = vertical or horizontal half-planes

19

Minimize the error For binary ht , typically use

slide by Rob Schapire
slide-20
SLIDE 20

Round 1

20

h1 ε1=0.30

slide by Rob Schapire
slide-21
SLIDE 21

Round 1

21

h1 α ε1 1 =0.30 =0.42

slide by Rob Schapire
slide-22
SLIDE 22

Round 1

22

h1 α ε1 1 =0.30 =0.42 2 D

slide by Rob Schapire
slide-23
SLIDE 23

Round 2

23

ε2=0.21 h2 3

slide by Rob Schapire
slide-24
SLIDE 24

Round 2

24

α ε2 2 =0.21 =0.65 h2 3

slide by Rob Schapire
slide-25
SLIDE 25

Round 2

25

α ε2 2 =0.21 =0.65 h2 3 D

slide by Rob Schapire
slide-26
SLIDE 26

Round 3

26

h3 ε3=0.14

slide by Rob Schapire
slide-27
SLIDE 27

Round 3

27

h3 α ε3 3=0.92 =0.14

slide by Rob Schapire
slide-28
SLIDE 28

H final + 0.92 + 0.65 0.42 sign = =

Final Hypothesis

28

slide by Rob Schapire
slide-29
SLIDE 29

Voted combination of classifiers

  • The general problem here is to try to combine many

simple “weak” classifiers into a single “strong” classifier

  • We consider voted combinations of simple binary ±1

component classifiers where the (non-negative) votes αi can be used to 
 emphasize component classifiers that are more 
 reliable than others

29

slide by Tommi S. Jaakkola
slide-30
SLIDE 30

Components: Decision stumps

  • Consider the following simple family of component

classifiers generating ±1 labels: where These are called decision 
 stumps.

  • Each decision stump pays attention to only a single

component of the input vector

30

slide by Tommi S. Jaakkola
slide-31
SLIDE 31

Voted combinations (cont’d.)

  • We need to define a loss function for the combination

so we can determine which new component h(x; θ) to add and how many votes it should receive


  • While there are many options for the loss function we

consider here only a simple exponential loss

31

slide by Tommi S. Jaakkola
slide-32
SLIDE 32

Modularity, errors, and loss

  • Consider adding the mth component:

32

slide by Tommi S. Jaakkola
slide-33
SLIDE 33

Modularity, errors, and loss

  • Consider adding the mth component:

33

slide by Tommi S. Jaakkola
slide-34
SLIDE 34

Modularity, errors, and loss

  • Consider adding the mth component:

  • So at the mth iteration the new component (and the votes)

should optimize a weighted loss (weighted towards mistakes).

34

slide by Tommi S. Jaakkola
slide-35
SLIDE 35

Empirical exponential loss (cont’d.)

  • To increase modularity we’d like to further decouple the
  • ptimization of h(x; θm) from the associated votes αm
  • To this end we select h(x; θm) that optimizes the rate at

which the loss would decrease as a function of αm

35

slide by Tommi S. Jaakkola
slide-36
SLIDE 36

Empirical exponential loss (cont’d.)

  • We find that minimizes
  • We can also normalize the weights:



 so that

36

slide by Tommi S. Jaakkola
slide-37
SLIDE 37

Empirical exponential loss (cont’d.)

  • We find that minimizes


where

  • is subsequently chosen to minimize

37

slide by Tommi S. Jaakkola
slide-38
SLIDE 38

The AdaBoost Algorithm

38

slide by Jiri Matas and Jan Šochman
slide-39
SLIDE 39

The AdaBoost Algorithm

39

Given: (x1, y1), . . . , (xm, ym); xi ∈ X, yi ∈ {−1, +1}

slide by Jiri Matas and Jan Šochman
slide-40
SLIDE 40

The AdaBoost Algorithm

40

Given: (x1, y1), . . . , (xm, ym); xi ∈ X, yi ∈ {−1, +1} Initialise weights D1(i) = 1/m

slide by Jiri Matas and Jan Šochman
slide-41
SLIDE 41

The AdaBoost Algorithm

41

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop t = 1

slide by Jiri Matas and Jan Šochman
slide-42
SLIDE 42

The AdaBoost Algorithm

42

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t )

t = 1

slide by Jiri Matas and Jan Šochman
slide-43
SLIDE 43

The AdaBoost Algorithm

43

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor t = 1

slide by Jiri Matas and Jan Šochman
slide-44
SLIDE 44

The AdaBoost Algorithm

44

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 1

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-45
SLIDE 45

The AdaBoost Algorithm

45

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 2

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-46
SLIDE 46

The AdaBoost Algorithm

46

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 3

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-47
SLIDE 47

The AdaBoost Algorithm

47

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 4

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-48
SLIDE 48

The AdaBoost Algorithm

48

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 5

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-49
SLIDE 49

The AdaBoost Algorithm

49

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 6

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-50
SLIDE 50

The AdaBoost Algorithm

50

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 7

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-51
SLIDE 51

The AdaBoost Algorithm

51

Given: (x1, y1), . . . , (xm, ym); xi 2 X, yi 2 {1, +1} Initialise weights D1(i) = 1/m For t = 1, ..., T:

Find ht = arg min

hj∈H ✏j = m

P

i=1

Dt(i)Jyi 6= hj(xi)K

If ✏t 1/2 then stop

Set ↵t = 1

2 log(1−✏t ✏t ) ⌅

Update Dt+1(i) = Dt(i)exp(↵tyiht(xi)) Zt where Zt is normalisation factor Output the final classifier: H(x) = sign T X

t=1

↵tht(x) !

step training error

t = 40

5 10 15 20 25 30 35 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 slide by Jiri Matas and Jan Šochman
slide-52
SLIDE 52

Reweighting

52

slide by Jiri Matas and Jan Šochman
slide-53
SLIDE 53

53

Reweighting

slide by Jiri Matas and Jan Šochman
slide-54
SLIDE 54

54

Reweighting

slide by Jiri Matas and Jan Šochman
slide-55
SLIDE 55

Boosting results – Digit recognition

  • Boosting often (but not always)
  • Robust to overfitting
  • Test set error decreases even after training error is zero

55

[Schapire, 1989]

10 100 1000 5 10 15 20

error # rounds training error test error

slide by Carlos Guestrin
slide-56
SLIDE 56

Application: Detecting Faces

  • Training Data
  • 5000 faces
  • All frontal
  • 300 million non-faces
  • 9500 non-face images

56

[Viola & Jones]

slide by Rob Schapire
slide-57
SLIDE 57

Application: Detecting Faces

  • Problem: find faces in photograph or movie
  • Weak classifiers: detect light/dark rectangle in image


  • Many clever tricks to make extremely fast and accurate

57

[Viola & Jones]

slide by Rob Schapire
slide-58
SLIDE 58

Boosting vs. Logistic Regression

58

Logis+c$regression:$

  • Minimize$log$loss$
  • Define$$

$ $ where$xj$predefined$ features$(linear$classifier)$

  • Jointly$op+mize$over$all$

weights$w0,w1,w2,…$

$ Boos+ng:$

  • Minimize$exp$loss$
  • Define$

$ $ where$ht(x)$defined$ dynamically$to$fit$data$$

(not$a$$linear$classifier)$

  • Weights$αt$learned$per$

itera+on$t$incrementally$

  • Minimize+log+loss+
  • Minimize+exp+loss+

where+x +predefined+ ++++++

slide by Aarti Singh
slide-59
SLIDE 59

Boosting vs. Bagging

Bagging:

  • Resample data points
  • Weight of each classifier

is the same

  • Only variance reduction

59

Boosting:

  • Reweights data points

(modifies their distribution)

  • Weight is dependent on

classifier’s accuracy

  • Both bias and variance

reduced – learning rule becomes more complex with iterations

slide by Aarti Singh
slide-60
SLIDE 60

Next Lecture:

K-Means Clustering

60