Statistical Machine Learning A Crash Course Part III: Boosting - - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning A Crash Course Part III: Boosting - - - PowerPoint PPT Presentation

Statistical Machine Learning A Crash Course Part III: Boosting - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Combining Classifiers Horse race prediction: Stefan Roth, 11.05.2012 | Department of Computer


slide-1
SLIDE 1

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

Statistical Machine Learning

A Crash Course

Part III: Boosting

  • 11.05.2012
slide-2
SLIDE 2

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Combining Classifiers

■ Horse race prediction:

2

slide-3
SLIDE 3

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Combining Classifiers

■ How do we make money from horse racing bets? ■ Ask a professional. ■ It is very likely that...

  • The professional cannot give a single highly accurate rule.
  • But presented with a set of races, can always generate better-

than-random rules.

■ Can you get rich?

■ Disclaimer: We are not saying you should actually try this at home :-)

3

slide-4
SLIDE 4

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Combining Classifiers

■ Idea:

  • Ask an expert for their rule-of-thumb.
  • Assemble the set of cases where the rule-of-thumb fails (hard

cases).

  • Ask the expert again for the selected set of hard cases.
  • And so on…

■ Combine many rules-of-thumb.

4

slide-5
SLIDE 5

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Combining Classifiers

■ How do we actually do this? ■ How to choose races on each round?

  • Concentrate on “hardest” races

(those most often misclassified by previous rules of thumb)

■ How to combine rules of thumb into single prediction rule?

  • Take (weighted) majority vote of several

rules-of-thumb

  • We take a weighted average of simple rules (models):

5

ht : Rd → {+1, −1}

H(x) = sign T ⇤

t=1

αtht(x) ⇥

slide-6
SLIDE 6

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Boosting

■ General method of converting rough rules of thumb into a highly accurate prediction rule. ■ More formally:

  • Given a “weak” learning algorithm that can consistently find

“weak classifiers” with a (training) error of

  • A boosting algorithm can provably construct a “strong classifier”

that has a training error of .

■ As long as we have a “weak” learning algorithm that does better than chance, we can convert it into an algorithm that performs arbitrarily well!

6

≤ 1

2 − γ

slide-7
SLIDE 7

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Toy Example

■ Training data:

7

slide-8
SLIDE 8

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Round 1:

AdaBoost: Toy Example

8

1st weak classifier reweighted training data

slide-9
SLIDE 9

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Toy Example

■ Round 2:

9

1st weak classifier 2nd weak classifier reweighted training data

slide-10
SLIDE 10

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Round 3:

AdaBoost: Toy Example

10

1st weak classifier 2nd weak classifier 3rd weak classifier

slide-11
SLIDE 11

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Toy Example

■ Weighted combination:

11

slide-12
SLIDE 12

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Toy Example

■ Final hypothesis / “strong” classifier:

12

slide-13
SLIDE 13

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost

■ Given: Training data with labels where ■ Initialize weights for every data point: ■ Loop over :

  • Train the weak learner on the training data

so that the weighted error with weights is minimized.

  • Choose an appropriate weight for the weak classifier.
  • Update the data weights as

where is chosen such that sums to 1.

13

(x1, y1), . . . , (xN, yN) xi ∈ Rd, yi ∈ {+1, −1} D1(i) = 1

N

t = 1, . . . , T

ht : Rd → {+1, −1}

Dt

# of boosting rounds

αt ∈ R+

Dt+1

Zt

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-14
SLIDE 14

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost

■ Given: Training data with labels where ■ Return the weighted (“strong”, ensemble) classifier: ■ Intuition:

  • Boosting uses weighted training data and adapts the weights

every round.

  • The weights make the algorithm focus on the wrongly classified

examples:

14

(x1, y1), . . . , (xN, yN) xi ∈ Rd, yi ∈ {+1, −1}

H(x) = sign T ⇤

t=1

αtht(x) ⇥

exp{αtyiht(xi)} =

  • ⇤ 1

if yi ⌅= ht(xi) ⇥ 1 if yi = ht(xi)

slide-15
SLIDE 15

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Weak Learners

■ Training the weak learner:

  • Given training data
  • and weights for all data point.
  • Select the weak classifier with the smallest weighted error:
  • Prerequisite: Weighted training error

■ Examples for :

  • Weighted least-squares classifier
  • Decision stumps (hold on...)

15

(x1, y1), . . . , (xN, yN) Dt(i) ht = arg min

h∈H t

t =

N

  • i=1

Dt(i)[yi = h(xi)]

with

H

⇥t ≤ 1

2 − t,

t > 0

slide-16
SLIDE 16

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Weak Learners

16

slide-17
SLIDE 17

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Weak Learners

17

slide-18
SLIDE 18

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost

■ How do we select ? ■ We want to minimize the empirical error: ■ The empirical error can be upper bounded: ■ To minimize the empirical error, we can greedily minimize in each round.

18

αt

tr(H) =

N

  • i=1

1 M [yi = H(xi)] tr(H) ≤

T

  • t=1

Zt

[Freund & Schapire]

Zt =

T

t=1

N ⇤

i=1

Dt(i) exp{−αtyiht(xi)} ⇥

slide-19
SLIDE 19

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost

■ Select by greedily minimizing in each round.

  • Minimizes an upper bound on the empirical error.

■ Minimize ■ We obtain the AdaBoost weighting:

19

Zt(α) αt t = 1 2 log 1 − ⇥t ⇥t ⇥

with

t =

N

  • i=1

Dt(i)[yi = h(xi)] Zt(α) =

N

  • i=1

Dt(i) exp{−αyiht(xi)}

slide-20
SLIDE 20

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Reweighting

20

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-21
SLIDE 21

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Reweighting

21

Increase the weight on incorrectly classified examples Decrease the weight on correctly classified examples

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

exp{αtyiht(xi)} =

  • ⇤ 1

if yi ⌅= ht(xi) ⇥ 1 if yi = ht(xi)

slide-22
SLIDE 22

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Reweighting

■ Eventually only the very difficult cases will be focused on:

22

slide-23
SLIDE 23

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: More realistic example

■ Initialize...

23

t = 0

slide-24
SLIDE 24

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: More realistic example

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

24

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

t = 1

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-25
SLIDE 25

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

25

t = 2

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-26
SLIDE 26

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

26

t = 3

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-27
SLIDE 27

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

27

t = 4

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-28
SLIDE 28

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

28

t = 5

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-29
SLIDE 29

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

29

t = 6

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-30
SLIDE 30

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

30

t = 7

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-31
SLIDE 31

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Initialize... ■ For

  • Find
  • Stop if
  • Set
  • Reweight the data:

AdaBoost: More realistic example

31

t = 40

t = 1, . . . , T

ht = arg min

h∈H t

t > 1

2

t = 1 2 log 1 − ⇥t ⇥t ⇥

Dt+1(i) =

1 Zt Dt(i) exp {−αtyiht(xi)}

slide-32
SLIDE 32

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Convergence

■ It can be shown that the training error is upper-bounded as:

  • where denotes how much better the weak learner

is compared to random guessing.

■ If it holds that

  • This means: If the weak learning is always better than chance, we

can make the boosted classifier perform arbitrarily well (on the training data).

32

t = 1

2 − ⇥t

⇥tr(H) ≤

T

t=1

Zt = exp

  • −2

T

t=1

2

t

⇥ γt > γ ⇥tr ≤ exp{−22T}

slide-33
SLIDE 33

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Generalization

■ We might expect:

  • That the training error will go down as we add more weak

classifiers.

  • That the test error will drop for a while, but then go up again as

the model starts to overfit the training data.

33

slide-34
SLIDE 34

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Generalization

■ Instead, what we typically see:

  • The test error goes down even though the training error is already

0.

  • AdaBoost doesn’t seem to overfit!
  • Why?

34

slide-35
SLIDE 35

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Generalization

■ AdaBoost has a built-in notion of margin maximization that in turn leads to good generalization. ■ SVM margin: ■ AdaBoost margin: ■ Caveat:

  • Generalization is not always effective.

35

min

(xi,yi)

yi(wTxi + b) ||w||2 min

(xi,yi)

yi(αTh(xi)) ||α||1 = min

(xi,yi)

yif(xi) ||α||1

f(x) =

T

  • t=1

αtht(x)

with

slide-36
SLIDE 36

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Generalization

■ Margin of a data point: ■ Distribution of margins:

  • AdaBoost tends to increase the margin, because it focuses on the

difficult cases.

36

Rounds 5 100 1000 Training error 0.0 0.0 0.0 Test error 8.4 3.3 3.1 % margins ≤ 0.5 7.7 0.0 0.0 Minimum margin 0.14 0.52 0.55

yif(xi) ||α||1

slide-37
SLIDE 37

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Decision Stumps

■ Often the simplest, but useful weak learner. ■ Axis-aligned linear classifier.

  • Try all possible thresholds along all feature dimensions:
  • We only have to try thresholds that lead to different

classifications.

37

slide-38
SLIDE 38

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Advantages

■ Quite simple & easy to program.

  • We don’t need a complex classifier, but just a simple weak learner.

■ Only a single parameter to tune:

  • Though AdaBoost is not very sensitive to it...

■ Very flexible:

  • Can be combined with almost any classifier (even neural

networks).

  • Also works well with discrete features (not covered here).
  • Very useful as feature selection method.

■ Solid theoretical foundation:

  • E.g., provably effective (assuming weak learner).

38

T

slide-39
SLIDE 39

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: Caveats

■ The actual performance depends on the suitability of the data & the weak learner. ■ AdaBoost can fail if:

  • The weak hypothesis is too complex (overfitting)
  • The weak hypothesis too weak ( too quickly)
  • Underfitting
  • Small margins → overfitting

■ Empirically, AdaBoost seems especially susceptible to noise:

  • There are variants that are more robust, however.

39

γt → 0

slide-40
SLIDE 40

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

AdaBoost: History

■ 1990 – Boost-by-majority algorithm (Freund) ■ 1995 – AdaBoost (Freund & Schapire) ■ 1997 – Generalized version of AdaBoost (Schapire & Singer) ■ 1998 - Theoretical analyses (Friedman, Hastie & Tibshirani and many others) ■ 2001 – AdaBoost in face detection (Viola & Jones)

  • Took off in various application areas.

■ Freund & Schapire won the 2003 Gödel Prize for AdaBoost.

40

slide-41
SLIDE 41

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Extensions

■ Posterior probability of the class assignment (of the “strong” classifier): ■ Multi-class extensions are also possible:

  • AdaBoost.M1

41

p(y = +1|x) = exp(f(x)) exp(f(x)) + exp(−f(x))

f(x) =

T

  • t=1

αtht(x)

with

slide-42
SLIDE 42

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Try it yourself

■ Boosting applet by Yoav Freund:

  • http://www.cse.ucsd.edu/~yfreund/adaboost/index.html

42

slide-43
SLIDE 43

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Application: Face Detection

43

■ Training data

  • 5000 faces
  • all frontal
  • 108 non-faces
  • normalized for scale

and translation

slide-44
SLIDE 44

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Scan window over image pyramid

Image source: A. Zisserman

44

slide-45
SLIDE 45

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Features: Simple Filters

45

slide-46
SLIDE 46

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Face Detection with Boosting

First couple of features selected are quite intuitive.

46

slide-47
SLIDE 47

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 47

slide-48
SLIDE 48

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Application: Pedestrian Detection

Viola, Jones and Snow, ICCV’03

48

slide-49
SLIDE 49

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Training Data

Some positive training examples.

Viola, Jones and Snow, ICCV’03

49

slide-50
SLIDE 50

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Features: Simple Features

Viola, Jones and Snow, ICCV’03

Examples of simple linear filters. Many different possible filters of this type. 24x24 windows applied at multiple scales. 45,396 possible features in each window.

50

slide-51
SLIDE 51

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Pedestrian Filters

Viola, Jones and Snow, ICCV’03

51

slide-52
SLIDE 52

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Viola, Jones and Snow, ICCV’03

52