Department of Computer Science CSCI 5622: Machine Learning Chenhao - - PowerPoint PPT Presentation

department of computer science csci 5622 machine learning
SMART_READER_LITE
LIVE PREVIEW

Department of Computer Science CSCI 5622: Machine Learning Chenhao - - PowerPoint PPT Presentation

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 13: Boosting Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1 Learning objectives Understand the general idea behind ensembling Learn about


slide-1
SLIDE 1

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 13: Boosting Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

1

slide-2
SLIDE 2

Learning objectives

  • Understand the general idea behind ensembling
  • Learn about Adaboost
  • Learn the math behind boosting

2

slide-3
SLIDE 3

Ensemble methods

  • We have learned
  • KNN
  • Naïve Bayes
  • Logistic regression
  • Neural networks
  • Support vector machines
  • Why use a single model?

3

slide-4
SLIDE 4

Ensemble methods

  • Bagging
  • Train classifiers on subsets of data
  • Predict based on majority vote
  • Stacking
  • Take multiple classifiers’ outputs as inputs and train another classifier

to make final prediction

4

slide-5
SLIDE 5

Boosting intuition

  • Boosting is an ensemble method, but with a different twist
  • Idea:
  • Build a sequence of dumb models
  • Modify training data along the way to focus on difficult to classify examples
  • Predict based on weighted majority vote of all the models
  • Challenges
  • What do we mean by dumb?
  • How do we promote difficult examples?
  • Which models get more say in vote?

5

slide-6
SLIDE 6

Boosting intuition

  • What do we mean by dumb?
  • Each model in our sequence will be a weak learner
  • Most common weak learner in Boosting is a decision stump - a

decision tree with a single split

6

slide-7
SLIDE 7

Boosting intuition

  • How do we promote difficult examples?
  • After each iteration, we'll increase the importance of training

examples that we got wrong on the previous iteration and decrease the importance of examples that we got right on the previous iteration

  • Each example will carry around a weight wi that will play into the

decision stump and the error estimation Weights are normalized so they act like a probability distribution

7

slide-8
SLIDE 8

Boosting intuition

  • Which models get more say in vote?
  • The models that performed better on training data get more say

in the vote

8

slide-9
SLIDE 9

The Plan

  • Learn Adaboost
  • Unpack it for intuition
  • Come back later and show the math

9

slide-10
SLIDE 10

The Plan

  • Learn Adaboost
  • Unpack it for intuition
  • Come back later and show the match

10

slide-11
SLIDE 11

Adaboost

11

slide-12
SLIDE 12

Adaboost

Weights are initialized to uniform distribution. Every training example counts equally on first iteration.

12

slide-13
SLIDE 13

Adaboost

13

slide-14
SLIDE 14

Adaboost

14

Mistakes on highly weighted examples hurt more Mistakes on lowly weighted examples don't register too much

slide-15
SLIDE 15

Adaboost

15

slide-16
SLIDE 16

Adaboost

  • If example was misclassified weight goes up
  • If example was classified correctly weight goes down
  • How big of a jump depends on accuracy of model
  • Do we need to compute Zk?

16

slide-17
SLIDE 17

Adaboost

17

slide-18
SLIDE 18

Adaboost Example

Suppose we have the following training data

18

slide-19
SLIDE 19

Adaboost Example

First decision stump

19

slide-20
SLIDE 20

Adaboost Example

First decision stump

20

slide-21
SLIDE 21

Adaboost Example

Second decision stump

21

slide-22
SLIDE 22

Adaboost Example

Second decision stump

22

slide-23
SLIDE 23

Adaboost Example

Third decision stump

23

slide-24
SLIDE 24

Adaboost Example

Third decision stump

24

slide-25
SLIDE 25

Adaboost Example

25

slide-26
SLIDE 26

Generalization performance

Recall the standard experiment of measuring test and training error vs. model complexity

26

Once overfitting begins, test error goes up

slide-27
SLIDE 27

Generalization performance

Boosting has remarkably uncommon effect

27

Happens much slower with boosting

slide-28
SLIDE 28

The Math

  • So far this looks like a reasonable thing that just worked out
  • But is there math behind it?

28

slide-29
SLIDE 29

The Math

  • Yep! It is minimization of a loss function, like always

29

slide-30
SLIDE 30

The Math

30

slide-31
SLIDE 31

The Math

31

slide-32
SLIDE 32

The Math

32

slide-33
SLIDE 33

The Math

33

slide-34
SLIDE 34

The Math

34

slide-35
SLIDE 35

The Math

35

slide-36
SLIDE 36

The Math

36

slide-37
SLIDE 37

The Math

37

slide-38
SLIDE 38

Practical Advantages of Boosting

  • It's fast!
  • Simple and easy to program
  • No parameters to tune (except K)
  • Flexible. Can choose any weak learner
  • Shift in mindset. Now can look for weak classifiers instead of

strong classifiers

  • Can be used in lots of settings

38

slide-39
SLIDE 39

Caveats

  • Performance depends on data and weak learner
  • Adaboost can fail if
  • Weak classifier not weak enough (overfitting)
  • Weak classifier too weak (underfitting)

39