Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - - PowerPoint PPT Presentation

bagging boosting and ransac
SMART_READER_LITE
LIVE PREVIEW

Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap - - PowerPoint PPT Presentation

MACHINE LEARNING - 2013 MACHINE LEARNING Bagging, Boosting and RANSAC MACHINE LEARNING - 2013 Bootstrap Aggregation Bagging The Main Idea Some Examples Why it works 2 MACHINE LEARNING - 2013 The Main Idea Aggregation


slide-1
SLIDE 1

MACHINE LEARNING - 2013

Bagging, Boosting and RANSAC

MACHINE LEARNING

slide-2
SLIDE 2

MACHINE LEARNING - 2013

Bootstrap Aggregation

  • The Main Idea
  • Some Examples
  • Why it works

2

Bagging

slide-3
SLIDE 3

MACHINE LEARNING - 2013

  • Imagine we have m sets of n independent
  • bservations

S(1)={(X1 ,Y1),...,(Xn ,Yn)}(1) , ... , S(m)={(X1 ,Y1),..., (Xn ,Yn)}(m)

all taken iid from the same underlying distribution P

  • Traditional approach: generate some ϕ(x,S) from

all the data samples

  • Aggregation: learn ϕ(x,S) by averaging ϕ(x,S(k))
  • ver many k

The Main Idea

3

Aggregation

before after

ϕ(x,S)

before after

ϕ(x,S(k))

slide-4
SLIDE 4

MACHINE LEARNING - 2013

The Main Idea

  • Unfortunately, we usually have one single
  • bservations set S
  • Idea: bootstrap S to form the S (k) observation sets
  • (canonical) Choose some samples, duplicate them

until you fill a new S (i) of the same size of S

  • (practical) Take a sub-set of samples of S, (use a

smaller set)

  • The samples not used by each set are validation

samples

4

Bootstrapping

S (1) S (2) S (3) S

slide-5
SLIDE 5

MACHINE LEARNING - 2013

The Main Idea

  • Generate S (1),..., S (m) from bootstrapping
  • Compute the ϕ(x,S)(k) individually
  • Compute ϕ(x,S)=Ek(ϕ(x,S)(k)) by aggregation

5

Bagging

slide-6
SLIDE 6

MACHINE LEARNING - 2013

6

  • We select some input samples
  • We learn a regression model
  • Very sensitive to the input selection
  • m training sets = m different models

A concrete example

(x(1)

1 , y(1) 1 ), . . . , (x(1) n , y(1) n )

(x(2)

1 , y(2) 1 ), . . . , (x(2) n , y(2) n )

ˆ f (2)(x) = Y (2) ˆ f (1)(x) = Y (1)

ˆ f (1), · · · , ˆ f (m) Y (1), · · · , Y (m)

slide-7
SLIDE 7

MACHINE LEARNING - 2013

7

Aggregation: combine several models

m = 60 m = 10 m = 4

  • Linear combination of simple models
  • More examples = Better model
  • We can stop when we’re satisfied

Z = 1 m

m

X

i=1

Y (i)

slide-8
SLIDE 8

MACHINE LEARNING - 2013

Proof of convergence

  • Assumptions
  • Y(1),...,Y(m) are iid
  • E(Y) = y (E(Y) is an unbiased estimator of y)

8

Z = 1 m

m

X

i=1

Y (i) E(Z) = 1 m

m

X

i=1

E(Y (i)) = 1 m

m

X

i=1

y = y

E((Y − y)2) = E((Y − E(Y ))2 = σ2(Y )

  • Expected Error
  • With Aggregation

E((Z − y)2) = E((Z − E(Z))2 = σ2(Z) = σ2⇣ 1 m

m

X

i=1

Y (i)⌘

= 1 m2

m

X

i=1

σ2(Y (i)) = 1 m 1 m

m

X

i=1

σ2(Y (i)) ! = 1 mσ2(Y )

Hypothesis: The average will converge to something meaningful

infinite observations = zero error : we have our underlying estimator!

slide-9
SLIDE 9

MACHINE LEARNING - 2013

In layman terms

9

y Y

The expected error (variance) of Y is larger than Z

Z

The variance of Z shrinks with m

slide-10
SLIDE 10

MACHINE LEARNING - 2013

Relaxing the assumptions

  • We DROP the second assumption
  • Y(1),...,Y(m) are iid
  • E(Y) = y (E(Y) is an unbiased estimator of y)

10

E((Y −y)2) = E((Y −E(Y )+E(Y )−y)2) = E(((Y −E(Y )+(E(Y )−y))2) = E((Y − E(Y ))2) + E((E(Y ) − y)2) + E(2(Y − E(Y ))(E(Y ) − y))

E

  • Y − E(Y )
  • E
  • 2(E(Y ) − y)
  • E
  • (Y − y)2

≥ E

  • (E(Y ) − y)2

E

  • (Y − y)2

≥ E

  • (Z − y)2

using Z gives us a smaller error

(even if we can’t prove convergence to zero)

Z

we add these we regroup them

= 0 σ 2(Y ) ≥ 0

The larger it is, the better for us

slide-11
SLIDE 11

MACHINE LEARNING - 2013

  • Instability is good
  • The more variable (unstable) the form of ϕ(x,S) is,

the more improvement can potentially be obtained

  • Low-variability methods (e.g. PCA, LDA) improve less

than high-variability ones (e.g. LWR, Decision Trees)

  • Loads of redundancy
  • Most predictors do roughly “the same thing”

Peculiarities

11

slide-12
SLIDE 12

MACHINE LEARNING - 2013

From Bagging to Boosting

  • Bagging: each model is trained independently
  • Boosting: each model is built on top of the

previous ones

12

slide-13
SLIDE 13

MACHINE LEARNING - 2013

  • The Main Idea
  • The Thousand Flavours of Boost
  • Weak Learners and Cascades

13

Adaptive Boosting AdaBoost

slide-14
SLIDE 14

MACHINE LEARNING - 2013

The Main Idea

  • Combine several simple models (weak learners)
  • Avoid redundancy
  • Each learner complements the previous ones
  • Keep track of the errors of the previous learners

14

Iterative Approach

slide-15
SLIDE 15

MACHINE LEARNING - 2013

Weak Learners

  • A “simple” classifier that can be generated easily
  • As long as it is better than random, we can use it
  • Better when tailored to the problem at hand
  • E.g. very fast at retrieval (for images)

15

slide-16
SLIDE 16

MACHINE LEARNING - 2013

AdaBoost

  • We choose a weak learner model ϕ(x)
  • Initialization
  • Generate ϕ1(x), ... , ϕN(x) weak learners
  • N can be in the hundreds of thousands
  • Assign a weight wi to each training sample

16

(e.g. f (x,v) = x ∙v > θ )

v

θ

v1 v2 v3v4 v5 v6 v7

Initialization

slide-17
SLIDE 17

MACHINE LEARNING - 2013

  • Compute the error ej for each classifier ϕj(x)
  • Select the ϕj with the smallest classification

error

  • Update the weights wi depending on how they

are classified by ϕj .

v1 v2 v3v4 v5 v6 v7

AdaBoost

17

v1 v2 v3v4 v5 v6 v7

Iterations

ej : Pn

i=1

  • wi · 1ϕj(xi)6=yi
  • argminj

n X

i=1

  • wi · 1ϕj(xi)6=yi
  • !

Here comes the important part

slide-18
SLIDE 18

MACHINE LEARNING - 2013

Updating the weights

  • Evaluate how “well” ϕj(x) is performing
  • Update the weights for each sample

18

how far are we from a perfect classification? make it bigger make it smaller

α = 1 2ln ✓1 − ej ej ◆ w(t+1)

i

= ( w(t)

i exp(α(t))

if ϕj(xi) 6= yi, w(t)

i exp(α(t))

if ϕj(xi) = yi.

slide-19
SLIDE 19

MACHINE LEARNING - 2013

  • Recompute the error ej for each classifier ϕj(x)

using the updated weights

  • Select the new ϕj with the smallest

classification error

  • Update the weights wi .

AdaBoost

Rinse and Repeat

ej : Pn

i=1

  • wi · 1ϕj(xi)6=yi
  • argminj

n X

i=1

  • wi · 1ϕj(xi)6=yi
  • !
slide-20
SLIDE 20

MACHINE LEARNING - 2013

20

Boosting In Action

The Checkerboard Problem

slide-21
SLIDE 21

MACHINE LEARNING - 2013

21

Initialization

Boosting In Action

  • We choose a simple weak learner
  • We generate a thousand random vectors

v1,...,v1000 and corresponding learners fj(x,vj)

  • For each fj(x,vj) we compute a good threshold θj

f (x,v) = x ∙v > θ

slide-22
SLIDE 22

MACHINE LEARNING - 2013

22

Boosting In Action

50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 9 10

  • We look for the best weak learner
  • We adjust the importance (weight) of the errors
  • Rinse and repeat

and we keep going...

slide-23
SLIDE 23

MACHINE LEARNING - 2013

23

20 weak learners 40 weak learners 80 weak learners 120 weak learners

50% 60% 70% 80% 90% 100% 1 10 20 40 80 120

Boosting In Action

slide-24
SLIDE 24

MACHINE LEARNING - 2013

Drawbacks of Boosting

  • Overfitting!
  • Boost will always overfit with

many weak learners

  • Training Time
  • Training of a face detector

takes up to 2 weeks on modern computers

24

slide-25
SLIDE 25

MACHINE LEARNING - 2013

A thousand different flavors

25

1989 Boosting 1996 AdaBoost 1999 Real AdaBoost 2000 Margin Boost Modest AdaBoost Gentle AdaBoost AnyBoost LogitBoost 2001 BrownBoost 2003 KLBoost Weight Boost 2004 FloatBoost ActiveBoost 2005 JensenShannonBoost Infomax Boost 2006 Emphasis Boost 2007 Entropy Boost Reweight Boost ...

  • A couple of new boost variants every year
  • Reduce overfitting
  • Increase robustness to noise
  • Tailored to specific problems
  • Mainly change two things
  • How the error is represented
  • How the weights are updated
slide-26
SLIDE 26

MACHINE LEARNING - 2013

An example

  • Instead of counting the errors, we compute the

probability of correct classification

26

Real AdaBoost

α = 1 2ln ✓1 − pj pj ◆

pj =

n

Y

i=1

wiP(yi = 1|xi)

Discrete AdaBoost

α = 1 2ln ✓1 − ej ej ◆

ej : Pn

i=1

  • wi · 1ϕj(xi)6=yi
  • w(t+1)

i

= ( w(t)

i exp(α(t))

if ϕj(xi) 6= yi, w(t)

i exp(α(t))

if ϕj(xi) = yi.

w(t+1)

i

= w(t)

i exp(−yiα(t))

slide-27
SLIDE 27

MACHINE LEARNING - 2013

A celebrated example

27

Viola-Jones Haar-Like wavelets

image pixels

A B

I(x) : pixel of image I at position x

f(x) = X

x∈A

I(x) − X

x∈B

I(x) ϕ(x) = ⇢ 1 if f(x) > 0, −1

  • therwise.
  • Millions of possible classifiers

...

ϕ1(x) ϕ2(x)

2 rectangles of pixels 1 positive, 1 negative

slide-28
SLIDE 28

MACHINE LEARNING - 2013

Real-Time on HD video

28

slide-29
SLIDE 29

MACHINE LEARNING - 2013

29

50% 60% 70% 80% 90% 100% 1 10 20 40 80 120

Random Circles Random Projections

Some simpler examples

f (x,c) = (x -c)T(x -c) > θ

Feature: the distance from a point c

slide-30
SLIDE 30

MACHINE LEARNING - 2013

30

50% 60% 70% 80% 90% 100% 1 10 20 40 80 120

Random Circles Random Projections Random Rectangles

Some simpler examples

f (x,R) = 1x ∈ R

Feature: being inside a rectangle R

slide-31
SLIDE 31

MACHINE LEARNING - 2013

31

f (x,μ,∑) = P(x | μ,∑)

Feature: full-covariance gaussian

Some simpler examples

50% 60% 70% 80% 90% 100% 1 10 20 40 80 120

Random Gaussians Random Circles Random Projections Random Rectangles

slide-32
SLIDE 32

MACHINE LEARNING - 2013

Weak Learners don’t need to be weak!

32

20 boosted SVMs with 5 SVs and the RBF kernel

slide-33
SLIDE 33

MACHINE LEARNING - 2013

Cascades of Weak Classifiers

  • Split classification task in Stages
  • Each stage has increasing numbers of weak

classifiers

  • Each stage only needs to ‘learn’ to classify

what the previous ones let through

33

slide-34
SLIDE 34

MACHINE LEARNING - 2013

Cascades of Weak Classifiers

34

Stage 1

yes no no no

sample Stage 2 Stage 3

1000 weak learners 99.9% classification 100 weak learners 95% classification 10 weak learners 90% classification

1000 samples 10’000 tests 10‘000 tests

+

50’000 tests

+ =

70’000 tests

instead of 1110

100 samples 50 samples

70 tests per sample

slide-35
SLIDE 35

MACHINE LEARNING - 2013

Cascades of Weak Classifiers

  • Advantages
  • Stage splits can be chosen manually
  • Trade-off performance and accuracy of the first

stages

  • Disadvantages
  • Later stages become difficult to train (few samples)
  • Very large amount of samples for training
  • Even slower to train than standard boosting

35

slide-36
SLIDE 36

MACHINE LEARNING - 2013

The curious case of RANSAC

  • Created to withstand hordes of
  • utliers
  • Hasn’t really been proven to

converge to optimal solution in any reasonable time

  • Extremely effective in practice
  • Easy to implement
  • Widely used in Computer Vision

36

RANdom SAmple Consensus

slide-37
SLIDE 37

MACHINE LEARNING - 2013

RANSAC in practice

  • Select a random subset of samples
  • Estimate the model on these samples
  • Sift through all other samples
  • If they are close to the model, add to the

Consensus

  • If the consensus is big enough, keep it
  • Repeat from top

37

Keep the best consensus (e.g. most samples, least error, etc.)

slide-38
SLIDE 38

MACHINE LEARNING - 2013

Some examples of RANSAC

38

  • B. Noris and A. Billard. “Aggregation of Asynchronous Eye-Tracking streams from Sets of Uncalibrated Panoramic Images”. (2011)

Input Video Reference Panoramic Image

slide-39
SLIDE 39

MACHINE LEARNING - 2013

Some examples of RANSAC

39

Scaramuzza et al. “Real-Time Monocular Visual Odometry for On-Road Vehicles with 1-Point RANSAC” ICRA 2009.

Input image: 360° Camera Compute visual features (KLT) Exclude Outliers (bad features) Compute Visual Odometry

Scaramuzza et al. (ETHZ)

RANSAC

slide-40
SLIDE 40

MACHINE LEARNING - 2013

RANSAC vs Bagging

40

RANSAC Bagging

Train many individual models Select a random subset of samples Keep the best Keep them all Needs to be lucky Proven to converge Heavy to compute Very light to compute

slide-41
SLIDE 41

MACHINE LEARNING - 2013

Summing up

  • Bagging: linear combination of multiple learners
  • + Very robust to noise
  • A lot of redundant effort
  • Boosting: weighed combination of arbitrary

learners

  • + Very strong learner from very simple ones
  • Sensitive to noise (at least Discrete Adaboost)
  • RANSAC: iterative evaluation of random learners
  • + Very robust against outliers, simple to implement
  • Not ensured to converge (although in practice it does)

41