An overview of Boosting Yoav Freund UCSD Plan of talk Generative - - PowerPoint PPT Presentation

an overview of boosting
SMART_READER_LITE
LIVE PREVIEW

An overview of Boosting Yoav Freund UCSD Plan of talk Generative - - PowerPoint PPT Presentation

An overview of Boosting Yoav Freund UCSD Plan of talk Generative vs. non-generative modeling Boosting Alternating decision trees Boosting and over-fitting Applications 2 Toy Example Computer receives telephone call


slide-1
SLIDE 1

An overview of Boosting

Yoav Freund UCSD

slide-2
SLIDE 2

Plan of talk

  • Generative vs. non-generative modeling
  • Boosting
  • Alternating decision trees
  • Boosting and over-fitting
  • Applications

2

slide-3
SLIDE 3

Toy Example

  • Computer receives telephone call
  • Measures Pitch of voice
  • Decides gender of caller

Human Voice Male Female

3

slide-4
SLIDE 4

Generative modeling

Voice Pitch Probability mean1 var1 mean2 var2

slide-5
SLIDE 5

Discriminative approach

Voice Pitch

  • No. of mistakes

[Vapnik 85]

slide-6
SLIDE 6

Ill-behaved data

Voice Pitch Probability mean1 mean2

  • No. of mistakes
slide-7
SLIDE 7

Traditional Statistics vs. 
 Machine Learning

Data Estimated world state Predictions Actions Statistics Decision Theory Machine Learning

slide-8
SLIDE 8

Model Generative Discriminative Goal Probability estimates Classification rule Performance measure Likelihood Misclassification rate Mismatch problems Outliers Misclassifications

Comparison of methodologies

8

slide-9
SLIDE 9

Boosting

slide-10
SLIDE 10

A weighted training set

Feature vectors Binary labels {-1,+1} Positive weights

x1,y1,w1

( ), x2,y2,w2 ( ),…, xm,ym,wm ( )

10

slide-11
SLIDE 11

A weak learner

The weak requirement: yi ˆ y

iwi i=1 m

wi

i=1 m

> γ > 0

A weak rule

h h Weak Learner

Weighted training set

x1,y1,w1

( ), x2,y2,w2 ( ),…, xm,ym,wm ( )

instances

x1,x2,…,xm

predictions ˆ y1, ˆ y2,…, ˆ ym; ˆ yi ∈{−1,+1}

slide-12
SLIDE 12

x1,y1,w1

T −1

( ), x2,y2,w2

T −1

( ),…, xn,yn,wn

T −1

( )

hT

The boosting process

weak learner

x1,y1,w1

1

( ), x2,y2,w2

1

( ),…, xn,yn,wn

1

( )

weak learner

x1,y1,1

( ), x2,y2,1 ( ),…, xn,yn,1 ( )

h1

x1,y1,w1

2

( ), x2,y2,w2

2

( ),…, xn,yn,wn

2

( )

h3 h2

...

Prediction(x) = sign(F

T (x))

slide-13
SLIDE 13

A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting A Formal Description of Boosting

  • given training set

(x1, y1), . . . , (xm, ym)

  • yi ∈ {−1, +1} correct label of instance xi ∈ X
  • for t = 1, . . . , T:
  • construct distribution Dt on {1, . . . , m}
  • find weak classifier (“rule of thumb”)

ht : X → {−1, +1} with small error t on Dt: t = Pri∼Dt[ht(xi) ̸= yi]

  • output final classifier Hfinal

[Slides from Rob Schapire]

slide-14
SLIDE 14

AdaBoost AdaBoost AdaBoost AdaBoost AdaBoost

[with Freund]

  • constructing Dt:
  • D1(i) = 1/m
  • given Dt and ht:

Dt+1(i) = Dt(i) Zt × e−αt if yi = ht(xi) eαt if yi ̸= ht(xi) = Dt(i) Zt exp(−αt yi ht(xi)) where Zt = normalization factor αt = 1

2 ln

1 − t t

  • > 0
  • final classifier:
  • Hfinal(x) = sign
  • t

αtht(x)

  • [Slides from Rob Schapire]
slide-15
SLIDE 15

Toy Example Toy Example Toy Example Toy Example Toy Example

D1

weak classifiers = vertical or horizontal half-planes

[Slides from Rob Schapire]

slide-16
SLIDE 16

Round 1 Round 1 Round 1 Round 1 Round 1

h1 α ε1 1 =0.30 =0.42 2 D

[Slides from Rob Schapire]

slide-17
SLIDE 17

Round 2 Round 2 Round 2 Round 2 Round 2

α ε2 2 =0.21 =0.65 h2 3 D

[Slides from Rob Schapire]

slide-18
SLIDE 18

Round 3 Round 3 Round 3 Round 3 Round 3

h3 α ε3 3=0.92 =0.14

[Slides from Rob Schapire]

slide-19
SLIDE 19

Final Classifier Final Classifier Final Classifier Final Classifier Final Classifier

H final + 0.92 + 0.65 0.42 sign = =

[Slides from Rob Schapire]

slide-20
SLIDE 20

Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error Analyzing the Training Error

[with Freund]

  • Theorem:
  • write t as 1

2 − γt

[ γt = “edge” ]

  • then

training error(Hfinal) ≤

  • t
  • 2
  • t(1 − t)
  • =
  • t
  • 1 − 4γ2

t

≤ exp

  • −2
  • t

γ2

t

  • so: if ∀t : γt ≥ γ > 0

then training error(Hfinal) ≤ e−2γ2T

  • AdaBoost is adaptive:
  • does not need to know γ or T a priori
  • can exploit γt γ

[Slides from Rob Schapire]

slide-21
SLIDE 21

Boosting block diagram

Weak Learner Booster

Weak rule Example weights

Strong Learner Accurate Rule

slide-22
SLIDE 22

Boosting with specialists

  • Specialists predict only when they are

confident.

  • In addition to {-1,+1} specialists use 0 to

indicate no-prediction.

  • As boosting allows both positive and

negative weights: we restrict attention to specialists that output {0,1}.

22

slide-23
SLIDE 23

+

  • Adaboost as variational
  • ptimization

αt = 1 2 ln ε + wi

t i:ht (xi )=1,yi=1

⎛ ⎝ ⎜ ⎞ ⎠ ⎟ ε + wi

t i:ht (xi )=1,yi=−1

⎛ ⎝ ⎜ ⎞ ⎠ ⎟

F

t = F t−1 +αtht

Weak Rule: ht : X → {0,1} Label: y ∈{−1,+1} Training set: {(x1,y1,1),(x2,y2,1),…,(xn,yn,

slide-24
SLIDE 24

AdaBoost as variational

  • ptimization.

24

+ + + + + + + + + + + + +

  • - - - - - - - - - - -
  • x = input, a scalar,
  • y = output, -1 or +1
  • (x1,y1),(x2,y2), … ,(xn,yn) = training set
  • Ft-1 = Strong rule after t-1 boosting iterations.
  • ht = Weak rule produced at iteration t

Ft-1 ht

x

slide-25
SLIDE 25

AdaBoost as variational

  • ptimization.

25

+ + + + + + + + + + + + +

  • - - - - - - - - - - -
  • Ft-1

ht

x

  • Ft-1 = Strong rule after t-1 boosting iterations.
  • ht = Weak rule produced at iteration t

F

t−1(x)+ 0ht(x)

slide-26
SLIDE 26

AdaBoost as variational

  • ptimization.

26

+ + + + + + + + + + + + +

  • - - - - - - - - - - -
  • Ft-1

ht

x

  • Ft-1 = Strong rule after t-1 boosting iterations.
  • ht = Weak rule produced at iteration t

F

t−1(x)+ 0.4ht(x)

slide-27
SLIDE 27

AdaBoost as variational

  • ptimization.

27

+ + + + + + + + + + + + +

  • - - - - - - - - - - -
  • Ft-1

ht

x

  • Ft-1 = Strong rule after t-1 boosting iterations.
  • ht = Weak rule produced at iteration t

F

t(x) = F t−1(x)+ 0.8ht(x)

slide-28
SLIDE 28

Margins

P r

  • j

e c t

Fix a set of weak rules: ! h(x) = (h1(x),h2(x),…,hT (x)) Represent the i'th example xi with outputs of the weak rules: ! hi Labels: y ∈{−1,+1} Training set: ( ! h1,y1),( ! h2,y2),…,( ! hn,yn) Goal : Find a weight vector ! α = (α1,…,αT ) that minimizes number of training mistakes

reflect

! α

+ + + + + + +

  • (

! hi,yi)

+ + + + + + +

  • C
  • r

r e c t M i s t a k e

! α

(yi ! hi,yi)

Margin Cumulative # examples Mistakes Correct

margin ! yi " hi ⋅ " α

slide-29
SLIDE 29

Boosting as gradient descent

Loss Correct Margin Mistakes 0-1 loss Adaboost : e−margin

✦ Our goal is to minimize 
 the number of mistakes (0-1 loss) ✦ Unfortunately, the step function 
 has deriv. zero at all points other than
 at zero where the derivative is undefined. ✦ Ada boost uses an upper bound on 
 the 0-1 loss. ✦ Adaboost minimizes the exponential loss 
 which is an upper bound. ✦ Finds the vector \alpha through 
 coordinate-wise gradient descent. ✦ Weak rules are added one at a time.


slide-30
SLIDE 30

Adaboost as gradient descent

Correct Margin Mistakes 0-1 loss

✦ Weak rules are added one at a

  • time. 


✦ Fixing the alphas for 1..t-1, find alpha for the new rule (ht)
 ✦ Weak rule defines how each example would move = increase or decrease the margin. new margin of (xi,yi) is yiF

t−1(xi)+αyiht(xi)

new total loss =

i=1 n

∑exp −yi F

t−1(xi)+ αht(xi)

( )

( )

derivative of total loss wrt α : − ∂ ∂α α=0

i=1 n

∑exp −yi F

t−1(xi)+ αht(xi)

( )

( )

=

i=1 n

∑yiht(xi) exp -yiFt-1(xi)

( )

weight of example (xi,yi )

! " # # $ ##

slide-31
SLIDE 31

Logitboost as gradient descent

Correct Margin Mistakes

new margin of (xi,yi) is yiF

t−1(xi)+αyiht(xi)

new total loss =

i=1 n

∑log 1+ exp −yi F

t−1(xi)+ αht(xi

(

(

⎡ ⎣ derivative of total loss wrt α : − ∂ ∂α α=0

i=1 n

∑log 1+ exp −yi F

t−1(xi)+ αht(xi)

( )

( )

⎡ ⎣ ⎤ ⎦ =

i=1 n

∑yiht(xi)

exp -yiF

t-1(xi)

( )

1+ exp -yiF

t-1(xi)

( )

weight of example (xi,yi )

! " ## # $ ### Also called Gentle-Boost and Logit Boost, Hastie, Freedman & Tibshirani

Adaboost (loss and weight) Logitboost loss

margin=m Instead of the loss e−m define loss to be log 1+ e−m

( )

When margin ≫ 0 : log 1+ e−m

( ) ≈ e−m

When margin ≪ 0 :log 1+ e−m

( ) ≈ −m

Logitboost weight

slide-32
SLIDE 32

Noise resistance

  • Adaboost:

– perform well when a achievable error rate is close to zero (almost consistent case). – Errors = examples with negative margins, get very large weights, can overfit.

  • Logitboost:

– Inferior to adaboost when achievable error rate is close to zero. – Often better than Adaboost when achievable error rate is high. – Weight on any example never larger than 1.

32

slide-33
SLIDE 33

Gradient-Boost / AnyBoost

  • A general recipe for learning by incremental optimization
  • Applies to any (differentiable) loss function.



 
 
 
 
 


  • At each iteration:

– calc example weights – call weak learner with weighted example – Add generated weak rule to create new rule:

33 labeled example is (x,y) (can be anything). prediction is of the form: F

t(x) = i=1 t

∑αihi(x)

Loss: L F

t(x),y

( ) Total Loss so far is:

j=1 n

∑ L F

t−1(x j),yj

( )

Weight of example (x,y): ∂ ∂z L F

t−1(x)+ z,y

( )

F

t = F t−1 +αtht
slide-34
SLIDE 34

Styles of weak learners

  • Simple (stumps)



 
 
 


  • Complex (fully grown trees)



 
 
 
 


  • Everything in between (neural networks, Nearest neighbors, Naive

Bayes,…..)

34

Season

+1

  • 1
Summer Winter

α

Season

+1

Winter Season Summer

+1

−α = α

Specialists

! " ## $ ##

Generalist

! " # $ #

  • 1

Location

Season
  • 1
+1 P r i s
  • n
Beach Ski Slope +1 Summer Winter

α

slide-35
SLIDE 35

Alternating Decision Trees

Joint work with Llew Mason

slide-36
SLIDE 36

Decision Trees

X>3 Y>5

  • 1

+1

  • 1

no y e s y e s no

X Y 3 5 +1

  • 1
  • 1
slide-37
SLIDE 37
  • 0.2

Decision tree as a sum

X Y

  • 0.2

Y>5

+0.2

  • 0.3

y e s no

X>3

  • 0.1

no y e s

+0.1

+0.1

  • 0.1

+0.2

  • 0.3

+1

  • 1
  • 1

sign

slide-38
SLIDE 38

An alternating decision tree

X Y +0.1

  • 0.1

+0.2

  • 0.3

sign

  • 0.2

Y>5

+0.2

  • 0.3

yes no

X>3

  • 0.1

no yes

+0.1

+0.7

Y<1

0.0

no yes

+0.7

+1

  • 1
  • 1

+1

slide-39
SLIDE 39

Example: Medical Diagnostics

  • Cleve dataset from UC Irvine database.
  • Heart disease diagnostics (+1=healthy,-1=sick)
  • 13 features from tests (real valued and discrete).
  • 303 instances.
slide-40
SLIDE 40

Adtree for Cleveland heart-disease diagnostics problem

slide-41
SLIDE 41

Cross-validated accuracy

Learning algorithm Number

  • f splits

Average test error Test error variance

ADtree 6 17.0% 0.6% C5.0 27 27.2% 0.5% C5.0 + boosting 446 20.2% 0.5% Boost Stumps 16 16.5% 0.8%

slide-42
SLIDE 42

Applications

  • f Boosting
slide-43
SLIDE 43

Applications of Boosting

  • Academic research
  • Applied research
  • Commercial deployment
slide-44
SLIDE 44

Academic research

Database Other Boosting

Error reduction

Cleveland 27.2 (DT) 16.5 39% Promoters 22.0 (DT) 11.8 46% Letter 13.8 (DT) 3.5 74% Reuters 4 5.8, 6.0, 9.8 2.95 ~60% Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

% test error rates

slide-45
SLIDE 45

Applied research

  • “AT&T, How may I help you?”
  • Classify voice requests
  • Voice -> text -> category
  • Fourteen categories

– Area code, – AT&T service, – billing credit, – calling card, – collect, – competitor

Schapire, Singer, Gorin 98

  • competitor,
  • dial assistance,
  • directory,
  • how to dial,
  • person to person,
  • rate,
  • third party,
  • time charge
slide-46
SLIDE 46
  • Yes I’d like to place a collect call long distance

please

  • Operator I need to make a call but I need to bill

it to my office

  • Yes I’d like to place a call on my master card

please

  • I just called a number in Sioux city and I musta

rang the wrong number because I got the wrong party and I would like to have that taken off my bill

Example transcribed sentences

Ø collect Ø third party Ø billing credit Ø calling card

slide-47
SLIDE 47

Calling card Collect call Third party Weak Rule Categories Word occurs Word does not occur

Weak rules generated by “boostexter”

slide-48
SLIDE 48

Results

  • 7844 training examples

– hand transcribed

  • 1000 test examples

– hand / machine transcribed

  • Accuracy with 20% rejected

– Machine transcribed: 75% – Hand transcribed: 90%

slide-49
SLIDE 49

Commercial deployment

  • Distinguish business/residence customers
  • Using statistics from call-detail records
  • Alternating decision trees

– Combines very simple rules – Can over-fit, cross validation used to stop

Freund, Mason, Rogers, Pregibon, Cortes 2000

slide-50
SLIDE 50

Massive datasets (for 1997)

  • 260M calls / day
  • 230M telephone numbers
  • Label unknown for ~30%
  • Hancock: software for computing statistical signatures.
  • 100K randomly selected training examples,
  • ~10K is enough
  • Training takes about 2 hours.
  • Generated classifier has to be both accurate and

efficient

slide-51
SLIDE 51

Alternating tree for “buizocity”

slide-52
SLIDE 52

Alternating Tree (Detail)

slide-53
SLIDE 53

Precision/recall graphs

Score Accuracy

slide-54
SLIDE 54

Face Detection

Viola and Jones / 2001

slide-55
SLIDE 55

Face Detection / Viola and Jones

Many Uses

  • Standard feature in cameras
  • User Interfaces
  • Security Systems
  • Video Compression
  • Image Database Analysis
  • Live demo in 2001 conference.
  • Higher accuracy than manually 


designed state of art

  • Real -time using 2001 laptop
slide-56
SLIDE 56

Face Detection as a Filtering process

Smallest Scale Larger Scale 50,000 Locations/Scales

Most Negative

slide-57
SLIDE 57

Classifier is Learned from Labeled Data

  • 5000 faces, 108 non faces
  • Faces are normalized
  • Scale, translation
  • Rotation remains…
  • Example: 28x28


image patch

  • Label: 


Face / Non face

slide-58
SLIDE 58

Image Features

Unique Binary Features “Rectangle filters” Similar to Haar wavelets Papageorgiou, et al.

ht(xi) = 1 if ft(xi) > θt 0 otherwise ⎧ ⎨ ⎩

Very fast to compute using “integral image”. Combined using adaboost

slide-59
SLIDE 59

Cascaded boosting

  • Features combined using Adaboost

– Find best weak rule by exhaustive search. – Ran for 2 days on a 250 node cluster

  • For detection, features combined in a

cascade:
 
 


– Runs in real time, 15 FPS, on laptop


59

1 Feature 5 Features F 50% 20 Features 20% 2%

FACE

NON-FACE F NON-FACE F NON-FACE

IMAGE SUB-WINDOW

slide-60
SLIDE 60

Paul Viola and Mike Jones

60

slide-61
SLIDE 61

Summary

  • Boosting is a computational method for learning

accurate classifiers

  • Resistance to over-fit explained by margins
  • Boosting has been applied successfully to a

variety of classification problems