Multiclass Boosting with Repartitioning Ling Li Learning Systems - - PowerPoint PPT Presentation

multiclass boosting with repartitioning
SMART_READER_LITE
LIVE PREVIEW

Multiclass Boosting with Repartitioning Ling Li Learning Systems - - PowerPoint PPT Presentation

Introduction Multiclass Boosting Repartitioning Experiments Summary Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Introduction Multiclass Boosting Repartitioning Experiments Summary Binary


slide-1
SLIDE 1

Introduction Multiclass Boosting Repartitioning Experiments Summary

Multiclass Boosting with Repartitioning

Ling Li

Learning Systems Group, Caltech

ICML 2006

slide-2
SLIDE 2

Introduction Multiclass Boosting Repartitioning Experiments Summary

Binary and Multiclass Problems

Binary classification problems Y = {−1, 1} Multiclass classification problems Y = {1, 2, . . . , K} A multiclass problem can be reduced to a collection of binary problems Examples

  • ne-vs-one
  • ne-vs-all

Usually we obtain an ensemble of binary classifiers

slide-3
SLIDE 3

Introduction Multiclass Boosting Repartitioning Experiments Summary

A Unified Approach [Allwein et al., 2000]

Given a coding matrix M =     − − − + + − + +     Each row is a codeword for a class the codeword for class 2 is “−+” Construct a binary classifier for each column (partition) f1 should discriminate classes 1 and 2 from 3 and 4 Decode (f1(x), f2(x)) to predict (f1(x), f2(x)) = (+, +) predicts class label 4

slide-4
SLIDE 4

Introduction Multiclass Boosting Repartitioning Experiments Summary

Coding Matrix

Error-Correcting If a few binary classifiers make mistakes, the correct label can still be predicted Assure the Hamming distance between codewords is large     − − − + + − + + − + + − + − − + + − + −     Assume errors are independent Extensions Some entries can be 0 Various distance measures can be used

slide-5
SLIDE 5

Introduction Multiclass Boosting Repartitioning Experiments Summary

Multiclass Boosting [Guruswami & Sahai, 1999]

Problems Errors of the binary classifiers may be highly correlated Optimal coding matrix is problem dependent Boosting Approach Dynamically generates the coding matrix Reweights examples to reduce the error correlation Minimizes a multiclass margin cost

slide-6
SLIDE 6

Introduction Multiclass Boosting Repartitioning Experiments Summary

Prototype

The ensemble F = (f1, f2, . . . , fT) ft has a coefficient αt The Hamming distance ∆ (M(k), F(x)) =

T

  • t=1

αt 1 − M(k, t)ft(x) 2 . Multiclass Boosting

1: F ← (0, 0, . . . , 0), i.e., ft ← 0 2: for t = 1 to T do 3:

Pick the t-th column M(·, t) ∈ {−, +}K

4:

Train a binary hypothesis ft on {(xn, M(yn, t))}N

n=1

5:

Decide a coefficient αt

6: end for 7: return M, F, and αt’s

slide-7
SLIDE 7

Introduction Multiclass Boosting Repartitioning Experiments Summary

Multiclass Margin Cost

For an example (x, y), we want ∆ (M(k), F(x)) > ∆ (M(y), F(x)) , ∀k = y Margin The margin of the example (x, y) for class k is ρk (x, y) = ∆ (M(k), F(x)) − ∆ (M(y), F(x)) Exponential Margin Cost C(F) =

N

  • n=1
  • k=yn

e−ρk(xn,yn) This is similar to the binary exponential margin cost.

slide-8
SLIDE 8

Introduction Multiclass Boosting Repartitioning Experiments Summary

Gradient Descent [Sun et al., 2005]

A multiclass boosting algorithm can be deduced as gradient descent on the margin cost Multiclass Boosting

1: F ← (0, 0, . . . , 0), i.e., ft ← 0 2: for t = 1 to T do 3:

Pick M(·, t) and ft to maximize the negative gradient

4:

Pick αt to minimize the cost along the gradient

5: end for 6: return M, F, and αt’s

AdaBoost.ECC is a concrete algorithm on the exponential cost.

slide-9
SLIDE 9

Introduction Multiclass Boosting Repartitioning Experiments Summary

Gradient of Exponential Cost

skipped most math equations Say F = (f1, . . . , ft, 0, . . . ). − ∂C (F) ∂αt

  • αt=0

= Ut (1 − 2εt) ˜ Dt(n, k) = e−ρk(xn,yn) (before ft is added) How would this example of class yn be confused as class k? Ut = N

n=1

K

k=1 ˜

Dt(n, k)M(k, t) = M(yn, t) Sum of the “confusion” for binary relabeled examples Dt(n) = U−1

t

· K

k=1 ˜

Dt(n, k)M(k, t) = M(yn, t) Sum of the “confusion” for individual example εt = N

n=1 Dt(n)ft(xn) = M(yn, t)

slide-10
SLIDE 10

Introduction Multiclass Boosting Repartitioning Experiments Summary

Picking Partitions

− ∂C (F) ∂αt

  • αt=0

= Ut (1 − 2εt) Ut is determined by the t-th column/partition εt is also decided by the binary learning performance Seems that we should pick the partition to maximize Ut and ask the binary learner to minimize εt Picking Partitions max-cut: picks the partition with the largest Ut rand-half: randomly assigns + to half of the classes Which one would you pick?

slide-11
SLIDE 11

Introduction Multiclass Boosting Repartitioning Experiments Summary

Tangram

1 2 3 4 5 6 7

slide-12
SLIDE 12

Introduction Multiclass Boosting Repartitioning Experiments Summary

Margin Cost (with perceptrons)

10 20 30 40 50 10

−2

10

−1

10 Number of iterations Training cost (normalized) AdaBoost.ECC (max−cut) AdaBoost.ECC (rand−half)

slide-13
SLIDE 13

Introduction Multiclass Boosting Repartitioning Experiments Summary

Why was Max-Cut Worse?

Maximizing Ut brings strong error-correcting ability But it also generates much “hard” binary problems

slide-14
SLIDE 14

Introduction Multiclass Boosting Repartitioning Experiments Summary

Trade-Off

− ∂C (F) ∂αt

  • αt=0

= Ut (1 − 2εt) Hard problems deteriorate the binary learning, thus overall the negative gradient might be smaller Need to find a trade-off between Ut and εt The “hardness” depends on the binary learner So we may “ask” the binary learner for a better partition

slide-15
SLIDE 15

Introduction Multiclass Boosting Repartitioning Experiments Summary

Repartitioning

Given a binary classifier ft, which partition is the best? The one that maximizes − ∂C (F) ∂αt

  • αt=0

skipped most math equations M(k, t) can be decided from the output of ft and the “confusion”

slide-16
SLIDE 16

Introduction Multiclass Boosting Repartitioning Experiments Summary

AdaBoost.ERP

Given a partition, a binary classifier can be learned Given a binary classifier, a better partition can be generated These two steps can be carried out alternatively We use a string of “L” and “R” to denote the schedule Example “LRL” means “Learning → Repartitioning → Learning” We can also start from partial partitions Example rand-2 starts with two random classes Faster learning; focus on local class structure

slide-17
SLIDE 17

Introduction Multiclass Boosting Repartitioning Experiments Summary

Experiment Settings

We compared one-vs-one, one-vs-all, AdaBoost.ECC, and AdaBoost.ERP Four different binary learners: decision stumps, perceptrons, binary AdaBoost, and SVM-perceptron Ten UCI data sets with number of classes varies from 3 to 26

slide-18
SLIDE 18

Introduction Multiclass Boosting Repartitioning Experiments Summary

Cost with Decision Stumps on letter

200 400 600 800 1000 10

−1

10 Number of iterations Training cost (normalized) AdaBoost.ECC (max−cut) AdaBoost.ECC (rand−half) AdaBoost.ERP (max−2, LRL) AdaBoost.ERP (rand−2, LRL) AdaBoost.ERP (rand−2, LRLR)

slide-19
SLIDE 19

Introduction Multiclass Boosting Repartitioning Experiments Summary

Test Error with Decision Stumps on letter

200 400 600 800 1000 15 20 25 30 35 40 45 50 55 60 Number of iterations Test error (%)

slide-20
SLIDE 20

Introduction Multiclass Boosting Repartitioning Experiments Summary

Cost with Perceptrons on letter

100 200 300 400 500 10

−2

10

−1

10 Number of iterations Training cost (normalized) AdaBoost.ECC (max−cut) AdaBoost.ECC (rand−half) AdaBoost.ERP (max−2, LRL) AdaBoost.ERP (rand−2, LRL) AdaBoost.ERP (rand−2, LRLR)

slide-21
SLIDE 21

Introduction Multiclass Boosting Repartitioning Experiments Summary

Test Error with Perceptrons on letter

100 200 300 400 500 10 15 20 25 30 35 40 Number of iterations Test error (%)

slide-22
SLIDE 22

Introduction Multiclass Boosting Repartitioning Experiments Summary

Overall Results

AdaBoost.ERP achieved the lowest cost, and the lowest test error on most of the data sets The improvement is especially significant for weak binary learners With SVM-perceptron, all methods were comparable AdaBoost.ERP starting with partial partitions were much faster than AdaBoost.ECC One-vs-one is much worse with weak binary learners One-vs-one is much faster

slide-23
SLIDE 23

Introduction Multiclass Boosting Repartitioning Experiments Summary

Summary

A multiclass problem can be reduced to a collection of binary problems via an error-correcting coding matrix Multiclass boosting dynamically generates the coding matrix and the binary problems Hard binary problems deteriorate the binary learning AdaBoost.ERP achieves a better trade-off between the error-correcting and the binary learning