Multiclass Boosting with Repartitioning Ling Li Learning Systems - PowerPoint PPT Presentation

Introduction Multiclass Boosting Repartitioning Experiments Summary Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006

Introduction Multiclass Boosting Repartitioning Experiments Summary Binary and Multiclass Problems Binary classification problems Y = {− 1 , 1 } Multiclass classification problems Y = { 1 , 2 , . . . , K } A multiclass problem can be reduced to a collection of binary problems Examples one-vs-one one-vs-all Usually we obtain an ensemble of binary classifiers

Introduction Multiclass Boosting Repartitioning Experiments Summary A Unified Approach [Allwein et al., 2000] Given a coding matrix  − −  − +   M =   + −   + + Each row is a codeword for a class the codeword for class 2 is “ − +” Construct a binary classifier for each column (partition) f 1 should discriminate classes 1 and 2 from 3 and 4 Decode ( f 1 ( x ) , f 2 ( x )) to predict ( f 1 ( x ) , f 2 ( x )) = (+ , +) predicts class label 4

Introduction Multiclass Boosting Repartitioning Experiments Summary Coding Matrix Error-Correcting If a few binary classifiers make mistakes, the correct label can still be predicted Assure the Hamming distance between codewords is large  − − − + +  − + + − +     + − + − −   + + − + − Assume errors are independent Extensions Some entries can be 0 Various distance measures can be used

Introduction Multiclass Boosting Repartitioning Experiments Summary Multiclass Boosting [Guruswami & Sahai, 1999] Problems Errors of the binary classifiers may be highly correlated Optimal coding matrix is problem dependent Boosting Approach Dynamically generates the coding matrix Reweights examples to reduce the error correlation Minimizes a multiclass margin cost

Introduction Multiclass Boosting Repartitioning Experiments Summary Prototype The ensemble F = ( f 1 , f 2 , . . . , f T ) f t has a coefficient α t The Hamming distance T 1 − M ( k , t ) f t ( x ) � ∆ ( M ( k ) , F ( x )) = α t . 2 t =1 Multiclass Boosting 1: F ← (0 , 0 , . . . , 0), i.e., f t ← 0 2: for t = 1 to T do Pick the t -th column M ( · , t ) ∈ {− , + } K 3: Train a binary hypothesis f t on { ( x n , M ( y n , t )) } N 4: n =1 Decide a coefficient α t 5: 6: end for 7: return M , F , and α t ’s

Introduction Multiclass Boosting Repartitioning Experiments Summary Multiclass Margin Cost For an example ( x , y ), we want ∆ ( M ( k ) , F ( x )) > ∆ ( M ( y ) , F ( x )) , ∀ k � = y Margin The margin of the example ( x , y ) for class k is ρ k ( x , y ) = ∆ ( M ( k ) , F ( x )) − ∆ ( M ( y ) , F ( x )) Exponential Margin Cost N � � e − ρ k ( x n , y n ) C ( F ) = n =1 k � = y n This is similar to the binary exponential margin cost.

Introduction Multiclass Boosting Repartitioning Experiments Summary Gradient Descent [Sun et al., 2005] A multiclass boosting algorithm can be deduced as gradient descent on the margin cost Multiclass Boosting 1: F ← (0 , 0 , . . . , 0), i.e., f t ← 0 2: for t = 1 to T do Pick M ( · , t ) and f t to maximize the negative gradient 3: Pick α t to minimize the cost along the gradient 4: 5: end for 6: return M , F , and α t ’s AdaBoost.ECC is a concrete algorithm on the exponential cost.

Introduction Multiclass Boosting Repartitioning Experiments Summary Gradient of Exponential Cost skipped most math equations Say F = ( f 1 , . . . , f t , 0 , . . . ). � − ∂ C ( F ) � = U t (1 − 2 ε t ) � ∂α t � α t =0 ˜ D t ( n , k ) = e − ρ k ( x n , y n ) (before f t is added) How would this example of class y n be confused as class k ? U t = � N � K k =1 ˜ D t ( n , k ) � M ( k , t ) � = M ( y n , t ) � n =1 Sum of the “confusion” for binary relabeled examples D t ( n ) = U − 1 · � K k =1 ˜ D t ( n , k ) � M ( k , t ) � = M ( y n , t ) � t Sum of the “confusion” for individual example ε t = � N n =1 D t ( n ) � f t ( x n ) � = M ( y n , t ) �

Introduction Multiclass Boosting Repartitioning Experiments Summary Picking Partitions � − ∂ C ( F ) � = U t (1 − 2 ε t ) � ∂α t � α t =0 U t is determined by the t -th column/partition ε t is also decided by the binary learning performance Seems that we should pick the partition to maximize U t and ask the binary learner to minimize ε t Picking Partitions max-cut: picks the partition with the largest U t rand-half: randomly assigns + to half of the classes Which one would you pick?

Introduction Multiclass Boosting Repartitioning Experiments Summary Tangram 1 3 2 4 6 5 7

Introduction Multiclass Boosting Repartitioning Experiments Summary Margin Cost (with perceptrons) 0 10 AdaBoost.ECC (max−cut) Training cost (normalized) AdaBoost.ECC (rand−half) −1 10 −2 10 0 10 20 30 40 50 Number of iterations

Introduction Multiclass Boosting Repartitioning Experiments Summary Why was Max-Cut Worse? Maximizing U t brings strong error-correcting ability But it also generates much “hard” binary problems

Introduction Multiclass Boosting Repartitioning Experiments Summary Trade-Off � − ∂ C ( F ) � = U t (1 − 2 ε t ) � ∂α t � α t =0 Hard problems deteriorate the binary learning, thus overall the negative gradient might be smaller Need to find a trade-off between U t and ε t The “hardness” depends on the binary learner So we may “ask” the binary learner for a better partition

Introduction Multiclass Boosting Repartitioning Experiments Summary Repartitioning Given a binary classifier f t , which partition is the best? � The one that maximizes − ∂ C ( F ) � � ∂α t � α t =0 skipped most math equations M ( k , t ) can be decided from the output of f t and the “confusion”

Introduction Multiclass Boosting Repartitioning Experiments Summary AdaBoost.ERP Given a partition, a binary classifier can be learned Given a binary classifier, a better partition can be generated These two steps can be carried out alternatively We use a string of “L” and “R” to denote the schedule Example “LRL” means “Learning → Repartitioning → Learning” We can also start from partial partitions Example rand-2 starts with two random classes Faster learning; focus on local class structure

Introduction Multiclass Boosting Repartitioning Experiments Summary Experiment Settings We compared one-vs-one, one-vs-all, AdaBoost.ECC, and AdaBoost.ERP Four different binary learners: decision stumps, perceptrons, binary AdaBoost, and SVM-perceptron Ten UCI data sets with number of classes varies from 3 to 26

Introduction Multiclass Boosting Repartitioning Experiments Summary Cost with Decision Stumps on letter 0 10 AdaBoost.ECC (max−cut) AdaBoost.ECC (rand−half) AdaBoost.ERP (max−2, LRL) Training cost (normalized) AdaBoost.ERP (rand−2, LRL) AdaBoost.ERP (rand−2, LRLR) −1 10 0 200 400 600 800 1000 Number of iterations

Introduction Multiclass Boosting Repartitioning Experiments Summary Test Error with Decision Stumps on letter 60 55 50 45 Test error (%) 40 35 30 25 20 15 0 200 400 600 800 1000 Number of iterations

Introduction Multiclass Boosting Repartitioning Experiments Summary Cost with Perceptrons on letter 0 10 AdaBoost.ECC (max−cut) AdaBoost.ECC (rand−half) AdaBoost.ERP (max−2, LRL) Training cost (normalized) AdaBoost.ERP (rand−2, LRL) AdaBoost.ERP (rand−2, LRLR) −1 10 −2 10 0 100 200 300 400 500 Number of iterations

Introduction Multiclass Boosting Repartitioning Experiments Summary Test Error with Perceptrons on letter 40 35 30 Test error (%) 25 20 15 10 0 100 200 300 400 500 Number of iterations

Introduction Multiclass Boosting Repartitioning Experiments Summary Overall Results AdaBoost.ERP achieved the lowest cost, and the lowest test error on most of the data sets The improvement is especially significant for weak binary learners With SVM-perceptron, all methods were comparable AdaBoost.ERP starting with partial partitions were much faster than AdaBoost.ECC One-vs-one is much worse with weak binary learners One-vs-one is much faster

Introduction Multiclass Boosting Repartitioning Experiments Summary Summary A multiclass problem can be reduced to a collection of binary problems via an error-correcting coding matrix Multiclass boosting dynamically generates the coding matrix and the binary problems Hard binary problems deteriorate the binary learning AdaBoost.ERP achieves a better trade-off between the error-correcting and the binary learning

Multiclass Boosting with Repartitioning Ling Li Learning Systems - PowerPoint PPT Presentation

Introduction Multiclass Boosting Repartitioning Experiments Summary Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Introduction Multiclass Boosting Repartitioning Experiments Summary Binary

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng ,

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

Adversarial Music: Real world audio adversary against wake-word detection systems Juncheng B. Li

Opening the pod bay doors building intelligent agents that can interpret, generate and learn

Kalman Filter Kalman Filter = special case of a Bayes filter with dynamics model and n

Improved Dynamic Graph Learning through Fault-Tolerant Sparsification Chun Jiang Zhu , Sabine

Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon

Product Development Dilemma Product Development Dilemma

Unique equilibrium states for geodesic flows in nonpositive curvature Todd Fisher Department of

Probabilistic reasoning with graphical security models Barbara Kordy Clermont-Ferrand, January

Multiclass Boosting with Repartitioning Ling Li Learning Systems - PowerPoint PPT Presentation

Introduction Multiclass Boosting Repartitioning Experiments Summary Multiclass Boosting with Repartitioning Ling Li Learning Systems Group, Caltech ICML 2006 Introduction Multiclass Boosting Repartitioning Experiments Summary Binary

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning Angen Zheng ,

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

Perception of Average Value in Multiclass Scatterplots Michael Gleicher, Michael Correll,

Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

MOTIVATE 1 and 2 Trials Maraviroc in Patients with Multiclass Drug Resistance MOTIVATE 1 and 2:

Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun &amp; Rich Zemels

Adversarial Surrogate Losses for General Multiclass Classification Rizal Zaini Ahmad Fathony

Lecture #16: Boosting Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas

Adversarial Music: Real world audio adversary against wake-word detection systems Juncheng B. Li

Opening the pod bay doors building intelligent agents that can interpret, generate and learn

Kalman Filter Kalman Filter = special case of a Bayes filter with dynamics model and n

Improved Dynamic Graph Learning through Fault-Tolerant Sparsification Chun Jiang Zhu , Sabine

Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon

Product Development Dilemma Product Development Dilemma

Unique equilibrium states for geodesic flows in nonpositive curvature Todd Fisher Department of

Probabilistic reasoning with graphical security models Barbara Kordy Clermont-Ferrand, January

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemels