Model Combination in Multiclass Classification Sam Reid Advisors: - - PowerPoint PPT Presentation

model combination in multiclass classification
SMART_READER_LITE
LIVE PREVIEW

Model Combination in Multiclass Classification Sam Reid Advisors: - - PowerPoint PPT Presentation

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Model Combination in Multiclass Classification Sam Reid Advisors: Mike Mozer, Greg Grudic Department of


slide-1
SLIDE 1

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification

Model Combination in Multiclass Classification

Sam Reid Advisors: Mike Mozer, Greg Grudic

Department of Computer Science University of Colorado at Boulder USA

April 5, 2010

Sam Reid Model Combination in Multiclass Classification 1/ 76

slide-2
SLIDE 2

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification

Multiclass Classification

◮ From examples, make multiclass predictions on unseen data. ◮ Applications in:

◮ Heartbeat arrythmia monitoring ◮ Protein structure classification ◮ Handwritten digit recognition ◮ Part of speech tagging ◮ Vehicle identification ◮ Many others...

◮ Our approach: model combination

Sam Reid Model Combination in Multiclass Classification 2/ 76

slide-3
SLIDE 3

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification

Multiclass Classification: Example

Heartbeat Arrhythmia Monitoring Data Set (truncated)

age gender height weight BPM QRS 274 other wave class (yrs) (cm) (kg) duration (ms) characteristics 75 m 190 80 91 63 ... Supraventricular Pre. 56 f 165 64 81 53 ... Sinus bradycardy 54 m 172 95 138 75 ... Right bundle block 55 m 175 94 100 71 ... normal 75 m 190 80 88 ? ... Ventricular Pre. 13 m 169 51 100 84 ... Left ventricule hyper. 40 f 160 52 77 70 ... normal 49 f 162 54 78 67 ... normal 44 m 168 56 84 64 ... normal 50 f 167 67 89 63 ... Right bundle block ... ... ... ... ... ... ... ... 62 m 170 72 102 70 ... ? 45 f 165 86 77 72 ... ?

Sam Reid Model Combination in Multiclass Classification 3/ 76

slide-4
SLIDE 4

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification

Model Combination

◮ Combine multiclass classifiers (e.g. KNN, Decision Trees,

Random Forests)

◮ Voting ◮ Averaging ◮ Linear ◮ Nonlinear

◮ Combine binary classifiers (e.g. SVM, AdaBoost) to solve

multiclass

◮ One vs. All ◮ Pairwise Classification ◮ Error Correcting Output Coding Sam Reid Model Combination in Multiclass Classification 4/ 76

slide-5
SLIDE 5

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification

Outline

Regularization in Linear Combinations of Multiclass Classifiers Background Model Experiments Model Selection in Binary Subproblems Background Experiments Discussion Probabilistic Pairwise Classification Background Our Method Experiments

Sam Reid Model Combination in Multiclass Classification 5/ 76

slide-6
SLIDE 6

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Outline

Regularization in Linear Combinations of Multiclass Classifiers Background Model Experiments Model Selection in Binary Subproblems Background Experiments Discussion Probabilistic Pairwise Classification Background Our Method Experiments

Sam Reid Model Combination in Multiclass Classification 6/ 76

slide-7
SLIDE 7

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Classifier Combination

◮ Goal: optimize predictions on test data ◮ Maintain diversity without sacrificing accuracy ◮ Train many classifiers with different

algorithms/hyperparameters

◮ Combine with a linear combination function

◮ Ting & Witten, 1999 ◮ Seewald, 2002 ◮ Caruana et al., 2004 Sam Reid Model Combination in Multiclass Classification 7/ 76

slide-8
SLIDE 8

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Linear StackingC 1/2

◮ Stacked Generalization

◮ Predictions on validation data are meta-training data

◮ Linear StackingC, class-conscious stacked generalization

ˆ pj( x) =

  • i=1..L

wijyij( x)

◮ ˆ

pj( x) is the predicted probability for class cj

◮ wij is the weight corresponding to classifier yi and class cj ◮ yij(

x) is the ith classifier’s output on class cj

◮ Training set = classifier predictions on unseen data + labels ◮ Determine weights using linear regression

Sam Reid Model Combination in Multiclass Classification 8/ 76

slide-9
SLIDE 9

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Linear StackingC 2/2

x y1(x) y2(x) ˆ y x′ yA(xA′) yB(xB′) yC(xC′)

Sam Reid Model Combination in Multiclass Classification 9/ 76

slide-10
SLIDE 10

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Problems

◮ Caruana et al., 2004: “Stacking [linear] performs poorly

because regression overfits dramatically when there are 2000 highly correlated input models and only 1k points in the validation set.”

◮ How can we scale up stacking to a large number of classifiers?

Sam Reid Model Combination in Multiclass Classification 10/ 76

slide-11
SLIDE 11

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Problems

◮ Caruana et al., 2004: “Stacking [linear] performs poorly

because regression overfits dramatically when there are 2000 highly correlated input models and only 1k points in the validation set.”

◮ How can we scale up stacking to a large number of classifiers? ◮ Our hypothesis: regularized linear combiner will

◮ reduce variance & prevent overfitting on indicator subproblems ◮ increase accuracy on multiclass problem

◮ Penalty terms in our studies:

◮ Ridge Regression: L = |y − Xβ|2 + λ|β|2 ◮ Lasso Regression: L = |y − Xβ|2 + λ|β|1 ◮ Elastic Net Regression: L = |y − Xβ|2 + (1 − α)|β|2 + α|β|1 Sam Reid Model Combination in Multiclass Classification 10/ 76

slide-12
SLIDE 12

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Thesis Statement - Part I

◮ In linear combinations of multiclass classifiers, regularization

significantly improves performance.

Sam Reid Model Combination in Multiclass Classification 11/ 76

slide-13
SLIDE 13

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Multiclass Classification Data Sets

Dataset Att.(numeric) Instances Classes balance-scale 4 625 3 glass 9 214 6 letter 16 4000 26 mfeat-morphological 6 2000 10

  • ptdigits

64 5620 10 sat-image 36 6435 6 segment 19 2310 7 vehicle 18 846 4 waveform-5000 40 5000 3 yeast 8 1484 10

Sam Reid Model Combination in Multiclass Classification 12/ 76

slide-14
SLIDE 14

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Algorithms

◮ About 1000 base classifiers for each problem

  • 1. Neural Network
  • 2. Support Vector Machine (C-SVM from LibSVM)
  • 3. K-Nearest Neighbor
  • 4. Decision Stump
  • 5. Decision Tree
  • 6. AdaBoost.M1
  • 7. Bagging classifier
  • 8. Random Forest (Weka)
  • 9. Random Forest (R)

Sam Reid Model Combination in Multiclass Classification 13/ 76

slide-15
SLIDE 15

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Results: Average Accuracy

s g

  • l

i n e a r s g

  • l

a s s

  • v
  • t

e a v e r a g e s e l e c t

  • b

e s t s g

  • r

i d g e

67.5 70.0 72.5 75.0 77.5 80.0 82.5 85.0

Accuracy (%)

Sam Reid Model Combination in Multiclass Classification 14/ 76

slide-16
SLIDE 16

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Statistical Analysis

◮ Ridge outperforms unregularized at p ≤ 0.002

◮ Validates hypothesis: regularization improves accuracy

◮ Ridge outperforms lasso at p ≤ 0.0019

◮ Dense better than sparse

◮ Voting and averaging all models not competitive

Sam Reid Model Combination in Multiclass Classification 15/ 76

slide-17
SLIDE 17

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Multiclass Accuracy ∝ Binary Accuracy 1/3

0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24

RMSE

10

  • 9

10

  • 8

10

  • 7

10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

1 10 10

2

10

3

Ridge Parameter

. . . . ..........

Root mean squared error for the first (class-1) indicator subproblem in sat-image, over 10 folds of Dietterich’s 5x2 CV.

Sam Reid Model Combination in Multiclass Classification 16/ 76

slide-18
SLIDE 18

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Multiclass Accuracy ∝ Binary Accuracy 2/3

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93

Accuracy

10

  • 9

10

  • 8

10

  • 7

10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

1 10 10

2

10

3

Ridge Parameter

. . . . ..........

Multiclass classification accuracy as a function of the regularization hyperparameter λridge.

Sam Reid Model Combination in Multiclass Classification 17/ 76

slide-19
SLIDE 19

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Multiclass Accuracy ∝ Binary Accuracy 3/3

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94

Accuracy

0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24

RMSE on Subproblem 1

. . . . . . . . . . . . . . .

Accuracy vs RMSE on the first (class-1) indicator subproblem.

Multiclass Accuracy ∝ Binary Accuracy

Sam Reid Model Combination in Multiclass Classification 18/ 76

slide-20
SLIDE 20

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Ridge More Effective than Lasso

0.9 0.905 0.91 0.915 0.92 0.925 0.93

Accuracy

10

  • 5 2

5 10

  • 4 2

5 10

  • 3 2

5 10

  • 2 2

5 10

  • 1 2

5

1

2 5

Penalty

alpha=0.95 alpha=0.5 alpha=0.05 select-best

Overall accuracy on sat-image with various parameters for elastic-net.

Sam Reid Model Combination in Multiclass Classification 19/ 76

slide-21
SLIDE 21

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Focus on Subproblems

◮ Choose from classifiers and predictions ◮ Allow classifiers to focus on subproblems ◮ Example: Benefit from a classifier that predicts well-calibrated

probabilities for class A but has B & C backwards

◮ This advantage possible on multiclass classification but not

binary classification, since k

i=1 pi(

x) = 1

Sam Reid Model Combination in Multiclass Classification 20/ 76

slide-22
SLIDE 22

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Sparse Linear Combinations

−7 −6 −5 −4 −3 −2 −1 −0.10 −0.05 0.00 0.05 0.10 Log Lambda Class 1

19 19 33

−7 −6 −5 −4 −3 −2 −1 −0.10 −0.05 0.00 0.05 0.10 Log Lambda Class 2

8 6 32

−7 −6 −5 −4 −3 −2 −1 −0.10 −0.05 0.00 0.05 0.10 Log Lambda Class 3

66 26 14

Coefficient profiles for the first three subproblems in StackingC for the sat-image dataset with elastic net regression at α = 0.95

Sam Reid Model Combination in Multiclass Classification 21/ 76

slide-23
SLIDE 23

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Selected Classifiers

Classifier red cotton grey damp veg v.damp total adaboost-500 6.3 1.4 2.3 10.0 ann-0.5-32-1000 6.1 3.5 0.4 10.0 ann-0.5-16-500 3.9 1.8 0.9 3.4 10.1 ann-0.9-16-500 0.2 8.2 0.7 1.6 10.8 ann-0.5-32-500 0.0 7.5 10.0 2.7 11.1 knn-1 7.6 6.5 0.8 9.7 24.6

Weights (%) for the sat-image problem in elastic net StackingC with α = 0.95 for the 6 models with highest total weights.

Sam Reid Model Combination in Multiclass Classification 22/ 76

slide-24
SLIDE 24

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Model Experiments

Conclusions & Future Work

◮ Regularization is essential in linear combinations of multiclass

classifiers

◮ Dense combiners outperform sparse combiners ◮ One-weight-per-output (instead of one-weight-per-classifier)

allows classifiers to specialize in subproblems

◮ Future Work

◮ Bayesian treatment, Gaussian/Laplacian priors over weights ◮ Constrain coefficients to be positive

◮ This work published as:

◮ Regularized Linear Models in Stacked Generalization, Sam

Reid and Greg Grudic, Multiple Classifier Systems, 2009, Springer LNCS 5519 112-121

Sam Reid Model Combination in Multiclass Classification 23/ 76

slide-25
SLIDE 25

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Outline

Regularization in Linear Combinations of Multiclass Classifiers Background Model Experiments Model Selection in Binary Subproblems Background Experiments Discussion Probabilistic Pairwise Classification Background Our Method Experiments

Sam Reid Model Combination in Multiclass Classification 24/ 76

slide-26
SLIDE 26

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Reducing Multiclass to Binary

◮ Some classifiers designed for binary (e.g. SVM, Adaboost) ◮ Transform multiclass ⇒ set of binary problems ◮ Combine binary predictions ⇒ predict multiclass

A vs B,C in one-vs-all A vs C in all-pairs

Sam Reid Model Combination in Multiclass Classification 25/ 76

slide-27
SLIDE 27

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Model Selection in Reducing Multiclass to Binary

◮ No Model Selection

◮ Dietterich and Bakiri, 1995 ◮ Allwein et al., 2000 Sam Reid Model Combination in Multiclass Classification 26/ 76

slide-28
SLIDE 28

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Model Selection in Reducing Multiclass to Binary

◮ No Model Selection

◮ Dietterich and Bakiri, 1995 ◮ Allwein et al., 2000

◮ Shared Hyperparameters

◮ Rifkin uses greedy 1d hillclimbing, with OVA + LBD, Rifkin &

Klautau, 2004

◮ Model selection in LibSVM, Chang & Lin, 2001 Sam Reid Model Combination in Multiclass Classification 26/ 76

slide-29
SLIDE 29

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Model Selection in Reducing Multiclass to Binary

◮ No Model Selection

◮ Dietterich and Bakiri, 1995 ◮ Allwein et al., 2000

◮ Shared Hyperparameters

◮ Rifkin uses greedy 1d hillclimbing, with OVA + LBD, Rifkin &

Klautau, 2004

◮ Model selection in LibSVM, Chang & Lin, 2001

◮ Optimize Subproblems Independently

◮ Homogeneous, Friedman 1996 ◮ Heterogeneous, Szepannek et al. 2007 Sam Reid Model Combination in Multiclass Classification 26/ 76

slide-30
SLIDE 30

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Model Selection in Reducing Multiclass to Binary

◮ No Model Selection

◮ Dietterich and Bakiri, 1995 ◮ Allwein et al., 2000

◮ Shared Hyperparameters

◮ Rifkin uses greedy 1d hillclimbing, with OVA + LBD, Rifkin &

Klautau, 2004

◮ Model selection in LibSVM, Chang & Lin, 2001

◮ Optimize Subproblems Independently

◮ Homogeneous, Friedman 1996 ◮ Heterogeneous, Szepannek et al. 2007

◮ Optimize the Joint Distribution

◮ Evolutionary search, de Souza et al., 2006, Lebrun et al., 2007 Sam Reid Model Combination in Multiclass Classification 26/ 76

slide-31
SLIDE 31

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Shared Hyperparameters vs Independent Optimization

◮ Shared Hyperparameters

◮ Optimizes to the target multiclass metric ◮ Increases bias and reduces variance for model selection

◮ Independent Optimization

◮ Accommodate subproblems with different structure ◮ Improved subproblem performance ⇒ improved performance Sam Reid Model Combination in Multiclass Classification 27/ 76

slide-32
SLIDE 32

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Thesis Statement - Part II

◮ When solving a multiclass problem with a set of binary

classifiers, it is more effective to constrain subproblems to use the same hyperparameters than to optimize each independently.

Sam Reid Model Combination in Multiclass Classification 28/ 76

slide-33
SLIDE 33

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Multiclass Classification Data Sets 1/2

dataset classes numeric train test sampled-from anneal 4 6 300 150 878 arrhythmia 5 206 257 129 386 authorship 4 70 300 150 841 autos 5 15 134 68 202 cars 3 6 270 136 406 collins 11 19 300 150 451 dj30-1985-2003 20 6 133 67 138123 ecoli 4 7 204 103 307 eucalyptus 5 14 300 150 736 halloffame 3 15 300 150 1340

Sam Reid Model Combination in Multiclass Classification 29/ 76

slide-34
SLIDE 34

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Multiclass Classification Data Sets 2/2

dataset classes numeric train test sampled-from hypothyroid 3 7 300 150 3707 letter 18 16 136 68 18668 mfeat-morphological 10 6 300 150 1888

  • ptdigits

10 64 300 150 5620 page-blocks 5 10 300 150 5393 segment 7 19 300 150 2086 synthetic-control 6 60 300 150 600 vehicle 4 18 300 150 846 vowel 11 10 300 150 990 waveform 3 40 300 150 5000

Sam Reid Model Combination in Multiclass Classification 30/ 76

slide-35
SLIDE 35

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Methods

◮ Reductions: {one-vs-all, all-pairs} × {hamming, squared} ◮ Model selection: {shared, independent} ◮ Base classifier: LibSVM with 2-phase grid search

Sam Reid Model Combination in Multiclass Classification 31/ 76

slide-36
SLIDE 36

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Shared vs Independent: Test Set Accuracy

shared independent

  • n

e

  • v

s

  • a

l l a l l

  • p

a i r s

  • n

e

  • v

s

  • a

l l

  • h

a m m i n g a l l

  • p

a i r s

  • s

q u a r e d 71 72 73 74 75 76 77 78 79

Average accuracy (%)

p <= 0.0663 p <= 0.0027 p <= 0.0027 p <= 0.0028

Sam Reid Model Combination in Multiclass Classification 32/ 76

slide-37
SLIDE 37

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Subproblems are Similar - Vehicle, one-vs-all

vehicle: one-vs-all

subproblem 0 subproblem 1 subproblem 2 subproblem 3

  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5

5 10 15 log2(g) 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 accuracy

Independent model selection curves for one-vs-all on vehicle

Sam Reid Model Combination in Multiclass Classification 33/ 76

slide-38
SLIDE 38

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Subproblems are Similar - Vehicle, all-pairs

vehicle: all-pairs

subproblem 0 subproblem 1 subproblem 2 subproblem 3 subproblem 4 subproblem 5

  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5

5 10 15 log2(g) 0.450 0.475 0.500 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 accuracy

Independent model selection curves for all-pairs on vehicle

Sam Reid Model Combination in Multiclass Classification 34/ 76

slide-39
SLIDE 39

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Subproblems are Similar - Examples

cars: one-vs-all subproblem 0 subproblem 1 subproblem 2
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 accuracy

cars: one-vs-all

page-blocks: one-vs-all subproblem 0 subproblem 1 subproblem 2 subproblem 3 subproblem 4
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 accuracy

page-blocks: one-vs-all

letter: one-vs-all
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.910 0.915 0.920 0.925 0.930 0.935 0.940 0.945 0.950 0.955 0.960 0.965 0.970 0.975 0.980 0.985 accuracy

letter: one-vs-all

cars: all-pairs subproblem 0 subproblem 1 subproblem 2
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.500 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 accuracy

cars: all-pairs

page-blocks: all-pairs
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.400 0.425 0.450 0.475 0.500 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 accuracy

page-blocks: all-pairs

letter: all-pairs
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 accuracy

letter: all-pairs

Sam Reid Model Combination in Multiclass Classification 35/ 76

slide-40
SLIDE 40

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Subproblems are Similar - Aggregate Results 1/3

◮ Define γs = optimal shared hyperparameter ◮ γi = optimal independent hyperparameter ◮ Compute accuracy difference d = ¯

a(γi) − a(γs)

◮ Where ¯

a indicates an average over subproblems

Sam Reid Model Combination in Multiclass Classification 36/ 76

slide-41
SLIDE 41

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Subproblems are Similar - Aggregate Results 2/3

Average Subproblem Loss at Selected Optimum

  • ne-vs-all

halloffame vehicle synthetic-control authorship

  • ptdigits

anneal waveform vowel letter dj30-1985-2003 collins segment page-blocks mfeat-morphological hypothyroid eucalyptus ecoli cars arrhythmia autos 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Average Accuracy Loss (%)

◮ For each dataset i, di < 0.80% ◮ Average ¯

d = 0.30%

Sam Reid Model Combination in Multiclass Classification 37/ 76

slide-42
SLIDE 42

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Subproblems are Similar - Aggregate Results 3/3

Average Subproblem Loss at Selected Optimum

all-pairs

anneal waveform authorship halloffame hypothyroid

  • ptdigits

eucalyptus segment ecoli page-blocks vowel vehicle collins cars synthetic-control arrhythmia autos mfeat-morphological dj30-1985-2003 letter 5 10 15 20 25 30 35

Average Accuracy Loss (%)

◮ Largest values: 36.6% (letter), 29.4% (dj30-1985-2003) ◮ Average ¯

d = 4.24%

Sam Reid Model Combination in Multiclass Classification 38/ 76

slide-43
SLIDE 43

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Differing Subproblems Favor Independent

◮ Construct a synthetic problem with different shapes of

decision boundaries

◮ Requires different hyperparameters ⇒ Requires independent

  • ptimization

◮ First, a control experiment with only linear decision boundaries

Sam Reid Model Combination in Multiclass Classification 39/ 76

slide-44
SLIDE 44

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Differing Subproblems Favor Independent - Linear Synthetic Data 1/2

Linear Decision Boundaries with Varying Noise

Class_0 Class_1 Class_2

  • 4.0
  • 3.5
  • 3.0
  • 2.5
  • 2.0
  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

x

  • 3.0
  • 2.5
  • 2.0
  • 1.5
  • 1.0
  • 0.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5

y

Sam Reid Model Combination in Multiclass Classification 40/ 76

slide-45
SLIDE 45

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Differing Subproblems Favor Independent - Linear Synthetic Data 2/2

Accuracy (%) results for linear decision boundaries. Standard error over 10 random samplings is indicated in parentheses.

reduction shared independent

  • ne-vs-all

66.7 (1.3) 66.1 (1.3)

  • ne-vs-all-hamming

58.2 (2.5) 58.1 (1.9) all-pairs 67.6 (1.3) 66.5 (1.9)

Sam Reid Model Combination in Multiclass Classification 41/ 76

slide-46
SLIDE 46

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Differing Subproblems Favor Independent - Mixed Synthetic Data 1/2

Linear and Nonlinear Decision Boundaries

A B C

  • 1.2
  • 1.1
  • 1.0
  • 0.9
  • 0.8
  • 0.7
  • 0.6
  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

x

  • 1.00
  • 0.75
  • 0.50
  • 0.25

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

y

Sam Reid Model Combination in Multiclass Classification 42/ 76

slide-47
SLIDE 47

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Differing Subproblems Favor Independent - Mixed Synthetic Data 2/2

Accuracy (%) results for mixed linear and nonlinear decision boundaries. Standard error over 10 random samplings is indicated in parentheses.

reduction shared independent

  • ne-vs-all

82.4 (0.6) 83.5 (0.9)

  • ne-vs-all-hamming

78.5 (1.3) 79.5 (1.3) all-pairs 82.4 (1.3) 84.2 (0.9)

Sam Reid Model Combination in Multiclass Classification 43/ 76

slide-48
SLIDE 48

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Multiclass Accuracy ∝ Binary Accuracy + Noise

arrhythmia: one-vs-all

  • ne-vs-all-shared
  • ne-vs-all-sharedsub
  • ne-vs-all-shared-oracle
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 accuracy
  • ne-vs-all

arrhythmia: one-vs-all

  • ne-vs-all
  • ne-vs-all-hamming
83.0 83.5 84.0 84.5 85.0 85.5 86.0 86.5 87.0 87.5 88.0 88.5 89.0 89.5 average binary accuracy (%) 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 multiclass accuracy (%)
  • ne-vs-all multi vs binary

arrhythmia: all-pairs all-pairs-shared all-pairs-sharedsub all-pairs-shared-oracle

  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 accuracy

all-pairs

arrhythmia: all-pairs all-pairs all-pairs-squared 70 71 72 73 74 75 76 77 78 79 80 81 82 average binary accuracy (%) 63.5 64.0 64.5 65.0 65.5 66.0 66.5 67.0 67.5 68.0 68.5 69.0 69.5 70.0 70.5 71.0 71.5 72.0 72.5 multiclass accuracy (%)

all-pairs multi vs binary

Sam Reid Model Combination in Multiclass Classification 44/ 76

slide-49
SLIDE 49

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Multiclass Accuracy ∝ Binary Accuracy + Noise: Anneal

anneal: one-vs-all

  • ne-vs-all-shared
  • ne-vs-all-sharedsub
  • ne-vs-all-shared-oracle
  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.525 0.550 0.575 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 accuracy
  • ne-vs-all

anneal: one-vs-all

  • ne-vs-all
  • ne-vs-all-hamming
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 average binary accuracy (%) 40.0 42.5 45.0 47.5 50.0 52.5 55.0 57.5 60.0 62.5 65.0 67.5 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 100.0 multiclass accuracy (%)
  • ne-vs-all multi vs binary

anneal: all-pairs all-pairs-shared all-pairs-sharedsub all-pairs-shared-oracle

  • 40
  • 35
  • 30
  • 25
  • 20
  • 15
  • 10
  • 5
5 10 15 log2(g) 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 accuracy

all-pairs

anneal: all-pairs all-pairs all-pairs-squared 72.5 75.0 77.5 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 100.0 average binary accuracy (%) 70.0 72.5 75.0 77.5 80.0 82.5 85.0 87.5 90.0 92.5 95.0 97.5 multiclass accuracy (%)

all-pairs multi vs binary

Sam Reid Model Combination in Multiclass Classification 45/ 76

slide-50
SLIDE 50

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Multiclass Accuracy ∝ Binary Accuracy + Noise: Aggregate

dj30-1985-2003 collins mfeat-morphological

  • ptdigits

letter autos segment vowel authorship synthetic-control eucalyptus vehicle arrhythmia page-blocks ecoli cars hypothyroid waveform halloffame anneal 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

R-Squared Value for One-vs-All

◮ Average R-Squared Value: One-vs-all=0.791, All-pairs=0.910

Sam Reid Model Combination in Multiclass Classification 46/ 76

slide-51
SLIDE 51

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Multiclass Metric Non-Essential

◮ Hypothesis: Advantage of shared due to selection on target

multiclass metric

◮ To test, implement shared-sub

◮ Constraints models to be shared ◮ But selected based on average binary accuracy

◮ Results comparing shared vs shared-sub

◮ one-vs-all: p ≤ 0.65 ◮ all-pairs: p ≤ 0.10 ◮ ova-hamming: p ≤ 0.57

◮ No statistically significant differences ◮ Conclusion: Sharing hyperparameters valuable whether you

use avg binary or multiclass metric

Sam Reid Model Combination in Multiclass Classification 47/ 76

slide-52
SLIDE 52

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Oracle Selection favors Shared

◮ To rule out sampling problems, use an oracle to select the

  • ptimal model

◮ Use oracle for both shared and independent

  • ne-vs-all

all-pairs

  • ne-vs-all-hamming

all-pairs-squared accuracy shared 0.0071 indep 5.72 × 10−6 indep 4.77 × 10−5 indep 0.3955

P-values from the Wilcoxon signed-ranks test are indicated by the winning strategy.

◮ For one-vs-all, shared still beats independent ◮ Independent wins for all-pairs and one-vs-all-hamming ◮ No difference for all-pairs-squared

Sam Reid Model Combination in Multiclass Classification 48/ 76

slide-53
SLIDE 53

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Supplementary Result: Comparing Methods

Average ranks of the 7 algorithms under study (omitted ova-ham-indep); algorithms not statistically significantly different from the top-scoring algorithm are connected to it with a vertical line.

Sam Reid Model Combination in Multiclass Classification 49/ 76

slide-54
SLIDE 54

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Experiments Discussion

Conclusions

◮ Shared hyperparameters often better than independent

  • ptimization

◮ Subproblems often similar, especially in one-vs-all ◮ If there are different decision boundary shapes, use

independent

◮ Future Work

◮ Multiclass metrics with no binary analog in independent

  • ptimization? (e.g. multiclass cost matrix)

◮ Relationship to regret transform, Langford & Beygelzimer,

2005

Sam Reid Model Combination in Multiclass Classification 50/ 76

slide-55
SLIDE 55

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Outline

Regularization in Linear Combinations of Multiclass Classifiers Background Model Experiments Model Selection in Binary Subproblems Background Experiments Discussion Probabilistic Pairwise Classification Background Our Method Experiments

Sam Reid Model Combination in Multiclass Classification 51/ 76

slide-56
SLIDE 56

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Pairwise Classification

◮ Assuming a classification problem with k ≥ 3 classes ◮ k(k − 1)/2 subproblems, one for each pair of classes ◮ Estimate ˆ

µij( x) ≈ µij( x) = P(y = ci|y = ci or cj, x)

◮ Note that µij = pi pi+pj ◮ Combine: p = {p1, p2, ..., pk} = f (ˆ

µij( x))

Sam Reid Model Combination in Multiclass Classification 52/ 76

slide-57
SLIDE 57

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Pairwise Classification Subproblem Example

Illustration of an A-C decision boundary in a 2D, 3-class example of pairwise classification.

Sam Reid Model Combination in Multiclass Classification 53/ 76

slide-58
SLIDE 58

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Pairwise Classification Methods

◮ Voted pairwise classification (VPC): Friedman, 1996

◮ ˆ

y( x) = argmaxi

  • j:j=i 1(ˆ

µij( x) > ˆ µji( x))

◮ Equivalent to Bayes optimal prediction if ˆ

µij( x) = µij( x)

Sam Reid Model Combination in Multiclass Classification 54/ 76

slide-59
SLIDE 59

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Pairwise Classification Methods

◮ Voted pairwise classification (VPC): Friedman, 1996

◮ ˆ

y( x) = argmaxi

  • j:j=i 1(ˆ

µij( x) > ˆ µji( x))

◮ Equivalent to Bayes optimal prediction if ˆ

µij( x) = µij( x)

◮ Hastie & Tibshirani (HT), 1996

◮ Iteratively update p = {p1, p2, ..., pk} ◮ Min KL-Divergence between µ and ˆ

µ, l(p) =

i=j nij ˆ

µij

ˆ µij µij

◮ Converges to minimum of KL divergence Sam Reid Model Combination in Multiclass Classification 54/ 76

slide-60
SLIDE 60

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Pairwise Classification Methods

◮ Voted pairwise classification (VPC): Friedman, 1996

◮ ˆ

y( x) = argmaxi

  • j:j=i 1(ˆ

µij( x) > ˆ µji( x))

◮ Equivalent to Bayes optimal prediction if ˆ

µij( x) = µij( x)

◮ Hastie & Tibshirani (HT), 1996

◮ Iteratively update p = {p1, p2, ..., pk} ◮ Min KL-Divergence between µ and ˆ

µ, l(p) =

i=j nij ˆ

µij

ˆ µij µij

◮ Converges to minimum of KL divergence

◮ Wu, Lin, Weng (WLW), 2004

◮ µij =

pi pi+pj ⇒ µij µji = pi pj

◮ Approx min

p

k

i=1

  • j=i(ˆ

µjipi − ˆ µijpj)2 s.t. k

i=1 pi = 1, pi ≥ 0

◮ Guaranteed convergence Sam Reid Model Combination in Multiclass Classification 54/ 76

slide-61
SLIDE 61

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Pairwise Classification

◮ Pros (Furnkranz, 2002)

◮ Smaller Subproblems ◮ Simpler Subproblems ◮ Improved Accuracy (disputed by Rifkin & Klautau, 2004)

◮ Cons

◮ Larger number of subproblems than one-vs-all ◮ Each pairwise classifier is trained on only two of the classes but

makes predictions for instances from any class (Hastie & Tibshirani, 1996, Cutzu, 2003) e.g. a classifier trained on cA and cB may have unpredictable behavior for instances with y( x) = cC

Sam Reid Model Combination in Multiclass Classification 55/ 76

slide-62
SLIDE 62

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Thesis Statement - Part III

◮ When solving a multiclass problem with a set of pairwise

binary classifiers, incorporation of the probability of membership in each pair improves performance.

Sam Reid Model Combination in Multiclass Classification 56/ 76

slide-63
SLIDE 63

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Probabilistic Pairwise Classification: Derivation 1/2

Theorem of Total Probability: p(b| x) =

N

  • i=1

p(b|ai, x)p(ai| x) (1) Assumes

◮ a1..aN mutually exclusive and exhaustive so N i=1 p(ai|

x) = 1 Let

◮ b = ci ◮ N = 2 ◮ a1 = ci ∪ cj ◮ a2 = L − ci − cj, for L = {c1..ck}

p(ci|L, x) = p(ci|ci ∪ cj, x)p(ci ∪ cj|L, x) +p(ci|L − ci − cj, x)p(L − ci − cj|L, x)

Sam Reid Model Combination in Multiclass Classification 57/ 76

slide-64
SLIDE 64

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Probabilistic Pairwise Classification: Derivation 2/2

◮ But

p(ci|L − ci − cj, x) = 0 (2)

⇒ p(ci| x) = p(ci|ci ∪ cj, x)p(ci ∪ cj|L, x) (3)

◮ Average over all j = i

ˆ p(ci|L, x) = 1 k − 1

  • j=i

ˆ p(ci|ci ∪ cj, x)ˆ p(ci ∪ cj|L, x) (4)

◮ Normalize so that i ˆ

p(ci|L, x) = 1.

Sam Reid Model Combination in Multiclass Classification 58/ 76

slide-65
SLIDE 65

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Comparison to Other Pairwise Classification Methods

◮ PPC

◮ Solves for each term pi(

x) independently

◮ Models pi + pj = p(i or j|L,

x) directly

◮ Conceptually simpler ◮ Easier to implement ◮ Theoretically well motivated

◮ Hastie-Tibshirani (HT) method approximates

pi =

j=i( 2 k(k−1))µij (Wu et al., 2004)

◮ Equivalent to our method with the assumption pi + pj = 2/k Sam Reid Model Combination in Multiclass Classification 59/ 76

slide-66
SLIDE 66

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Computational Complexity

Computational complexity of one-vs-all (OVA), pairwise coupling (PC) and probabilistic pairwise classification (PPC)

OVA PC PPC subproblems k k(k-1)/2 k(k-1) instances per subproblem N 2N/k N (half) + 2N/k (other half) computational complexity/SVM O(kN3) O(k−1N3) O(k2N3)

Sam Reid Model Combination in Multiclass Classification 60/ 76

slide-67
SLIDE 67

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Experiments

◮ Base Classifiers

◮ Decision Tree (J48) ◮ K-Nearest Neighbor (KNN) ◮ Random Forests (RF-100) ◮ Support Vector Machines (SVM-121) Sam Reid Model Combination in Multiclass Classification 61/ 76

slide-68
SLIDE 68

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Experiments

◮ Base Classifiers

◮ Decision Tree (J48) ◮ K-Nearest Neighbor (KNN) ◮ Random Forests (RF-100) ◮ Support Vector Machines (SVM-121)

◮ Multiclass Classification Methods

◮ Multi (for J48, KNN, RF-100) ◮ Voted Pairwise Classification (VPC) ◮ Hastie-Tibshirani (HT) ◮ Wu, Lin, Weng (WLW) ◮ Probabilistic Pairwise Classification (PPC) Sam Reid Model Combination in Multiclass Classification 61/ 76

slide-69
SLIDE 69

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Experiments

◮ Base Classifiers

◮ Decision Tree (J48) ◮ K-Nearest Neighbor (KNN) ◮ Random Forests (RF-100) ◮ Support Vector Machines (SVM-121)

◮ Multiclass Classification Methods

◮ Multi (for J48, KNN, RF-100) ◮ Voted Pairwise Classification (VPC) ◮ Hastie-Tibshirani (HT) ◮ Wu, Lin, Weng (WLW) ◮ Probabilistic Pairwise Classification (PPC)

◮ Metrics

◮ Accuracy ◮ Brier 1 − b(

x) = 1 − 1

d

  • j(tj(

x) − ˆ pj( x))2, tj( x) = 1(y( x) = cj)

Sam Reid Model Combination in Multiclass Classification 61/ 76

slide-70
SLIDE 70

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Average Accuracy

multiclass vpc ht wlw ppc j 4 8 k n n r f 1 s v m 1 2 1

72.5 73.0 73.5 74.0 74.5 75.0 75.5 76.0 76.5 77.0 77.5 78.0 78.5 79.0

accuracy (%)

Accuracy averaged over all 20 data sets.

Sam Reid Model Combination in Multiclass Classification 62/ 76

slide-71
SLIDE 71

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Average Brier Score

multiclass vpc ht wlw ppc j 4 8 k n n r f 1 s v m 1 2 1

91.5 92.0 92.5 93.0 93.5 94.0 94.5 95.0 95.5

Rectified Brier score (%)

Rectified Brier score averaged over all 20 data sets.

Sam Reid Model Combination in Multiclass Classification 63/ 76

slide-72
SLIDE 72

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Average Ranks

Sam Reid Model Combination in Multiclass Classification 64/ 76

slide-73
SLIDE 73

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Varying Base Classifier Accuracy

Accuracy vs. Number of Trees Averaged over 20 Data Sets

multi vpc ht wlw ppc 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 log_10(Number of Trees) 76.25 76.50 76.75 77.00 77.25 77.50 77.75 78.00 78.25 78.50 78.75 79.00 79.25 Accuracy (%)

Accuracy vs number of trees in random forest

Sam Reid Model Combination in Multiclass Classification 65/ 76

slide-74
SLIDE 74

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Learning Curves

Learning Curves

multi vpc ht wlw ppc 450 500 550 600 650 700 750 800 850 900 950 1,000 Number of Data Points 76.0 76.5 77.0 77.5 78.0 78.5 79.0 79.5 80.0 80.5 81.0 81.5 82.0 82.5 83.0 83.5 84.0 84.5 85.0 Accuracy (%)

Accuracy vs sample size for 10 largest data sets

Sam Reid Model Combination in Multiclass Classification 66/ 76

slide-75
SLIDE 75

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Duplicate Decision Boundaries Favors MULTI

◮ Hypothesis: Direct multiclass method will outperform PPC

when decision boundaries are shared

◮ Construct a synthetic data set meant to favor multi-j48 ◮ Decision boundaries are shared

Sam Reid Model Combination in Multiclass Classification 67/ 76

slide-76
SLIDE 76

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Duplicate Decision Boundaries: Noiseless Synthetic Data

Noiseless Synthetic Data Set

A B C D

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 x

  • 0.5
  • 0.4
  • 0.3
  • 0.2
  • 0.1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 y

multi-j48 ppc-j48 99.2 (0.08) 98.7 (0.10)

Sam Reid Model Combination in Multiclass Classification 68/ 76

slide-77
SLIDE 77

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Duplicate Decision Boundaries: Noisy Synthetic Data

Noisy Synthetic Data Set

A B C D

  • 1.00
  • 0.75
  • 0.50
  • 0.25

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 x

  • 1.00
  • 0.75
  • 0.50
  • 0.25

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 y

multi-j48 ppc-j48 84.5 (0.34) 86.0 (0.31)

Sam Reid Model Combination in Multiclass Classification 69/ 76

slide-78
SLIDE 78

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

PPC More Accurate at Large Number of Classes 1/2

Accuracy relative to random forest vs. # classes vpc ht wlw ppc

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of Classes

  • 15.0
  • 12.5
  • 10.0
  • 7.5
  • 5.0
  • 2.5

0.0 2.5 5.0 7.5 10.0 12.5

Relative Accuracy (%)

Method accuracy relative to RF-100

Sam Reid Model Combination in Multiclass Classification 70/ 76

slide-79
SLIDE 79

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

PPC More Accurate at Large Number of Classes 2/2

Accuracy for Discretized Regression Data Sets housing autoMpg meta pbc quake sensory strike cholesterol cleveland average

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Classes

  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4 5 6 7 8 9 10

Relative accuracy (%)

PPC relative to RF-100 for discretized regression data sets

Sam Reid Model Combination in Multiclass Classification 71/ 76

slide-80
SLIDE 80

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Terms in PPC estimate equally important

◮ Hypothesis: Both terms in the PPC estimate are equally

important ˆ p(ci|L, x) = 1 k − 1

  • j=i

ˆ p(ci|ci ∪ cj, x)ˆ p(ci ∪ cj|L, x)

◮ Pairwise term: ˆ

p(ci|ci ∪ cj, x)

◮ Weight (pair-vs-rest) term: ˆ

p(ci ∪ cj|L, x)

◮ Use J48 decision trees, 100 replications, 20 data sets.

Adjusted p-values under various degradations.

hypothesis pHolm both vs. none 2.25E-10 no-pair vs. none 6.87E-05 no-weight vs. none 7.49E-04 both vs. no-weight 0.012 both vs. no-pair 0.04693 no-weight vs. no-pair 0.540291 Sam Reid Model Combination in Multiclass Classification 72/ 76

slide-81
SLIDE 81

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

PPC Summary & Conclusions

◮ Introduced new pairwise classification algorithm, PPC ◮ Based on Theorem of Total Probability ◮ Explicitly models p(ci ∪ cj|L,

x)

◮ Outperforms or ties related methods

◮ For several base classifiers, metrics, data sets

◮ Some data sets benefit from direct multiclass methods ◮ PPC works well at large # classes ◮ Future Work

◮ Faster but less accurate pair-vs-rest classifier? ◮ Independent vs. shared in PPC? Sam Reid Model Combination in Multiclass Classification 73/ 76

slide-82
SLIDE 82

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Thesis Statement

Multiclass classification problems can be productively solved by combining multiple classifiers. Specifically:

◮ In linear combinations of multiclass classifiers, regularization

significantly improves performance.

◮ When solving a multiclass problem with a set of binary

classifiers, it is more effective to constrain subproblems to use the same hyperparameters than to optimize each independently.

◮ When solving a multiclass problem with a set of pairwise

binary classifiers, incorporation of the probability of membership in each pair improves performance.

Sam Reid Model Combination in Multiclass Classification 74/ 76

slide-83
SLIDE 83

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Acknowledgments

◮ PhET Interactive Simulations ◮ NSF Grants

◮ SBE-0542013 ◮ Science of Learning Center (Garrison Cottrell, PI) ◮ BCS-0339103 ◮ BCS-720375 ◮ SBE-0518699

◮ Mike Mozer, Greg Grudic ◮ Dissertation Support Group/CAPS ◮ Turing Institute ◮ UCI Repository

Sam Reid Model Combination in Multiclass Classification 75/ 76

slide-84
SLIDE 84

Regularization in Linear Combinations of Multiclass Classifiers Model Selection in Binary Subproblems Probabilistic Pairwise Classification Background Our Method Experiments

Questions?

◮ Questions?

Sam Reid Model Combination in Multiclass Classification 76/ 76