Regularized Linear Models in Stacked Generalization Sam Reid and - - PowerPoint PPT Presentation

regularized linear models in stacked generalization
SMART_READER_LITE
LIVE PREVIEW

Regularized Linear Models in Stacked Generalization Sam Reid and - - PowerPoint PPT Presentation

Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic Department of Computer Science University of Colorado at Boulder USA June 11, 2009 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in


slide-1
SLIDE 1

Regularized Linear Models in Stacked Generalization

Sam Reid and Greg Grudic

Department of Computer Science University of Colorado at Boulder USA

June 11, 2009

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 1 / 33

slide-2
SLIDE 2

How to combine classifiers?

Which classifiers? How to combine? Adaboost, Random Forest prescribe classifiers and combiner We want L ≥ 1000 heterogeneous classifiers Vote/Average/Forward Stepwise Selection/Linear/Nonlinear? Our combiner: Regularized Linear Model

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 2 / 33

slide-3
SLIDE 3

Outline

1

Introduction How to combine classifiers?

2

Model Stacked Generalization StackingC Linear Regression and Regularization

3

Experiments Setup Results Discussion

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 3 / 33

slide-4
SLIDE 4

Outline

1

Introduction How to combine classifiers?

2

Model Stacked Generalization StackingC Linear Regression and Regularization

3

Experiments Setup Results Discussion

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 4 / 33

slide-5
SLIDE 5

Stacked Generalization

Combiner is produced by a classification algorithm Training set = base classifier predictions on unseen data + labels Learn to compensate for classifier biases Linear and nonlinear combiners What classification algorithm should be used?

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 5 / 33

slide-6
SLIDE 6

Stacked Generalization - Combiners

Wolpert, 1992: relatively global, smooth combiners Ting & Witten, 1999: linear regression combiners Seewald, 2002: low-dimensional combiner inputs

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 6 / 33

slide-7
SLIDE 7

Problems

Caruana et al., 2004: Stacking performs poorly because regression

  • verfits dramatically when there are 2000 highly correlated input

models and only 1k points in the validation set. How can we scale up stacking to a large number of classifiers? Our hypothesis: regularized linear combiner will

reduce variance prevent overfitting increase accuracy

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 7 / 33

slide-8
SLIDE 8

Posterior Predictions in Multiclass Classification

x y(x) p

Classification with d = 4, k = 3

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 8 / 33

slide-9
SLIDE 9

Ensemble Methods for Multiclass Classification

x y1(x) y2(x) ˆ y y′(x′

1, x′ 2)

x′

1

x′

2

Multiple classifier system with 2 classifiers

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 9 / 33

slide-10
SLIDE 10

Stacked Generalization

x y1(x) y2(x) ˆ y y′(x′) x′

Stacked generalization with 2 classifiers

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 10 / 33

slide-11
SLIDE 11

Classification via Regression

x y1(x) y2(x) ˆ y x′ yA(x′) yB(x′) yC(x′)

Stacking using Classification via Regression

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 11 / 33

slide-12
SLIDE 12

StackingC

x y1(x) y2(x) ˆ y x′ yA(xA′) yB(xB′) yC(xC′)

StackingC, class-conscious stacked generalization

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 12 / 33

slide-13
SLIDE 13

Linear Models

Linear model for use in Stacking or StackingC ˆ y = d

i=1 βixi + β0

Least Squares: L = |y − Xβ|2 Problems:

High variance Overfitting Ill-posed problem Poor accuracy

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 13 / 33

slide-14
SLIDE 14

Regularization

Increase bias a little, decrease variance a lot Constrain weights ⇒ reduce flexibility ⇒ prevent overfitting Penalty terms in our studies:

Ridge Regression: L = |y − Xβ|2 + λ|β|2 Lasso Regression: L = |y − Xβ|2 + λ|β|1 Elastic Net Regression: L = |y − Xβ|2 + λ|β|2 + (1 − λ)|β|1

Lasso/Elastic Net produce sparse models

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 14 / 33

slide-15
SLIDE 15

Model

About 1000 base classifiers making probabilistic predictions Stacked Generalization to create combiner StackingC to reduce dimensionality Convert multiclass to regression Use linear regression Regularization on the weights

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 15 / 33

slide-16
SLIDE 16

Model

About 1000 base classifiers making probabilistic predictions Stacked Generalization to create combiner StackingC to reduce dimensionality Convert multiclass to regression Use linear regression Regularization on the weights

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 16 / 33

slide-17
SLIDE 17

Outline

1

Introduction How to combine classifiers?

2

Model Stacked Generalization StackingC Linear Regression and Regularization

3

Experiments Setup Results Discussion

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 17 / 33

slide-18
SLIDE 18

Datasets

Table: Datasets and their properties

Dataset Attributes Instances Classes balance-scale 4 625 3 glass 9 214 6 letter 16 4000 26 mfeat-morphological 6 2000 10

  • ptdigits

64 5620 10 sat-image 36 6435 6 segment 19 2310 7 vehicle 18 846 4 waveform-5000 40 5000 3 yeast 8 1484 10

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 18 / 33

slide-19
SLIDE 19

Base Classifiers

About 1000 base classifiers for each problem

1

Neural Network

2

Support Vector Machine (C-SVM from LibSVM)

3

K-Nearest Neighbor

4

Decision Stump

5

Decision Tree

6

AdaBoost.M1

7

Bagging classifier

8

Random Forest (Weka)

9

Random Forest (R)

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 19 / 33

slide-20
SLIDE 20

Results

select− best vote average sg − linear sg − lasso sg − ridge balance 0.9872 0.9234 0.9265 0.9399 0.9610 0.9796 glass 0.6689 0.5887 0.6167 0.5275 0.6429 0.7271 letter 0.8747 0.8400 0.8565 0.5787 0.6410 0.9002 mfeat 0.7426 0.7390 0.7320 0.4534 0.4712 0.7670

  • ptdigits

0.9893 0.9847 0.9858 0.9851 0.9660 0.9899 sat-image 0.9140 0.8906 0.9024 0.8597 0.8940 0.9257 segment 0.9768 0.9567 0.9654 0.9176 0.6147 0.9799 vehicle 0.7905 0.7991 0.8133 0.6312 0.7716 0.8142 waveform 0.8534 0.8584 0.8624 0.7230 0.6263 0.8599 yeast 0.6205 0.6024 0.6105 0.2892 0.4218 0.5970

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 20 / 33

slide-21
SLIDE 21

Statistical Analysis

Pairwise Wilcoxon Signed-Rank Tests Ridge outperforms unregularized at p ≤ 0.002 Lasso outperforms unregularized at p ≤ 0.375

Validates hypothesis: regularization improves accuracy

Ridge outperforms lasso at p ≤ 0.0019

Dense techniques outperform sparse techniques

Ridge outperforms Select-Best at p ≤ 0.084

Properly trained model better than single best

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 21 / 33

slide-22
SLIDE 22

Baseline Algorithms

Average outperforms Vote at p ≤ 0.014

Probabilistic predictions are valuable

Select-Best outperforms Average at p ≤ 0.084

Validation/training is valuable

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 22 / 33

slide-23
SLIDE 23

Subproblem/Overall Accuracy - I

0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24

RMSE

10

  • 9

10

  • 8

10

  • 7

10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

1 10 10

2

10

3

Ridge Parameter

. . . . ..........

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 23 / 33

slide-24
SLIDE 24

Subproblem/Overall Accuracy - II

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93

Accuracy

10

  • 9

10

  • 8

10

  • 7

10

  • 6

10

  • 5

10

  • 4

10

  • 3

10

  • 2

10

  • 1

1 10 10

2

10

3

Ridge Parameter

. . . . ..........

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 24 / 33

slide-25
SLIDE 25

Subproblem/Overall Accuracy - III

0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94

Accuracy

0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24

RMSE on Subproblem 1

. . . . . . . . . . . . . . .

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 25 / 33

slide-26
SLIDE 26

Accuracy for Elastic Nets

0.9 0.905 0.91 0.915 0.92 0.925 0.93

Accuracy

10

  • 5 2

5 10

  • 4 2

5 10

  • 3 2

5 10

  • 2 2

5 10

  • 1 2

5

1

2 5

Penalty

alpha=0.95 alpha=0.5 alpha=0.05 select-best

Figure: Overall accuracy on sat-image with various parameters for elastic-net.

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 26 / 33

slide-27
SLIDE 27

Partial Ensemble Selection

Sparse techniques perform Partial Ensemble Selection Choose from classifiers and predictions Allow classifiers to focus on subproblems Example: Benefit from a classifier good at separating A from B but poor at A/C, B/C

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 27 / 33

slide-28
SLIDE 28

Partial Selection

−7 −6 −5 −4 −3 −2 −1 −0.10 −0.05 0.00 0.05 0.10 Log Lambda Class 1

19 19 33

−7 −6 −5 −4 −3 −2 −1 −0.10 −0.05 0.00 0.05 0.10 Log Lambda Class 2

8 6 32

−7 −6 −5 −4 −3 −2 −1 −0.10 −0.05 0.00 0.05 0.10 Log Lambda Class 3

66 26 14

Figure: Coefficient profiles for the first three subproblems in StackingC for the sat-image dataset with elastic net regression at α = 0.95.

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 28 / 33

slide-29
SLIDE 29

Selected Classifiers

Classifier red cotton grey damp veg v.damp total adaboost-500 0.063 0.014 0.000 0.0226 0.100 ann-0.5-32-1000 0.061 0.035 0.004 0.100 ann-0.5-16-500 0.039 0.018 0.009 0.034 0.101 ann-0.9-16-500 0.002 0.082 0.007 0.016 0.108 ann-0.5-32-500 0.000 0.075 0.100 0.027 0.111 knn-1 0.076 0.065 0.008 0.097 0.246

Table: Selected posterior probabilities and corresponding weights for the sat-image problem for elastic net StackingC with α = 0.95 for the 6 models with highest total weights.

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 29 / 33

slide-30
SLIDE 30

Conclusions

Regularization is essential in Linear StackingC Trained linear combination outperforms Select-Best Dense combiners outperform sparse combiners Sparse models allow classifiers to specialize in subproblems

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 30 / 33

slide-31
SLIDE 31

Future Work

Examine full Bayesian solutions Constrain coefficients to be positive Choose a single regularizer for all subproblems

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 31 / 33

slide-32
SLIDE 32

Acknowledgments

PhET Interactive Simulations Turing Institute UCI Repository University of Colorado at Boulder

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 32 / 33

slide-33
SLIDE 33

Questions?

Questions?

Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 33 / 33