Regularized Linear Models in Stacked Generalization Sam Reid and - PowerPoint PPT Presentation

Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic Department of Computer Science University of Colorado at Boulder USA June 11, 2009 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 1 / 33

How to combine classifiers? Which classifiers? How to combine? Adaboost, Random Forest prescribe classifiers and combiner We want L ≥ 1000 heterogeneous classifiers Vote/Average/Forward Stepwise Selection/Linear/Nonlinear? Our combiner: Regularized Linear Model Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 2 / 33

Outline Introduction 1 How to combine classifiers? Model 2 Stacked Generalization StackingC Linear Regression and Regularization Experiments 3 Setup Results Discussion Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 3 / 33

Stacked Generalization Combiner is produced by a classification algorithm Training set = base classifier predictions on unseen data + labels Learn to compensate for classifier biases Linear and nonlinear combiners What classification algorithm should be used? Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 5 / 33

Stacked Generalization - Combiners Wolpert, 1992: relatively global, smooth combiners Ting & Witten, 1999: linear regression combiners Seewald, 2002: low-dimensional combiner inputs Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 6 / 33

Problems Caruana et al., 2004: Stacking performs poorly because regression overfits dramatically when there are 2000 highly correlated input models and only 1k points in the validation set. How can we scale up stacking to a large number of classifiers? Our hypothesis: regularized linear combiner will reduce variance prevent overfitting increase accuracy Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 7 / 33

Posterior Predictions in Multiclass Classification p y( x ) x Classification with d = 4, k = 3 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 8 / 33

Ensemble Methods for Multiclass Classification ˆ y y ′ ( x ′ 1 , x ′ 2 ) x ′ x ′ 1 2 y 1 ( x ) y 2 ( x ) x Multiple classifier system with 2 classifiers Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 9 / 33

Stacked Generalization ˆ y y ′ ( x ′ ) x ′ y 1 ( x ) y 2 ( x ) x Stacked generalization with 2 classifiers Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 10 / 33

Classification via Regression ˆ y y A ( x ′ ) y B ( x ′ ) y C ( x ′ ) x ′ y 1 ( x ) y 2 ( x ) x Stacking using Classification via Regression Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 11 / 33

StackingC ˆ y y A ( x A ′ ) y B ( x B ′ ) y C ( x C ′ ) x ′ y 1 ( x ) y 2 ( x ) x StackingC, class-conscious stacked generalization Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 12 / 33

Linear Models Linear model for use in Stacking or StackingC y = � d ˆ i =1 β i x i + β 0 Least Squares: L = | y − X β | 2 Problems: High variance Overfitting Ill-posed problem Poor accuracy Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 13 / 33

Regularization Increase bias a little, decrease variance a lot Constrain weights ⇒ reduce flexibility ⇒ prevent overfitting Penalty terms in our studies: Ridge Regression: L = | y − X β | 2 + λ | β | 2 Lasso Regression: L = | y − X β | 2 + λ | β | 1 Elastic Net Regression: L = | y − X β | 2 + λ | β | 2 + (1 − λ ) | β | 1 Lasso/Elastic Net produce sparse models Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 14 / 33

Model About 1000 base classifiers making probabilistic predictions Stacked Generalization to create combiner StackingC to reduce dimensionality Convert multiclass to regression Use linear regression Regularization on the weights Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 15 / 33

Model About 1000 base classifiers making probabilistic predictions Stacked Generalization to create combiner StackingC to reduce dimensionality Convert multiclass to regression Use linear regression Regularization on the weights Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 16 / 33

Datasets Table: Datasets and their properties Dataset Attributes Instances Classes balance-scale 4 625 3 glass 9 214 6 letter 16 4000 26 mfeat-morphological 6 2000 10 optdigits 64 5620 10 sat-image 36 6435 6 segment 19 2310 7 vehicle 18 846 4 waveform-5000 40 5000 3 yeast 8 1484 10 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 18 / 33

Base Classifiers About 1000 base classifiers for each problem Neural Network 1 Support Vector Machine (C-SVM from LibSVM) 2 K-Nearest Neighbor 3 Decision Stump 4 Decision Tree 5 AdaBoost.M1 6 Bagging classifier 7 Random Forest (Weka) 8 Random Forest (R) 9 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 19 / 33

Results select − vote average sg − sg − sg − best linear lasso ridge balance 0.9234 0.9265 0.9399 0.9610 0.9796 0.9872 glass 0.6689 0.5887 0.6167 0.5275 0.6429 0.7271 letter 0.8747 0.8400 0.8565 0.5787 0.6410 0.9002 mfeat 0.7426 0.7390 0.7320 0.4534 0.4712 0.7670 optdigits 0.9893 0.9847 0.9858 0.9851 0.9660 0.9899 sat-image 0.9140 0.8906 0.9024 0.8597 0.8940 0.9257 segment 0.9768 0.9567 0.9654 0.9176 0.6147 0.9799 vehicle 0.7905 0.7991 0.8133 0.6312 0.7716 0.8142 waveform 0.8534 0.8584 0.8624 0.7230 0.6263 0.8599 yeast 0.6205 0.6024 0.6105 0.2892 0.4218 0.5970 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 20 / 33

Statistical Analysis Pairwise Wilcoxon Signed-Rank Tests Ridge outperforms unregularized at p ≤ 0 . 002 Lasso outperforms unregularized at p ≤ 0 . 375 Validates hypothesis: regularization improves accuracy Ridge outperforms lasso at p ≤ 0 . 0019 Dense techniques outperform sparse techniques Ridge outperforms Select-Best at p ≤ 0 . 084 Properly trained model better than single best Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 21 / 33

Baseline Algorithms Average outperforms Vote at p ≤ 0 . 014 Probabilistic predictions are valuable Select-Best outperforms Average at p ≤ 0 . 084 Validation/training is valuable Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 22 / 33

Subproblem/Overall Accuracy - I 0.24 . . 0.22 0.2 . 0.18 RMSE . 0.16 0.14 .......... 0.12 0.1 0.08 0.06 -9 -8 -7 -6 -5 -4 -3 -2 -1 2 3 10 10 10 10 10 10 10 10 10 1 10 10 10 Ridge Parameter Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 23 / 33

Subproblem/Overall Accuracy - II . . .......... 0.93 0.92 0.91 0.9 Accuracy 0.89 . 0.88 0.87 . 0.86 0.85 -9 -8 -7 -6 -5 -4 -3 -2 -1 2 3 10 10 10 10 10 10 10 10 10 1 10 10 10 Ridge Parameter Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 24 / 33

Subproblem/Overall Accuracy - III 0.94 . . . . . . . 0.93 . . . 0.92 . . 0.91 Accuracy . 0.9 . 0.89 0.88 0.87 . 0.86 0.85 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 RMSE on Subproblem 1 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 25 / 33

Accuracy for Elastic Nets 0.93 alpha=0.95 alpha=0.5 0.925 alpha=0.05 select-best 0.92 Accuracy 0.915 0.91 0.905 0.9 -5 2 -4 2 -3 2 -2 2 -1 2 10 5 10 5 10 5 10 5 10 1 5 2 5 Penalty Figure: Overall accuracy on sat-image with various parameters for elastic-net. Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 26 / 33

Partial Ensemble Selection Sparse techniques perform Partial Ensemble Selection Choose from classifiers and predictions Allow classifiers to focus on subproblems Example: Benefit from a classifier good at separating A from B but poor at A/C, B/C Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in Stacking June 11, 2009 27 / 33

Regularized Linear Models in Stacked Generalization Sam Reid and - PowerPoint PPT Presentation

Regularized Linear Models in Stacked Generalization Sam Reid and Greg Grudic Department of Computer Science University of Colorado at Boulder USA June 11, 2009 Reid & Grudic (Univ. of Colo. at Boulder) Regularized Linear Models in

STACKED GRAPHS STACKED GRAPHS EVOLUTION OF STACKED GRAPHS Stacked Area Chart Themeriver

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

CSI5180. MachineLearningfor BioinformaticsApplications Regularized Linear Models by Marcel

Create Centered Stacked Bar Charts V0A 12/11/2016 for Even-Choice Ordinal Data using Excel 2013

Create Centered Stacked Bar Charts V0A 12/11/2016 for Odd-Choice Ordinal Data using Excel 2013

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular

Classifyng Objects at Differnts Sizes with Multi-scale Stacked Sequential Learning Eloi Puertas,

5nm IMEC ( VLSI 2016) 7nm Leti ( IEDM 2008 ) 10nm Stacked-NWs (nanosheets) S. Barraud et al,

Video De-Captioning using U-Net with Stacked Dilated Convolutional Layers. ChaLearn Video

TSV-Constrained Micro-Channel Infrastructure Design for Cooling Stacked 3D-ICs Bing Shi and

Single Single- -Thread NVE Thread NVE Multiple Subsystems, Multiple Threads Multiple

Physics 1 Overview Collision detection Model representation Dynamic vs. static

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 12 Yan n Gu

Motion planning in practice: An introduction to Kite Massimo Cefalo M. Cefalo Motion planning

Computer Graphics (CS 543) Lecture 6 (Part 3): Lighting, Shading and Materials (Part 3) Prof

Choices( @pquerna( Rackspace( May(6,(2013( My(Problems( I(want(to(build(things.(

Observationally Closing the Gap Between IR Radiative Forcing and Changes in IR Radiation Climate

Whats the Science ? Communication under Model uncertainty Philippe Colo 1 November 2018 1