Variable selection in model-based classification G. Celeux 1 , M.-L. - PowerPoint PPT Presentation

Variable selection in model-based classification G. Celeux 1 , M.-L. Martin-Magniette 2 , C. Maugis 3 1: INRIA Saclay-Île-de-France 2: UMR AgroParisTech/INRA MIA 518 et URGV (Unité de Recherche en Génomique Végétale) 3: Institut de Mathématiques de Toulouse

Variable selection in clustering and classification Variable selection is highly desirable for unsupervised or supervised classification in high dimension setttings. Actually, this question received a lot of attention in recent years. Different variable selection procedures have been proposed from heuristic point of views. Roughly speaking, the variables are separated into two groups : the relevant variables and the independent variables. In the same spirit, sparse classification methods have been proposed depending on some tuning parameters. We opt for a mixture model which allows to deal properly with variable selection in classification.

Gaussian mixture model for clustering Purpose : Clustering of y = ( y 1 , . . . , y n ) where y i ∈ R Q are iid observations with unknown pdf h The pdf h is modelled with a Gaussian mixture K � f clust ( . | K , m , α ) = p k Φ( . | µ k , Σ k ) k = 1 with α = ( p , µ 1 , . . . , µ K , Σ 1 , . . . , Σ K ) where p = ( p 1 , . . . , p K ) , � K p k = 1 k = 1 Φ( . | µ k , Σ k ) the pdf of a N Q ( µ k , Σ k ) T = set of models ( K , m ) where K ∈ N ⋆ = number of mixture components m = Gaussian mixture type

The Gaussian mixture collection It is based on the eigenvalue decomposition of the mixture component variance matrices : Σ k = L k D ′ k A k D k Σ k variance matrix with dimension Q × Q L k = | Σ k | 1 / Q (cluster volume) D k = Σ k eigenvector matrix (cluster orientation) A k = Σ k normalised eigenvalue diagonal matrix (cluster shape) ⇒ 3 families :  spherical family   ⇒ 14 models diagonal family general family Free or fixed proportions ⇒ 28 Gaussian mixture models

Model selection Asymptotic approximation of the integrated or completed integrated likelihood BIC (Bayesian Information Criterion) 2 ln [ f ( y | K , m )] ≈ 2 ln [ f ( y | K , m , ˆ α )] − λ (K,m) ln ( n ) = BIC clust ( y | K , m ) where ˆ α is computed by the EM algorithm. ICL (Integrated Likelihood Criterion) ICL = BIC + Entropy of the fuzzy clustering matrix. The classifier : ˆ z = MAP (ˆ α ) is � µ k , ˆ µ j , ˆ 1 if ˆ Σ k ) > ˆ p k Φ( y i | ˆ p j Φ( y i | ˆ Σ j ) , ∀ j � = k ˆ z ik = 0 otherwise MIXMOD software http ://www.mixmod.org

Variable selection in the mixture setting Law, Figueiredo and Jain (2004) : The irrelevant variables are assumed to be independent of the relevant variables. Raftery and Dean (2006) : The irrelevant variable are linked with all the relevant variables according to a linear regression. Maugis, Celeux and Martin-Magniette (2009a, b) : SRUW Model The irrelevant variables could be linked to a subset of the relevant variables according to a linear regression or independent

Our model : Four different variable roles Modelling the pdf h : x ∈ R Q �→ f clust ( x S | K , m , α ) f reg ( x U | r , a + x R β, Ω) f indep ( x W | ℓ, γ, τ ) relevant variables( S ) : Gaussian mixture density K � f clust ( x S | K , m , α ) = p k Φ( x S | µ k , Σ k ) k = 1 redundant variables ( U ) : linear regression of x U on x R ( R ⊆ S ) f reg ( x U | r , a + x R β, Ω) = Φ( x U | a + x R β, Ω ( r ) ) independent variables ( W ) : Gaussian density f indep ( x W | ℓ, γ, τ ) = Φ( x W | γ, τ ( ℓ ) )

SRUW model It is assumed that h can be written x ∈ R Q �→ f clust ( x S | K , m , α ) f reg ( x U | r , a + x R β, Ω) f indep ( x W | ℓ, γ, τ ) relevant variables ( S ) : Gaussian mixture pdf redundant variables ( U ) : linear regression of x U with respect to x R independent variables ( W ) : Gaussian pdf Model collection : � � ( K , m , r , ℓ, V ); ( K , m ) ∈ T , V ∈ V N = r ∈ { [ LI ] , [ LB ] , [ LC ] } , ℓ ∈ { [ LI ] , [ LB ] }   ( S , R , U , W );     S ⊔ U ⊔ W = { 1 , . . . , Q } where V = S � = ∅ , R ⊆ S     R = ∅ if U = ∅ and R � = ∅ otherwise

Model selection criterion Variable selection by maximising the integrated likelihood ( ˆ r , ˆ ℓ, ˆ argmax crit ( K , m , r , ℓ, V ) where K , ˆ m , ˆ V ) = ( K , m , r ,ℓ, V ) ∈N crit ( K , m , r , ℓ, V ) BIC clust ( y S | K , m ) + = BIC reg ( y U | r , y R ) + BIC ind ( y W | ℓ ) Theoretical properties : The model collection is identifiable, The selection criterion is consistent.

Selection algorithm (SelvarclustIndep) It makes use of two embedded (for-back)ward stepwise algorithms. 3 situations are possible for a candidate variable j : M1 : f clust ( y S , y j | K , m ) e M2 : f clust ( y S | K , m ) f reg ( y j | [ LI ] , y R [ j ] ) where R [ j ] = R [ j ] ⊆ S , � � R [ j ] � = ∅ . M3 : f clust ( y S | K , m ) f indep ( y j | [ LI ]) i.e. e R [ j ] ) where � f clust ( y S | K , m ) f reg ( y j | [ LI ] , y R [ j ] = ∅ . It reduces to comparing e f clust ( y S , y j | K , m ) versus f clust ( y S | K , m ) f reg ( y j | [ LI ] , y R [ j ] ) ⇒ algorithm SelvarClust (SR model) = � if � j in model M2 R [ j ] � = ∅ and j in model M3 otherwise

Synopsis of the backward algorithm For each mixture model ( K , m ) : 1 Step A- Backward stepwise selection for clustering : ◮ Initialisation : S ( K , m ) = { 1 , . . . , Q } 9 ◮ exclusion step (remove a variable from S) using backward = stepwise variable ; ◮ inclusion step (add a variable in S) selection for regression ( ⋆ ) ⇒ two-cluster partition of the variables in ˆ S ( K , m ) and ˆ S c ( K , m ) . Step B- ˆ S c ( K , m ) is partitioned in ˆ U ( K , m ) and ˆ W ( K , m ) with ( ⋆ ) Step C- for each regression model form r : selection with ( ⋆ ) of the variables ˆ R ( K , m , r ) for each independent model form ℓ : estimation of the parameters ˆ θ and calculation of the criterion crit ( K , m , r , ℓ ) = crit ( K , m , r , ℓ, ˆ f S ( K , m ) , ˆ R ( K , m , r ) , ˆ U ( K , m ) , ˆ W ( K , m )) . Selection of (ˆ r , ˆ ℓ ) maximising f K , ˆ m , ˆ crit ( K , m , r , ℓ ) 2 “ ” ˆ ℓ, ˆ S (ˆ m ) , ˆ R (ˆ r ) , ˆ U (ˆ m ) , ˆ W (ˆ Selection of the model r , ˆ K , ˆ m , ˆ K , ˆ K , ˆ m , ˆ K , ˆ K , ˆ m )

Alternative sparse clustering methods Model-based regularisation Zhou and Pan (2009) propose to minimise a penalized log-likelihood through an EM-like algorithm with the penalty K Q K Q Q � � � � � | Σ − 1 | µ jk | + λ 2 k ; jj ′ | . p ( λ ) = λ 1 k = 1 j = 1 k = 1 j = 1 j ′ = 1 Sparse clustering framework Witten and Tibshirani (2010) define a general criterion � Q j = 1 w j f j ( y j , θ ) with || w || 2 ≤ 1 , || w || 1 ≤ s , w j ≥ 0 ∀ j , where f j measures the clustering fit for variable j . Example : for sparse K -means clustering, we have   Q n n K �  1 � � � 1 � d j d j  . ii ′ − f j = w j ii ′ n n k j = 1 i = 1 i ′ = 1 k = 1 i , i ′ ∈ C k

Comparing sparse clustering and MBC variable selection Simulation Method CER card (ˆ s ) . n = 30 , δ = 0 . 6 SparseKmeans 0 . 40 ( ± 0 . 03 ) 14 . 4 ( ± 1 . 3 ) Kmeans 25 . 0 ( ± 0 ) 0 . 39 ( ± 0 . 04 ) SU-LI 0 . 62 ( ± 0 . 06 ) 22 . 2 ( ± 1 . 2 ) SRUW-LI 0 . 40 ( ± 0 . 03 ) 8 . 1 ( ± 1 . 9 ) n = 30 , δ = 1 . 7 SparseKmeans 8 . 2 ( ± 0 . 8 ) 0 . 08 ( ± 0 . 02 ) Kmeans 0 . 25 ( ± 0 . 01 ) 25 . 0 ( ± 0 ) SU-LI 0 . 57 ( ± 0 . 03 ) 23 . 1 ( ± 0 . 2 ) SRUW-LI 6 . 8 ( ± 1 . 4 ) 0 . 085 ( ± 0 . 08 ) n = 300 , δ = 0 . 6 SparseKmeans 0 . 38 ( ± 0 . 003 ) 24 . 00 ( ± 0 . 5 ) Kmeans 0 . 36 ( ± 0 . 003 ) 25 . 0 ( ± 0 ) SU-LI 0 . 37 ( ± 0 . 03 ) 25 . 0 ( ± 0 ) SRUW-LI 7 . 0 ( ± 1 . 7 ) 0 . 34 ( ± 0 . 02 ) n = 300 , δ = 1 . 7 SparseKmeans 25 . 0 ( ± 0 ) 0 . 05 ( ± 0 . 01 ) Kmeans 0 . 16 ( ± 0 . 06 ) 25 . 0 ( ± 0 ) SU-LI 0 . 05 ( ± 0 . 01 ) 14 . 6 ( ± 2 . 0 ) SRUW-LI 5 . 6 ( ± 0 . 9 ) 0 . 05 ( ± 0 . 01 ) Results from 20 simulations with Q = 25 and card ( s ) = 5

Comparing sparse clustering and MBC variable selection Fifty independent simulated data sets with n = 2000, Q = 14, the first two variables are a mixture of 4 equiprobable spherical Gaussian : µ 1 = ( 0 , 0 ) , µ 2 = ( 4 , 0 ) , µ 3 = ( 0 , 2 ) and µ 4 = ( 4 , 2 ) . y { 3 ,..., 14 } a + y { 1 , 2 } β + ε i with ε i ∼ N ( 0 , ˜ ˜ Ω) and ˜ a = ( 0 , 0 , 0 . 4 , . . . , 4 ) = ˜ i i and 2 different scenarios for ˜ β and ˜ Ω . Method Scenario 1 Scenario 2 Sparse Kmeans 0.47 ( ± 0.016) 0.31 ( ± 0.035) Kmeans 0.52 ( ± 0.014) 0.57 ( ± 0.015) SR-LI 0.39 ( ± 0.039) 0.42 ( ± 0.082) SRUW-LI 0.57 ( ± 0.04 ) 0.60 ( ± 0.015) The adjusted Rand index Method Scenario 1 Scenario 2 Sparse Kmeans 14 ( ± 0) 13.5 ( ± 1.5) Kmeans 14 ( ± 0) 14 ( ± 0) SU-LI 12 ( ± 0) 3.96 ( ± 0.57) SRUW-LI 2 ( ± 0.20) 2 ( ± 0) The number of selected variables

Variable selection in model-based classification G. Celeux 1 , M.-L. - PowerPoint PPT Presentation

Variable selection in model-based classification G. Celeux 1 , M.-L. Martin-Magniette 2 , C. Maugis 3 1: INRIA Saclay-le-de-France 2: UMR AgroParisTech/INRA MIA 518 et URGV (Unit de Recherche en Gnomique Vgtale) 3: Institut de

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Week 8: Classification & Model Building Classification for Binary Outcomes, Variable

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Overfitting + k-Nearest Neighbors Matt Gormley Lecture 4 Jan. 27, 2020 1 Course Staff 3

PI3K Inhibitors Anas Younes, M.D. chief, Lymphoma Service Memorial Sloan Kettering Cancer Center

node biopsy in cervical cancer Survey within the GCIG On behalf of the working group Jalid

ALI 334: STRESS THE FIRE WITHIN US UNDERSTANDING CHRONIC STRESS By: Aziza Amarshi, BSc, RPh,

BIOE 301/362 Lecture Four: Leading Causes of Mortality, Ages 45-60 Global Health Challenges

disease prediction using administrative claim data Dr Shahadat Uddin Senior lecturer Complex

Bringing the health systems strengthening message to life to Close the Cancer Divide August

BTK Inhibitors in Follicular NHL Bruce D. Cheson, M.D. Georgetown University Hospital Lombardi

Variable selection in model-based classification G. Celeux 1 , M.-L. - PowerPoint PPT Presentation

Variable selection in model-based classification G. Celeux 1 , M.-L. Martin-Magniette 2 , C. Maugis 3 1: INRIA Saclay-le-de-France 2: UMR AgroParisTech/INRA MIA 518 et URGV (Unit de Recherche en Gnomique Vgtale) 3: Institut de

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Week 8: Classification &amp; Model Building Classification for Binary Outcomes, Variable

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Overfitting + k-Nearest Neighbors Matt Gormley Lecture 4 Jan. 27, 2020 1 Course Staff 3

PI3K Inhibitors Anas Younes, M.D. chief, Lymphoma Service Memorial Sloan Kettering Cancer Center

node biopsy in cervical cancer Survey within the GCIG On behalf of the working group Jalid

ALI 334: STRESS THE FIRE WITHIN US UNDERSTANDING CHRONIC STRESS By: Aziza Amarshi, BSc, RPh,

BIOE 301/362 Lecture Four: Leading Causes of Mortality, Ages 45-60 Global Health Challenges

disease prediction using administrative claim data Dr Shahadat Uddin Senior lecturer Complex

Bringing the health systems strengthening message to life to Close the Cancer Divide August

BTK Inhibitors in Follicular NHL Bruce D. Cheson, M.D. Georgetown University Hospital Lombardi

Week 8: Classification & Model Building Classification for Binary Outcomes, Variable

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION