Model-based clustering using mixtures of t -factor analyzers: A food - PowerPoint PPT Presentation

Introduction Constraints Other Techniques Applications Conclusion References Model-based clustering using mixtures of t -factor analyzers: A food authenticity example Jeffrey L. Andrews Ph.D. Candidate Department of Mathematics & Statistics University of Guelph Guelph, Ontario, Canada July 26, 2010 Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Overview Welcome This presentation will focus on model-based clustering using a 6-member family of mixtures of multivariate t -distribution models as introduced by Andrews and McNicholas (2010). Parameter estimation, model selection, and model performance will be discussed. The 6-member MM t FA family will be illustrated via an application to two food authenticity data sets. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References The Data Italian Wines The wine dataset from the gclus library in R: 13 chemical properties; 178 samples of wine; 3 varieties of wine: Barolo, Barbera, and Grignolino. Can we objectively cluster types of wine according to their chemical properties? Table: Thirteen of the chemical and physical properties of the Italian wines. Alcohol Proline OD 280 /OD 315 of diluted wines Malic acid Ash Alcalinity of ash Hue Total phenols Magnesium Flavonoids Nonflavonoid phenols Proanthocyanins Color intensity Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Mixture Models Mixtures of Multivariate t -Distributions The model density is of the form G � f ( x ) = π g f t ( x | µ g , Σ g , ν g ) , g =1 where 2 ) | Σ | − 1 Γ( ν + p 2 f t ( x | µ , Σ , ν ) = 1 2 ) { 1 + δ ( x , µ | Σ ) 1 2 ( ν + p ) 2 p Γ( ν ( πν ) } ν is the multivariate t -distribution with mean µ , covariance matrix Σ , and degrees of freedom ν . π g are the mixing proportions. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Mixture Models Mixtures of Multivariate t -Factor Analyzers The model density is of the form G � f ( x ) = π g f t ( x | µ g , Σ g , ν g ) . g =1 MM t FAs adjust the covariance structure of the density such that ′ Σ g = Λ g Λ g + Ψ g This is the factor analysis covariance structure. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Mixture Models Extensions McLachlan et al. (2007) develop the unconstrained case: Σ g = Λ g Λ ′ g + Ψ g . Zhao and Jiang (2006) develop a version of the PPCA constraint: Σ g = Λ g Λ ′ g + ψ g I p . We will consider: constraining the degrees of freedom parameter, or ν g = ν ; the PPCA constraint, or Ψ g = ψ g I ; the loading matrix constraint, or Λ g = Λ . Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Parameter Estimation EM Algorithms The expectation-maximization (EM) algorithm is an iterative procedure used to find maximum likelihood estimates in the presence of missing or incomplete data. The expectation-conditional maximization (ECM) algorithm replaces the maximization (M) step with a series of computationally simpler conditional maximization (CM) steps. The alternating expectation-conditional maximization (AECM) algorithm permits the complete data vector to vary, or alternate, on each CM-step. Parameters are estimated using the AECM algorithm in the t -factors case because there are three types of missing data. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Model Selection BIC and ICL Model selection is performed using the Bayesian information criterion (BIC) and the integrated completed likelihood (ICL): BIC = 2 l ( x , ˆ Ψ ) − m log n , n G � � ICL = BIC + MAP(ˆ z ig ) ln(ˆ z ig ) . i =1 g =1 Note that � 1 if max g { z ig } occurs at group g , MAP(ˆ z ig ) = 0 otherwise. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Model Performance Adjusted Rand Index Clustering performance will be evaluated using the adjusted Rand index. The Rand index is calculated by number of agreements number of agreements + number of disagreements , where ‘number of agreements/disagreements’ are based on pairwise comparisons. The adjusted Rand index corrects for chance, recognizing that clustering performed randomly would correctly classify some pairs. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Overview MM t FA Family Development Three constraints will now be introduced that lead to a family of six mixture models. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Degrees of Freedom Constraining ν g = ν Constraining the degrees of freedom to be equal across groups ( ν g = ν ) effectively assumes that each group can be modelled using the same distributional shape. The savings in parameter estimation are quite small ( G − 1), however in practice constraining the degrees of freedom can lead to better clustering performance (Andrews and McNicholas, 2010). This is likely due to a more stable estimation of the degrees of freedom parameter under n samples rather than n g . Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References PPCA Constraining Ψ g = ψ g I Utilizing the isotropic constraint ( Ψ g = ψ g I ) assumes that each group contains a unique, scalar error in the variance estimation under the factor analysis structure. As Ψ g is a diagonal matrix, Gp parameters are normally needed for estimation. Under this constraint, only G parameters are estimated: a significant reduction, especially under high-dimensional data sets. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Loading Matrix Constraining Λ g = Λ Constraining the loading matrices to be equal across groups ( Λ g = Λ ) assumes that each group’s covariance estimates are identical. As Λ g is a p × q matrix, G [ pq − q ( q − 1) / 2] parameters are normally needed for estimation. Under this constraint, only pq − q ( q − 1) / 2 parameters are estimated: a large reduction in free parameters. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Resulting Family of Models The Six Models Covariance structures derived from the mixtures of t -factor analyzers model (C=Constrained, U=Unconstrained): Model Λ ψ g I ν Covariance and DF Parameters CCC C C C [ pq − q ( q − 1) / 2] + G + 1 CCU C C U [ pq − q ( q − 1) / 2] + G + G UCC U C C G [ pq − q ( q − 1) / 2] + G + 1 UCU U C U G [ pq − q ( q − 1) / 2] + G + G UUC U U C G [ pq − q ( q − 1) / 2] + Gp + 1 UUU U U U G [ pq − q ( q − 1) / 2] + Gp + G Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Established Model-based Clustering Methods Overview The MM t FA family will be compared to established model-based clustering techniques: Parsimonious Gaussian mixture models (PGMMs, McNicholas and Murphy, 2008); MCLUST (Fraley and Raftery, 1999); and Variable selection (Dean and Raftery, 2006). A brief summary of these methods follows... Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Established Model-based Clustering Methods PGMMs McNicholas and Murphy (2008) introduce PGMMs, a family based on mixtures of factor analyzers The model density is G � π g φ ( x | µ g , Λ g Λ ′ f ( x ) = g + Ψ g ) , g =1 where φ ( · ) is the multivariate Gaussian density. Constraining... Λ g = Λ , Ψ g = Ψ , and/or Ψ g = ψ g I p leads to a family of 8 mixture models Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Introduction Constraints Other Techniques Applications Conclusion References Established Model-based Clustering Methods MCLUST Fraley and Raftery (1999) introduce MCLUST, a family based on the eigendecomposition of the multivariate Gaussian covariance structure The model density is G � f ( x ) = π g φ ( x | µ g , λ g D g A g D g ) , g =1 Constraining... λ g = λ , λ = 1, D g = D , A g = A , or replacing A and/or D with the identity matrix leads to a family of 10 mixture models. Jeffrey L. Andrews Sensometrics 2010 MBC using MM t FAs: A food authenticity example

Model-based clustering using mixtures of t -factor analyzers: A food - PowerPoint PPT Presentation

Introduction Constraints Other Techniques Applications Conclusion References Model-based clustering using mixtures of t -factor analyzers: A food authenticity example Jeffrey L. Andrews Ph.D. Candidate Department of Mathematics &

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Week 7 Video 5 Factor Analysis Factor Analysis You have a whole lot of variables Can

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Pos Total Pos Cat. Name Surname Cat. Club 10k 10k 6m 10k 10k 2 2 1 3 1 Thomas Webb

Solutions to Alleviate Mobility Poverty: The RailBus BiModal Commuter Public Transport for

Event Detection Tarek Abdelzaher University of Illinois at Urbana Champaign Research Goal

Cambodia mbodia Voter er Regist istry y Aud udit 2013 Phnom Penh, 21 March 2013 Supported

Ethiopia Nuffic Agribizz WHO BENEFITS? MARIJKE DHAESE AGRICULTURAL ECONOMICS GHENT

Emission Offsets Information Seminar Present Why? World Bank Estimated Global Carbon Market

Elfab Ltd Products & Applications Pressure Intelligence Leading the Industry in Pressure

Invest Grenada Petit Tri-Island Martinique Carriacou State Grenada 8 hours 2,626 km

Model-based clustering using mixtures of t -factor analyzers: A food - PowerPoint PPT Presentation

Introduction Constraints Other Techniques Applications Conclusion References Model-based clustering using mixtures of t -factor analyzers: A food authenticity example Jeffrey L. Andrews Ph.D. Candidate Department of Mathematics &

Analysis of a model of elastic plastic mixtures (Prandtl-Reuss-mixtures) Project of Josef

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Triadic Factor Analysis Cynthia Glodeanu Institute of Algebra, TU Dresden October 19, 2010.

Week 7 Video 5 Factor Analysis Factor Analysis You have a whole lot of variables Can

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Pos Total Pos Cat. Name Surname Cat. Club 10k 10k 6m 10k 10k 2 2 1 3 1 Thomas Webb

Solutions to Alleviate Mobility Poverty: The RailBus BiModal Commuter Public Transport for

Event Detection Tarek Abdelzaher University of Illinois at Urbana Champaign Research Goal

Cambodia mbodia Voter er Regist istry y Aud udit 2013 Phnom Penh, 21 March 2013 Supported

Ethiopia Nuffic Agribizz WHO BENEFITS? MARIJKE DHAESE AGRICULTURAL ECONOMICS GHENT

Emission Offsets Information Seminar Present Why? World Bank Estimated Global Carbon Market

Elfab Ltd Products &amp; Applications Pressure Intelligence Leading the Industry in Pressure

Invest Grenada Petit Tri-Island Martinique Carriacou State Grenada 8 hours 2,626 km

Elfab Ltd Products & Applications Pressure Intelligence Leading the Industry in Pressure