# Classification of High Dimensional Data By Two-way Mixture Models - PDF document

## Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1 Outline Goals Two-way mixture model approach Background: mixture discriminant analysis Model

1. Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1

2. Outline � Goals � Two-way mixture model approach – Background: mixture discriminant analysis – Model assumptions and motivations – Dimension reduction implied by the two-way mix- ture model – Estimation algorithm � Examples – Document topic classification (Discrete) � A mixture of Poisson distributions – Disease-type classification using microarray gene expression data (Continuous) � A mixture of normal distributions � Conclusions and future work 2

3. Goals � Achieve high accuracy for the classification of high dimensional data. – Document data: � Dimension: 3400 . p > � Training sample size: � 2500 . n � Number of classes: = 5 . K � The feature vectors are sparse. – Gene expression data: � Dimension: 4000 . p > � Training sample size: 100 . n < � Number of classes: = 4 . K � Attribute (variable, feature) clustering may be desired. – Document data: which words play similar roles and do not need to be distinguished for identifying a set of topics? – Gene expression data: which genes function simi- larly? 3

4. Mixture Discriminant Analysis � Proposed as an extension of linear discriminant anal- ysis. � T. Hastie, R. Tibshirani, “Discriminant analysis by Gaussian mixtures,” Journal of the Royal Statistical Society. Series B (Methodological) , vol. 58, no. 1, pp. 155-176, 1996. � The mixture of normals is used to obtain a density estimation for each class. � Denote the feature vector by X and the class label by Y . � For class = 1 ; 2 ; k :::; K , the within-class density is: R k X ( x ) = � ( x j � �) f � ; k k r k r r =1 4

5. � A 2-classes example. Class 1 is a mixture of 3 nor- mals and class 2 a mixture of 2 normals. The vari- 3 : 0 . ances for all the normals are 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −10 −5 0 5 10 15 20 25 5

6. � The overall model is: ( X = = ) = ( x ) P x; Y k a f k k R k X = � ( x j � �) a � ; k k r k r =1 r where a k is the prior probability of class k . � Equivalent formulation: M X ( X = = ) = � ( x j � �) q ( k ) P x; Y k � ; m m m m =1 where q m is a pmf for the class label Y within a mix- ture component. � Here we have ( k ) = 1 : 0 if mixture component q m m k and zero otherwise. “belongs to” class � The ML estimation of a k is the proportion of training k . samples in class � EM algorithm is used to estimate � . � � r , k , and k � Bayes classification rule: R k X ^ = arg max � ( x j � �) y a � ; k k r k r k r =1 6

7. Assumptions for the Two-way Mixture � For each mixture component, the variables are inde- pendent. – As a class may contain multiple mixture compo- nents, the variables are NOT independent in gen- eral given the class. – To approximate the density within each class, the restriction on each component can be compensated by having more components. – Convenient for extending to a two-way mixture model. – Efficient for treating missing data. T . � Suppose = ( x ) X is p -dimensional, x ; x ; :::; x 1 2 p The mixture model is: M p X Y ( X = = ) = ( k ) � ( x j � ) P x; Y k � q m m j m;j m =1 =1 j � We need to estimate parameter m;j for each dimen- j in each mixture component m . sion 7

8. � When the dimension is very high (sometimes � p n ), we may need an even more parsimonious model for each mixture component. � Clustering structure on the variables: p variables belong to L clusters. – Assume that the j j Two variables 1 , 2 , in the same cluster have = = 1 ; 2 ; � � m :::; M . 2 , m;j m;j 1 j by – Denote the cluster identity of variable c ( j ) 2 f 1 ; L g . :::; – For a fixed mixture component m , only need to estimate L � ’s. – The � m;j ’s are shrunk to L � ) ’s. m;c ( j p M X Y ( X = = ) = ( k ) � ( x j � ) P x; Y k a q m m j m;c ( j ) m =1 j =1 � This way of regularizing the model leads to variable clustering. 8

9. � Properties of variable clusters: – Variables in the same cluster have the same distri- butions within each class. – For each cluster of variables, only a small number of statistics are sufficient for predicting the class label. Variables Mixture component 1 1, 2, 3 4, 5, 6 class . 1 . Mixture component 2 . Variables Mixture component 3 1, 2, 3 4, 5, 6 class 2 Mixture component 4 . . . Mixture component 5 class 3 Variables Mixture component 6 1, 2, 3 4, 5, 6 9

10. Dimension Reduction � Within each mixture component, variables in the same cluster are i.i.d. random variables. � For i.i.d. random variables sampled from an exponen- tial family, the dimension of the sufficient statistic for � is fixed w.r.t. the sample size. the parameter � Assume the exponential family to be: S ! X p ( x ) = exp � ( � ) T ( x ) � B ( � ) h ( x ) � j s s j j s =1 = 1 ; X l , l :::; L , define Proposition: For j ’s in cluster � X ( x ) = ( x ) = 1 ; 2 ; T T s :::; S : l ;s s j : c ( j )= l j � ( x ) , = 1 ; = 1 ; Given T l :::; L , s :::; S , the class of a l ;s sample Y is conditionally independent of X 1 , X 2 , ..., X p . 10

11. � Dimension reduction : “sufficient information” for predicting Y is of dimension LS . We often have � LS p . � Examples: = 1 . – Mixtures of Poisson: S � X ( x ) = T x l ; 1 j : c ( j )= l j = 2 . S – Mixtures of normal: � X ( x ) = T x ; 1 l j j : c ( j )= l � X 2 ( x ) = T x ; 2 l j j : c ( j )= l Equivalently: P x j : c ( j )= l j � ( x ) = Sample mean: T l ; 1 P ( c ( j ) = ) I l j � 2 P ( x � T ( x )) ; 1 j l : c ( j )= l j � ( x ) = Sample variance: T ; 2 l P ( c ( j ) = ) I l j 11

12. Model Fitting � We need to estimate the following: = – Mixture component prior probabilities a m , m 1 ; :::; M . = � m – Parameters of the Poisson distributions: m;l , 1 ; = 1 ; :::; M , l :::; L . c ( j ) , = 1 ; – The variable clustering function j :::; p , c ( j ) 2 f 1 ; L g . :::; � Criterion: Maximum likelihood estimation. � Algorithm: EM. – E-step: compute the posterior probability of each sample coming from each mixture component. – M-step: � Update the parameters a m , � m;l . � Update the variable clustering function c ( j ) by c ( j ) individually for each j , j = optimizing 1 ; :::; p with all the other parameters fixed. � Computational perspective: – E-step: a “soft” clustering of samples into mixture components, “row-wise” clustering. – M-step: 1) update parameters; 2) a clustering of attributes, “column-wise” clustering. 12

13. Document Topic Classification � Classify documents into different topics. � Five document classes from the Newsgroup data set collected by Lang (1995): 1. comp.graphics 2. rec.sport.baseball 3. sci.med 4. sci.space 5. talk.politics.guns � Classification is based on word counts. – Examples: bit: 2, graphic: 3, sun: 2. � Each document is represented by a vector of word counts. Every dimension corresponds to a particular word. � Each class contains about 1000 documents. Roughly half of them are randomly selected as training data, and the others testing. � Pre-processing: for each document class, the 1000 words with the largest total counts in the training data are used as variables. � The dimension of the word vector is = 3455 , p p > n . 13

14. Mixture of Poisson Distribution � The Poisson distribution is uni-modal. k � � � P ( X = k ) = e : ! k � Example mixtures of Poisson distributions: 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 � Mixture of multivariate independent Poisson distribu- tions with variable clustering: x p M j � m;c ( j ) X Y � � ( X = = ) = ( k ) � P x; Y k a q e m;c ( j ) m m ! x j m =1 j =1 14

15. Results � Classification error rates achieved without variable # components = 1 � 20 . per cl ass clustering. 24 22 Classification error rate (%) 20 18 16 14 12 10 0 5 10 15 20 Number of mixture components per class � = 30 � 3455 , # components = 6 . L per cl ass 16 15.5 15 Classification error rate (%) 14.5 14 13.5 13 12.5 12 11.5 11 1 2 3 4 10 10 10 10 Number of word clusters 15

16. � Confusion table for = 30 , without word clustering, M = 3455 . Classification error rate: 11 : 22% . p graphics baseball sci.med sci.space politics.guns graphics 463 5 9 16 3 baseball 3 459 4 2 9 sci.med 22 12 435 20 14 sci.space 27 14 28 409 18 politics.guns 11 27 17 17 434 � For = 30 , = 168 . Classification error rate: M L 12 : 51% . graphics baseball sci.med sci.space politics.guns graphics 458 1 12 15 10 baseball 3 446 2 5 21 sci.med 23 9 408 21 42 sci.space 24 9 21 404 38 politics.guns 4 15 18 17 452 16