classification of high dimensional data by two way
play

Classification of High Dimensional Data By Two-way Mixture Models - PDF document

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1 Outline Goals Two-way mixture model approach Background: mixture discriminant analysis Model


  1. Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1

  2. Outline � Goals � Two-way mixture model approach – Background: mixture discriminant analysis – Model assumptions and motivations – Dimension reduction implied by the two-way mix- ture model – Estimation algorithm � Examples – Document topic classification (Discrete) � A mixture of Poisson distributions – Disease-type classification using microarray gene expression data (Continuous) � A mixture of normal distributions � Conclusions and future work 2

  3. Goals � Achieve high accuracy for the classification of high dimensional data. – Document data: � Dimension: 3400 . p > � Training sample size: � 2500 . n � Number of classes: = 5 . K � The feature vectors are sparse. – Gene expression data: � Dimension: 4000 . p > � Training sample size: 100 . n < � Number of classes: = 4 . K � Attribute (variable, feature) clustering may be desired. – Document data: which words play similar roles and do not need to be distinguished for identifying a set of topics? – Gene expression data: which genes function simi- larly? 3

  4. Mixture Discriminant Analysis � Proposed as an extension of linear discriminant anal- ysis. � T. Hastie, R. Tibshirani, “Discriminant analysis by Gaussian mixtures,” Journal of the Royal Statistical Society. Series B (Methodological) , vol. 58, no. 1, pp. 155-176, 1996. � The mixture of normals is used to obtain a density estimation for each class. � Denote the feature vector by X and the class label by Y . � For class = 1 ; 2 ; k :::; K , the within-class density is: R k X ( x ) = � ( x j � �) f � ; k k r k r r =1 4

  5. � A 2-classes example. Class 1 is a mixture of 3 nor- mals and class 2 a mixture of 2 normals. The vari- 3 : 0 . ances for all the normals are 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −10 −5 0 5 10 15 20 25 5

  6. � The overall model is: ( X = = ) = ( x ) P x; Y k a f k k R k X = � ( x j � �) a � ; k k r k r =1 r where a k is the prior probability of class k . � Equivalent formulation: M X ( X = = ) = � ( x j � �) q ( k ) P x; Y k � ; m m m m =1 where q m is a pmf for the class label Y within a mix- ture component. � Here we have ( k ) = 1 : 0 if mixture component q m m k and zero otherwise. “belongs to” class � The ML estimation of a k is the proportion of training k . samples in class � EM algorithm is used to estimate � . � � r , k , and k � Bayes classification rule: R k X ^ = arg max � ( x j � �) y a � ; k k r k r k r =1 6

  7. Assumptions for the Two-way Mixture � For each mixture component, the variables are inde- pendent. – As a class may contain multiple mixture compo- nents, the variables are NOT independent in gen- eral given the class. – To approximate the density within each class, the restriction on each component can be compensated by having more components. – Convenient for extending to a two-way mixture model. – Efficient for treating missing data. T . � Suppose = ( x ) X is p -dimensional, x ; x ; :::; x 1 2 p The mixture model is: M p X Y ( X = = ) = ( k ) � ( x j � ) P x; Y k � q m m j m;j m =1 =1 j � We need to estimate parameter m;j for each dimen- j in each mixture component m . sion 7

  8. � When the dimension is very high (sometimes � p n ), we may need an even more parsimonious model for each mixture component. � Clustering structure on the variables: p variables belong to L clusters. – Assume that the j j Two variables 1 , 2 , in the same cluster have = = 1 ; 2 ; � � m :::; M . 2 , m;j m;j 1 j by – Denote the cluster identity of variable c ( j ) 2 f 1 ; L g . :::; – For a fixed mixture component m , only need to estimate L � ’s. – The � m;j ’s are shrunk to L � ) ’s. m;c ( j p M X Y ( X = = ) = ( k ) � ( x j � ) P x; Y k a q m m j m;c ( j ) m =1 j =1 � This way of regularizing the model leads to variable clustering. 8

  9. � Properties of variable clusters: – Variables in the same cluster have the same distri- butions within each class. – For each cluster of variables, only a small number of statistics are sufficient for predicting the class label. Variables Mixture component 1 1, 2, 3 4, 5, 6 class . 1 . Mixture component 2 . Variables Mixture component 3 1, 2, 3 4, 5, 6 class 2 Mixture component 4 . . . Mixture component 5 class 3 Variables Mixture component 6 1, 2, 3 4, 5, 6 9

  10. Dimension Reduction � Within each mixture component, variables in the same cluster are i.i.d. random variables. � For i.i.d. random variables sampled from an exponen- tial family, the dimension of the sufficient statistic for � is fixed w.r.t. the sample size. the parameter � Assume the exponential family to be: S ! X p ( x ) = exp � ( � ) T ( x ) � B ( � ) h ( x ) � j s s j j s =1 = 1 ; X l , l :::; L , define Proposition: For j ’s in cluster � X ( x ) = ( x ) = 1 ; 2 ; T T s :::; S : l ;s s j : c ( j )= l j � ( x ) , = 1 ; = 1 ; Given T l :::; L , s :::; S , the class of a l ;s sample Y is conditionally independent of X 1 , X 2 , ..., X p . 10

  11. � Dimension reduction : “sufficient information” for predicting Y is of dimension LS . We often have � LS p . � Examples: = 1 . – Mixtures of Poisson: S � X ( x ) = T x l ; 1 j : c ( j )= l j = 2 . S – Mixtures of normal: � X ( x ) = T x ; 1 l j j : c ( j )= l � X 2 ( x ) = T x ; 2 l j j : c ( j )= l Equivalently: P x j : c ( j )= l j � ( x ) = Sample mean: T l ; 1 P ( c ( j ) = ) I l j � 2 P ( x � T ( x )) ; 1 j l : c ( j )= l j � ( x ) = Sample variance: T ; 2 l P ( c ( j ) = ) I l j 11

  12. Model Fitting � We need to estimate the following: = – Mixture component prior probabilities a m , m 1 ; :::; M . = � m – Parameters of the Poisson distributions: m;l , 1 ; = 1 ; :::; M , l :::; L . c ( j ) , = 1 ; – The variable clustering function j :::; p , c ( j ) 2 f 1 ; L g . :::; � Criterion: Maximum likelihood estimation. � Algorithm: EM. – E-step: compute the posterior probability of each sample coming from each mixture component. – M-step: � Update the parameters a m , � m;l . � Update the variable clustering function c ( j ) by c ( j ) individually for each j , j = optimizing 1 ; :::; p with all the other parameters fixed. � Computational perspective: – E-step: a “soft” clustering of samples into mixture components, “row-wise” clustering. – M-step: 1) update parameters; 2) a clustering of attributes, “column-wise” clustering. 12

  13. Document Topic Classification � Classify documents into different topics. � Five document classes from the Newsgroup data set collected by Lang (1995): 1. comp.graphics 2. rec.sport.baseball 3. sci.med 4. sci.space 5. talk.politics.guns � Classification is based on word counts. – Examples: bit: 2, graphic: 3, sun: 2. � Each document is represented by a vector of word counts. Every dimension corresponds to a particular word. � Each class contains about 1000 documents. Roughly half of them are randomly selected as training data, and the others testing. � Pre-processing: for each document class, the 1000 words with the largest total counts in the training data are used as variables. � The dimension of the word vector is = 3455 , p p > n . 13

  14. Mixture of Poisson Distribution � The Poisson distribution is uni-modal. k � � � P ( X = k ) = e : ! k � Example mixtures of Poisson distributions: 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 � Mixture of multivariate independent Poisson distribu- tions with variable clustering: x p M j � m;c ( j ) X Y � � ( X = = ) = ( k ) � P x; Y k a q e m;c ( j ) m m ! x j m =1 j =1 14

  15. Results � Classification error rates achieved without variable # components = 1 � 20 . per cl ass clustering. 24 22 Classification error rate (%) 20 18 16 14 12 10 0 5 10 15 20 Number of mixture components per class � = 30 � 3455 , # components = 6 . L per cl ass 16 15.5 15 Classification error rate (%) 14.5 14 13.5 13 12.5 12 11.5 11 1 2 3 4 10 10 10 10 Number of word clusters 15

  16. � Confusion table for = 30 , without word clustering, M = 3455 . Classification error rate: 11 : 22% . p graphics baseball sci.med sci.space politics.guns graphics 463 5 9 16 3 baseball 3 459 4 2 9 sci.med 22 12 435 20 14 sci.space 27 14 28 409 18 politics.guns 11 27 17 17 434 � For = 30 , = 168 . Classification error rate: M L 12 : 51% . graphics baseball sci.med sci.space politics.guns graphics 458 1 12 15 10 baseball 3 446 2 5 21 sci.med 23 9 408 21 42 sci.space 24 9 21 404 38 politics.guns 4 15 18 17 452 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend