Classification of High Dimensional Data By Two-way Mixture Models - PDF document

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1

Outline � Goals � Two-way mixture model approach – Background: mixture discriminant analysis – Model assumptions and motivations – Dimension reduction implied by the two-way mixture model – Estimation algorithm � Examples – Document topic classification (Discrete) � A mixture of Poisson distributions – Disease-type classification using microarray gene expression data (Continuous) � A mixture of normal distributions � Conclusions and future work 2

Goals � Achieve high accuracy for the classification of high dimensional data. – Document data: � Dimension: 3400 . p > � Training sample size: � 2500 . n � Number of classes: = 5 . K � The feature vectors are sparse. – Gene expression data: � Dimension: 4000 . p > � Training sample size: 100 . n < � Number of classes: = 4 . K � Attribute (variable, feature) clustering may be desired. – Document data: which words play similar roles and do not need to be distinguished for identifying a set of topics? – Gene expression data: which genes function simi- larly? 3

Mixture Discriminant Analysis � Proposed as an extension of linear discriminant analysis. � T. Hastie, R. Tibshirani, “Discriminant analysis by Gaussian mixtures,” Journal of the Royal Statistical Society. Series B (Methodological) , vol. 58, no. 1, pp. 155-176, 1996. � The mixture of normals is used to obtain a density estimation for each class. � Denote the feature vector by X and the class label by Y . � For class = 1 ; 2 ; k :::; K , the within-class density is: R k X ( x ) = � ( x j � �) f � ; k k r k r r =1 4

� A 2-classes example. Class 1 is a mixture of 3 normals and class 2 a mixture of 2 normals. The vari- 3 : 0 . ances for all the normals are 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −10 −5 0 5 10 15 20 25 5

� The overall model is: ( X = = ) = ( x ) P x; Y k a f k k R k X = � ( x j � �) a � ; k k r k r =1 r where a k is the prior probability of class k . � Equivalent formulation: M X ( X = = ) = � ( x j � �) q ( k ) P x; Y k � ; m m m m =1 where q m is a pmf for the class label Y within a mixture component. � Here we have ( k ) = 1 : 0 if mixture component q m m k and zero otherwise. “belongs to” class � The ML estimation of a k is the proportion of training k . samples in class � EM algorithm is used to estimate � . � � r , k , and k � Bayes classification rule: R k X ^ = arg max � ( x j � �) y a � ; k k r k r k r =1 6

Assumptions for the Two-way Mixture � For each mixture component, the variables are independent. – As a class may contain multiple mixture components, the variables are NOT independent in gen- eral given the class. – To approximate the density within each class, the restriction on each component can be compensated by having more components. – Convenient for extending to a two-way mixture model. – Efficient for treating missing data. T . � Suppose = ( x ) X is p -dimensional, x ; x ; :::; x 1 2 p The mixture model is: M p X Y ( X = = ) = ( k ) � ( x j � ) P x; Y k � q m m j m;j m =1 =1 j � We need to estimate parameter m;j for each dimen- j in each mixture component m . sion 7

� When the dimension is very high (sometimes � p n ), we may need an even more parsimonious model for each mixture component. � Clustering structure on the variables: p variables belong to L clusters. – Assume that the j j Two variables 1 , 2 , in the same cluster have = = 1 ; 2 ; � � m :::; M . 2 , m;j m;j 1 j by – Denote the cluster identity of variable c ( j ) 2 f 1 ; L g . :::; – For a fixed mixture component m , only need to estimate L � ’s. – The � m;j ’s are shrunk to L � ) ’s. m;c ( j p M X Y ( X = = ) = ( k ) � ( x j � ) P x; Y k a q m m j m;c ( j ) m =1 j =1 � This way of regularizing the model leads to variable clustering. 8

� Properties of variable clusters: – Variables in the same cluster have the same distributions within each class. – For each cluster of variables, only a small number of statistics are sufficient for predicting the class label. Variables Mixture component 1 1, 2, 3 4, 5, 6 class . 1 . Mixture component 2 . Variables Mixture component 3 1, 2, 3 4, 5, 6 class 2 Mixture component 4 . . . Mixture component 5 class 3 Variables Mixture component 6 1, 2, 3 4, 5, 6 9

Dimension Reduction � Within each mixture component, variables in the same cluster are i.i.d. random variables. � For i.i.d. random variables sampled from an exponential family, the dimension of the sufficient statistic for � is fixed w.r.t. the sample size. the parameter � Assume the exponential family to be: S ! X p ( x ) = exp � ( � ) T ( x ) � B ( � ) h ( x ) � j s s j j s =1 = 1 ; X l , l :::; L , define Proposition: For j ’s in cluster � X ( x ) = ( x ) = 1 ; 2 ; T T s :::; S : l ;s s j : c ( j )= l j � ( x ) , = 1 ; = 1 ; Given T l :::; L , s :::; S , the class of a l ;s sample Y is conditionally independent of X 1 , X 2 , ..., X p . 10

� Dimension reduction : “sufficient information” for predicting Y is of dimension LS . We often have � LS p . � Examples: = 1 . – Mixtures of Poisson: S � X ( x ) = T x l ; 1 j : c ( j )= l j = 2 . S – Mixtures of normal: � X ( x ) = T x ; 1 l j j : c ( j )= l � X 2 ( x ) = T x ; 2 l j j : c ( j )= l Equivalently: P x j : c ( j )= l j � ( x ) = Sample mean: T l ; 1 P ( c ( j ) = ) I l j � 2 P ( x � T ( x )) ; 1 j l : c ( j )= l j � ( x ) = Sample variance: T ; 2 l P ( c ( j ) = ) I l j 11

Model Fitting � We need to estimate the following: = – Mixture component prior probabilities a m , m 1 ; :::; M . = � m – Parameters of the Poisson distributions: m;l , 1 ; = 1 ; :::; M , l :::; L . c ( j ) , = 1 ; – The variable clustering function j :::; p , c ( j ) 2 f 1 ; L g . :::; � Criterion: Maximum likelihood estimation. � Algorithm: EM. – E-step: compute the posterior probability of each sample coming from each mixture component. – M-step: � Update the parameters a m , � m;l . � Update the variable clustering function c ( j ) by c ( j ) individually for each j , j = optimizing 1 ; :::; p with all the other parameters fixed. � Computational perspective: – E-step: a “soft” clustering of samples into mixture components, “row-wise” clustering. – M-step: 1) update parameters; 2) a clustering of attributes, “column-wise” clustering. 12

Document Topic Classification � Classify documents into different topics. � Five document classes from the Newsgroup data set collected by Lang (1995): 1. comp.graphics 2. rec.sport.baseball 3. sci.med 4. sci.space 5. talk.politics.guns � Classification is based on word counts. – Examples: bit: 2, graphic: 3, sun: 2. � Each document is represented by a vector of word counts. Every dimension corresponds to a particular word. � Each class contains about 1000 documents. Roughly half of them are randomly selected as training data, and the others testing. � Pre-processing: for each document class, the 1000 words with the largest total counts in the training data are used as variables. � The dimension of the word vector is = 3455 , p p > n . 13

Mixture of Poisson Distribution � The Poisson distribution is uni-modal. k � � � P ( X = k ) = e : ! k � Example mixtures of Poisson distributions: 0.18 0.18 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 � Mixture of multivariate independent Poisson distributions with variable clustering: x p M j � m;c ( j ) X Y � � ( X = = ) = ( k ) � P x; Y k a q e m;c ( j ) m m ! x j m =1 j =1 14

Results � Classification error rates achieved without variable # components = 1 � 20 . per cl ass clustering. 24 22 Classification error rate (%) 20 18 16 14 12 10 0 5 10 15 20 Number of mixture components per class � = 30 � 3455 , # components = 6 . L per cl ass 16 15.5 15 Classification error rate (%) 14.5 14 13.5 13 12.5 12 11.5 11 1 2 3 4 10 10 10 10 Number of word clusters 15

� Confusion table for = 30 , without word clustering, M = 3455 . Classification error rate: 11 : 22% . p graphics baseball sci.med sci.space politics.guns graphics 463 5 9 16 3 baseball 3 459 4 2 9 sci.med 22 12 435 20 14 sci.space 27 14 28 409 18 politics.guns 11 27 17 17 434 � For = 30 , = 168 . Classification error rate: M L 12 : 51% . graphics baseball sci.med sci.space politics.guns graphics 458 1 12 15 10 baseball 3 446 2 5 21 sci.med 23 9 408 21 42 sci.space 24 9 21 404 38 politics.guns 4 15 18 17 452 16

Classification of High Dimensional Data By Two-way Mixture Models - PDF document

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1 Outline Goals Two-way mixture model approach Background: mixture discriminant analysis Model

Two-dimensional atomic Fermi gases Michael Khl University of Bonn Two-dimensional

+ Two Dimensional Arrays + Two Dimensional Arrays So far we have studied how to store linear

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Three Dimensional Euclidean Space We set up a coordinate system in space (three dimensional

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

A Two- -Dimensional Bisection Dimensional Bisection A Two Envelope Algorithm Envelope

Two-dimensional Fermi gases Two dimensional Fermi gases Michael Khl BEC-BCS crossover What

Fundamentals of Programming Lecture 14 Hamed Rasifard 1 Outline Two-Dimensional Array

Molecular Diagnostics in Thyroid Cancer Jonathan George, MD, MPH Assistant Professor Current

C OMMERCIALIZING P ROTEIN M ICROARRAY D IAGNOSTICS C ALENDAR Q2 2014 D ISCLAIMER AND F ORWARD L

Team Introductions Karen Burns, MD Pediatric Oncology, Co-Director CFCPP Holly

Oncofertility Shvetha M. Zarek, M.D., FACOG Assistant Professor of Obstetrics and Gynecology

Investor Presentation May 2015 Some of the statements made in this presentation may look forward

Newborn Screening for Congenital Hypothyroidism Lesley Tetlow, Paediatric Biochemistry, Central

EHAs Education & Outreach Program 2019-2021 Gert Ossenkoppele | Chair Education Committee

Occult Hepatitis B Infection: why, who and what to do ? MF Yuen, MD, PhD Chair of

Classification of High Dimensional Data By Two-way Mixture Models - PDF document

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department The Pennsylvania State University 1 Outline Goals Two-way mixture model approach Background: mixture discriminant analysis Model

Two-dimensional atomic Fermi gases Michael Khl University of Bonn Two-dimensional

+ Two Dimensional Arrays + Two Dimensional Arrays So far we have studied how to store linear

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Statistics for High-Dimensional Data: Selected Topics Peter B uhlmann Seminar f ur

Using Local Neighborhoods to Find Subspace Clusters Emin Aksehirli with Bart Goethals, Emmanuel

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Three Dimensional Euclidean Space We set up a coordinate system in space (three dimensional

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

A Two- -Dimensional Bisection Dimensional Bisection A Two Envelope Algorithm Envelope

Two-dimensional Fermi gases Two dimensional Fermi gases Michael Khl BEC-BCS crossover What

Fundamentals of Programming Lecture 14 Hamed Rasifard 1 Outline Two-Dimensional Array

Molecular Diagnostics in Thyroid Cancer Jonathan George, MD, MPH Assistant Professor Current

C OMMERCIALIZING P ROTEIN M ICROARRAY D IAGNOSTICS C ALENDAR Q2 2014 D ISCLAIMER AND F ORWARD L

Team Introductions Karen Burns, MD Pediatric Oncology, Co-Director CFCPP Holly

Oncofertility Shvetha M. Zarek, M.D., FACOG Assistant Professor of Obstetrics and Gynecology

Investor Presentation May 2015 Some of the statements made in this presentation may look forward

Newborn Screening for Congenital Hypothyroidism Lesley Tetlow, Paediatric Biochemistry, Central

EHAs Education &amp; Outreach Program 2019-2021 Gert Ossenkoppele | Chair Education Committee

Occult Hepatitis B Infection: why, who and what to do ? MF Yuen, MD, PhD Chair of

EHAs Education & Outreach Program 2019-2021 Gert Ossenkoppele | Chair Education Committee