cluster subspace identification via conditional entropy
play

Cluster Subspace Identification Via Conditional Entropy - PowerPoint PPT Presentation

Cluster Subspace Identification Via Conditional Entropy Calculations James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu Outline Subspace identification - why? Conditional


  1. Cluster Subspace Identification Via Conditional Entropy Calculations James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu

  2. Outline � Subspace identification - why? � Conditional entropy and clusters in R 2 . � Ordering dimensions for easy subspace visualization and identification. � Maximal cliques lead to automatic subspace identification.

  3. Subspace identification � Initial, high-level exploration of complex data can inform downstream analyses. � Explore samples (observations) or genes (dimensions) depending on intent. � Cluster structure in patients may only be revealed on a subset of genes (and vice- versa) (Getz el at ). � Uninformed feature selection can discard informative features.

  4. Conditional entropy and clusters in R 2 � Use of conditional entropy gives us: � Distribution-free � Robust to outliers/extreme values � Minimal nuisance parameters � Robust to noise as long as the noise exists in all subspaces. � Adapted from a method proposed by Guo et al at the Geography department at Penn State. Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003

  5. Geography to … Microarrays? � Guo et al have data with many (~10,000) observations in a few (~50) dimensions (measurements): “Dim” Obs. Obs. Dim. Dim. “Obs” � We have the opposite problem; we have many more ‘dimensions’ – genes – than we do observations – ‘samples’ or ‘patients’ – on those dimensions. � We flip Guo’s method on its ear – pretend that observations are dimensions and vice-versa.

  6. The method n s n r n s Nested Means Matrix n g n s Minimal Spanning MST Order Tree n s CE Distance Matrix Clique Discovery Cliques Gene Expression Data

  7. CE – what are we looking for?

  8. Nested means discretization Resistant to extreme outliers not seen in an equal-interval approach. � We calculate nested mean vectors by: � Calculate the mean value of a dimension. � Divide the data into two halves on this mean. � Recursively divide each half into half again, calculating a vector of � ‘nested mean’ boundaries. Stop once we have the ‘required’ number of intervals (denoted r ). � We want enough intervals so that, on average, each cell contains � ~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals): 2 ≈ n / r 35 Example: For n = 10,000, r = 16 because 16*16 and is 256 and 256*35 = 8960 < 10,000. = k r 2

  9. The method n s n r n s Nested Means Matrix n g n s Minimal Spanning MST Order Tree n s CE Distance Matrix Clique Discovery Cliques Gene Expression Data

  10. Calculating CE � For every pair of dimensions (X and Y), discretize the 2D sub-space (using the nested means intervals); each cell is then represented in a table by the number of observations that fall in that cell. � Calculate entropy for every row and column; weight each by the row or column sum divided by the total number of observations. � Add up weighted row and column entropy values to get CE(Y|X) and CE(X|Y). The maximum of these two values is the final cluster tendency measure.

  11. Calculating CE ∑ ∈ = − χ H ( C ) [ d ( x ) log d ( x )] log χ x X1 X2 X3 X4 X5 X6 Sum Wt CE X1 0 1 3 0 0 0 4 .03 .314 X2 1 9 1 0 1 2 14 .09 .629 X3 7 14 3 7 6 0 37 .25 .835 X4 7 6 13 19 12 5 62 .41 .939 X5 0 4 14 5 1 1 25 .17 .668 X6 1 2 3 2 0 0 8 .05 .737 CE(X|Y) .812 Sum 16 36 37 33 20 8 Wt .11 .24 .25 .22 .13 .05 CE(Y|X) CE max .700 .812 CE .597 .847 .806 .615 .540 .502 150 total values, r = 6 intervals example taken from Guo et al

  12. The method n s n r n s Nested Means Matrix n g n s Minimal Spanning MST Order Tree n s CE Distance Matrix Clique Discovery Cliques Gene Expression Data

  13. Graph-theoretic analysis � CE calculation results in a distance matrix - visualizing the fully-connected graph is of little use. � We can use graph theory to answer two questions: � Topologically, is there a linear order that, when sorted and imaged, can reveal cluster structure? � What fully-connected sub-graphs (cliques) exist in my data?

  14. Sample ordering – the MST � A minimum spanning tree (MST) is a spanning tree, but has weights or lengths associated with the edges, and the total weight of the tree (the sum of the weights of its edges) is at a minimum. � We can use the topological ordering of the MST to create a relative ordering of our samples. Sorting the samples in this way in a data image can reveal structure. � We used Kruskal’s algorithm in the RBGL R library ( mstree.kruskal() ) – a greedy approach to generate an MST.

  15. Use of the MST to Induce Orderings on the Dimensions • similar to UPGMA tree-building • the linear ordering can be viewed as a 1D compression of the resulting hierarchical tree

  16. MST orderings on the image of the CE values � After ordering the samples according to their MST order, use of R’s image() method can generate the image at right. � This ordering can show us formerly-hidden cluster structure without any presupposition.

  17. Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph � If we can see cluster structure, can we retrieve it in an automatic fashion? � On the fully-connected graph, break all edges longer than a threshold distance (somewhat subjective; varies between data sets).

  18. Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph � On the resulting graph, find all cliques (fully- connected node sets). � Dr. Marchette – graph library’s clique() � Future work: a more efficient method is required.

  19. Implementation details � Nested means discretization and calculation of conditional entropy written in R � MST ordering and dot files (our graph format of choice) written in Perl � Graphs visualized using AT&T’s Graphviz � All input and output files are tab-delimited ASCII text

  20. Anecdotal Results

  21. Artificial Data Set � 1000 observations in R 100 distributed N(0,1) in each of the variates � Observations 1-250 translated by + 3 in dimensions {5,6,7,8} � Observations 251-500 translated by –3 in dimensions {24,25,26,27,28,29,30} � Observations 501-750 translated by +5 in dimensions {55,56,57,58,59,60,61,62,63,64,65,66,67} � Observations 751-1000 translated by –5 in dimensions {10,11,12,13,14}

  22. Artificial dataset results - MST

  23. Image of Sorted CE Values for the Artificial Dataset

  24. Golub dataset � An experiment to determine the ability of microarray data to separate acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). � Custom microarray, 7,129 genes � 72 samples � 47 ALL samples (both B- and T-cell) � 25 AML samples T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)

  25. Golub Dataset - MST • AML samples • ALL samples

  26. Image of Sorted CE Values for the Golub Dataset

  27. ALL data set � Acute lymphoblastic leukemia B and T-cell data set contributed to Bioconductor by the Dana Farber Cancer Institute. � Affymetrix U95Av2 chip, 12,625 genes � 128 samples � 95 B-cell samples � 33 T-cell samples

  28. ALL - MST • B-cell samples • T-cell samples

  29. Image of Sorted CE Values for the ALL Dataset

  30. Summary/Conclusions � An informative technique for initial high-level data exploration � Future direction: � Concretely determine sensitivity to noise � Develop a visualization tool for the MST ordering � A more efficient clique-discovery method

  31. References Cheng, C., A. Fu, and Y. Zhang. Entropy-based subspace clustering for � mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999) Getz, G., Levine, E., Domany E. Coupled two-way clustering analysis � of gene microarray data. PNAS. 97:22, 12079. (2000). Guo, D. et al. Breaking Down Dimensionality: Effective and Efficient � Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date] Guo, D., D. Peuquet and M. Gahegan (2002). Opening the Black � Box: Interactive Hierarchical Clustering for Multivariate Spatial Patterns. The 10 th ACM International Symposium on Advances in Geographic Information Systems, McLean, VA, USA.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend