Cluster Subspace Identification Via Conditional Entropy Calculations
James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu
Cluster Subspace Identification Via Conditional Entropy - - PowerPoint PPT Presentation
Cluster Subspace Identification Via Conditional Entropy Calculations James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu Outline Subspace identification - why? Conditional
James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu
Subspace identification - why? Conditional entropy and clusters in R2. Ordering dimensions for easy subspace
Maximal cliques lead to automatic subspace
Initial, high-level exploration of complex data
Explore samples (observations) or genes
Cluster structure in patients may only be
Uninformed feature selection can discard
Use of conditional entropy gives us:
Distribution-free Robust to outliers/extreme values Minimal nuisance parameters Robust to noise as long as the noise exists in all
subspaces.
Adapted from a method proposed by Guo et al at
Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003
Guo et al have data with many (~10,000) observations in
Dim. Obs.
We have the opposite problem; we have many more
We flip Guo’s method on its ear – pretend that
Dim. Obs. “Obs” “Dim”
Nested Means Matrix
Minimal Spanning Tree MST Order CE Distance Matrix Clique Discovery Cliques Gene Expression Data
‘nested mean’ boundaries.
~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals):
35 /
2 ≈
r n
k
r 2 =
and Example: For n = 10,000, r = 16 because 16*16 is 256 and 256*35 = 8960 < 10,000.
Nested Means Matrix
Minimal Spanning Tree MST Order CE Distance Matrix Clique Discovery Cliques Gene Expression Data
For every pair of dimensions (X and Y), discretize
Calculate entropy for every row and column; weight
Add up weighted row and column entropy values to
− =
χ
χ
x
x d x d C H log )] ( log ) ( [ ) (
X1 X2 X3 X4 X5 X6 Sum Wt CE X1 1 3 4 .03 .314 X2 1 9 1 1 2 14 .09 .629 X3 7 14 3 7 6 37 .25 .835 X4 7 6 13 19 12 5 62 .41 .939 X5 4 14 5 1 1 25 .17 .668 X6 1 2 3 2 8 .05 .737 Sum 16 36 37 33 20 8 Wt .11 .24 .25 .22 .13 .05 CE .597 .847 .806 .615 .540 .502 CE(Y|X) .700 CEmax .812 CE(X|Y) .812 example taken from Guo et al 150 total values, r = 6 intervals
Nested Means Matrix
Minimal Spanning Tree MST Order CE Distance Matrix Clique Discovery Cliques Gene Expression Data
CE calculation results in a distance matrix -
We can use graph theory to answer two
Topologically, is there a linear order that, when
What fully-connected sub-graphs (cliques) exist in
A minimum spanning tree (MST) is a spanning tree,
We can use the topological ordering of the MST to
We used Kruskal’s algorithm in the RBGL R library
as a 1D compression of the resulting hierarchical tree
After ordering the
This ordering can show
If we can see cluster
On the fully-connected
On the resulting graph,
Future work: a more
Nested means discretization and calculation
MST ordering and dot files (our graph format
Graphs visualized using AT&T’s Graphviz All input and output files are tab-delimited
1000 observations in R100 distributed N(0,1) in each
Observations 1-250 translated by + 3 in dimensions
Observations 251-500 translated by –3 in
Observations 501-750 translated by +5 in
Observations 751-1000 translated by –5 in
An experiment to determine the ability of
Custom microarray, 7,129 genes 72 samples
47 ALL samples (both B- and T-cell) 25 AML samples
T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)
Acute lymphoblastic leukemia B and T-cell
Affymetrix U95Av2 chip, 12,625 genes 128 samples
95 B-cell samples 33 T-cell samples
An informative technique for initial high-level
Future direction:
Concretely determine sensitivity to noise Develop a visualization tool for the MST ordering A more efficient clique-discovery method
mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999)
Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date]
Box: Interactive Hierarchical Clustering for Multivariate Spatial
Geographic Information Systems, McLean, VA, USA.