Cluster Subspace Identification Via Conditional Entropy - - PowerPoint PPT Presentation

cluster subspace identification via conditional entropy
SMART_READER_LITE
LIVE PREVIEW

Cluster Subspace Identification Via Conditional Entropy - - PowerPoint PPT Presentation

Cluster Subspace Identification Via Conditional Entropy Calculations James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu Outline Subspace identification - why? Conditional


slide-1
SLIDE 1

Cluster Subspace Identification Via Conditional Entropy Calculations

James Diggans George Mason University jdiggans@gmu.edu Jeffrey L. Solka George Mason University jsolka@gmu.edu

slide-2
SLIDE 2

Outline

Subspace identification - why? Conditional entropy and clusters in R2. Ordering dimensions for easy subspace

visualization and identification.

Maximal cliques lead to automatic subspace

identification.

slide-3
SLIDE 3

Subspace identification

Initial, high-level exploration of complex data

can inform downstream analyses.

Explore samples (observations) or genes

(dimensions) depending on intent.

Cluster structure in patients may only be

revealed on a subset of genes (and vice- versa) (Getz el at).

Uninformed feature selection can discard

informative features.

slide-4
SLIDE 4

Conditional entropy and clusters in R2

Use of conditional entropy gives us:

Distribution-free Robust to outliers/extreme values Minimal nuisance parameters Robust to noise as long as the noise exists in all

subspaces.

Adapted from a method proposed by Guo et al at

the Geography department at Penn State.

Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003

slide-5
SLIDE 5

Geography to … Microarrays?

Guo et al have data with many (~10,000) observations in

a few (~50) dimensions (measurements):

Dim. Obs.

We have the opposite problem; we have many more

‘dimensions’ – genes – than we do observations – ‘samples’ or ‘patients’ – on those dimensions.

We flip Guo’s method on its ear – pretend that

  • bservations are dimensions and vice-versa.

Dim. Obs. “Obs” “Dim”

slide-6
SLIDE 6

The method

ns ns nr

Nested Means Matrix

ng ns ns

Minimal Spanning Tree MST Order CE Distance Matrix Clique Discovery Cliques Gene Expression Data

slide-7
SLIDE 7

CE – what are we looking for?

slide-8
SLIDE 8

Nested means discretization

  • Resistant to extreme outliers not seen in an equal-interval approach.
  • We calculate nested mean vectors by:
  • Calculate the mean value of a dimension.
  • Divide the data into two halves on this mean.
  • Recursively divide each half into half again, calculating a vector of

‘nested mean’ boundaries.

  • Stop once we have the ‘required’ number of intervals (denoted r).
  • We want enough intervals so that, on average, each cell contains

~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals):

35 /

2 ≈

r n

k

r 2 =

and Example: For n = 10,000, r = 16 because 16*16 is 256 and 256*35 = 8960 < 10,000.

slide-9
SLIDE 9

The method

ns ns nr

Nested Means Matrix

ng ns ns

Minimal Spanning Tree MST Order CE Distance Matrix Clique Discovery Cliques Gene Expression Data

slide-10
SLIDE 10

Calculating CE

For every pair of dimensions (X and Y), discretize

the 2D sub-space (using the nested means intervals); each cell is then represented in a table by the number of observations that fall in that cell.

Calculate entropy for every row and column; weight

each by the row or column sum divided by the total number of observations.

Add up weighted row and column entropy values to

get CE(Y|X) and CE(X|Y). The maximum of these two values is the final cluster tendency measure.

slide-11
SLIDE 11

Calculating CE

∑ ∈

− =

χ

χ

x

x d x d C H log )] ( log ) ( [ ) (

X1 X2 X3 X4 X5 X6 Sum Wt CE X1 1 3 4 .03 .314 X2 1 9 1 1 2 14 .09 .629 X3 7 14 3 7 6 37 .25 .835 X4 7 6 13 19 12 5 62 .41 .939 X5 4 14 5 1 1 25 .17 .668 X6 1 2 3 2 8 .05 .737 Sum 16 36 37 33 20 8 Wt .11 .24 .25 .22 .13 .05 CE .597 .847 .806 .615 .540 .502 CE(Y|X) .700 CEmax .812 CE(X|Y) .812 example taken from Guo et al 150 total values, r = 6 intervals

slide-12
SLIDE 12

The method

ns ns nr

Nested Means Matrix

ng ns ns

Minimal Spanning Tree MST Order CE Distance Matrix Clique Discovery Cliques Gene Expression Data

slide-13
SLIDE 13

Graph-theoretic analysis

CE calculation results in a distance matrix -

visualizing the fully-connected graph is of little use.

We can use graph theory to answer two

questions:

Topologically, is there a linear order that, when

sorted and imaged, can reveal cluster structure?

What fully-connected sub-graphs (cliques) exist in

my data?

slide-14
SLIDE 14

Sample ordering – the MST

A minimum spanning tree (MST) is a spanning tree,

but has weights or lengths associated with the edges, and the total weight of the tree (the sum of the weights of its edges) is at a minimum.

We can use the topological ordering of the MST to

create a relative ordering of our samples. Sorting the samples in this way in a data image can reveal structure.

We used Kruskal’s algorithm in the RBGL R library

(mstree.kruskal()) – a greedy approach to generate an MST.

slide-15
SLIDE 15

Use of the MST to Induce Orderings on the Dimensions

  • similar to UPGMA tree-building
  • the linear ordering can be viewed

as a 1D compression of the resulting hierarchical tree

slide-16
SLIDE 16

MST orderings on the image of the CE values

After ordering the

samples according to their MST order, use of R’s image() method can generate the image at right.

This ordering can show

us formerly-hidden cluster structure without any presupposition.

slide-17
SLIDE 17

Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph

If we can see cluster

structure, can we retrieve it in an automatic fashion?

On the fully-connected

graph, break all edges longer than a threshold distance (somewhat subjective; varies between data sets).

slide-18
SLIDE 18

Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph

On the resulting graph,

find all cliques (fully- connected node sets).

  • Dr. Marchette – graph

library’s clique()

Future work: a more

efficient method is required.

slide-19
SLIDE 19

Implementation details

Nested means discretization and calculation

  • f conditional entropy written in R

MST ordering and dot files (our graph format

  • f choice) written in Perl

Graphs visualized using AT&T’s Graphviz All input and output files are tab-delimited

ASCII text

slide-20
SLIDE 20

Anecdotal Results

slide-21
SLIDE 21

Artificial Data Set

1000 observations in R100 distributed N(0,1) in each

  • f the variates

Observations 1-250 translated by + 3 in dimensions

{5,6,7,8}

Observations 251-500 translated by –3 in

dimensions {24,25,26,27,28,29,30}

Observations 501-750 translated by +5 in

dimensions {55,56,57,58,59,60,61,62,63,64,65,66,67}

Observations 751-1000 translated by –5 in

dimensions {10,11,12,13,14}

slide-22
SLIDE 22

Artificial dataset results - MST

slide-23
SLIDE 23

Image of Sorted CE Values for the Artificial Dataset

slide-24
SLIDE 24

Golub dataset

An experiment to determine the ability of

microarray data to separate acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL).

Custom microarray, 7,129 genes 72 samples

47 ALL samples (both B- and T-cell) 25 AML samples

T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)

slide-25
SLIDE 25

Golub Dataset - MST

  • ALL samples
  • AML samples
slide-26
SLIDE 26

Image of Sorted CE Values for the Golub Dataset

slide-27
SLIDE 27

ALL data set

Acute lymphoblastic leukemia B and T-cell

data set contributed to Bioconductor by the Dana Farber Cancer Institute.

Affymetrix U95Av2 chip, 12,625 genes 128 samples

95 B-cell samples 33 T-cell samples

slide-28
SLIDE 28

ALL - MST

  • B-cell samples
  • T-cell samples
slide-29
SLIDE 29

Image of Sorted CE Values for the ALL Dataset

slide-30
SLIDE 30

Summary/Conclusions

An informative technique for initial high-level

data exploration

Future direction:

Concretely determine sensitivity to noise Develop a visualization tool for the MST ordering A more efficient clique-discovery method

slide-31
SLIDE 31

References

  • Cheng, C., A. Fu, and Y. Zhang. Entropy-based subspace clustering for

mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999)

  • Getz, G., Levine, E., Domany E. Coupled two-way clustering analysis
  • f gene microarray data. PNAS. 97:22, 12079. (2000).
  • Guo, D. et al. Breaking Down Dimensionality: Effective and Efficient

Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date]

  • Guo, D., D. Peuquet and M. Gahegan (2002). Opening the Black

Box: Interactive Hierarchical Clustering for Multivariate Spatial

  • Patterns. The 10th ACM International Symposium on Advances in

Geographic Information Systems, McLean, VA, USA.