Graph Theoretic Latent Class Discovery and Its Robustness to - - PowerPoint PPT Presentation

graph theoretic latent class discovery and it s
SMART_READER_LITE
LIVE PREVIEW

Graph Theoretic Latent Class Discovery and Its Robustness to - - PowerPoint PPT Presentation

Graph Theoretic Latent Class Discovery and Its Robustness to Minimal Dominating Set Choice J. L. Solka, C. E. Priebe, and D. J. Marchette jsolka@nswc.navy.mil;dmarche@nswc.navy.mil NSWCDD Interface04 p.1/24 Agenda What is latent


slide-1
SLIDE 1

Graph Theoretic Latent Class Discovery and It’s

Robustness to Minimal Dominating Set Choice

  • J. L. Solka, C. E. Priebe, and D. J. Marchette

jsolka@nswc.navy.mil;dmarche@nswc.navy.mil

NSWCDD

Interface04 – p.1/24

slide-2
SLIDE 2

Agenda

What is latent class discovery? What are some approaches to the latent class discovery process? The class cover catch digraph classifier. Latent class discovery results on a gene expression data set. Wrap-up and conclusions.

Interface04 – p.2/24

slide-3
SLIDE 3

Acknowledgments

Michael C. Minnotte and Jurgen Symanzik, and others for organizing the conference Office of Naval Research through their ILIR Program for funding this effort

Interface04 – p.3/24

slide-4
SLIDE 4

What is Latent Class Discovery?

A latent class is a class of observations that reside undiscovered within a known class of observations. Develop a general methodology for the discernment of latent class structure during discriminant analysis. Moderately large hyperdimensional data sets. During training or testing. Explore applications of developed methodologies to the analysis of data sets in the areas hyperdimensional image analysis, artificial olfactory systems, computer security data, gene expression data, and text data mining.

Interface04 – p.4/24

slide-5
SLIDE 5

Flow Chart

HYPERDIM ENSIONAL DATA GRAPH THEORETIC DISCRIM INANT ANALYSIS M ETRIC SPACE ADAPTATION LATENT CLASSES NONLINEAR DIM ENSIONALITY REDUCTION M ULTIDIM ENSIONAL SCALING

I N S I G H T S

Interface04 – p.5/24

slide-6
SLIDE 6

Dominating Set

two− class data and covering discs Dominating set

Interface04 – p.6/24

slide-7
SLIDE 7

CCCD-Based Latent Class Discovery

−6 −5 −4 −3 −2 −1 1 2 3 4 −7 −6 −5 −4 −3 −2 −1 1 2 3

Interface04 – p.7/24

slide-8
SLIDE 8

ALL/AML Leukemia Gene Expression Analysis

72 Patients 7129 genes Apply CCCD

to ALL Observations Cluster CCCD Solution Based on Radii Examine Clusters for Latent Class Structure Ascertain Significance of Latent Class Structure

= AML = ALL B− cell = ALL T− cell

Interface04 – p.8/24

slide-9
SLIDE 9

Resubstitution Error Rate Estimate

For each

✂☎✄ ✆ ✆ ✆ ✄ ✝✟✞

an empirical risk (resubstitution error rate estimate)

✝✟✠☛✡

is calculated as

✝ ✠☞✡ ✌ ✁ ✍ ✂ ✎ ✍✑✏ ✒ ✓ ✔ ✔ ✍ ✕ ✖✘✗ ✙ ✚ ✛✑✜ ✖ ✎✣✢ ✤✦✥ ✗ ✙★✧ ✩ ✩ ✩ ✧ ✡ ✤✫✪ ✬ ✭✯✮✱✰ ✲ ✍✑✳ ✄ ✴ ✵✷✶ ✸ ✬ ✭ ✮✱✰ ✹ ✸ ✔✺ ✒ ✻ ✖✘✗ ✙ ✚ ✛✑✼ ✖ ✢ ✤✦✥ ✗ ✙★✧ ✩ ✩ ✩ ✧ ✡ ✤ ✪ ✬ ✭ ✮ ✰ ✲ ✍✑✳ ✄ ✴ ✵✷✶ ✸ ✬ ✭ ✮✽✰ ✹ ✸ ✔✺ ✔

Interface04 – p.9/24

slide-10
SLIDE 10

Classification Dimension

We proceed by defining the “scale dimension”

✝✁ ✂

to be the cluster map dimension that minimizes a dimensionality-penalized empirical risk;

✝✄ ✂✆☎ ✌ ✁ ✴ ✵✷✶ ✛✁✝ ✞ ✟ ✴ ✵✷✶ ✡ ✝✑✠☞✡ ✒ ✠ ✆

for some penalty coefficient

✠ ✢ ✡ ☛ ✄ ✂ ☞

.

Interface04 – p.10/24

slide-11
SLIDE 11

ALL/AML Classification Dimension Plot

Interface04 – p.11/24

slide-12
SLIDE 12

Gene Latent Class Discovery

Interface04 – p.12/24

slide-13
SLIDE 13

ALL/AML MDS Plot

Interface04 – p.13/24

slide-14
SLIDE 14

How Robust is the Methodology?

One other “success” story using artificial nose data. What if we had used another dominating set in our analysis? Is the discovered latent class structure independent of the dominating set used?

Interface04 – p.14/24

slide-15
SLIDE 15

An Exhaustive Enumeration of All Possible Dominating Sets for the Gene Data

180 21 node solutions 16 of the nodes remain fixed across the solutions 14 greedy solutions

Interface04 – p.15/24

slide-16
SLIDE 16

Classification Space Curves for the 180 Solutions

5 10 15 20 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Interface04 – p.16/24

slide-17
SLIDE 17

Classification Dimension for the 180 Solutions (red o Greedy Solutions, Green * Previous Solution)

20 40 60 80 100 120 140 160 180 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7

Interface04 – p.17/24

slide-18
SLIDE 18

Number of Dominating Sets for Each Vertex

10 20 30 40 50 100 150

Number of Dominating sets for each vertex

Vertex # Dominating Sets T−Cell B−Cell In−degree 0

Interface04 – p.18/24

slide-19
SLIDE 19

Digraph Analysis

✂ ✄ ☎ ✆ ✝ ✞ ✟ ✠ ✡ ☛ ☞ ✌ ✍ ✎ ✏ ✑ ✒ ✓ ✔ ✕ ✖ ✗ ✘ ✙ ✚ ✛ ✜ ✢ ✣ ✤ ✥ ✦ ✧ ★ ✩ ✪ ✫ ✬ ✭ ✮ ✯ ✰ ✱ ✲ ✳ ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ❅ ✽ ❆ ✿ ❁ ✽ ❅ ❇ ✿ ❈ ❁ ❇ ❉ ✿ ❊ ❀ ❋
❉ ■ ❅ ❏ ❍ ❁ ❑ ❈ ❍ ❋ ❅ ✾ ✽ ❅ ✾ ▲ ❁ ❀ ❏ ✽ ❈ ❁ ❉ ✽ ❅ ❏ ❍ ❁ ❂▼ ◆ ❇ ■ ❖ ✽ ❅ ❋ ❏ ✽ ❅ ✾ ❉ ❁ ❏ ❉ P ◗ ❘ ❙ ❚ ❯ ❱ ❲ ❳ ❨ ❩ ❬ ❭ ❪ ❫ ❴ ❵ ❛ ❜ ❝ ❞ ❡ ❢ ❣ ❤ ✐ ❥ ❦ ❧ ♠ ♥ ✼ ✽ ✾ ✿ ❀ ❁ ♦ ❃ ❄ ❅ ✽ ❆ ✿ ❁ ✽ ❅ ❇ ✿ ❈ ❁ ❇ ❉ ✿ ❊ ❀ ❋
❉ ■ ❅ ❏ ❍ ❁ ❑ ❈ ❍ ❋ ❅ ✾ ✽ ❅ ✾ ▲ ❁ ❀ ❏ ✽ ❈ ❁ ❉ ✽ ❅ ❏ ❍ ❁ ❂ ♣ ❇ ■ ❖ ✽ ❅ ❋ ❏ ✽ ❅ ✾ ❉ ❁ ❏ ❉ ❏ ❍ ❋ ❏ ❈ ■ ✿ q ❇ ❀ ❁ ❉ ✿ q ❏ r ❀ ■ ❖ ❋ ✾ ❀ ❁ ❁ ❇ s ❋ q ✾ ■ ❀ ✽ ❏ ❍ ❖ P ❂

Interface04 – p.19/24

slide-20
SLIDE 20

Latent Class Discovery Figures of Merit

How can we be assured that all of the greedy dominating set solutions discover the same latent classes? Previous greedy solution had 3 clusters that are pure B and 1 cluster that contained 8/9 of the T observations Percentage of B points that are in pure B clusters and the highest percentage of T points in any one cluster

Interface04 – p.20/24

slide-21
SLIDE 21

Purity (Latent Class Discovery) for the Golub Gene Data , Red Triangles are the Greedy Solutions

0.4 0.5 0.6 0.7 0.8 0.9 0.80 0.85 0.90 0.95 1.00 bpercent tpercent

Interface04 – p.21/24

slide-22
SLIDE 22

Remaining Questions

Demonstrated similar latent class discovery among all

  • f the greedy dominating set solutions

Many of the 7129 variates (genes) are superfluous to the discriminant analysis problem Work is ongoing to examine the discovered latent classes based on subsets of the genes Various figures of merit have been used to choose the subsets of the genes

Interface04 – p.22/24

slide-23
SLIDE 23

Conclusions

Developed a new concept for latent class discovery during discriminant analysis Illustrated one graph theoretic methodology for the discovery of the latent classes Illustrated this methodology with a gene expression data set. Presented some preliminary results examining the robustness of the discovery process to the cccd process

Interface04 – p.23/24

slide-24
SLIDE 24

Readings

  • C. E. Priebe, J. L. Solka, D. J. Marchette, and B. T. Clark, “Class Cover Catch

Digraphs for Latent Class Discovery in Gene Expression Monitoring by DNA Microarrays,” to appear the Special Issue of Computational Statistics and Data Analysis on Statistical Visualization, 2002+.

  • J. L. Solka, C. E. Priebe, and B. T. Clark, “A Visualization Framework for the

Analysis of Hyperdimensional Data,” in International Journal of Image and Graphics Special Issue on Data Mining, 2002. Marchette, D.J., Priebe, C.E., “Characterizing the scale dimension of a high-dimensional classification problem,”in Pattern Recognition,2002

Interface04 – p.24/24