Use of Microarray Data via Model- Based Classification in the Study - - PowerPoint PPT Presentation

use of microarray data via model based classification in
SMART_READER_LITE
LIVE PREVIEW

Use of Microarray Data via Model- Based Classification in the Study - - PowerPoint PPT Presentation

Use of Microarray Data via Model- Based Classification in the Study and Prediction of Survival from Lung Cancer Liat Jones * , Angus Ng * , Chris Ambroise ** , Katrina Monico* and Geoff McLachlan * * Institute for Molecular Bioscience **


slide-1
SLIDE 1

Use of Microarray Data via Model- Based Classification in the Study and Prediction of Survival from Lung Cancer

Liat Jones*, Angus Ng*, Chris Ambroise** , Katrina Monico* and Geoff McLachlan*

*Institute for Molecular Bioscience **Laboratoire Heudiasyc

University of Queensland

slide-2
SLIDE 2

AIM: To link gene-expression data with survival from lung cancer

A CLUSTER ANALYSIS We apply a model-based clustering approach to classify tumor tissues on the basis of microarray gene expression. B SURVIVAL ANALYSIS The association between the clusters so formed and patient survival (recurrence) times is established. C DISCRIMINANT ANALYSIS We demonstrate the potential of the clustering-based prognosis as a predictor of the outcome of disease.

slide-3
SLIDE 3

STANFORD and ONTARIO DATASETS:

cDNA microarrays were used to obtain gene expression profiles for the tissue (tumor) samples.

STANFORD: 918 genes ONTARIO: 2880 genes

The Stanford Dataset contains relatively more adenocarcinoma (AC) samples, and the Ontario Dataset contains only non-small cell lung carcinomas (NSCLC).

slide-4
SLIDE 4

Tumor Type Number of Samples Stanford Ontario Adenocarcinoma 41 19 Squamous cell 16 14 Large cell 5 4 Adenosquamous 0 1 Carcinoid 0 1 Small Cell 5 0 TOTAL 67 39

Tumor Types in Stanford and Ontario Datasets

slide-5
SLIDE 5

Tissues Genes

Heat Map for 2880 Ontario Genes (39 Tissues)

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

CLUSTERING OF ONTARIO TUMORS Using EMMIX-GENE

Steps used in the application of EMMIX-GENE:

  • Select the most relevant genes from this filtered set
  • f 2,880 genes. The set of retained genes is thus

reduced to 766.

  • Cluster these 766 genes into twenty groups. The

majority of gene groups produced were reasonably cohesive and distinct.

  • Using these twenty group means, cluster the tissue

samples into two groups using a mixture of normal components/factor analyzers.

slide-9
SLIDE 9

Tissues are ordered as: Recurrence (1-24) and Censored (25-39) Heat Maps for the 20 Ontario Gene-Groups (39 Tissues)

Tissues Genes

slide-10
SLIDE 10

Expression Profiles for Useful Metagenes (Ontario 39 Tissues)

Log Expression Value

Our Tissue Cluster 1 Our Tissue Cluster 2

Tissues

Recurrence (1-24) Censored (25-39)

Gene Group 1 Gene Group 2 Gene Group 19 Gene Group 20

slide-11
SLIDE 11

PNUTL1

Cluster A

(down Rec, up Censored)

Censored (25-39) Recurrence (1-24)

FUS Wee1

Expression Profiles of some Genes Identified in Ontario

ATM

Clusters B and C

(up Rec, down Censored)

Recurrence (1-24) Censored (25-39)

HIF1A RABIF Log Expression Value

Tissues

slide-12
SLIDE 12

Only ZNF136 is retained by us and also identified in Ontario

Recurrence (1-24) Censored (25-39)

Tissues

Log Expression Value

It is found in our Group 19 (up-regulated in recurrence).

slide-13
SLIDE 13

Tumors 1-24 belong to RECURRENCE group Tumors 25-39 are CENSORED CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES yields TWO CLUSTERS: CLUSTER 1: 1-14, 16-24 (recurrence) plus 25-29, 33, 36, 38 (censored) CLUSTER 2: 15 (recurrence) plus 30-32, 34, 35, 37, 39 (censored)

Tissue Clusters

slide-14
SLIDE 14

SURVIVAL ANALYSIS:

LONG-TERM SURVIVOR (LTS) MODEL where T is time to recurrence and π1 = 1- π2 is the prior prob. of recurrence. Adopt Weibull model for the survival function for recurrence S1(t).

2 1 1

) ( } { . ) (

prob

π π + = > = t S t T t S

slide-15
SLIDE 15

Fitted LTS Model vs. Kaplan-Meier

slide-16
SLIDE 16

PCA of Tissues Based on Metagenes

First PC Second PC

slide-17
SLIDE 17

PCA of Tissues Based on Metagenes

First PC Second PC

slide-18
SLIDE 18

PCA of Tissues Based on All Genes (via SVD)

First PC Second PC

slide-19
SLIDE 19

PCA of Tissues Based on All Genes (via SVD)

First PC Second PC

slide-20
SLIDE 20

Cluster-Specific Kaplan-Meier Plots

slide-21
SLIDE 21

Survival Analysis for Ontario Dataset

8 7

  • No. of Censored

29 8

  • No. of Tissues

665 ± 85.9 1388 ± 155.7 1 2 Mean time to Failure (±SE) Cluster

  • Nonparametric analysis:

A significant difference between Kaplan-Meier estimates for the two clusters (P=0.027).

  • Cox’s proportional hazards analysis:

6.78 (0.9 – 51.5) 1.07 (0.57 – 2.0) Hazard ratio (95% CI) 0.06 0.83 Cluster 1 vs. Cluster 2 Tumor stage (I vs. II&III) P-value Variable

slide-22
SLIDE 22

Discriminant Analysis (Supervised Classification)

A prognosis classifier was developed to predict the class

  • f origin of a tumor tissue with a small error rate after

correction for the selection bias. A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes. A cross-validation (CV) procedure was used to calculate the prediction error, after correction for the selection bias.

slide-23
SLIDE 23

ONTARIO DATA (39 tissues): Support Vector Machine (SVM) with Recursive Feature Elimination (RFE)

2 4 6 8 10 12 0.02 0.04 0.06 0.08 0.1 0.12

log2 (number of genes) Error Rate (CV10E)

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). applied to g=2 clusters (G1: 1-14, 16- 29,33,36,38; G2: 15,30-32,34,35,37,39)

slide-24
SLIDE 24

STANFORD DATA

918 genes based on 73 tissue samples from 67 patients. Row and column normalized, retained 451 genes after select-genes step. Used 20 metagenes to cluster tissues. Retrieved histological groups.

slide-25
SLIDE 25

Tissues are ordered by their histological classification: Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal (48-52), Squamous cell (53-68), Small cell (69-73) Heat Maps for the 20 Stanford Gene-Groups (73 Tissues)

Genes Tissues

slide-26
SLIDE 26

Reduced dataset of 35 Adenocarcinoma (AC) Tissues

Full dataset had 41 AC tissues. According to our cluster analysis: AC tissues 5, 16, 26 are put with LCLC 7, 29 are put with SCLC 40 is put with SCC. Also, we did not add tissues 43 (LCLC) nor 68 (SCC) (as done in the Stanford study), as they were both assigned to the LCLC cluster. This left 35 AC tissues with 918 genes, reduced to 219 genes, which were clustered into 15 groups (metagenes).

slide-27
SLIDE 27

STANFORD CLASSIFICATION: Cluster 1: 1-19 (good prognosis) Cluster 2: 20-26 (long-term survivors) Cluster 3: 27-35 (poor prognosis)

slide-28
SLIDE 28

Heat Maps for the 15 Stanford Gene-Groups (35 Tissues) Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35)

Tissues Genes

slide-29
SLIDE 29

Expression Profiles for Top Metagenes (Stanford 35 AC Tissues)

Gene Group 1 Gene Group 2 Gene Group 3 Gene Group 4

Log Expression Value Tissues Stanford AC group 1 Stanford AC group 2 Stanford AC group 3 Misallocated

slide-30
SLIDE 30

Which Genes make up the top 4 Metagenes ?

Group 1 (22 genes) includes:

ESTs Hs.11607 ataxia-telangiectasia group D-associated protein solute carrier family 7, member 5 (CD98) vascular endothelial growth factor C

Marker Genes For Group 3 (Supervised)

High in group 3, low in 1 and 2 (4/10 genes)

Group 2 (12 genes) includes:

  • rnithine decarboxylase

carbonyl reductase (metabolic enzyme)

Marker Genes for Group 2 (Supervised)

High in group 2, low in 3 (1/8 genes)

Group 3 (16 genes) includes:

aldo-keto reductase family 1 glutathione peroxidase thioredoxin reductase

Metabolic Enzymes (Unsupervised)

High in group 3, also SCC (3/6 genes)

Group 4 (14 genes) includes:

cartilage paired-class homeoprotein tumor suppressor deleted in oral cancer-related 1

Marker Genes for Group 2 (Supervised)

High in group 2, low in 3 (2/8 genes)

slide-31
SLIDE 31

Gene Group 7

Some other interesting Metagenes

Group 7 (19 genes) includes:

citron surfactant A1

Marker Genes For Group 1 (Supervised)

High in group 1, low in 2 (1/ 9 genes)

Surfactant Proteins (Unsupervised)

High in groups 1 and 2, low in 3

Gene Group 9

Group 9 (22 genes) includes:

ICAM-1 (CD54) collagen, type IX hepsin thyroid transcription factor

Marker Genes For Group 1 (Supervised)

High in group 1, low in 2 (4/ 9 genes) Log Expression Value Tissues

slide-32
SLIDE 32

Cluster-Specific Kaplan-Meier Plots

slide-33
SLIDE 33

Cluster-Specific Kaplan-Meier Plots

slide-34
SLIDE 34

STANFORD DATA:

TWO-COMPONENT WEIBULL MIXTURE MODEL where

), ( ) ( ) (

2 2 1 1

t S t S t S π π + =

). 2 , 1 ( ) ( exp ) ( = − = i t t S

i

i i β

α

slide-35
SLIDE 35

Plot of 1- and 2-component Weibull Mixture vs. Kaplan-Meier

slide-36
SLIDE 36

Survival Analysis for Stanford Dataset

10

  • No. of Censored

17 5

  • No. of Tissues

37.5 ± 5.0 5.2 ± 2.3 1 2 Mean time to Failure (±SE) Cluster

  • Kaplan-Meier estimation:

A significant difference in survival between clusters (P<0.001)

  • Cox’s proportional hazards analysis:

13.2 (2.1 – 81.1) 1.94 (0.5 – 8.5) 0.96 (0.3 – 2.8) 1.65 (0.7 – 3.9) 4.41 (1.0 – 19.8) Hazard ratio (95% CI) 0.005 0.38 0.93 0.25 0.05 Cluster 3 vs. Clusters 1&2 Grade 3 vs. grades 1 or 2 Tumor size

  • No. of tumors in lymph nodes

Presence of metastases P-value Variable

slide-37
SLIDE 37

Survival Analysis for Stanford Dataset

0.48 0.18 0.20 0.53 0.06 0.35 (0.50)

  • 0.55 (0.41)
  • 0.61 (0.48)

0.22 (0.36) 1.70 (0.92) 11 12 13 14 15 0.20 0.24 0.10 0.02 0.06

  • 0.63 (0.50)
  • 0.68 (0.57)

0.75 (0.46)

  • 1.13 (0.50)

0.73 (0.39) 6 7 8 9 10 1.37 (0.44)

  • 0.24 (0.31)

0.14 (0.34)

  • 1.01 (0.56)

0.66 (0.65) Coefficient (SE) 0.002 0.44 0.68 0.07 0.31 1 2 3 4 5 P-value Metagene

  • Univariate Cox’s proportional hazards analysis (metagenes):
slide-38
SLIDE 38

Survival Analysis for Stanford Dataset

3.44 (0.95)

  • 1.60 (0.62)
  • 1.55 (0.73)

1.16 (0.54) Coefficient (SE) 0.0003 0.010 0.033 0.031 1 2 8 11 P-value Metagene

  • Multivariate Cox’s proportional hazards analysis (metagenes):

The final model consists of four metagenes.

slide-39
SLIDE 39

STANFORD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE)

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). Applied to g=2 clusters.

1 2 3 4 5 6 7 8 9 10 0.01 0.02 0.03 0.04 0.05 0.06 0.07

log2 (number of genes) Error Rate (CV10E)

slide-40
SLIDE 40

HARVARD DATA (203 Tissues) 3312 genes on 203 tumors (oligonucleotide arrays)

We imposed a floor of 1; ceiling of 3,000; logged data and then row (but not column) normalized. Continuing with 3190 genes, 1363 were retained after select-genes step, which were then clustered into 40 groups (metagenes).

slide-41
SLIDE 41

Tissues are ordered by their histological classification: Adenocarcinoma (1-127) Colon Mets (128-139) Normal (140-156) Squamous cell (157-177)

  • Pulm. Carcinoids (178-197)

Small cell (198-203)

Genes

Heat Maps for the 40 Harvard Gene Clusters (203 Tissues)

Tissues

slide-42
SLIDE 42

We retrieved histological groups and then focused

  • n the 127 ACs.

Start with 3190 genes, 858 genes were retained after the select-genes step, which were clustered into 20 groups.

HARVARD DATA (127 Tissues)

slide-43
SLIDE 43

Heat Maps for the 20 Harvard Gene-Groups (127 AC Tumors)

Genes Tissues

Tissues are ordered as our clusters: Cluster 1 (1-55), Cluster 2 (56-110), Cluster 3 (111-127)

slide-44
SLIDE 44

Cluster-Specific Kaplan-Meier Plots

slide-45
SLIDE 45

Survival Analysis for Harvard Dataset

22 25 4

  • No. of Censored

54 47 14

  • No. of Tissues

50.9 ± 5.8 62.2 ± 5.7 26.5 ± 4.7 1 2 3 Mean time to Failure (±SE) Cluster

  • Kaplan-Meier estimation:

A significant difference in survival between clusters (P=0.044)

  • Cox’s proportional hazards analysis:

0.74 (0.4 – 1.4) 3.08 (1.4 – 6.8) 1.02 (1.0 – 1.1) 0.56 (0.3 – 1.0) 1.24 (0.6 – 2.5) 1.68 (0.9 – 3.3) 2.50 (1.3 – 4.8) 1.43 (0.7 – 2.9) Hazard ratio (95% CI) 0.34 0.005 0.12 0.06 0.54 0.13 0.006 0.33 Cluster 2 vs. Cluster 1 Cluster 3 vs. Cluster 2 Age Female vs. Male Smoking frequency Tumor size Presence of tumor in lymph nodes Grade (1 vs. 2 to 4) P-value Variable

slide-46
SLIDE 46

HARVARD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) applied to g=2 clusters

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM) applied to 2 clusters.

1 2 3 4 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

log2 (number of genes) Error Rate (CV10E)

slide-47
SLIDE 47

MARKER GENES FOR HARVARD DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected.

  • No. of genes Times selected

55 1 18 2 11 3 7 4 8 5 6 6 10 7 8 8 12 9 17 10

slide-48
SLIDE 48
  • No. of Times

genes selected 55 1 18 2 11 3 7 4 8 5 6 6 10 7 8 8 12 9 17 10

Fc fragment of IgG, receptor, transporter, alpha sine oculis homeobox (Drosophila) homolog 3 transcriptional intermediary factor 1 gamma transcription elongation factor A (SII)-like 1 like mouse brain protein E46 minichromosome maintenance deficient (mis5, S. pombe) 6 transcription factor 12 (HTF4, helix-loop-helix transcription factors 4) guanine nucleotide binding protein (G protein), gamma 3, linked dihydropyrimidinase-like 2 Cluster Incl AI951946 transforming growth factor, beta receptor II (70-80kD) protein kinase C-like 1 tubulin, alpha, ubiquitous Cluster Incl N90862 cyclin-dependent kinase inhibitor 2C (p18, inhibits CDK4) DEK oncogene (DNA binding) Cluster Incl AF035316 transducin-like enhancer of split 2, homolog of Drosophila E(sp1) ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) benzodiazapine receptor (peripheral) Cluster Incl D21063 galactosidase, beta 1 high-mobility group (nonhistone chromosomal) protein 2 cold inducible RNA-binding protein Cluster Incl U79287 BAF53 tubulin, beta polypeptide thromboxane A2 receptor H1 histone family, member X

MARKER GENES FOR HARVARD DATA

slide-49
SLIDE 49

MICHIGAN DATA 4965 genes on 86 AC tumors (oligonucleotide

arrays) We imposed a floor of –100; ceiling of 26,000; applied generalized log transformation then row normalized but not column normalized. Retained 1394 genes after select-genes steps, which were clustered into 40 groups (metagenes).

[ ]

2 2

log x c x + +

slide-50
SLIDE 50

Heat Maps for the 40 Michigan Gene-Groups

Tissues Genes

Tissues in

  • rder of
  • ur clusters:

Cluster 1 (1-34) Cluster 2 (35-69) Cluster 3 (70-86)

slide-51
SLIDE 51

Cluster-Specific Kaplan-Meier Plots

slide-52
SLIDE 52

CONCLUSIONS

We applied a model-based clustering approach to classify tumors using their gene signatures into:

  • clusters corresponding to tumor type
  • clusters corresponding to clinical outcomes

for tumors of a given subtype In (a), almost perfect correspondence between cluster and tumor type, at least for non-AC tumors (except Ontario dataset).

slide-53
SLIDE 53

The clusters in (b) were identified with clinical

  • utcomes (e.g. recurrence/recurrence-free and

death/long-term survival). Except for the Michigan dataset, we were able to show that gene-expression data provide prognostic information, beyond that of clinical indicators such as stage. CONCLUSIONS (cont.)

slide-54
SLIDE 54

Based on the tissue clusters, a discriminant analysis using support vector machines (SVM) demonstrated further the potential of gene expression as a tool for guiding treatment therapy and patient care to lung cancer patients. This supervised classification procedure was used to provide marker genes for prediction of clinical

  • utcomes.

(In addition to those provided by the cluster-genes step in the initial unsupervised classification.) CONCLUSIONS (cont.)

slide-55
SLIDE 55

LIMITATIONS Small number of tumors available (e.g Ontario and Stanford datasets). Clinical data available for only subsets of the tumors;

  • ften for only one tumor type (AC).

High proportion of censored observations limits comparison of survival rates.

slide-56
SLIDE 56

Effect of different platforms? In both oligonucleotide arrays and the

  • ne cDNA array with sufficient AC tumors, we

worked with 3 tissue clusters, which corresponded to low to high chance of survival. There was a small subset of genes important for this differentiation of tissues that are common to the arrays, regardless of platform (e.g Harvard and Stanford).

slide-57
SLIDE 57

AC2 (Group 2) AC2/3 (Group 3) AC2/3 (Group 3) (Yes) AC2/3 (Group 3) AC2/3 (Group 3) AC3 (Group 1) (Yes) C3 (Group 6) C3 C3 C3 C3 C3 (Group 2)

Metabolic

  • rnithine decarboxylase 1

glutathione S transferase pi aldo-keto reductase family 1 aldehyde dehydrogenase 3 family S100 calcium binding protein P mucin 1 epithelial glutathione peroxidase 2 thioredoxin reducatase 1 S100 calcium binding protein A8 Glutathione S transferase theta 1 AC1 (Group 9) AC1 (Group 9) AC1 (Group 9) C4 C4 C4 (Group 6)

Type II Pneumocyte

thyroid transcription factor 1 surfactant B, C and D cathepsin H hepsin cadherin 1

Stanford (Our Group) Harvard (Our Group) Genes

Comparing some genes from Harvard and Stanford

slide-58
SLIDE 58

Ontario Dataset (39 Tumors)

Top 10 Selected Genes by Emmix-Gene

161395:Data not found:not available 146669:Hs.181125:IGL@ 182472:Data not found:not available 148115:Hs.111279:FLJ10404 346624:Data not found:not available 114620:Hs.25911:D6S51E 152203:Hs.108043:FLI1 143221:Hs.76781:ABCD3 124418:Data not found:not available 153697:Hs.25486:not available

slide-59
SLIDE 59

Stanford Dataset (35 AC Tumors)

Top 10 Selected Genes by Emmix-Gene

CD36 antigen (collagen type I receptor, thrombospondin receptor) signal transducer and activator of transcription 4 Hs.80642 R91570 aldo-keto reductase family 1, member C1 kynureninase (L-kynurenine hydrolase) Hs.169139 H87471 ESTs Hs.297820 AA013260 aldo-keto reductase family 1, member C1 maternally expressed 3 Hs.112844 W85841 **ATP-binding cassette, sub-family C (CFTR/MRP) ESTs Hs.11607 AA443569 hypothetical protein FLJ12541 similar to Stra6 Hs.24553 R32753

slide-60
SLIDE 60

Michigan Dataset (86 AC Tumors)

Top 10 Selected Genes by Emmix-Gene

NULL PSMB9 ATP2B1 NULL COL13A1 IRS1 ETFDH ELF3 MMP7 NRGN

slide-61
SLIDE 61

Harvard Dataset (127 AC Tumors)

Top 10 Selected Genes by Emmix-Gene

"glucuronidase, beta" Ets2 repressor factor " Cluster Incl M69245:Human pregnancy-specific beta-1 glycoprotein KIAA0225 protein "tissue inhibitor of metalloproteinase 1" PHD finger protein 2 poliovirus receptor-related 2 (herpesvirus entry mediator B) KIAA0699 protein chromobox homolog 1 (Drosophila HP1 beta) "ATPase, H+ transporting, lysosomal (vacuolar proton pump), member D"

slide-62
SLIDE 62

MICHIGAN DATA:

LTS MODEL where CONCLUDE:

, ) ( ) (

2i i i

t S t S π π + =

1 1

). exp( ) (

β

αt t S − =

1

3 2 1 2 2 2

π π π = >

slide-63
SLIDE 63

Survival Analysis for Michigan Dataset

27 23 12

  • No. of Censored

34 35 17

  • No. of Tissues

84.1 ± 7.3 73.5 ± 8.5 41.7 ± 7.8 1 2 3 Mean time to Failure (±SE) Cluster

  • Kaplan-Meier estimation:

No significant difference in survival between clusters

  • Long-term survivor model:

α : 0.017 (0.007); β : 1.116 (0.215)

  • 0.712 (0.218); 0.654 (0.169); 0.723 (0.702)

Estimates (SE) S1(t) (Weibull distribution) π2 (Logistic function)

A significant difference in π2 between Clusters 1 & 2

51.5% Cluster 2 67.1% Cluster 1 34.0% Proportion of long-term survivors Cluster 3

slide-64
SLIDE 64

MICHICAN DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) applied to g=3 clusters

Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM) applied to 3 clusters.

2 4 6 8 10 12 14 0.1 0.2 0.3 0.4 0.5

log2 (number of genes) Error Rate (CV10E)

slide-65
SLIDE 65

MARKER GENES FOR MICHIGAN DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected.

  • No. of genes Times selected

93 1 21 2 14 3 15 4 9 5 6 6 7 7 12 8 3 9 15 10

slide-66
SLIDE 66

MARKER GENES FOR MICHIGAN DATA For a SVM based on 64 genes, and using 10-fold CV, we noted the number of times a gene was selected.

  • No. of genes Times selected

93 1 21 2 14 3 15 4 9 5 6 6 7 7 12 8 3 9 15 10

STAT6 MT2A TNFRSF12 NULL PPFIA1 HBB FEN1 FRG1 F7 FABP5 BAK1 TMP21 SLC21A2 FOXM1 BMPR1A REGL NR3C1 MMP12 DGKQ SELP PIK3R1 COX7A1 CHAF1A SELPLG LOC51763 AGER PDLIM1 MEN1 VARS2 FGFR1

slide-67
SLIDE 67

MIXTURE OF g NORMAL COMPONENTS ) ; ( ) ; ( ) (

1 g g g 1 1

f S , µ x S , µ x x φ π φ π + + = K K

) ( ) ( µ x µ x − −

T

EUCLIDEAN DISTANCE

+ − − = −

) ( ) ( ) ( log 2 µ x S µ x S µ, x;

1 T

φ

where

constant constant

+ − − = −

4 4 4 3 4 4 4 2 1 ) ( ) ( ) ( log 2 µ x S µ x S µ, x;

1 T

φ

MAHALANOBIS DISTANCE where

slide-68
SLIDE 68

SPHERICAL CLUSTERS k-means

{

I S S

2 1

s

g =

= = K K

MIXTURE OF g NORMAL COMPONENTS ) , ; ( ) , ; ( ) (

1 1 1 g g g

f S µ x S µ x x φ π φ π + + = K K

I S S

2 1

s

g =

= = K K

k-means

slide-69
SLIDE 69

With a mixture model-based approach to clustering, an observation is assigned

  • utright to the ith cluster if its density in

the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components.

) , ; ( ) , ; ( ) , ; ( ) (

1 1 1 g g g i i i

f S µ x S µ x S µ x x φ π φ π φ π + + + + = K K

slide-70
SLIDE 70

Mixtures of Factor Analyzers

A normal mixture model without restrictions

  • n the component-covariance matrices may

be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers.

slide-71
SLIDE 71

), ,..., 1 ( where ), , ; ( ) (

1

g i f

i T i i i i i j i g i j

= + = =∑

=

D B B x x Σ µ Σ φ π

Bi is a p x q matrix and Di is a diagonal matrix.