Comparative Clustering Analysis of Gene Expression Profiles Qun - - PowerPoint PPT Presentation

comparative clustering analysis of gene expression
SMART_READER_LITE
LIVE PREVIEW

Comparative Clustering Analysis of Gene Expression Profiles Qun - - PowerPoint PPT Presentation

Comparative Clustering Analysis of Gene Expression Profiles Qun Shan Genetics Division, MCB Department University of California at Berkeley Charless Fowlkes Serge Belongie (now at UCSD) Jitendra Malik Computer Science Department University


slide-1
SLIDE 1

Comparative Clustering Analysis of Gene Expression Profiles

Qun Shan Genetics Division, MCB Department University of California at Berkeley Charless Fowlkes Serge Belongie (now at UCSD) Jitendra Malik Computer Science Department University of California at Berkeley

slide-2
SLIDE 2

Take Home Messages

! Part I: The GeneCut Program

! a global, hierarchical clustering program ! based on normalized cut

! Part II: Application to Rosetta’s dataset

! Both GeneCut and hierarchical clustering

program used in the original paper capture essentially the same obvious clusters

! Comparative clustering analysis provides

an avenue for generating testable hypotheses

slide-3
SLIDE 3

Challenges & Motivations

! Unknown ground truth and sparse data

in multiple dimensional space

! None of the numerous clustering

programs is perfect

! Our solutions

! provide the state of art clustering program ! comparative clustering analysis ! user knowledge of how these different

clustering programs work

slide-4
SLIDE 4

Related Work

! Hierarchical Clustering Program (Eisen

et al, 1998) is the most popular

! simple, fast, and free ! nice data visualization interface ! Heuristic algorithm: greedy and pairwise

! SOM (Tamayo et al, 1999) ! CLICK (Sharan & Shamir, 2000) ! Support Vector Machine (Brown, 2000)

slide-5
SLIDE 5

Why Should We Care About GeneCut

! GeneCut uses the state of art clustering

algorithm (Kannan et al 2000)

! GeneCut offers a global clustering

method

! Comparative clustering analysis

slide-6
SLIDE 6

. GeneCut does Pairwise Clustering

! Pairwise clustering methods are based

directly on distance between all pairs of feature vectors in the data set

! In contrast, Central clustering, used by

k-mean based clustering programs, requires a small number of prototypical feature vectors

slide-7
SLIDE 7

Central Clustering Does Not Handle Transitivity Well

slide-8
SLIDE 8

GeneCut does Global Clustering

slide-9
SLIDE 9

Some Terminology for Graph Partitioning

! How do we bipartition a graph:

∅ = ∩ ∈ ∈∑

=

B A with B A,

), , W( B) A, (

v u

v u cut

disjoint y necessaril not A' and A A' A,

) , ( W ) A' A, (

∈ ∈

=

v u

v u assoc

slide-10
SLIDE 10

Normalized Cut, A measure of dissimilarity

! Normalized Cut, Ncut: ! Minimum cut is not appropriate since it favors

cutting small pieces.

V) , ( B) A, ( V) A, ( B) A, ( B) A, ( B assoc cut assoc cut Ncut + =

slide-11
SLIDE 11

Solving the Normalized Cut problem

! Exact discrete solution to Ncut is NP-

complete even on regular grid,

! [Papadimitriou’97]

! Drawing on spectral graph theory, good

approximation can be obtained by solving a generalized eigenvalue problem.

slide-12
SLIDE 12

Approximating Using Random Samples

! Solving big eigenvalue problem is

computationally expensive

! Approximate solution is possible using

small subset of random samples → Nyström approximation

! Originally developed in 1928 for

solution of eigenfunction problems

slide-13
SLIDE 13

Summary

! GeneCut is based on normalized cut

! global, hierarchical clustering ! Recursive K-way partitioning ! Stable clustering results ! Nystrom approximation

slide-14
SLIDE 14

GeneCut: Web Interface

slide-15
SLIDE 15

Rosetta Data Set

slide-16
SLIDE 16

GeneCut: hierarchical trees

slide-17
SLIDE 17

Ergosterol Cluster

erg2 erg3 ERG11 (tet promoter) HMG2 (tet promoter) yer0440c (haploid) Itraconazole Lovastatin Terbinafine hmg1 (haploid)

slide-18
SLIDE 18

Cell Wall Cluster

yar014c spf1 fks1 (haploid) anp1 2-deoxy-D-glucose glucosamine swi4 swi5 gas1 Tunicamycin yer083c

Yar014C= BUD14, unknown function, however it interacts with cell wall related proteins GLC7p and YOL154p FKS1 is a plasma membrane protein Swi4p-Swi6p activates genes involved in cell wall biosynthesis

slide-19
SLIDE 19

Mitochondria Cluster

slide-20
SLIDE 20

GeneCut: A Close Look

slide-21
SLIDE 21

Genes in The Ergosterol Cluster

slide-22
SLIDE 22

A Close Look at YER044C

slide-23
SLIDE 23

YER044C May Share Functions with Genes in this Clusters:

slide-24
SLIDE 24

Where Does YER044C Fit in the Ergosterol Pathway?

slide-25
SLIDE 25

Proteolysis Model for Regulation of Ergosterol Biosynthesis

  • A. Vik (2001)
slide-26
SLIDE 26

Concluding Remarks

! GeneCut is a global clustering analysis

program for gene expression data

! GeneCut is based on normalized cut

algorithm, and incorporates features such as k-way clustering and Nyström approximation

! Exploration of gene expression profile

through comparative clustering analysis