[PPT] - TimesVector: A vectorized clustering approach to the analysis of PowerPoint Presentation

SLIDE 1

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes

Inuk Jung, Hongryul Ahn, Kyuri Jo, Hyejin Kang, Youngjae Yu and Sun Kim

Inuk Jung (inukjung@snu.ac.kr) Bio and Health Informatics lab Seoul National University

SLIDE 2

Goal of this study

Identify biologically meaningful gene clusters (triclusters) that have significantly similar or differential expression patterns from 3 dimensional time series data (Gene-Time-Condition)

C

SLIDE 3

Example Organism: Mouse (18117 genes) Time points: day 0, day 3, day 7, day 14 Conditions: Malaria infected intact female, gonadectomized* (gdx) female, intact male, gdx male

289872 expression values (GxTxC)

Differentially Expressed Patterns (DEP) Similarly Expressed Pattern (SEP)

100 genes 80 genes 200 genes

Goal of this study

*Removementof ovaries or testis

SLIDE 4

Two technical problem statements

1. High clustering complexity by dimensions
2. Technical difficulty to capture differential

expression patterns between two or more conditions (What are DEGs in time series data?)

SLIDE 5

P1. High clustering complexity by dimensions

DEG analysis used for time series analysis [1] (2000) Biclustering algorithm developed for time series data [2] (2000)

Does not take into account the sequential nature of time series expression data Biclustering is NP-hard and is bound to 2 dimensional clustering (either gene-time

r gene-condition)

First triclustering algorithm developed , TriCluster [3] (2005)

Only able to identify triclusters with similar expression patterns (SEP)

Triclustering tool that is able to identify DEPs [4] (2012)

Identification process of DEP is based on similarity measures – poor performance One dimension (C) Two dimensions (GT, or GC) Three dimensions (GCT) Three dimensions (GCT)

[1] Alizadeh et al, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 2000 [2] Cheng et al, Biclustering of expression data, ISMB2000 [3] Zhao et al, The Tricluster algorithm, ACM SIGMOD 2005 [4] Tchagang et al, The OPTricluster algorithm, BMC Bioinformatics 2012

SLIDE 6

Divergent pattern recognition is not available
Expression pattern differs between all patterns
OPTricluster performs a pairwise comparison for detecting

divergent expression pattern clusters

In case of four conditions – A, B, C, D
A vs BCD, B vs ACD, C vs ABD, D vs ABC
Hence A!=B!=C!=D is not supported
P2. Capturing differential expression patterns

between two or more conditions

SLIDE 7

TimesVector Framework

Clustering Detecting patterns

SLIDE 8

Clustering – Dimension reduction

Dimension reduction by stripping away the sample dimension and

concatenating it to the time dimension

Takes burden off of for clustering and post-processing procedures
No information is lost

t1 t2 t3 25 23 22 48 17

…

16 t1 t2 t3 5 12 1

…

13 t1 t2 t3 g1 15 20 10 g2 39 52 31 g3 8 16 6 …

… … …

gi 25 23 25

Genes (i) Time (j)

G×C×T matrix t1 t2 t3 t1 t2 t3 g1 15 20 10 15 10 5 g2 39 52 31 35 22 12 g3 8 16 6 7 3 1 …

… … … … … …

gi 25 23 25 14 15 13

s1 s2 sk

G×CT matrix t1 t2 t3 25 23 22 55 52 48 20 18 17

… … …

17 16 16 … …

Concatenate samples

Genes (i) Time (j)⋅Conditions(k)

3 dimensional matrix 2 dimensional matrix

SLIDE 9

Spherical K-means clustering

Spherical K-means (skmeans) for clustering the vectors
A K-means clustering algorithm with cosine similarity as its distance metric
Vectors are normalized to unit vectors – this causes projection of vectors to a sphere
Minimize the cosine dissimilarity in all clusters

: indicator of a gene having membership to cluster : the centroid of cluster : expression level vector of gene : total number of genes : total number of clusters

SLIDE 10

Selecting K by silhouette score

Using four microarray and RNA-seq time-series data, the K with the

highest silhouette score was chosen

100 200 300 400 500 600 700 5 10 15 20 25 30

Optimal K Condition × Time points Data C T C×T K GSE74465 (Rice) 2 3 6 100 GSE11651 (Yeast) 5 3 15 200 GSE4324 (Mouse) 4 4 16 500 GSE39429 (Rice) 4 6 24 600

: C×T

SLIDE 11

Re-introduce condition dimension by splitting vectors by conditions
The bZIP gene vector is dissected into the number of conditions

v(bZIP)=<1, 1, 1, 3, 3, 3, 3.5, 2.5, 3, 3.7, 2.2, 3>

Detecting clusters with distinct expression patterns

0h 1h 6h

A B C D 1 2 3 4 5 1 2 3

A (0h, 1h 6h) B (0h, 1h 6h) C (0h, 1h 6h)

Conditions centroid

D (0h, 1h 6h)

SLIDE 12

Three types of patterns are defined

DEP (Differentially Expressed Pattern)
All samples in a cluster have different expression patterns
ODEP (One Differentially Expressed Pattern)
One sample in a cluster have different expression from the others
SEP (Similarly Expressed Pattern)
All samples have similar expression pattern in a cluster

SLIDE 13

Method – DEP pattern recognition

Objective: Test if expression of conditions A, B, C are A!=B!=C
Build centroid for each condition within each cluster
Select the most outer centroid as base centroid

0h 1h 6h

cluster C1 cluster C2

A B C A centroid B centroid C centroid

cluster C3

1 2 3 4 5 1 2 3

SLIDE 14

1. Compute cosine distance from each dissected vector to the base centroid for each cluster 2. Rank dissected vectors by cosine distance 3. Measure Mutual Information with X as distance to base centroid and Y as condition 4. Measure significance of MI by 1000 random permutated tests,

Method – DEP pattern recognition

0h 1h 6h

cluster C2

A B C A centroid B centroid C centroid Base centroid 1 2 3 4 5 1 2 3

Phenotype A A A A B B B B C C C C clid G1_A G2_A G3_A G4_A G1_B G2_B G3_B G4_B G1_C G2_C G3_C G4_C C2 0.9 0.87 0.96 0.99 0.1 0.05 0.2 0.18 0.5 0.6 0.57 0.61 Rank 10 9 11 12 2 1 4 3 5 7 6 8 Discretized Rank 3 3 3 3 1 1 1 1 2 2 2 2 MI Log2(4)=2

SLIDE 15

Method – ODEP pattern recognition

Objective: Test if expression pattern of a condition among A, B, C is A!=BC (B=C) or B!=AC (A=C) or C!=AB (A=B) 1. Compute a base centroid of comparing conditions (BC, AC, AB) 2. Compute cosine distance of dissected vectors to the centroid for each combination 3. Perform ANOVA on the computed cosine distance combinations

0h 1h 6h

cluster C1 cluster C2

A B C A centroid B centroid C centroid

cluster C3

1 2 3 4 5 1 2 3

SLIDE 16

Method – SEP pattern recognition

Objective: Test if expression of conditions A, B, C is A=B=C 1. Compute a base centroid of all conditions within a cluster 2. Compute cosine distance of dissected vectors to the base centroid 3. Tightness - lower bound of 99% confidence interval of all clusters 4. Clusters with tightness less than 99% CI are SEP clusters

0h 1h 6h

cluster C1 cluster C2

A B C A centroid B centroid C centroid

cluster C3

1 2 3 4 5 1 2 3

SLIDE 17

Results

Data
Biologically significant clusters detected
Performance compared with Tricluster and OPTricluster

*

Malaria infected / Gonadectomized male and female mice Rice plants treated with 4 phytohormones Dehydration stress treated rice plants Fermentation of five yeast strains

SLIDE 18

Results – Cluster patterns

C=4, T=4 C=4, T=6

SLIDE 19

Results – Malaria infected Mouse data

(a) DEP cluster 51 (b) ODEP cluster 20 (c) SEP cluster 357

SLIDE 20

Results – Phytohormone treated rice plants

5 clusters were found that responded to the ABA (Absicic acid) phytohormone
Genes were gradually induced over time.
Enriched GO terms in these clusters were related to ‘Response to abscisic acid’

SLIDE 21

Results – Comparison with other tools

Average number of genes per cluster Tightness (average within cosine distance of clusters) Weighted silhouette score

SLIDE 22

Conclusion

TimesVector is able to detect gene clusters in 3D time-series data that

exhibit distinct expression patterns

Especially, it is able to detect clusters with distinctively different

expression patterns across conditions

It showed significantly improved clustering quality compared to

recent triclusteringtools

SLIDE 23

Funding

The Cooperative Research Program for Agriculture Science & Technology

Development (Project No. PJ01121102) Rural Development Administration (RDA), Republic of Korea

The Bio & Medical Technology Development Program of the National Rese

arch Foundation (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3A9D1054622)

The Korea Health Technology R&D Project through the Korea Health Industry

Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224)

SLIDE 24

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes

Goal of this study

Identify biologically meaningful gene clusters (triclusters) that have significantly similar or differential expression patterns from 3 dimensional time series data (Gene-Time-Condition)

C

Example Organism: Mouse (18117 genes) Time points: day 0, day 3, day 7, day 14 Conditions: Malaria infected intact female, gonadectomized* (gdx) female, intact male, gdx male

Goal of this study

Two technical problem statements

expression patterns between two or more conditions (What are DEGs in time series data?)

DEG analysis used for time series analysis [1] (2000) Biclustering algorithm developed for time series data [2] (2000)

First triclustering algorithm developed , TriCluster [3] (2005)

Triclustering tool that is able to identify DEPs [4] (2012)

divergent expression pattern clusters

between two or more conditions

TimesVector Framework

Clustering – Dimension reduction

concatenating it to the time dimension

3 dimensional matrix 2 dimensional matrix

Spherical K-means clustering

Selecting K by silhouette score

highest silhouette score was chosen

v(bZIP)=<1, 1, 1, 3, 3, 3, 3.5, 2.5, 3, 3.7, 2.2, 3>

Detecting clusters with distinct expression patterns

Three types of patterns are defined

Method – DEP pattern recognition

cluster C1 cluster C2

cluster C3

Method – DEP pattern recognition

cluster C2

Method – ODEP pattern recognition

cluster C1 cluster C2

cluster C3

Method – SEP pattern recognition

cluster C1 cluster C2

cluster C3

Results

*

Results – Cluster patterns

Results – Malaria infected Mouse data

Results – Phytohormone treated rice plants

Results – Comparison with other tools

Conclusion

exhibit distinct expression patterns

expression patterns across conditions

recent triclusteringtools

Funding

Development (Project No. PJ01121102) Rural Development Administration (RDA), Republic of Korea

arch Foundation (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3A9D1054622)

Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224)

Thank you for your attention