TimesVector: A vectorized clustering approach to the analysis of - - PowerPoint PPT Presentation

timesvector a vectorized clustering approach to the
SMART_READER_LITE
LIVE PREVIEW

TimesVector: A vectorized clustering approach to the analysis of - - PowerPoint PPT Presentation

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes Inuk Jung, Hongryul Ahn, Kyuri Jo, Hyejin Kang, Youngjae Yu and Sun Kim Inuk Jung (inukjung@snu.ac.kr) Bio and Health


slide-1
SLIDE 1

TimesVector: A vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes

Inuk Jung, Hongryul Ahn, Kyuri Jo, Hyejin Kang, Youngjae Yu and Sun Kim

Inuk Jung (inukjung@snu.ac.kr) Bio and Health Informatics lab Seoul National University

slide-2
SLIDE 2

Goal of this study

Identify biologically meaningful gene clusters (triclusters) that have significantly similar or differential expression patterns from 3 dimensional time series data (Gene-Time-Condition)

C

slide-3
SLIDE 3

Example Organism: Mouse (18117 genes) Time points: day 0, day 3, day 7, day 14 Conditions: Malaria infected intact female, gonadectomized* (gdx) female, intact male, gdx male

289872 expression values (GxTxC)

Differentially Expressed Patterns (DEP) Similarly Expressed Pattern (SEP)

100 genes 80 genes 200 genes

Goal of this study

*Removementof ovaries or testis

slide-4
SLIDE 4

Two technical problem statements

  • 1. High clustering complexity by dimensions
  • 2. Technical difficulty to capture differential

expression patterns between two or more conditions (What are DEGs in time series data?)

slide-5
SLIDE 5
  • P1. High clustering complexity by dimensions

DEG analysis used for time series analysis [1] (2000) Biclustering algorithm developed for time series data [2] (2000)

Does not take into account the sequential nature of time series expression data Biclustering is NP-hard and is bound to 2 dimensional clustering (either gene-time

  • r gene-condition)

First triclustering algorithm developed , TriCluster [3] (2005)

Only able to identify triclusters with similar expression patterns (SEP)

Triclustering tool that is able to identify DEPs [4] (2012)

Identification process of DEP is based on similarity measures – poor performance One dimension (C) Two dimensions (GT, or GC) Three dimensions (GCT) Three dimensions (GCT)

[1] Alizadeh et al, Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature 2000 [2] Cheng et al, Biclustering of expression data, ISMB2000 [3] Zhao et al, The Tricluster algorithm, ACM SIGMOD 2005 [4] Tchagang et al, The OPTricluster algorithm, BMC Bioinformatics 2012

slide-6
SLIDE 6
  • Divergent pattern recognition is not available
  • Expression pattern differs between all patterns
  • OPTricluster performs a pairwise comparison for detecting

divergent expression pattern clusters

  • In case of four conditions – A, B, C, D
  • A vs BCD, B vs ACD, C vs ABD, D vs ABC
  • Hence A!=B!=C!=D is not supported
  • P2. Capturing differential expression patterns

between two or more conditions

slide-7
SLIDE 7

TimesVector Framework

Clustering Detecting patterns

slide-8
SLIDE 8

Clustering – Dimension reduction

  • Dimension reduction by stripping away the sample dimension and

concatenating it to the time dimension

  • Takes burden off of for clustering and post-processing procedures
  • No information is lost

t1 t2 t3 25 23 22 48 17

16 t1 t2 t3 5 12 1

13 t1 t2 t3 g1 15 20 10 g2 39 52 31 g3 8 16 6 …

… … …

gi 25 23 25

Genes (i) Time (j)

G×C×T matrix t1 t2 t3 t1 t2 t3 g1 15 20 10 15 10 5 g2 39 52 31 35 22 12 g3 8 16 6 7 3 1 …

… … … … … …

gi 25 23 25 14 15 13

s1 s2 sk

G×CT matrix t1 t2 t3 25 23 22 55 52 48 20 18 17

… … …

17 16 16 … …

Concatenate samples

Genes (i) Time (j)⋅Conditions(k)

3 dimensional matrix 2 dimensional matrix

slide-9
SLIDE 9

Spherical K-means clustering

  • Spherical K-means (skmeans) for clustering the vectors
  • A K-means clustering algorithm with cosine similarity as its distance metric
  • Vectors are normalized to unit vectors – this causes projection of vectors to a sphere
  • Minimize the cosine dissimilarity in all clusters

: indicator of a gene having membership to cluster : the centroid of cluster : expression level vector of gene : total number of genes : total number of clusters

slide-10
SLIDE 10

Selecting K by silhouette score

  • Using four microarray and RNA-seq time-series data, the K with the

highest silhouette score was chosen

100 200 300 400 500 600 700 5 10 15 20 25 30

Optimal K Condition × Time points Data C T C×T K GSE74465 (Rice) 2 3 6 100 GSE11651 (Yeast) 5 3 15 200 GSE4324 (Mouse) 4 4 16 500 GSE39429 (Rice) 4 6 24 600

: C×T

slide-11
SLIDE 11
  • Re-introduce condition dimension by splitting vectors by conditions
  • The bZIP gene vector is dissected into the number of conditions

v(bZIP)=<1, 1, 1, 3, 3, 3, 3.5, 2.5, 3, 3.7, 2.2, 3>

Detecting clusters with distinct expression patterns

0h 1h 6h

A B C D 1 2 3 4 5 1 2 3

A (0h, 1h 6h) B (0h, 1h 6h) C (0h, 1h 6h)

Conditions centroid

D (0h, 1h 6h)

slide-12
SLIDE 12

Three types of patterns are defined

  • DEP (Differentially Expressed Pattern)
  • All samples in a cluster have different expression patterns
  • ODEP (One Differentially Expressed Pattern)
  • One sample in a cluster have different expression from the others
  • SEP (Similarly Expressed Pattern)
  • All samples have similar expression pattern in a cluster
slide-13
SLIDE 13

Method – DEP pattern recognition

  • Objective: Test if expression of conditions A, B, C are A!=B!=C
  • Build centroid for each condition within each cluster
  • Select the most outer centroid as base centroid

0h 1h 6h

cluster C1 cluster C2

A B C A centroid B centroid C centroid

cluster C3

1 2 3 4 5 1 2 3

slide-14
SLIDE 14

1. Compute cosine distance from each dissected vector to the base centroid for each cluster 2. Rank dissected vectors by cosine distance 3. Measure Mutual Information with X as distance to base centroid and Y as condition 4. Measure significance of MI by 1000 random permutated tests,

Method – DEP pattern recognition

0h 1h 6h

cluster C2

A B C A centroid B centroid C centroid Base centroid 1 2 3 4 5 1 2 3

Phenotype A A A A B B B B C C C C clid G1_A G2_A G3_A G4_A G1_B G2_B G3_B G4_B G1_C G2_C G3_C G4_C C2 0.9 0.87 0.96 0.99 0.1 0.05 0.2 0.18 0.5 0.6 0.57 0.61 Rank 10 9 11 12 2 1 4 3 5 7 6 8 Discretized Rank 3 3 3 3 1 1 1 1 2 2 2 2 MI Log2(4)=2

slide-15
SLIDE 15

Method – ODEP pattern recognition

Objective: Test if expression pattern of a condition among A, B, C is A!=BC (B=C) or B!=AC (A=C) or C!=AB (A=B) 1. Compute a base centroid of comparing conditions (BC, AC, AB) 2. Compute cosine distance of dissected vectors to the centroid for each combination 3. Perform ANOVA on the computed cosine distance combinations

0h 1h 6h

cluster C1 cluster C2

A B C A centroid B centroid C centroid

cluster C3

1 2 3 4 5 1 2 3

slide-16
SLIDE 16

Method – SEP pattern recognition

Objective: Test if expression of conditions A, B, C is A=B=C 1. Compute a base centroid of all conditions within a cluster 2. Compute cosine distance of dissected vectors to the base centroid 3. Tightness - lower bound of 99% confidence interval of all clusters 4. Clusters with tightness less than 99% CI are SEP clusters

0h 1h 6h

cluster C1 cluster C2

A B C A centroid B centroid C centroid

cluster C3

1 2 3 4 5 1 2 3

slide-17
SLIDE 17

Results

  • Data
  • Biologically significant clusters detected
  • Performance compared with Tricluster and OPTricluster

*

Malaria infected / Gonadectomized male and female mice Rice plants treated with 4 phytohormones Dehydration stress treated rice plants Fermentation of five yeast strains

slide-18
SLIDE 18

Results – Cluster patterns

C=4, T=4 C=4, T=6

slide-19
SLIDE 19

Results – Malaria infected Mouse data

(a) DEP cluster 51 (b) ODEP cluster 20 (c) SEP cluster 357

slide-20
SLIDE 20

Results – Phytohormone treated rice plants

  • 5 clusters were found that responded to the ABA (Absicic acid) phytohormone
  • Genes were gradually induced over time.
  • Enriched GO terms in these clusters were related to ‘Response to abscisic acid’
slide-21
SLIDE 21

Results – Comparison with other tools

Average number of genes per cluster Tightness (average within cosine distance of clusters) Weighted silhouette score

slide-22
SLIDE 22

Conclusion

  • TimesVector is able to detect gene clusters in 3D time-series data that

exhibit distinct expression patterns

  • Especially, it is able to detect clusters with distinctively different

expression patterns across conditions

  • It showed significantly improved clustering quality compared to

recent triclusteringtools

slide-23
SLIDE 23

Funding

  • The Cooperative Research Program for Agriculture Science & Technology

Development (Project No. PJ01121102) Rural Development Administration (RDA), Republic of Korea

  • The Bio & Medical Technology Development Program of the National Rese

arch Foundation (NRF) funded by the Ministry of Science, ICT & Future Planning (2012M3A9D1054622)

  • The Korea Health Technology R&D Project through the Korea Health Industry

Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C3224)

slide-24
SLIDE 24

Thank you for your attention