Clustering In this example distance matrix: and have the most - - PDF document

clustering
SMART_READER_LITE
LIVE PREVIEW

Clustering In this example distance matrix: and have the most - - PDF document

30 Mar 15 A distance matrix allows us to create clusters Clustering In this example distance matrix: and have the most similar vectors 0 0.265 0.799 and are the second most similar 0.265 0 0.534 and are the third most


slide-1
SLIDE 1

30‐Mar‐15 1

Clustering

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 30th 2015

A distance matrix allows us to create clusters

  • In this example distance matrix:

– and have the most similar vectors – and are the second most similar – and are the third most similar

  • These relationships are hierarchical
  • We can write them as a bracket‐notation

0.265 0.265 0.799 0.799 0.534 0.534

(( , ), );

  • r as a cladogram:
  • If we have a lot of genes/microbes/etc, then we can define

hierarchical clusters that might represent:

– Genes that are involved in the same metabolic pathway – Micro‐organisms that respond to the same environmental signals

  • Thus, clustering allows us to make biological discoveries!

– (or at least: make specific predictions about potential biological discoveries, and these predictions can be tested)

(( , ), );

Representation in fewer dimensions

  • Below is a two‐dimensional representation of ten multi‐

dimensional vectors

– Some vectors are similar in multi‐dimensional space – This is still visible in the two‐dimensional representation

  • Distance matrix depends on distance measure (previous

lecture)

a b c d a j k l b j r s c k r y d l s y e f g h i m n

  • p

q t u v w x z α β γ δ ε ζ η θ ι f n u α ζ g

  • v

β η h p w γ θ i q x δ ι e m t z ε κ ξ π ρ λ ξ ς σ μ π ς τ ν ρ σ τ κ λ μ ν a b c d a j k l b j r s c k r y d l s y e f g h i m n

  • p

q t u v w x z α β γ δ ε ζ η θ ι f n u α ζ g

  • v

β η h p w γ θ i q x δ ι e m t z ε κ ξ π ρ λ ξ ς σ μ π ς τ ν ρ σ τ κ λ μ ν

Distance matrix:

Similar vectors Different vectors Identical vector

Clustering algorithms

  • Single linkage

– Distance between clusters is the distance between closest vectors

  • Complete linkage

– Distance between clusters is distance between most distant vectors

  • Average linkage

– A.k.a. Unweighted Pair Group Method with Arithmetic mean (UPGMA) – Distance between clusters is the average distance between all vectors

d d d

Chaining versus clumping behavior

  • Single linkage cladograms typically

show chaining behavior

– Often, a single close data point (vector) is added to an existing cluster

  • Conversely, complete linkage

y, p g cladograms are more clumped

  • Average linkage (UPGMA) is

intermediate

Example

  • Cluster using single linkage

((((A,B),C),D),E);

  • Cluster using complete linkage

((A,B),((C,D),E)));

A B C D E A ‐ 1 2 5 7 B ‐ 8 7 9 C ‐ 3 5 D ‐ 5 E ‐

A B C D E A

  • Cluster using average linkage

((A,B),((C,D),E)));

B C D E A B C D E

slide-2
SLIDE 2

30‐Mar‐15 2

Branch lengths

  • Cluster using average linkage (UPGMA)

((A,B),((C,D),E)));

A B C D E A ‐ 1 2 5 7 B ‐ 8 7 9 C ‐ 3 5 D ‐ 5 E ‐

0 5

((A ,B ) ,((C ,D ) ,E ) ));

A B C D E

0.5 0.5 3.17 0.67 2.5 1

:0.5 :0.5 :3.17 :1.5 :1.5 :1 :2.5 :0.67

A B C D E

1.5 1.5

Effect of distance measures on clustering

Euclidian

Gene 1 Gene 2 Gene 3 Gene 2 Gene 1

15 20 25

/Expression

0. 0. 0.

Correlation

Gene 1 Gene 3 Gene 2 Gene 3

5 10 1 2 3 4 5 6 7 8 9 10

Abundance/ Time/environments/samples…

0. 0.0

Changing perspective

x y x z y z Between the genes (etc)

1 2 3 4 5 6 7 8 9 10

5 10 15 20 25

Abundance/Expression

0. 0. 0. 0. 0.0

a b c d e f g h i a j k l m n

  • p

q b j r s t u v w x c k r y z α β γ δ f n u α ζ κ 0 ξ π ρ g

  • v

β η λ ξ ς σ h p w γ θ μ π ς τ i q x δ ι ν ρ σ τ d l s y ε ζ η θ ι e m t z ε 0 κ λ μ ν 1 2 3 4 7 8 9 10 5 6 1 2 3 4 5 6 7 8 9 10

Between the samples (etc)

1 0.20 0.15 0.12 2 0.17 0.15 0.09 3 0.16 0.16 0.08 4 0.20 0.15 0.11 5 0.20 0.16 0.12 6 0.17 0.16 0.10 7 0.16 0.15 0.08 8 0.20 0.15 0.12 9 0.18 0.16 0.11 10 0.16 0.15 0.08 1 2 3 4 5 6 7 8 9 10

Time/Environments/Samples

Bi‐clustering

Samples Genes

How good is my clustering?

  • Does the picture or cladogram fit your expectation?
  • Do genes that “should” cluster together, cluster together?

The cell cycle

  • The cell cycle is an important

process in all cellular life forms

– Reproduction – Growth and development – Tissue renewal

slide-3
SLIDE 3

30‐Mar‐15 3

Genes involved in the cell cycle oscillate Discovering new cell cycle related genes

  • All ~6,200 genes in the

genome of baker’s yeast (Saccharomyces cerevisiae) were measured during several cell cycles

Time 

  • ~800 genes oscillate

indicating that they are involved in the cell cycle

Time 

~800 genes

Enterotypes

  • Bacterial abundances in fecal samples from 39 people

were determined by using metagenomics

  • The abundance profiles were clustered into three distinct

types of gut flora, called “enterotypes”

– Bacteroides‐dominated – Prevotella‐dominated Ruminococcus dominated – Ruminococcus‐dominated

  • Including more data revealed

that enterotypes are actually gradients