30‐Mar‐15 1
Clustering
Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 30th 2015
A distance matrix allows us to create clusters
- In this example distance matrix:
– and have the most similar vectors – and are the second most similar – and are the third most similar
- These relationships are hierarchical
- We can write them as a bracket‐notation
0.265 0.265 0.799 0.799 0.534 0.534
(( , ), );
- r as a cladogram:
- If we have a lot of genes/microbes/etc, then we can define
hierarchical clusters that might represent:
– Genes that are involved in the same metabolic pathway – Micro‐organisms that respond to the same environmental signals
- Thus, clustering allows us to make biological discoveries!
– (or at least: make specific predictions about potential biological discoveries, and these predictions can be tested)
(( , ), );
Representation in fewer dimensions
- Below is a two‐dimensional representation of ten multi‐
dimensional vectors
– Some vectors are similar in multi‐dimensional space – This is still visible in the two‐dimensional representation
- Distance matrix depends on distance measure (previous
lecture)
a b c d a j k l b j r s c k r y d l s y e f g h i m n
- p
q t u v w x z α β γ δ ε ζ η θ ι f n u α ζ g
- v
β η h p w γ θ i q x δ ι e m t z ε κ ξ π ρ λ ξ ς σ μ π ς τ ν ρ σ τ κ λ μ ν a b c d a j k l b j r s c k r y d l s y e f g h i m n
- p
q t u v w x z α β γ δ ε ζ η θ ι f n u α ζ g
- v
β η h p w γ θ i q x δ ι e m t z ε κ ξ π ρ λ ξ ς σ μ π ς τ ν ρ σ τ κ λ μ ν
Distance matrix:
Similar vectors Different vectors Identical vector
Clustering algorithms
- Single linkage
– Distance between clusters is the distance between closest vectors
- Complete linkage
– Distance between clusters is distance between most distant vectors
- Average linkage
– A.k.a. Unweighted Pair Group Method with Arithmetic mean (UPGMA) – Distance between clusters is the average distance between all vectors
d d d
Chaining versus clumping behavior
- Single linkage cladograms typically
show chaining behavior
– Often, a single close data point (vector) is added to an existing cluster
- Conversely, complete linkage
y, p g cladograms are more clumped
- Average linkage (UPGMA) is
intermediate
Example
- Cluster using single linkage
((((A,B),C),D),E);
- Cluster using complete linkage
((A,B),((C,D),E)));
A B C D E A ‐ 1 2 5 7 B ‐ 8 7 9 C ‐ 3 5 D ‐ 5 E ‐
A B C D E A
- Cluster using average linkage
((A,B),((C,D),E)));
B C D E A B C D E