Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma - - PDF document
Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma - - PDF document
3/19/09 CSCI1950Z Computa3onal Methods for Biology Lecture 13 Ben Raphael March 11, 2009 hFp://cs.brown.edu/courses/csci1950z/ Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma What can we measure? Sequencing
3/19/09 2
Biology 101
Central Dogma
What can we measure?
Sequencing (expensive) Hybridiza3on (noisy) Sequencing (expensive) Hybridiza3on (noisy) Mass spectrometry (noisy) Hybridiza3on (very noisy!)
3/19/09 3
DNA Basepairing DNA/RNA Basepairing
RNA is single stranded T U
3/19/09 4
RNA Microarrays Gene Expression Data
Gene expression Samples/Condi3ons
Each microarray experiment: expression vector u = (u1, …, un) ui = expression value for each gene.
BMC Genomics 2006, 7:279
3/19/09 5
Topics
- Methods for Clustering
– Hierarchical, Graph based (Clique‐finding), Matrix‐based (PCA),
- Methods for Classifica3on
– Nearest neighbors, support vector machines
- Data Integra3on: Bayesian
Networks
Gene expression Samples/Condi3ons
BMC Genomics 2006, 7:279
Gene Expression Data
Gene expression Samples/Condi3ons
Each microarray experiment: expression vector u = (u1, …, un) ui = expression value for each gene. Goal: Group genes with similar expression paFerns over mul3ple samples/condi3ons.
BMC Genomics 2006, 7:279
3/19/09 6
Clustering
Goal: Group data into groups.
- Input: n data points
- Output: k clusters. Points in clusters “closer”
than to points in other clusters.
n x n distance matrix
11 7 5 11 4 6 7 4 9 5 6 9
1 4 3 2
Clustering
Proper3es of a good clustering/par33on.
- Separa.on: points in different clusters are far
apart.
- Homogeneity: points in the same cluster are
close.
n x n distance matrix
11 7 5 11 4 6 7 4 9 5 6 9
1 4 3 2
3/19/09 7
Agglomera3ve Hierarchical Clustering
Itera3vely combine closest groups into larger groups.
C { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck Ci ∪ Cj [Replace Ci and Cj by Ck.] C (C \ Ci \ Cj )∪ Ck.
1 4 3 2 5
Agglomera3ve Hierarchical Clustering
How to compute d?
C { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck Ci ∪ Cj [Replace Ci and Cj by Ck.] C (C \ Ci \ Cj )∪ Ck.
1 4 3 2 5
3/19/09 8
Agglomera3ve Hierarchical Clustering
Distance between clusters defined as average pairwise distance. Average linkage clustering.
Given two disjoint clusters Ci, Cj
1
d(Ci, Cj) = ––––––––– Σ{p ∈Ci, q ∈Cj}dpq |Ci| × |Cj|
1 4 3 2 5
Agglomera3ve Hierarchical Clustering
Ini.aliza.on: Assign each xi to its own cluster Ci Itera.on: Find two clusters Ci and Cj such that dij is min Let Ck = Ci ∪ Cj
Delete Ci and Cj
Termina.on: When a single cluster remains
1 4 3 2 5 1 4 2 3 5
Dendrogram
3/19/09 9
UPGMA Algorithm
Unweighted Pair Group Method with Averages Ini.aliza.on: Assign each xi to its own cluster Ci Define one leaf per sequence, each at height 0 Itera.on: Find two clusters Ci and Cj such that dij is min Let Ck = Ci ∪ Cj Add a vertex connec3ng Ci, Cj and place it at height dij /2
Delete Ci and Cj
Termina.on: When a single cluster remains
1 4 3 2 5 1 4 2 3 5
Agglomera3ve Hierarchical Clustering
Average linkage
Clusters Ci, Cj
Single linkage Complete linkage
d(Ci, Cj) = 1 |Ci||Cj|
- p∈Ci
- q∈Cj
dpq
d(Ci, Cj) = min
p∈Ci,q∈Cj dpq
d(Ci, Cj) = max
p∈Ci,q∈Cj dpq
3/19/09 10
Agglomera3ve Hierarchical Clustering
Where are the clusters?
1 4 3 2 5 1 4 2 3 5
Cut tree at some point. Can define any number of clusters.
Cluster Centers
Each cluster defined by center/centroid. (Whiteboard)
3/19/09 11
Another Greedy k‐means
Cost(P) = k‐means “cost” of P. PiC: clustering w/ i moved to cluster C. Δ(i C) = cost(P) – cost(PiC)
Move
How many clusters?
3/19/09 12
Distance Graph
Θ = 7
Distance graph G(Θ) = (V, E). V = data points E = {(i,j): d(i,j) < Θ
Cliques
- A graph is complete provided all possible
edges are present.
- A subgraph that is a complete graph is called a
clique.
- Separa3on and homogeneity proper3es of
good clustering imply: clusters = cliques.
K4 K5 K3
3/19/09 13
Cliques and Clustering
Good clustering
- 1. One connected component for each cluster
(separa.on)
- 2. Each connected component has edge b/w
pair of ver3ces (homogeneity)
K4 K5 K3
Clique Graphs
A graph whose connected components are all cliques.
3/19/09 14
Distance Graph Clique Graph
Distance graphs from real data have missing edges and extra edges.
Corrupted Cliques Problem
Input: Graph G. Output: Smallest number of edges to add or remove to transform G into a clique graph. NP‐hard (Sharan, Shamir & Tsur 2004)
3/19/09 15
Extending a subpar33on
- Suppose we knew
- p3mal clustering for
subset V’ ⊆ V.
- Extend this clustering
to V.
Cluster Affinity
Maximum Affinity Extension Assign v to argmaxj N(v, Cj) / |Cj| N(v, Cj) = # of edges from v to Cj. Define affinity (rela3ve density) of v to Cj: N(v, Cj) / |Cj|
3/19/09 16
Parallel Clustering with Cores (PCC)
(ben‐Dor et al. 1999)
Score(P) = min. # edges to add/remove to par33on P to make clique graph. Straigh{orward to compute since P is known.
PCC: Algorithmic Analysis
Very inefficient: Number of such par33ons is equal to φ(|S’|, k) S.rling number of second kind
φ(r, k) = 1 k!
k
- i=0
(−1)i k i
- (k − i)r
3/19/09 17
Corrupted Cliques Random Graph
1) Start with clique graph H. 2) Randomly add/remove edges with probability p. Obtain graph GH,p
PCC: Algorithmic Analysis
- PCC selects two random sets of ver3ces. Analysis
is relies on probability.
- Let PCC(G) denote output graph (clique graph).
- For graphs G = (V, E) and G’ = (V’, E’) define:
Δ(G,G’) = | E Δ E’| = | E \E’| + |E’ \ E|
- Can show (See Shamir notes) that with high
probability, output graph from PCC is as good as clique graph H. Pr[ Δ( PCC( GH,p), GH,p) ≤ Δ(H, GH,p)] > 1 – δ.
3/19/09 18
Cluster Affinity Search Technique (CAST)
Clustering of Gene Expression
Gene expression Samples
Each microarray experiment: expression vector x = (x1, …, xn) xi = expression value for each gene. Group similar vectors.
BMC Genomics 2006, 7:279
3/19/09 19
Distances between vectors
Measures linear rela3onship between vectors xi and xj. ‐1 ≤ rij ≤ 1.
rij = m
k=1(xik − xi)(xjk − xj)
(m − 1)sisj
si =
- 1
m − 1
m
- k=1
(xik − xi)2
xi = 1 m
m
- k=1
xik
Sample standard devia3on Sample mean