Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma - - PDF document

topic 2 func3onal genomics
SMART_READER_LITE
LIVE PREVIEW

Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma - - PDF document

3/19/09 CSCI1950Z Computa3onal Methods for Biology Lecture 13 Ben Raphael March 11, 2009 hFp://cs.brown.edu/courses/csci1950z/ Topic 2: Func3onal Genomics 1 3/19/09 Biology 101 Central Dogma What can we measure? Sequencing


slide-1
SLIDE 1

3/19/09 1

CSCI1950‐Z Computa3onal Methods for Biology Lecture 13

Ben Raphael March 11, 2009

hFp://cs.brown.edu/courses/csci1950‐z/

Topic 2: Func3onal Genomics

slide-2
SLIDE 2

3/19/09 2

Biology 101

Central Dogma

What can we measure?

Sequencing (expensive) Hybridiza3on (noisy) Sequencing (expensive) Hybridiza3on (noisy) Mass spectrometry (noisy) Hybridiza3on (very noisy!)

slide-3
SLIDE 3

3/19/09 3

DNA Basepairing DNA/RNA Basepairing

RNA is single stranded T  U

slide-4
SLIDE 4

3/19/09 4

RNA Microarrays Gene Expression Data

Gene expression Samples/Condi3ons

Each microarray experiment: expression vector u = (u1, …, un) ui = expression value for each gene.

BMC Genomics 2006, 7:279

slide-5
SLIDE 5

3/19/09 5

Topics

  • Methods for Clustering

– Hierarchical, Graph based (Clique‐finding), Matrix‐based (PCA),

  • Methods for Classifica3on

– Nearest neighbors, support vector machines

  • Data Integra3on: Bayesian

Networks

Gene expression Samples/Condi3ons

BMC Genomics 2006, 7:279

Gene Expression Data

Gene expression Samples/Condi3ons

Each microarray experiment: expression vector u = (u1, …, un) ui = expression value for each gene. Goal: Group genes with similar expression paFerns over mul3ple samples/condi3ons.

BMC Genomics 2006, 7:279

slide-6
SLIDE 6

3/19/09 6

Clustering

Goal: Group data into groups.

  • Input: n data points
  • Output: k clusters. Points in clusters “closer”

than to points in other clusters.

n x n distance matrix

11 7 5 11 4 6 7 4 9 5 6 9

1 4 3 2

Clustering

Proper3es of a good clustering/par33on.

  • Separa.on: points in different clusters are far

apart.

  • Homogeneity: points in the same cluster are

close.

n x n distance matrix

11 7 5 11 4 6 7 4 9 5 6 9

1 4 3 2

slide-7
SLIDE 7

3/19/09 7

Agglomera3ve Hierarchical Clustering

Itera3vely combine closest groups into larger groups.

C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck  Ci ∪ Cj [Replace Ci and Cj by Ck.] C  (C \ Ci \ Cj )∪ Ck.

1 4 3 2 5

Agglomera3ve Hierarchical Clustering

How to compute d?

C  { {1}, …, {n} } While |C| > 2 do [Find closest clusters.] d(Ci, Cj ) = min d(Ci, Cj). Ck  Ci ∪ Cj [Replace Ci and Cj by Ck.] C  (C \ Ci \ Cj )∪ Ck.

1 4 3 2 5

slide-8
SLIDE 8

3/19/09 8

Agglomera3ve Hierarchical Clustering

Distance between clusters defined as average pairwise distance. Average linkage clustering.

Given two disjoint clusters Ci, Cj

1

d(Ci, Cj) = ––––––––– Σ{p ∈Ci, q ∈Cj}dpq |Ci| × |Cj|

1 4 3 2 5

Agglomera3ve Hierarchical Clustering

Ini.aliza.on: Assign each xi to its own cluster Ci Itera.on: Find two clusters Ci and Cj such that dij is min Let Ck = Ci ∪ Cj

Delete Ci and Cj

Termina.on: When a single cluster remains

1 4 3 2 5 1 4 2 3 5

Dendrogram

slide-9
SLIDE 9

3/19/09 9

UPGMA Algorithm

Unweighted Pair Group Method with Averages Ini.aliza.on: Assign each xi to its own cluster Ci Define one leaf per sequence, each at height 0 Itera.on: Find two clusters Ci and Cj such that dij is min Let Ck = Ci ∪ Cj Add a vertex connec3ng Ci, Cj and place it at height dij /2

Delete Ci and Cj

Termina.on: When a single cluster remains

1 4 3 2 5 1 4 2 3 5

Agglomera3ve Hierarchical Clustering

Average linkage

Clusters Ci, Cj

Single linkage Complete linkage

d(Ci, Cj) = 1 |Ci||Cj|

  • p∈Ci
  • q∈Cj

dpq

d(Ci, Cj) = min

p∈Ci,q∈Cj dpq

d(Ci, Cj) = max

p∈Ci,q∈Cj dpq

slide-10
SLIDE 10

3/19/09 10

Agglomera3ve Hierarchical Clustering

Where are the clusters?

1 4 3 2 5 1 4 2 3 5

Cut tree at some point. Can define any number of clusters.

Cluster Centers

Each cluster defined by center/centroid. (Whiteboard)

slide-11
SLIDE 11

3/19/09 11

Another Greedy k‐means

Cost(P) = k‐means “cost” of P. PiC: clustering w/ i moved to cluster C. Δ(i  C) = cost(P) – cost(PiC)

Move

How many clusters?

slide-12
SLIDE 12

3/19/09 12

Distance Graph

Θ = 7

Distance graph G(Θ) = (V, E). V = data points E = {(i,j): d(i,j) < Θ

Cliques

  • A graph is complete provided all possible

edges are present.

  • A subgraph that is a complete graph is called a

clique.

  • Separa3on and homogeneity proper3es of

good clustering imply: clusters = cliques.

K4 K5 K3

slide-13
SLIDE 13

3/19/09 13

Cliques and Clustering

Good clustering

  • 1. One connected component for each cluster

(separa.on)

  • 2. Each connected component has edge b/w

pair of ver3ces (homogeneity)

K4 K5 K3

Clique Graphs

A graph whose connected components are all cliques.

slide-14
SLIDE 14

3/19/09 14

Distance Graph  Clique Graph

Distance graphs from real data have missing edges and extra edges.

Corrupted Cliques Problem

Input: Graph G. Output: Smallest number of edges to add or remove to transform G into a clique graph. NP‐hard (Sharan, Shamir & Tsur 2004)

slide-15
SLIDE 15

3/19/09 15

Extending a subpar33on

  • Suppose we knew
  • p3mal clustering for

subset V’ ⊆ V.

  • Extend this clustering

to V.

Cluster Affinity

Maximum Affinity Extension Assign v to argmaxj N(v, Cj) / |Cj| N(v, Cj) = # of edges from v to Cj. Define affinity (rela3ve density) of v to Cj: N(v, Cj) / |Cj|

slide-16
SLIDE 16

3/19/09 16

Parallel Clustering with Cores (PCC)

(ben‐Dor et al. 1999)

Score(P) = min. # edges to add/remove to par33on P to make clique graph. Straigh{orward to compute since P is known.

PCC: Algorithmic Analysis

Very inefficient: Number of such par33ons is equal to φ(|S’|, k) S.rling number of second kind

φ(r, k) = 1 k!

k

  • i=0

(−1)i k i

  • (k − i)r
slide-17
SLIDE 17

3/19/09 17

Corrupted Cliques Random Graph

1) Start with clique graph H. 2) Randomly add/remove edges with probability p. Obtain graph GH,p

PCC: Algorithmic Analysis

  • PCC selects two random sets of ver3ces. Analysis

is relies on probability.

  • Let PCC(G) denote output graph (clique graph).
  • For graphs G = (V, E) and G’ = (V’, E’) define:

Δ(G,G’) = | E Δ E’| = | E \E’| + |E’ \ E|

  • Can show (See Shamir notes) that with high

probability, output graph from PCC is as good as clique graph H. Pr[ Δ( PCC( GH,p), GH,p) ≤ Δ(H, GH,p)] > 1 – δ.

slide-18
SLIDE 18

3/19/09 18

Cluster Affinity Search Technique (CAST)

Clustering of Gene Expression

Gene expression Samples

Each microarray experiment: expression vector x = (x1, …, xn) xi = expression value for each gene. Group similar vectors.

BMC Genomics 2006, 7:279

slide-19
SLIDE 19

3/19/09 19

Distances between vectors

Measures linear rela3onship between vectors xi and xj. ‐1 ≤ rij ≤ 1.

rij = m

k=1(xik − xi)(xjk − xj)

(m − 1)sisj

si =

  • 1

m − 1

m

  • k=1

(xik − xi)2

xi = 1 m

m

  • k=1

xik

Sample standard devia3on Sample mean

Pearson product‐moment correla3on coefficient