Comparative Genomics: Comparative Genomics: Sequence, Structure, - - PowerPoint PPT Presentation
Comparative Genomics: Comparative Genomics: Sequence, Structure, - - PowerPoint PPT Presentation
Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and Networks and Networks Bonnie Berger MIT Comparative Genomics Comparative Genomics Look at the same kind of data across species with the hope that
Comparative Genomics Comparative Genomics
Look at the same kind of data across species with the hope that areas of high correlation correspond to functional parts or modules of the genome.
Biology in One Slide Biology in One Slide
Function Protein
Comparative Genomics of DNA Comparative Genomics of DNA
Function Protein
Multiple Species Comparison
Look at multiple species simultaneously
Application to Regulatory Motif Discovery Application to Regulatory Motif Discovery
- S. par
- S. mik
- S. bay
- S. cer
Evaluate conservation within: Gal4 Controls 4:1 1:3 (2) Intergenic : coding 12:1 1:1 (3) Upstream : downstream
A signature for regulatory motifs
22% 5% (1) All intergenic regions
Result Highlights Result Highlights
- Identify gene correspondence across species for > 90% of
genes in Yeast.
– 99.9% sensitivity and 99% specificity on 4000 known genes. – Refine boundaries of hundreds of genes (5700 genes total).
- Identify most previously known and 41 novel regulatory
motifs. – Genome-wide, unbiased search. – No previous knowledge necessary. [Kellis, Patterson, Birren, Berger*, Lander* (2004). RECOMB, 157-166; J. Comp Biol special issue, 11:2-3, 319-355; Kellis et al. Nature (2003)]
Comparative Genomics of RNA Comparative Genomics of RNA
Function Protein
RNA Secondary Structure Detection RNA Secondary Structure Detection
Problem: Identify biologically significant RNA
secondary structure.
Challenge:
Any given single sequence will have a plausible secondary structure.
Hairpin Loops Stems Bulge loop Interior loops Multi-branched loop
A-U G-C G-U
Given K orthologous aligned RNA sequences: If ith and jth positions are base-paired in many
- rganisms, then their nucleotides must covary.
Compensatory Mutations Compensatory Mutations
Given K orthologous aligned RNA sequences: If ith and jth positions are base-paired in many
- rganisms, then their nucleotides must covary.
Compensatory Mutations Compensatory Mutations
Approaches to Secondary Structure Approaches to Secondary Structure Detection Detection
- Statistical
– Stochastic context free grammars for 2- species comparison (QRNA) – Machine learning (RNAGenie) – Our approach: statistical significance across multiple species (MSARi)
- Homology
– Train on a particular RNA secondary structure and try to predict that structure
Result Highlights Result Highlights
- Identifies RNA secondary structure with 90% sensitivity at 98%
specificity.
– no previous knowledge necessary
- Used to identify functionally significant RNA secondary
structure in mRNA.
- Can be used to scan multiple genomes for RNA secondary
structure.
- Benchmarks: QRNA 19.8% sensitivity at 98% specificity
ddbRNA 68% sensitivity at 97.7% specificity [Coventry, Kleitman, Berger (2004), PNAS]
Comparative Genomics of Proteins Comparative Genomics of Proteins
Function Protein
Given an amino acid sequence, e.g., MDPNCSCAAAGDSCTCANSCTCLACKCTSCK, how will it fold in 3D?
Protein Structure: The Protein Folding Problem
Proteins must fold to function Some diseases are caused by misfolding e.g., mad cow disease
Protein Folding by Comparative Modeling Protein Folding by Comparative Modeling
- Similar protein sequences similar structures
- Use known structures to predict a new one
- About 40,000 protein structures have been solved
using experimental techniques and stored in the Protein Data Bank (PDB) ; ~1000 are unique structural folds
Different structural folds Same structural folds
Protein Threading Protein Threading
Threading = Match between a string and a 3D object
The Best Match DRVYIHPFADRVYIHPFA Query Sequence:
Result Highlights Result Highlights
- RAPTOR: threading as Linear-Programming
(Jinbo Xu)
} 1 , { , 1 ] [ , ] [ , . .
) , )( , ( , ] [ , ] , , [ ) , )( , ( , ] , , [ ) , )( , ( , ) , )( , ( ) , )( , ( , ,
∈ = ∈ ∀ = ∈ ∀ = + =
∑ ∑ ∑ ∑ ∑
∈ ∈ ∈ k j l i l i i D l l i i k j R l k j l i k j l j i R k k j l i l i k j l i k j l i l i l i
y x x j D k y x i D l y x t s y b x a E Minimize Structural Template
9
T N L A K Y E T L …
…
1 2 3 4 5 6 7 8 10
Input Sequence
RAPTOR was the best performing algorithm at CAFASP, a worldwide competition
– DBLRAP is our extension of RAPTOR for joint homology modeling of two structures (PS
B 06)
- Extend LP formulation to score interfaces
between two structures as well
- DBLRAP was able to predict interactions
for 8% of proteins in the yeast genome (c.f. 5% previously)
Threading Protein Complexes Threading Protein Complexes
DBLRAP
B A EGAATQY… EGAATQY… RGPPQLIK… RGPPQLIK…
Structure Alignment Structure Alignment
Protein Structure Alignment Protein Structure Alignment
Problem: find the
- ptimal
alignment between two protein structures
Contact Map Alignment Contact Map Alignment
Goal: find maximum common subgraph contact distance ≤ Du (5Å−7.5Å)
State of Art: Contact Map Alignment State of Art: Contact Map Alignment
- History: more than 20 years, many programs
based on heuristic algorithms
- NP-hard and hard to approximate if being
measured by Maximum Common Subgraph (Goldman, Papadimitriou & Istrail, FOCS 99)
- Lagrangian
relaxation (Caprara & Lancia, Recomb 2002)
- Integer programming (Caprara
et al, JCB 2004)
Tree Tree-
- Decomposition for Protein
Decomposition for Protein Structure Alignment Structure Alignment
3 3 6 2
) ( ) 1 ( ) ) /( ) ( (
l c
D D c tw
D N poly k O ε ε + = Δ Δ
) ) ((
2 3k
O tw
l
D D
=
Method: tree-decomposition of one protein structure
into small pieces to exploit the geometric characteristics of a structure
Results: there is a poly-time approximation algorithm
(PTAS) to find an alignment at least (1-1/k) of the best. Its time complexity is:
The parameters D, Dc and Dl are small constants, so is D/Dl. Therefore, this problem admits a PTAS, the best that we can achieve since this problem is NP-hard.
Biological Applications of Tree Biological Applications of Tree Decomposition Decomposition
- Sidechain
packing (Xu & Berger, Recomb ’05,
JACM ’06)
- Protein threading (Xu, Jiao, & Berger, CS
B ’05)
- Network motif search (Dost et al, Recomb ’07)
- RNA secondary structure alignment (Song et al,
CS B ’05)
- De novo sequencing (Liu et al, PS
B ’06)
- Protein structure alignment (Xu, Jiao, & Berger,
Recomb ’06, JCB ’06; Xu, CDC ’07)
Comparative Genomics of Networks Comparative Genomics of Networks
Function Protein
Why understanding function Why understanding function-
- level
level differences is important differences is important
- Increased complexity (function) is not explained
simply by variations in gene (or protein) count
6600 21000 14000 24500 23000 6600 27000 19000 32000 49000 Estimated Number of Genes Estimated Number of Proteins Estimated Number of Genes Estimated Number of Genes Estimated Number of Proteins Estimated Number of Proteins
Numbers from http://www.ensembl.org
Protein Protein-
- Protein Interactions (
Protein Interactions (PPIs PPIs) )
- Often, proteins interact with other proteins to
perform their functions
- Many cellular activities are a result of protein
interactions
Image from: http://focosi.altervista.org /mapkmap2.html
MAPK Signaling Cascade MAPK Signaling Cascade MAPK Signaling Cascade
Modeling Modeling PPIs PPIs
- Traditional perspective: low-throughput, structural
- New perspective: high-throughput, network-based
Image from www.rcsb.org
Gα Gβ Gγ GDP
G-protein complex
New systems-level perspective New systems New systems-
- level
level perspective perspective
Gα Gα Gβ Gβ Gγ Gγ
GDP GDP
Traditional perspective Traditional perspective Traditional perspective
Protein Protein-
- Protein Interaction (PPI)
Protein Interaction (PPI) Network Network
http://internal.binf.ku.dk
Yeast PPI Network
Cusick et al. Hum Med Gen, 05
X
+ = ?
Y
Yeast 2-Hybrid method
Motivation behind Network Comparison Motivation behind Network Comparison
- Compare PPI networks at the species level
- Transfer annotation from one species to another
– More feasible, cheaper and easier than in humans – Error detection
- Compute functional orthologs
– Functional orthologs: proteins which perform the same function across species
Given two protein-protein interaction networks, find for a piece of one network, something that has a comparative structure in the other network
Our approach: match neighborhood topologies
The Problem The Problem
Algorithm: Algorithm: IsoRank IsoRank
a1 a3 a8 a4 a7 a6 a2 a5 b2 b3 b1 b8 b5 b7 b6 b4 b9
Sequence similarity
3e-9 b6 a3 5e-4 b1 a3 1e-4 b9 a5 1e-7 b3 a5
…
2e-8 b1 a5 1e-2 b7 a5
Functional similarity for each possible node pairing
a5 b7 2.1 a5 b9 1.5 a3 b2 3.4
Functional Similarity Score: Intuition Functional Similarity Score: Intuition
- Compute pairwise
scores Rij :
- Goal: “high
Rij ” ⇒ “i and j are a good match”
- Intuition: i
and j are a good match if their sequences align and their neighbors are a good match
b3 b1 b2 b4 b5 a1 a3 a4 a2 a5
Ra5,b1 = ?
Computing R Computing Rij
ij
- Combine both sequence and network data
Rij = Eij
functional similarity sequence similarity network similarity
Rij = (1-α)Eij +αNij
sequence similarity
Simple Case: Simple Case: α
α=1 (no
=1 (no E Eij
ij )
)
∑ ∑
∈ ∈
=
) ( ) (
) ( ) ( 1
i N u j N v uv ij
R v N u N R
b3 b1 b2 b4 b5 a1 a3 a4 a2 a5
3 , 2 4 , 1
3 2 1
b a b a
R R × =
a1 a2 b3 b4
∑ ∑
∈ ∈
= =
) ( ) (
) ( ) ( 1
i N u j N v uv ij ij
R v N u N N R
- Rij
=Nij. Rij depends on neighborhoods of i and j
- N(a)
is the set of neighbors of a
Simple case: Simple case: α
α=1 (no
=1 (no E Eij
ij )
)
- Rij
=Nij. Rij depends on neighborhoods of i and j
- N(a)
is the set of neighbors of a
∑ ∑
∈ ∈
= =
) ( ) (
) ( ) ( 1
i N u j N v uv ij ij
R v N u N N R
b3 b1 b2 b4 b5 a1 a3 a4 a2 a5
3 , 3 1 , 3 3 , 1 1 , 1 2 , 2
3 3 1 1 3 1 3 1 1 1 1 1
b a b a b a b a b a
R R R R R × + × + × + × =
a1 a3 a2 b3 b1 b2
Example: Computed R Example: Computed Rij
ij values
values
b3 b1 b2 b4 b5 a1 a3 a4 a2 a5
b1 b2 b3 b4 b5 a1
0.0312 0.0937
a2
0.1250 0.0625 0.0625
a3
0.0937 0.2813
a4
0.0625 0.0312 0.0312
a5
0.0625 0.0312 0.0312
Empty cell indicates Rij = 0
R
Example: Computed R Example: Computed Rij
ij values
values
b3 b1 b2 b4 b5 a1 a3 a4 a2 a5
b1 b2 b3 b4 b5 a1
0.0312 0.0937
a2
0.1250 0.0625 0.0625
a3
0.0937 0.2813
a4
0.0625 0.0312 0.0312
a5
0.0625 0.0312 0.0312
Empty cell indicates Rij = 0
R
Example: Computed R Example: Computed Rij
ij values
values
b3 b1 b2 b4 b5 a1 a3 a4 a2 a5
b1 b2 b3 b4 b5 a1
0.0312 0.0937
a2
0.1250 0.0625 0.0625
a3
0.0937 0.2813
a4
0.0625 0.0312 0.0312
a5
0.0625 0.0312 0.0312
Empty cell indicates Rij = 0
R
Capturing non Capturing non-
- local effects?
local effects?
- The algorithm can resolve between p-r
- vs. p-q
q p r Rpr = 8.12e-3 Rpq = 8.64e-3
Computing R: an eigenvalue problem Computing R: an eigenvalue problem
2 1 2 1
) ( ) ( ) ( 1 ] ][ [ N N N N A size v N u N uv ij A AR R × = = =
N1 = # nodes in Graph 1 N2 = # nodes in Graph 2
- A is about 108x108 when aligning yeast and fly networks
– However, both A and R are very sparse – We use the Power method to efficiently compute R
- Extension to weighted edges is straightforward
- The equations for R describe an eigenvalue problem
R is the principal eigenvector of A
∑ ∑
∈ ∈
=
) ( ) (
) ( ) ( 1
i N u j N v uv ij
R v N u N R
A Random Walk Interpretation A Random Walk Interpretation
Tensor Product: G1 x G2
r p s v j q i u
G1 G2
) ( ) ( 1 v N u N ) ( ) ( 1 j N i N
r,s r,j r,v u,s u,j u,v i,s i,j i,v
… … … … … …
General Case: 0 General Case: 0 ≤
≤ α α ≤ ≤
1 1
- Let Bij
= sequence similarity score between i (from graph #1) and j from (graph #2)
- Eij
= Bij /|B|1
AR R =
1 ) 1 ( ≤ ≤ + − = α α α AR E R
Complex Case: Multiple Networks Complex Case: Multiple Networks
#1 #2 #3
1R3 1R2 2R3
Results: Yeast Results: Yeast-
- Fly Global Alignment
Fly Global Alignment
- # of edges in the common subgraph: 1420
- Implies about 5% overlap! Why so low?
- PPI data currently is noisy and low-coverage
- # of edges in the largest component: 35
- The value of α
used: 0.6
- Provided best overall agreement with previous gene
correspondence predictions
Various Topologies Are Found Various Topologies Are Found
Existing local alignment methods (PathBlast; Kelley et al.)
- ften find only specific topologies
Role of Role of α:
α:
why the dip?
Robustness to Error in PPI data Robustness to Error in PPI data
a1 a3 a8 a4 a7 a6 a2 a5 a9 a11 a10 a1 a3 a8 a4 a7 a6 a2 a5 a9 a11 a10
Robustness to Error in PPI data Robustness to Error in PPI data
True curve somewhere around here
Functional Orthologs Functional Orthologs
- Genes that perform similar functions
– “functional orthologs” vs “plain old orthologs” – distinguish between orthologs and paralogs
- Bandyopadhyay
et al. [Genome Res. ’06]
– Use local network alignment results – Then use a MRF to partially resolve ambiguities
- We compared our results with theirs
Functional Orthologs: Functional Orthologs: IsoRank IsoRank Pairwise Pairwise Alignment Predictions Alignment Predictions
Protein Functional Ortholog IsoRank Bandyopadhyay et al. Gid8 CG6617 CG6617 76% CG18467
- Gpa1
Goα47a Goα47a 41% Giα65a
- Kap104
Trn Trn 41% CG8219 47% CG18617 Vph1 Vph1 43% Stv1 48% Egd1 Bic Bcd 47%
Results: Multiple Network Alignment Results: Multiple Network Alignment
- Size of networks
– human (36387 PPIs), yeast (31899 PPIs), fly (25831 PPIs), worm (4573 PPIs) and mouse (255 PPIs)
- # of edges in the common subgraph
with Isorank
– 1663 PPIs aligned in at least 2 species – 157 PPIs aligned in at least 3 species
- Comparison with NCBI’s
Homologene
– 509 PPIs in at least 2 species – 40 PPIs in at least 3 species
- INPARANOID (http://inparanoid.sbc.su.se/cgi-bin/index.cgi)
– ≤ 1172 PPIs in at least 2 species
Multiple Network Alignment: Multiple Network Alignment: Functional Functional Orthologs Orthologs
- Coverage of known genes
– Out of 86,932 proteins in five species – Isorank: 59,539 have at least one mapping – INPARANOID: ≤ 55,000; 66% overlap with ours – Homologene: 33,434
- Functional coherence
– 0.220 Isorank – 0.206 INPARANOID – 0.223 Homologene
Theoretical Considerations Theoretical Considerations
- Limitation:
K-regular graphs, i.e., what if all the nodes have the same degree?
- Convergence guarantee: number of
iterations of the power method scales as log(1/α)
Biological Applications of Biological Applications of Isorank Isorank
- Pairwise
network alignment (Singh, Xu &
Berger, Recomb ’07, S
ODA ’08)
- Multiple network alignment (Singh, Xu &
Berger, PS
B ’08)
- Multiple RNA secondary and tertiary
structure alignment
- Multiple protein structure alignment
Related Work on ( Related Work on (Pairwise Pairwise) ) Network Alignment Network Alignment
- PathBlast: Kelly et al.
– Use sequence similarity to shortlist possible pairs of matching nodes – Search for conserved topologies like pathways and hub-and-spokes
- Koyuturk
et al.
– Like PathBlast, but with a more sophisticated
- bjective function that models gene deletion etc.
- Graemlin: Batzoglou et al.
– First uses sequence data to generate matching pairs
- f “seed” subgraphs and then heuristically grows the
seed matches in search of a specific topology
Open Issues Open Issues
- How can traditional graph-theoretic
algorithms be extended to handle noise and incomplete data in biology?
Acknowledgments Acknowledgments
RNA Secondary Structure Protein Structure Sequence Genomics Manolis Kellis, Eric Lander, Nick Patterson Jinbo Xu, Rohit Singh Alex Coventry, Dan Kleitman PPI Network Alignment Rohit Singh, Jinbo Xu Thanks also to:
- Michael Baym
- Gopal Ramachandran
- Leonid Chindelevitch
- Michael Schnall-Levin
- Chris Bakal, HMS & CSAIL
- Lenore Cowen, Tufts