Comparative Genomics: Comparative Genomics: Sequence, Structure, - - PowerPoint PPT Presentation

comparative genomics comparative genomics sequence
SMART_READER_LITE
LIVE PREVIEW

Comparative Genomics: Comparative Genomics: Sequence, Structure, - - PowerPoint PPT Presentation

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and Networks and Networks Bonnie Berger MIT Comparative Genomics Comparative Genomics Look at the same kind of data across species with the hope that


slide-1
SLIDE 1

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and Networks and Networks

Bonnie Berger MIT

slide-2
SLIDE 2

Comparative Genomics Comparative Genomics

Look at the same kind of data across species with the hope that areas of high correlation correspond to functional parts or modules of the genome.

slide-3
SLIDE 3

Biology in One Slide Biology in One Slide

Function Protein

slide-4
SLIDE 4

Comparative Genomics of DNA Comparative Genomics of DNA

Function Protein

slide-5
SLIDE 5

Multiple Species Comparison

Look at multiple species simultaneously

slide-6
SLIDE 6

Application to Regulatory Motif Discovery Application to Regulatory Motif Discovery

  • S. par
  • S. mik
  • S. bay
  • S. cer

Evaluate conservation within: Gal4 Controls 4:1 1:3 (2) Intergenic : coding 12:1 1:1 (3) Upstream : downstream

A signature for regulatory motifs

22% 5% (1) All intergenic regions

slide-7
SLIDE 7

Result Highlights Result Highlights

  • Identify gene correspondence across species for > 90% of

genes in Yeast.

– 99.9% sensitivity and 99% specificity on 4000 known genes. – Refine boundaries of hundreds of genes (5700 genes total).

  • Identify most previously known and 41 novel regulatory

motifs. – Genome-wide, unbiased search. – No previous knowledge necessary. [Kellis, Patterson, Birren, Berger*, Lander* (2004). RECOMB, 157-166; J. Comp Biol special issue, 11:2-3, 319-355; Kellis et al. Nature (2003)]

slide-8
SLIDE 8

Comparative Genomics of RNA Comparative Genomics of RNA

Function Protein

slide-9
SLIDE 9

RNA Secondary Structure Detection RNA Secondary Structure Detection

Problem: Identify biologically significant RNA

secondary structure.

Challenge:

Any given single sequence will have a plausible secondary structure.

Hairpin Loops Stems Bulge loop Interior loops Multi-branched loop

A-U G-C G-U

slide-10
SLIDE 10

Given K orthologous aligned RNA sequences: If ith and jth positions are base-paired in many

  • rganisms, then their nucleotides must covary.

Compensatory Mutations Compensatory Mutations

slide-11
SLIDE 11

Given K orthologous aligned RNA sequences: If ith and jth positions are base-paired in many

  • rganisms, then their nucleotides must covary.

Compensatory Mutations Compensatory Mutations

slide-12
SLIDE 12

Approaches to Secondary Structure Approaches to Secondary Structure Detection Detection

  • Statistical

– Stochastic context free grammars for 2- species comparison (QRNA) – Machine learning (RNAGenie) – Our approach: statistical significance across multiple species (MSARi)

  • Homology

– Train on a particular RNA secondary structure and try to predict that structure

slide-13
SLIDE 13

Result Highlights Result Highlights

  • Identifies RNA secondary structure with 90% sensitivity at 98%

specificity.

– no previous knowledge necessary

  • Used to identify functionally significant RNA secondary

structure in mRNA.

  • Can be used to scan multiple genomes for RNA secondary

structure.

  • Benchmarks: QRNA 19.8% sensitivity at 98% specificity

ddbRNA 68% sensitivity at 97.7% specificity [Coventry, Kleitman, Berger (2004), PNAS]

slide-14
SLIDE 14

Comparative Genomics of Proteins Comparative Genomics of Proteins

Function Protein

slide-15
SLIDE 15

Given an amino acid sequence, e.g., MDPNCSCAAAGDSCTCANSCTCLACKCTSCK, how will it fold in 3D?

Protein Structure: The Protein Folding Problem

Proteins must fold to function Some diseases are caused by misfolding e.g., mad cow disease

slide-16
SLIDE 16

Protein Folding by Comparative Modeling Protein Folding by Comparative Modeling

  • Similar protein sequences similar structures
  • Use known structures to predict a new one
  • About 40,000 protein structures have been solved

using experimental techniques and stored in the Protein Data Bank (PDB) ; ~1000 are unique structural folds

Different structural folds Same structural folds

slide-17
SLIDE 17

Protein Threading Protein Threading

Threading = Match between a string and a 3D object

The Best Match DRVYIHPFADRVYIHPFA Query Sequence:

slide-18
SLIDE 18

Result Highlights Result Highlights

  • RAPTOR: threading as Linear-Programming

(Jinbo Xu)

} 1 , { , 1 ] [ , ] [ , . .

) , )( , ( , ] [ , ] , , [ ) , )( , ( , ] , , [ ) , )( , ( , ) , )( , ( ) , )( , ( , ,

∈ = ∈ ∀ = ∈ ∀ = + =

∑ ∑ ∑ ∑ ∑

∈ ∈ ∈ k j l i l i i D l l i i k j R l k j l i k j l j i R k k j l i l i k j l i k j l i l i l i

y x x j D k y x i D l y x t s y b x a E Minimize Structural Template

9

T N L A K Y E T L …

1 2 3 4 5 6 7 8 10

Input Sequence

RAPTOR was the best performing algorithm at CAFASP, a worldwide competition

slide-19
SLIDE 19

– DBLRAP is our extension of RAPTOR for joint homology modeling of two structures (PS

B 06)

  • Extend LP formulation to score interfaces

between two structures as well

  • DBLRAP was able to predict interactions

for 8% of proteins in the yeast genome (c.f. 5% previously)

Threading Protein Complexes Threading Protein Complexes

DBLRAP

B A EGAATQY… EGAATQY… RGPPQLIK… RGPPQLIK…

slide-20
SLIDE 20

Structure Alignment Structure Alignment

slide-21
SLIDE 21

Protein Structure Alignment Protein Structure Alignment

Problem: find the

  • ptimal

alignment between two protein structures

slide-22
SLIDE 22

Contact Map Alignment Contact Map Alignment

Goal: find maximum common subgraph contact distance ≤ Du (5Å−7.5Å)

slide-23
SLIDE 23

State of Art: Contact Map Alignment State of Art: Contact Map Alignment

  • History: more than 20 years, many programs

based on heuristic algorithms

  • NP-hard and hard to approximate if being

measured by Maximum Common Subgraph (Goldman, Papadimitriou & Istrail, FOCS 99)

  • Lagrangian

relaxation (Caprara & Lancia, Recomb 2002)

  • Integer programming (Caprara

et al, JCB 2004)

slide-24
SLIDE 24

Tree Tree-

  • Decomposition for Protein

Decomposition for Protein Structure Alignment Structure Alignment

3 3 6 2

) ( ) 1 ( ) ) /( ) ( (

l c

D D c tw

D N poly k O ε ε + = Δ Δ

) ) ((

2 3k

O tw

l

D D

=

Method: tree-decomposition of one protein structure

into small pieces to exploit the geometric characteristics of a structure

Results: there is a poly-time approximation algorithm

(PTAS) to find an alignment at least (1-1/k) of the best. Its time complexity is:

The parameters D, Dc and Dl are small constants, so is D/Dl. Therefore, this problem admits a PTAS, the best that we can achieve since this problem is NP-hard.

slide-25
SLIDE 25

Biological Applications of Tree Biological Applications of Tree Decomposition Decomposition

  • Sidechain

packing (Xu & Berger, Recomb ’05,

JACM ’06)

  • Protein threading (Xu, Jiao, & Berger, CS

B ’05)

  • Network motif search (Dost et al, Recomb ’07)
  • RNA secondary structure alignment (Song et al,

CS B ’05)

  • De novo sequencing (Liu et al, PS

B ’06)

  • Protein structure alignment (Xu, Jiao, & Berger,

Recomb ’06, JCB ’06; Xu, CDC ’07)

slide-26
SLIDE 26

Comparative Genomics of Networks Comparative Genomics of Networks

Function Protein

slide-27
SLIDE 27

Why understanding function Why understanding function-

  • level

level differences is important differences is important

  • Increased complexity (function) is not explained

simply by variations in gene (or protein) count

6600 21000 14000 24500 23000 6600 27000 19000 32000 49000 Estimated Number of Genes Estimated Number of Proteins Estimated Number of Genes Estimated Number of Genes Estimated Number of Proteins Estimated Number of Proteins

Numbers from http://www.ensembl.org

slide-28
SLIDE 28

Protein Protein-

  • Protein Interactions (

Protein Interactions (PPIs PPIs) )

  • Often, proteins interact with other proteins to

perform their functions

  • Many cellular activities are a result of protein

interactions

Image from: http://focosi.altervista.org /mapkmap2.html

MAPK Signaling Cascade MAPK Signaling Cascade MAPK Signaling Cascade

slide-29
SLIDE 29

Modeling Modeling PPIs PPIs

  • Traditional perspective: low-throughput, structural
  • New perspective: high-throughput, network-based

Image from www.rcsb.org

Gα Gβ Gγ GDP

G-protein complex

New systems-level perspective New systems New systems-

  • level

level perspective perspective

Gα Gα Gβ Gβ Gγ Gγ

GDP GDP

Traditional perspective Traditional perspective Traditional perspective

slide-30
SLIDE 30

Protein Protein-

  • Protein Interaction (PPI)

Protein Interaction (PPI) Network Network

http://internal.binf.ku.dk

Yeast PPI Network

Cusick et al. Hum Med Gen, 05

X

+ = ?

Y

Yeast 2-Hybrid method

slide-31
SLIDE 31

Motivation behind Network Comparison Motivation behind Network Comparison

  • Compare PPI networks at the species level
  • Transfer annotation from one species to another

– More feasible, cheaper and easier than in humans – Error detection

  • Compute functional orthologs

– Functional orthologs: proteins which perform the same function across species

slide-32
SLIDE 32

Given two protein-protein interaction networks, find for a piece of one network, something that has a comparative structure in the other network

Our approach: match neighborhood topologies

The Problem The Problem

slide-33
SLIDE 33

Algorithm: Algorithm: IsoRank IsoRank

a1 a3 a8 a4 a7 a6 a2 a5 b2 b3 b1 b8 b5 b7 b6 b4 b9

Sequence similarity

3e-9 b6 a3 5e-4 b1 a3 1e-4 b9 a5 1e-7 b3 a5

2e-8 b1 a5 1e-2 b7 a5

Functional similarity for each possible node pairing

a5 b7 2.1 a5 b9 1.5 a3 b2 3.4

slide-34
SLIDE 34

Functional Similarity Score: Intuition Functional Similarity Score: Intuition

  • Compute pairwise

scores Rij :

  • Goal: “high

Rij ” ⇒ “i and j are a good match”

  • Intuition: i

and j are a good match if their sequences align and their neighbors are a good match

b3 b1 b2 b4 b5 a1 a3 a4 a2 a5

Ra5,b1 = ?

slide-35
SLIDE 35

Computing R Computing Rij

ij

  • Combine both sequence and network data

Rij = Eij

functional similarity sequence similarity network similarity

Rij = (1-α)Eij +αNij

sequence similarity

slide-36
SLIDE 36

Simple Case: Simple Case: α

α=1 (no

=1 (no E Eij

ij )

)

∑ ∑

∈ ∈

=

) ( ) (

) ( ) ( 1

i N u j N v uv ij

R v N u N R

b3 b1 b2 b4 b5 a1 a3 a4 a2 a5

3 , 2 4 , 1

3 2 1

b a b a

R R × =

a1 a2 b3 b4

∑ ∑

∈ ∈

= =

) ( ) (

) ( ) ( 1

i N u j N v uv ij ij

R v N u N N R

  • Rij

=Nij. Rij depends on neighborhoods of i and j

  • N(a)

is the set of neighbors of a

slide-37
SLIDE 37

Simple case: Simple case: α

α=1 (no

=1 (no E Eij

ij )

)

  • Rij

=Nij. Rij depends on neighborhoods of i and j

  • N(a)

is the set of neighbors of a

∑ ∑

∈ ∈

= =

) ( ) (

) ( ) ( 1

i N u j N v uv ij ij

R v N u N N R

b3 b1 b2 b4 b5 a1 a3 a4 a2 a5

3 , 3 1 , 3 3 , 1 1 , 1 2 , 2

3 3 1 1 3 1 3 1 1 1 1 1

b a b a b a b a b a

R R R R R × + × + × + × =

a1 a3 a2 b3 b1 b2

slide-38
SLIDE 38

Example: Computed R Example: Computed Rij

ij values

values

b3 b1 b2 b4 b5 a1 a3 a4 a2 a5

b1 b2 b3 b4 b5 a1

0.0312 0.0937

a2

0.1250 0.0625 0.0625

a3

0.0937 0.2813

a4

0.0625 0.0312 0.0312

a5

0.0625 0.0312 0.0312

Empty cell indicates Rij = 0

R

slide-39
SLIDE 39

Example: Computed R Example: Computed Rij

ij values

values

b3 b1 b2 b4 b5 a1 a3 a4 a2 a5

b1 b2 b3 b4 b5 a1

0.0312 0.0937

a2

0.1250 0.0625 0.0625

a3

0.0937 0.2813

a4

0.0625 0.0312 0.0312

a5

0.0625 0.0312 0.0312

Empty cell indicates Rij = 0

R

slide-40
SLIDE 40

Example: Computed R Example: Computed Rij

ij values

values

b3 b1 b2 b4 b5 a1 a3 a4 a2 a5

b1 b2 b3 b4 b5 a1

0.0312 0.0937

a2

0.1250 0.0625 0.0625

a3

0.0937 0.2813

a4

0.0625 0.0312 0.0312

a5

0.0625 0.0312 0.0312

Empty cell indicates Rij = 0

R

slide-41
SLIDE 41

Capturing non Capturing non-

  • local effects?

local effects?

  • The algorithm can resolve between p-r
  • vs. p-q

q p r Rpr = 8.12e-3 Rpq = 8.64e-3

slide-42
SLIDE 42

Computing R: an eigenvalue problem Computing R: an eigenvalue problem

2 1 2 1

) ( ) ( ) ( 1 ] ][ [ N N N N A size v N u N uv ij A AR R × = = =

N1 = # nodes in Graph 1 N2 = # nodes in Graph 2

  • A is about 108x108 when aligning yeast and fly networks

– However, both A and R are very sparse – We use the Power method to efficiently compute R

  • Extension to weighted edges is straightforward
  • The equations for R describe an eigenvalue problem

R is the principal eigenvector of A

slide-43
SLIDE 43

∑ ∑

∈ ∈

=

) ( ) (

) ( ) ( 1

i N u j N v uv ij

R v N u N R

A Random Walk Interpretation A Random Walk Interpretation

Tensor Product: G1 x G2

r p s v j q i u

G1 G2

) ( ) ( 1 v N u N ) ( ) ( 1 j N i N

r,s r,j r,v u,s u,j u,v i,s i,j i,v

… … … … … …

slide-44
SLIDE 44

General Case: 0 General Case: 0 ≤

≤ α α ≤ ≤

1 1

  • Let Bij

= sequence similarity score between i (from graph #1) and j from (graph #2)

  • Eij

= Bij /|B|1

AR R =

1 ) 1 ( ≤ ≤ + − = α α α AR E R

slide-45
SLIDE 45

Complex Case: Multiple Networks Complex Case: Multiple Networks

#1 #2 #3

1R3 1R2 2R3

slide-46
SLIDE 46

Results: Yeast Results: Yeast-

  • Fly Global Alignment

Fly Global Alignment

  • # of edges in the common subgraph: 1420
  • Implies about 5% overlap! Why so low?
  • PPI data currently is noisy and low-coverage
  • # of edges in the largest component: 35
  • The value of α

used: 0.6

  • Provided best overall agreement with previous gene

correspondence predictions

slide-47
SLIDE 47

Various Topologies Are Found Various Topologies Are Found

Existing local alignment methods (PathBlast; Kelley et al.)

  • ften find only specific topologies
slide-48
SLIDE 48

Role of Role of α:

α:

why the dip?

slide-49
SLIDE 49

Robustness to Error in PPI data Robustness to Error in PPI data

a1 a3 a8 a4 a7 a6 a2 a5 a9 a11 a10 a1 a3 a8 a4 a7 a6 a2 a5 a9 a11 a10

slide-50
SLIDE 50

Robustness to Error in PPI data Robustness to Error in PPI data

True curve somewhere around here

slide-51
SLIDE 51

Functional Orthologs Functional Orthologs

  • Genes that perform similar functions

– “functional orthologs” vs “plain old orthologs” – distinguish between orthologs and paralogs

  • Bandyopadhyay

et al. [Genome Res. ’06]

– Use local network alignment results – Then use a MRF to partially resolve ambiguities

  • We compared our results with theirs
slide-52
SLIDE 52

Functional Orthologs: Functional Orthologs: IsoRank IsoRank Pairwise Pairwise Alignment Predictions Alignment Predictions

Protein Functional Ortholog IsoRank Bandyopadhyay et al. Gid8 CG6617 CG6617 76% CG18467

  • Gpa1

Goα47a Goα47a 41% Giα65a

  • Kap104

Trn Trn 41% CG8219 47% CG18617 Vph1 Vph1 43% Stv1 48% Egd1 Bic Bcd 47%

slide-53
SLIDE 53

Results: Multiple Network Alignment Results: Multiple Network Alignment

  • Size of networks

– human (36387 PPIs), yeast (31899 PPIs), fly (25831 PPIs), worm (4573 PPIs) and mouse (255 PPIs)

  • # of edges in the common subgraph

with Isorank

– 1663 PPIs aligned in at least 2 species – 157 PPIs aligned in at least 3 species

  • Comparison with NCBI’s

Homologene

– 509 PPIs in at least 2 species – 40 PPIs in at least 3 species

  • INPARANOID (http://inparanoid.sbc.su.se/cgi-bin/index.cgi)

– ≤ 1172 PPIs in at least 2 species

slide-54
SLIDE 54

Multiple Network Alignment: Multiple Network Alignment: Functional Functional Orthologs Orthologs

  • Coverage of known genes

– Out of 86,932 proteins in five species – Isorank: 59,539 have at least one mapping – INPARANOID: ≤ 55,000; 66% overlap with ours – Homologene: 33,434

  • Functional coherence

– 0.220 Isorank – 0.206 INPARANOID – 0.223 Homologene

slide-55
SLIDE 55

Theoretical Considerations Theoretical Considerations

  • Limitation:

K-regular graphs, i.e., what if all the nodes have the same degree?

  • Convergence guarantee: number of

iterations of the power method scales as log(1/α)

slide-56
SLIDE 56

Biological Applications of Biological Applications of Isorank Isorank

  • Pairwise

network alignment (Singh, Xu &

Berger, Recomb ’07, S

ODA ’08)

  • Multiple network alignment (Singh, Xu &

Berger, PS

B ’08)

  • Multiple RNA secondary and tertiary

structure alignment

  • Multiple protein structure alignment
slide-57
SLIDE 57

Related Work on ( Related Work on (Pairwise Pairwise) ) Network Alignment Network Alignment

  • PathBlast: Kelly et al.

– Use sequence similarity to shortlist possible pairs of matching nodes – Search for conserved topologies like pathways and hub-and-spokes

  • Koyuturk

et al.

– Like PathBlast, but with a more sophisticated

  • bjective function that models gene deletion etc.
  • Graemlin: Batzoglou et al.

– First uses sequence data to generate matching pairs

  • f “seed” subgraphs and then heuristically grows the

seed matches in search of a specific topology

slide-58
SLIDE 58

Open Issues Open Issues

  • How can traditional graph-theoretic

algorithms be extended to handle noise and incomplete data in biology?

slide-59
SLIDE 59

Acknowledgments Acknowledgments

RNA Secondary Structure Protein Structure Sequence Genomics Manolis Kellis, Eric Lander, Nick Patterson Jinbo Xu, Rohit Singh Alex Coventry, Dan Kleitman PPI Network Alignment Rohit Singh, Jinbo Xu Thanks also to:

  • Michael Baym
  • Gopal Ramachandran
  • Leonid Chindelevitch
  • Michael Schnall-Levin
  • Chris Bakal, HMS & CSAIL
  • Lenore Cowen, Tufts