Bioinformatics: Network Analysis Evolution of Genes and Genomes - - PowerPoint PPT Presentation

bioinformatics network analysis
SMART_READER_LITE
LIVE PREVIEW

Bioinformatics: Network Analysis Evolution of Genes and Genomes - - PowerPoint PPT Presentation

Bioinformatics: Network Analysis Evolution of Genes and Genomes COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 The Traditional Phylogeny Reconstruction Problem U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA


slide-1
SLIDE 1

Bioinformatics: Network Analysis

Evolution of Genes and Genomes

COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University

1

slide-2
SLIDE 2

The “Traditional” Phylogeny Reconstruction Problem

U V W X Y

AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT

2

slide-3
SLIDE 3

The “Traditional” Phylogeny Reconstruction Problem

U V W X Y

AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT

2

slide-4
SLIDE 4

The Evolution of Genes Within the Branches of a Species Tree

[Source: W.P. Maddison, Syst. Biol. 46(3):523-536,1997.]

3

slide-5
SLIDE 5

So, What Tree is Being Reconstructed? Species tree Gene tree

4

slide-6
SLIDE 6

The Pre-Genomic Era

Locus i B A E D C A B C D E Species Phylogeny

5

slide-7
SLIDE 7

The Pre-Genomic Era

Locus i B A E D C

6

slide-8
SLIDE 8

The Pre-Genomic Era

Locus i B A E D C E A D C B Gene Tree

6

slide-9
SLIDE 9

The Pre-Genomic Era

Locus i B A E D C E A D C B Gene Tree A B C D E Species Phylogeny

6

slide-10
SLIDE 10

The Pre-Genomic Era

Locus i B A E D C E A D C B Gene Tree A B C D E Species Phylogeny

The “traditional” phylogeny reconstruction problem

6

slide-11
SLIDE 11

The Post-genomic Era

B A E D C

7

slide-12
SLIDE 12

The Post-genomic Era

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C

7

slide-13
SLIDE 13

The Post-genomic Era

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C A B C D E Species Phylogeny

7

slide-14
SLIDE 14

The Post-genomic Era

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C A B C D E Species Phylogeny

?

7

slide-15
SLIDE 15

The Post-genomic Era:

  • I. Gene Tree Incongruence

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C

8

slide-16
SLIDE 16

The Post-genomic Era:

  • I. Gene Tree Incongruence

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C E A D C B E A D C B E A D C B E A D C B E A D C B E A D C B Gene Trees

8

slide-17
SLIDE 17

The Post-genomic Era:

  • II. Genome Rearrangements

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C

9

slide-18
SLIDE 18

The Post-genomic Era:

  • II. Genome Rearrangements

Locus 1 Locus 2 Locus 3 Locus 4 Locus 5 Locus 6 B A E D C B A E D C

The Genomic Context

9

slide-19
SLIDE 19

The Post-genomic Era: Incongruence and Rearrangements

  • Gene tree incongruence and genome rearrangements pose challenges and
  • pportunities:
  • Challenges: how to model the events, how to infer the events, how to

infer species phylogeny while accounting for these events, ...

  • Opportunities: resolve very shallow and very deep evolutionary

relationships, inform about gene function, understand genomic structural variations and their role in disease (e.g., cancer), ...

10

slide-20
SLIDE 20

Outline of the Rest of this Tutorial

  • Gene tree incongruence
  • Biological causes
  • General mathematical frameworks
  • Genome rearrangement
  • Rearrangement events
  • General mathematical frameworks

11

slide-21
SLIDE 21

Gene Tree Incongruence

12

slide-22
SLIDE 22

Three Main Biological Events

Lineage sorting

[Source: W.P. Maddison, Syst. Biol. 46(3):523-536,1997.]

13

slide-23
SLIDE 23

Horizontal (or, Lateral) Gene Transfer (HGT/LGT)

[Source: W.P. Maddison, Syst. Biol. 46(3):523-536,1997.]

14

slide-24
SLIDE 24

Detecting HGT

[Source: http://topicpages.ploscompbiol.org/wiki/Detection_of_horizontal_gene_transfer]

15

slide-25
SLIDE 25

Detecting HGT

  • The explicit phylogeny-based approach for detecting HGT mostly seeks the

minimum number of tree transformation operations (often, the “subtree prune and regraft” operation) that reconciles a gene tree with a species tree.

  • This number is taken as a lower bound on the number of HGT events required

to explain the evolutionary history of the gene under study.

16

slide-26
SLIDE 26

Gene Duplication and Loss

[Source: W.P. Maddison, Syst. Biol. 46(3):523-536,1997.]

17

slide-27
SLIDE 27

Gene Duplication and Loss

[Source: Understanding Bioinformatics]

18

slide-28
SLIDE 28

Gene Duplication and Loss

[Source: Understanding Bioinformatics]

19

slide-29
SLIDE 29

Reconcile

Gene Duplication and Loss

Species tree Gene tree Reconciled gene tree

[Source: Understanding Bioinformatics]

20

slide-30
SLIDE 30

Gene Duplication and Loss

  • The parsimony approach to the reconciliation problem seeks the minimum

number of duplications and losses (or a weighted sum thereof) to explain the incongruence between the gene tree and species tree.

  • Beginning with Goodman et al., 1979
  • Probabilistic models of gene duplication/loss are now emerging, allowing for

probabilistic reconciliations.

The Gene Evolution Model and Computing Its Associated Probabilities

LARS ARVESTAD AND JENS LAGERGREN

Royal Institute of Technology and Stockholm Bioinformatics Center, Stockholm, Sweden

AND BENGT SENNBLAD

Stockholm University and Stockholm Bioinformatics Center, Stockholm, Sweden

21

slide-31
SLIDE 31

Incomplete Lineage Sorting (ILS)

22

slide-32
SLIDE 32

Incomplete Lineage Sorting (ILS)

23

slide-33
SLIDE 33

Incomplete Lineage Sorting (ILS)

T2 T1 MRCA(C,G) MRCA(H,C,G)

24

slide-34
SLIDE 34

Incomplete Lineage Sorting (ILS)

T2 T1 MRCA(C,G) MRCA(H,C,G)

P[((H, C), G)] = 1 − 2 3e−(T2−T1)/N P[((H, G), C)] = 1 3e−(T2−T1)/N P[(H, (C, G))] = 1 3e−(T2−T1)/N

24

slide-35
SLIDE 35

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1

Probability (T2 – T1)/N

A(BC) (AC)B (AB)C

(HC)G (HG)C H(CG)

Incomplete Lineage Sorting (ILS)

T2 T1 MRCA(C,G) MRCA(H,C,G)

P[((H, C), G)] = 1 − 2 3e−(T2−T1)/N P[((H, G), C)] = 1 3e−(T2−T1)/N P[(H, (C, G))] = 1 3e−(T2−T1)/N

24

slide-36
SLIDE 36

Incomplete Lineage Sorting (ILS)

  • A gene tree can be reconciled with a species tree under ILS using
  • a parsimony approach, which seeks to minimize the amount of “deep

coalescence” of the gene tree within the branches of the species tree, and

  • a probabilistic approach, which seeks to maximize the probability of
  • bserving the gene tree given the species tree, using the coalescent

framework.

25

slide-37
SLIDE 37

Incomplete Lineage Sorting (ILS)

  • The inference problem seeks a species tree from a collection of gene trees (or

sequence alignments).

  • Many approaches have been proposed: parsimony, likelihood, Bayesian,

distance-based, and summary statistics.

26

slide-38
SLIDE 38

Inferring Phylogenetic Relationships in the Post-Genomic Era: A New Paradigm

  • The increasing availability of multi-locus data is highlighting the extent of

incongruence between a species tree and its “contained” gene trees, as well as among the gene trees themselves, and the need for new methods to establish phylogenetic relationships in light of this incongruence

  • The result is the emergence of a new paradigm that simultaneously accounts

for

  • mutations within a locus (base pair mutations and indels), and
  • incongruence among loci (HGT, dup/loss, and ILS).

27

slide-39
SLIDE 39

Dup/Loss + ILS

Method

Unified modeling of gene duplication, loss, and coalescence using a locus tree

Matthew D. Rasmussen1 and Manolis Kellis1

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA; Broad Institute, Cambridge, Massachusetts 02139, USA

28

slide-40
SLIDE 40

Dup/Loss + HGT

BIOINFORMATICS

  • Vol. 28 ISMB 2012, pages i283–i291

doi:10.1093/bioinformatics/bts225

Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss

Mukul S. Bansal 1,∗, Eric J. Alm 2 and Manolis Kellis 1,3,∗

1Computer Science and Artificial Intelligence Laboratory, 2Department of Biological Engineering, Massachusetts

Institute of Technology, Cambridge, MA 02139, USA and 3Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA

29

slide-41
SLIDE 41

ILS + Hybridization

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Yun Yu1, James H. Degnan2,3, Luay Nakhleh1*

1 Department of Computer Science, Rice University, Houston, Texas, United States of America, 2 Department of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand, 3 National Institute of Mathematical and Biological Synthesis, Knoxville, Tennessee, United States of America

30

slide-42
SLIDE 42

Dup/Loss + HGT + ILS

BIOINFORMATICS

  • Vol. 28 ECCB 2012, pages i409–i415

doi:10.1093/bioinformatics/bts386

Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees

Maureen Stolzer1,∗, Han Lai1, Minli Xu2, Deepa Sathaye3, Benjamin Vernot4 and Dannie Durand1,3

1Department of Biological Sciences, 2Lane Center for Computational Biology, 3Department of Computer Science,

Carnegie Mellon University, Pittsburgh, PA 15213, USA and 4Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA

31

slide-43
SLIDE 43

Keep In Mind...

  • In practice, gene trees are estimated from sequence data.
  • Gene tree estimates may be inaccurate.
  • These inaccuracies in the gene tree estimates give rise to incongruence

similar to that caused by true evolutionary events.

  • It is important to recognize this and account for errors in the gene tree

estimates before or during the species phylogeny inference process.

32

slide-44
SLIDE 44

Genome Rearrangements

33

slide-45
SLIDE 45

Genome Rearrangements

  • In addition to HGT and dup/loss, other “large” mutational events act on the

genome:

  • transpositions
  • translocations
  • inversions
  • fusions
  • ...

34

slide-46
SLIDE 46

Genome Rearrangements

[Source: Bourque et al., Genome Research, 12(1):26-36,2002.]

35

slide-47
SLIDE 47

Genome Rearrangements

Rearrangements in the MCF-7 breast cancer cell line

[Source: Hampton et al., Genome Research, 19(2):167-177,2009.]

36

slide-48
SLIDE 48

Genome Rearrangements

  • Genome representation: The molecular sequence information of genes is

abstracted out, and the genome is turned into a list of signed numbers, where each element in the list corresponds to a gene, and the sign corresponds to the direction (strand) on the genome.

  • Under this representation, a genome is viewed as a single character with a

very large state space (all possible permutations of the list), and the evolution

  • f this character is sought for a set of species.

37

slide-49
SLIDE 49

Genome Rearrangements

G2=(1 2 −5 −4 −3 6 7 8) G1=(1 2 3 4 5 6 7 8)

breakpoints (arrows) are missing adjacencies

1 2 3 7 4 6 5 8 7 8 5 6 1 4 3 2 7 8 5 6 1 −4 −3 −2 1 7 6 5 8 −4 −3 −2 Inversion Inverted Transposition Transposition

[Source: Slides on Comparative Genomics, by B.M.E. Moret, PSB 2010]

38

slide-50
SLIDE 50

Genome Rearrangements

  • Evolutionary models for genome rearrangements include
  • parsimony: minimize the number of “allowed” rearrangement events that

transform one genome into another

  • “weighted”:
  • The Nadeau-Taylor and Generalized Nadeau-Taylor models
  • The double-cut-and-join (DCJ) model

39

slide-51
SLIDE 51

Genome Rearrangements

  • Computational problems include:
  • Computing the distance between two genomes
  • Ancestral reconstruction of genomes
  • The “median problem” (given three genomes, find a genome that

minimizes the sum of distances to all three genomes)

  • Multiple alignment of genomes
  • Phylogeny reconstruction from genomes

40

slide-52
SLIDE 52
  • Currently, our best understanding, from a modeling perspective, is in the area
  • f incomplete lineage sorting (thanks to the coalescent model). There is need

for similar models for the other events.

  • The holy grail: a model that captures all events simultaneously!
  • Ultimately, we are interested in inference. Most methods currently employ the

parsimony principle. There is need for probabilistic inference.

  • Scalability: Current methods, even if accurate, are too slow to analyze large

data sets. There is need for efficient algorithms and high-performance computing techniques.

Challenges

41

slide-53
SLIDE 53

Summary

  • Phylogenomic analyses often involve dealing with incongruent gene trees

and/or genome rearrangements.

  • It is important to infer the evolutionary mechanism, or mechanisms, that gave

rise to the incongruence.

  • Inferring species phylogenies (trees or networks) in the post-genomic era

requires accounting for “evolution within a gene” (e.g., nucleotide evolution) and “evolution within and across the branches of the species phylogeny” (i.e., gene tree incongruence).

  • The evolution of networks should be tied to the evolution of genes/genomes!

42