The same gene can be present in different organisms, but with - - PowerPoint PPT Presentation

the same gene can be present in different organisms but
SMART_READER_LITE
LIVE PREVIEW

The same gene can be present in different organisms, but with - - PowerPoint PPT Presentation

S TRINGS E VOLUTIONARY M ODELS I NFERRING P HYLOGENIES E VOLUTIONARY D ISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Universit degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007


slide-1
SLIDE 1

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

EVOLUTIONARY DISTANCES

FROM STRINGS TO TREES

Luca Bortolussi1

1Dipartimento di Matematica ed Informatica

Università degli studi di Trieste luca@dmi.units.it

Trieste, 14th November 2007

slide-2
SLIDE 2

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

OUTLINE

1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS

Examples

3 INFERRING PHYLOGENIES

slide-3
SLIDE 3

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

OUTLINE

1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS

Examples

3 INFERRING PHYLOGENIES

slide-4
SLIDE 4

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

DIGITAL MOLECULES

DNA DNA can be considered as a very long string over an alphabet

  • f 4 bases (A, C, G, T). This encyclopedia stores genetic

information in volumes (chromosomes), with interesting chapters (genes), reading instructions (regulatory elements) and less interesting material (junk DNA).

slide-5
SLIDE 5

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C

Actually, we should use edit distance and construct an alignment between strings.

slide-6
SLIDE 6

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

GENES ENCODE PROTEINS The same gene can be present in different organisms, but with variations: the same chapter can be written in French, in Italian, in English,... HOW CAN WE MEASURE THE DISTANCE BETWEEN TWO GENES? Genes are strings of DNA: we can count the differences (Hamming distance). A C C T G T T A G C A A C T G G T A C C

Actually, we should use edit distance and construct an alignment between strings.

slide-7
SLIDE 7

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

HOW EVOLUTION ACTS ON DNA?

EVOLUTIONARY EVENTS Evolution can modify DNA in several ways: Local pointwise mutations can substitute, delete or insert a base somewhere. Entire DNA fragments can be deleted or duplicated, possibly reversed in their order. Bigger pieces of DNA can be swapped or inverted. Entire genomes can be duplicated.

MUTATIONS HAPPEN RANDOMLY!!!

OUR FOCUS For simplicity, we focus simply on pointwise substitution events.

slide-8
SLIDE 8

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

OUR SCENARIO

The scenario is the following: consider two species (human and chimp) evolved from a common ancestor (some old primate). As the ancestor evolved to human or chimp, his DNA mutated pointwise in some positions, chosen randomly.

evolutionary distance = number of mutations

slide-9
SLIDE 9

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

DOES HAMMING DISTANCE COUNT THE NUMBER OF

MUTATIONS?

Consider the following situation: A → C; A → G → C; A → C → G → C The same observation A, C can corresponds to different evolutionary histories. Hamming distance ignores multiple substitutions in a site. Moreover: A → G → A; A → C → G → A Hamming distance ignores back-mutation! It underestimates the number of mutations. CORRECTING DISTANCES The strategy is to develop a stochastic model of DNA evolution, and use it to correct the observed distance to account for multiple substitutions in a site.

slide-10
SLIDE 10

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

OUTLINE

1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS

Examples

3 INFERRING PHYLOGENIES

slide-11
SLIDE 11

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

A SIMPLE MODEL OF NUCLEOTIDE EVOLUTION

HYPOTHESIS Time evolves continuously; Each site can be substituted independently; the rate of substitutions (expected frequency per unit of time) does not change in time (homogeneity); The rate of change from base i to base j does not depend

  • n the mutation history of the site (memoryless property).

CONSEQUENCES Happening time of a single mutation event is modeled by an exponential distribution. Number of mutations is modeled by a Poisson process.

slide-12
SLIDE 12

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

MARKOV PROCESSES

If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: πA, πC, πG, πT (stationary chain). The process is time reversible: πiPij(t) = πjPji(t). RATE MATRIX Under the previous hypothesis, the Q-matrix decomposes in qij = Rijπj R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.

slide-13
SLIDE 13

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

MARKOV PROCESSES

If we consider all possible mutations (from A to C, G, T and so on), we end up with a matrix of rates and with a time-homogeneous continuous time Markov Chain. FURTHER SIMPLIFYING HYPOTHESIS Frequencies are in equilibrium: πA, πC, πG, πT (stationary chain). The process is time reversible: πiPij(t) = πjPji(t). RATE MATRIX Under the previous hypothesis, the Q-matrix decomposes in qij = Rijπj R is a symmetric matrix π are the stationary frequencies (solution of Qπ = 0) For nucleotide substitution models, we have 6+3 parameters to set.

slide-14
SLIDE 14

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

EXPECTATIONS

TOTAL RATE OF CHANGE µ = −

  • i

qiiπi EXPECTED NUMBER OF CHANGES AFTER TIME t d = µt PROBABILITY OF OBSERVING A SUBSTITUTION AFTER TIME t p = 1 −

  • i

πiPii(t) p is also the expected number of observed substitutions per site.

slide-15
SLIDE 15

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

CORRECTING HAMMING DISTANCE

1

Estimate p as ˆ p = Hamming distance

total length

2

From d = µt and p = 1 −

i πiPii(t) deduce

p = 1 −

  • i

πiPii(d µ).

3

Solve the previous formula for d and use the estimate ˆ p of p to compute the estimate ˆ d.

slide-16
SLIDE 16

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

DIFFERENT EVOLUTIONARY MODELS

There are 6 parameters to fix the rate matrix R and 3 to fix the equilibrium frequencies π.

slide-17
SLIDE 17

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

THE JUKES-CANTOR MODEL

The Jukes-Cantor model has been published in 1969. It is the simplest model of evolution, assuming Rij = 1 and πi = 1

4.

Q =     −3

4 1 4 1 4 1 4 1 4

−3

4 1 4 1 4 1 4 1 4

−3

4 1 4 1 4 1 4 1 4

−3

4

    SOLUTION FOR P P(t) = 1 4 − Qe−t. CORRECTION FOR THE

DISTANCE

d = −3 4 ln

  • 1 − 4

3 ˆ p

slide-18
SLIDE 18

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

OUTLINE

1 STRINGS: DISTANCES AND EVOLUTION 2 EVOLUTIONARY MODELS

Examples

3 INFERRING PHYLOGENIES

slide-19
SLIDE 19

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

RECONSTRUCTING HISTORY OF LIFE

WHAT MEANS “PHYLOGENETIC INFERENCE”? All species on Earth come from a common ancestor. If we have data from a pool of species, we wish to reconstruct the history of speciation events that lead to their emergence: We want to find the phylogenetic tree giving this information! This is an hard task, because data is often incomplete (we lack information about most of the ancestor species) and noisy.

slide-20
SLIDE 20

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

METHODS TO INFER PHYLOGENY

APPROACHES TO PHYLOGENY Distance-based methods Parsimony methods Likelihood methods Bayesian inference methods DISTANCE-BASED METHODS Given a matrix of pairwise distances, find the tree that explains it better. Several algorithms: UPGMA (clustering methods) Neighbor Joining Fitch-Margolias (sum of squares methods)

slide-21
SLIDE 21

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

AN EXAMPLE: PRIMATES

DNA FROM PRIMATES

Tarsius AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCT... Lemur AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCAT... Homo Sapiens AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... Chimp AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... Gorilla AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... Pongo AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... Hylobates AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... Macaco Fuscata AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTT...

DISTANCE MATRIX

           0.00 0.29 0.40 0.39 0.38 0.34 0.38 0.37 0.29 0.00 0.37 0.38 0.35 0.33 0.36 0.34 0.40 0.37 0.00 0.10 0.11 0.15 0.21 0.24 0.39 0.38 0.10 0.00 0.12 0.17 0.21 0.24 0.38 0.35 0.11 0.12 0.00 0.16 0.21 0.26 0.34 0.33 0.15 0.17 0.16 0.00 0.22 0.24 0.38 0.36 0.21 0.21 0.21 0.22 0.00 0.26 0.37 0.34 0.24 0.24 0.26 0.24 0.26 0.00            Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata

slide-22
SLIDE 22

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

LEAST SQUARE METHOD

We have our observed distance matrix Dij and a tree T with branch lengths predicting an additive distance matrix dij. TARGET Find the tree T minimizing the error between dij and Dij, i.e. the tree minimizing the weighted least square sum S(T) =

  • i,j

wij(Dij − dij)2 Given a tree topology, the best branch lengths for S can be computed by solving a linear system. A least square algorithms needs to search the tree space for the best tree T: this is an NP-hard problem. The search for the best tree can use branch and bound methods or heuristic state space explorations. This method gives the best explanation of the data

slide-23
SLIDE 23

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

FITCH-MARGOLIAS ALGORITHM

Letting wij =

1 D2

ij in S(T) =

i,j wij(Dij − dij)2, we obtain the

method of Fitch-Margoliash.

Lemur M fuscata Hylobates Pongo Gorilla Pan Homo sap Tarsius

Lemur M fuscata Hylobates Pongo Gorilla Pan Homo sap Tarsius

slide-24
SLIDE 24

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

HEURISTIC METHODS: UPGMA

HIERARCHICAL CLUSTERING Hierarchical clustering works by iteratively merging the two closest clusters (sets of elements) in the current collection of clusters. It requires a matrix of distances among singletons. Different ways of computing intercluster distances give rise to different HC-algorithms. UPGMA UPGMA (Unweighted Pair Group Method with Arithmetic mean) computes the distance between two clusters as d(A, B) =

1 |A||B|

  • i∈A,j∈B dij.

When two clusters A and B are merged, their union is represented by their ancestor node in the tree. The distance between A and B is evenly split between the two branches entering in A and B

slide-25
SLIDE 25

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

UPGMA - II

HYPOTHESIS UPGMA reconstructs correctly the tree if the input distance is an ultrametric (molecular clock).

Tarsius Lemur Homo sap Pan Gorilla Pongo Hylobates M fuscata

slide-26
SLIDE 26

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

HEURISTIC METHODS: NEIGHBOR-JOINING

NEIGHBOR-JOINING Neighbor-Joining works similarly to UPGMA, but it merges together the two clusters minimizing Dij = dij − ri − rj, where ri =

1 C−2

  • k dik is the

average distance of i from all other nodes. When i and j are merged, their new ancestor x has distances from another node k equal to dxk = 1

2(dik + djk − dij)

The branch lengths are dix = 1

2(dij + ri − rj) and djx = 1 2(dij + rj − ri).

NJ reconstructs the correct tree if the input distance is additive.

Lemur M fuscata Hylobates Pongo Gorilla Homo sap Pan Tarsius

Lemur M fuscata Hylobates Pongo Gorilla Homo sap Pan Tarsius

slide-27
SLIDE 27

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

NOT ONLY DNA EVOLVE...

slide-28
SLIDE 28

STRINGS EVOLUTIONARY MODELS INFERRING PHYLOGENIES

THE END Thanks for the attention!