Markov models in molecular phylogeny and evolution Nicolas Galtier - - PowerPoint PPT Presentation
Markov models in molecular phylogeny and evolution Nicolas Galtier - - PowerPoint PPT Presentation
Markov models in molecular phylogeny and evolution Nicolas Galtier CNRS UMR 5554 Institut des Sciences de lEvolution Universit Montpellier 2 galtier@univ-montp2.fr Markov models in molecular phylogeny Generalities about Markov
Markov models in molecular phylogeny
- Definition:
Markov chains (= Markov processes) are mathematical objects devoted to the description/modelling of the variations in time of a system under the (very weak) hypothesis of lack of memory: the future of the system only depends on its current state, not on the pathway that was followed to reach it.
- A few examples: discrete time, discrete states: branching process
discrete time, continuous states: random walks continuous time, discrete states: Poisson process continuous time, continuous states : Brownian motion
- In molecular phylogeny, states are the 4 nucleotides / 20 amino-acids / 61 codons,
and the process is typically represented by a rate matrix in continuous time. Generalities about Markov processes
Markov models in molecular phylogeny
A C G T A C G T X α κ.α α
α X α κ.α κ.α α X α α κ.α α X
Kimura model (nucleotides) WAG model (amino-acids)
Example rate matrices
- because evolution is very generally memoryless
Why? How?
- thanks to the statistical approach in molecular phylogeny
- simulating data
- building phylogenies accounting for the evolutionary process
- inferring the processes and learn about the forces underlying molecular evolution
What for? Markov models are the fundamental tool of molecular phylogeny
- because the theory of Markov chains is well developed
Markov models in molecular phylogeny
The statistical approach in molecular phylogeny 1- modelling 2- computing expectations 3- fitting model to data Sequence evolution is represented by a Markov process running along a tree. Calculate the likelihood function, i.e. the probability of the data given the model. Maximise the likelihood over the parameter space, and thus obtain maximum likelihood estimates for parameters.
- r
Calculate the posterior probability of parameters given the data and the priors (bayesian approach).
Markov models in molecular phylogeny
A C G T A C G T β α α α α β β β β β β β
Rate matrix : M
X0 X1 X2 X3 l1 l2 l3 l4 l5 l6 l7 l8 A A C A G T T C T T A A A A A y1: y2: y3:
data : Y Tree topology T Branch lengths: li Likelihood calculation in molecular phylogeny
Markov models in molecular phylogeny
A C G T A C G T β α α α α β β β β β β β
Rate matrix : M
X0 X1 X2 X3 l1 l2 l3 l4 l5 l6 l7 l8 A A C A G T T C T T A A A A A y1: y2: y3:
data : Y Tree topology T Branch lengths: li Likelihood calculation in molecular phylogeny
Markov models in molecular phylogeny
L(li, Μ, T ) = Pr(Y | li, Μ, T ) = Π Pr(yi | li, Μ , T )
i
Pr(y1 | li, Μ, T ) = ΣΣΣΣ Pr(X0=x0).Pr(X1=x1| X0=x0). Pr(X2=x2|X1=x1).Pr(y11=A| X2=x2). Pr(y12=A| X2=x2).
Pr(y13=C| X1=x1). Pr(X3=x3| X0=x0). Pr(y14=A| X3=x3). Pr(y15=G| X3=x3) x0 x1 x2 x3 Felsenstein 1981 J Mol Evol 17:368
Calculating transition probabilities
Markov models in molecular phylogeny
P(t)=eMt
t is for time (branch length) M is the rate matrix :1/mij = average waiting time before state i changes to state j P(t) is the substitution probability matrix: pij(t) is the probability of observing state j after evolution during time t starting from state i. Deriving this formula starts by writing differential equations like: A(t+dt)=A(t)(-mAC-mAG-mAT)dt + C(t)mCAdt + G(t)mGAdt + T(t)mTAdt Calculating the exponential of a matrix is easy when diagonalisable.
Markov models in molecular phylogeny
Using the likelihood function The bayesian approach can fulfill the same purposes with more complex models, if we accept to draw prior distributions for parameters. Knowing how to calculate the likelihood, we can:
- estimate parameters by maximising (ML = Maximum Likelihood)
- test hypotheses by comparing models (LRT = Likelihood Ratio Test)
- recover details of the process using conditional likelihoods (EB = Empirical
Bayesian)
Example biological questions requiring good usage of Markov models:
- have my favourite protein evolved under positive selection? (codon models)
- have it undergone any functional change ? (covarion = heterotachous models)
- can we exhibit coevolution between sites ? (models of departure from independence)
Markov models in molecular phylogeny
- what did the ancestral sequence look like ? (empirical bayesian)
- have it undergone any compositional change ? (non-stationary models)
- which changes occurred ? In which branches ? (substitution mapping)
- when did speciations occur ? (clock-relaxed models)
1 2 3 4 5
θ θ θ θ θ θ θ θ ω stationary, homogeneous
1 2 3 4 5
ω non-stationary, non-homogeneous θ1 θ2 θ4 θ7 θ3 θ5 θ8 θ6
A non-stationary model
Galtier and Gouy 1998 Mol Biol Evol 15:871
A C G T A C G T X (1-θ)α (1-θ)κα (1-θ)α
θα X θα θκα θκα θα X θα
(1-θ)α (1-θ)κα (1-θ)α X Tamura 1992 model
θ = equilibrium GC-content
actual MP NHML 18% 10% 22% 14% 14%
low GCanc (10-25%) high eqGC (90%) medium sequence GC (~40%)
actual MP NHML 18% 32% 10% 27% 22% 40% 14% 30% 14% 28% actual MP NHML 18% 32% 19% 10% 27% 11% 22% 40% 21% 14% 30% 16% 14% 28% 15%
Accuracy of ancestral GC% estimation (simulations)
A non-stationary model
A non-stationary model
40 80 40 80
50 60 70
SSU LSU Topt Topt rRNA G+C-content Optimal growth temperature versus rRNA GC% in prokaryotes
A non-stationary model
Giardia 70.4% Entamoeba 43.7%
Desulfurococcus 64.2% Thermoproteus 63.5% M.jannashi 62.3% M.vannieli 57.7% Halococcus 58.9% Halobacterium 58.7%
Thermus 61.3% Thermotoga 60.9% Euglena 51.7% FUNGI 48.6% PLANTA 50.4% METAZOA 52.4% EUCARYA CRENARCHAE EURYARCHAE BACTERIA LOW GC GRAM+ 54.2% PROTEOBACTERIA 54.1% HIGH GC GRAM+ 57.0% CHLOROPLASTS 52.5%
56.1%
estimated ancestral GC% : The rRNA universal tree of life
40 80 40 80
50 60 70
SSU LSU
Topt Topt rRNA G+C-content
A non-stationary model
A non-hyperthermophilic ancestor?
Giardia 70.4% Entamoeba 43.7%
Desulfurococcus 64.2% Thermoproteus 63.5% M.jannashi 62.3% M.vannieli 57.7% Halococcus 58.9% Halobacterium 58.7%
Thermus 61.3% Thermotoga 60.9% Euglena 51.7% FUNGI 48.6% PLANTA 50.4% METAZOA 52.4% EUCARYA CRENARCHAE EURYARCHAE BACTERIA LOW GC GRAM+ 54.2% PROTEOBACTERIA 54.1% HIGH GC GRAM+ 57.0% CHLOROPLASTS 52.5%
56.1% 57.3%
Eukaryote 1 70.9% Eukaryote 2 70.9% Crenarchae 1 65.4% Crenarchae 2 65.1% Euryarchae 1 65.2% Euryarchae 2 65.0% Bacteria 1 63.2% Bacteria 2 62.3% A non-stationary model
Controlling for species sampling
A non-stationary model
40 80 40 80
50 60 70
SSU LSU
Topt Topt rRNA G+C-content
A non-hyperthermophilic ancestor?
Galtier et al. 1999 Science 283:221
Codon models, positive selection
T C A G T C A G
TTT → Phe TTC → Phe TTA → Leu TTG → Leu CTT → Leu CTC → Leu CTA → Leu CTG → Leu ATT → Ile ATC → Ile ATA → Ile ATG → Met GTT → Val GTC → Val GTA → Val GTG → Val TCT → Ser TCC → Ser TCA → Ser TCG → Ser CCT → Pro CCC → Pro CCA → Pro CCG → Pro ACT → Thr ACC → Thr ACA → Thr ACG → Thr GCT → Ala GCC → Ala GCA → Ala GCG → Ala TAT → Tyr TAC → Tyr TAA → Stop TAG → Stop CAT → His CAC → His CAA → Gln CAG → Gln AAT → Asn AAC → Asn AAA → Lys AAG → Lys GAT → Asp GAC → Asp GAA → Glu GAG → Glu TGT → Cys TGC → Cys TGA → Stop TGG → Trp CGT → Arg CGC → Arg CGA → Arg CGG → Arg AGT → Ser AGC → Ser AGA → Arg AGG → Arg GGT → Gly GGC → Gly GGA → Gly GGG → Gly The standard genetic code
Codon models, positive selection
The Goldman-Yang codon model if codon X and codon Y differ by more than one base β .πY if codon X and codon Y differ by one synonymous transversion β ω.πY if codon X and codon Y differ by one nonsynonymous transversion α .πY if codon X and codon Y differ by one synonymous transition α.ω.πY if codon X and codon Y differ by one non-synonymous transition mXY =
Goldman & Yang 1994 Mol Biol Evol 11:725
ω is the parameter of interest:
- ω=1 in case of neutral evolution
- ω<1 in case of negative selection (constraint)
- ω>1 in case of positive selection (adaptation)
Codon models, positive selection
Model 1 : ω0 ≠ ωC ln(L)= -1041.70 ω0 = 0.489 ; ωC = 3.383 Model 0 : ω0 = ωC ln(L)= -1043.84 ω0 = ωC = 0.574 Primate lysosyme evolution
Yang 1998 Mol Biol Evol 15:568
Codon models, positive selection
- 2. log(L1/L0) ~ χ2 (n df)
Let MO and M1 be two nested models: MO (p parameters) is a special instance of M1 (p+n parameters) Let L0 and L1 be the maximum likelihoods under MO and M1, respectively. Twice the log-likelihood ratio is asymptotically χ2 distributed (n degrees of freedom) under MO The likelihood ratio test (LRT) LRT are used to decide whether the increase in likelihood obtained by adding parameters (=degrees of freedom) to a model is significant.
Codon models, positive selection
Yang 1998 Mol Biol Evol 15:568
Model 1 : ω0 ≠ ωC ln(L)= -1041.70 ω0 = 0.489 ; ωC = 3.383 Model 0 : ω0 = ωC ln(L)= -1043.84 ω0 = ωC = 0.574 Primate lysosyme evolution 2 [ ln(L1) - ln(L0)] = 4.28 *
Codon models, positive selection
Yang 1998 Mol Biol Evol 15:568
Model 1 : ω0 ≠ ωC ln(L)= -1041.70 ω0 = 0.489 ; ωC = 3.383 Model 0 : ω0 = ωC ln(L)= -1043.84 ω0 = ωC = 0.574 Primate lysosyme evolution Model 2 : ω0 ≠ ωC, ωC =1 ln(L)= -1042.50 ω0 = 0.488 ; ωC = 1 2 [ ln(L2) - ln(L1)] = 1.6 NS
Codon models, positive selection
green: volume blue: polarity
- range: charge
brown: dN/dS
Sainudiin et al 2005 J. Mol. Evol. 60:315
Variation of ω across sites: class 1 MHC
Codon models, positive selection
MORMYRIFORMES (Afrique) GYMNOTIFORMES (Amérique du Sud)
Zakon et al 2006 PNAS 103:3675
A sodium channel in electric fish
Codon models, positive selection
A genomic approach: 13731 human/chimpanzee orthologous genes Fonction n p-val
Immunité Perception sensorielle Gametogenèse Inhibition apoptose
417 51 40 133
<10-10 <10-3 <10-2 <5%
Tissu n p-val
Testicules Cerveau Thyroïde Sang
247 66 405 133
<10-3 <5% NS NS
The main target for adaptation in apes are immunity, perception/communication, and spermatic competition / genomic conflicts.
Nielsen et al 2005 PLoS 3:170
Covarion models, functional shifts
Constant rate among sites Distribution of substitution rate across sites
Covarion models, functional shifts
Constant rate among sites Variable rates between sites Distribution of substitution rate across sites
Covarion models, functional shifts
Gamma
r
( ) ( ). ( / ). L y f r L y r dr
- =
: Gamma probability density function
f
r1 r2 r3 r4
discrete Gamma g: assumed number of classes
1
( ) Pr( ). ( / )
g i i i
L y r r L y r
=
= =
- The likelihood conditional on r is obtained by first multiplying branch lengths by r
Yang 1994 J Mol Evol 39:306
Calculating the likelihood assuming variable rates across sites
Covarion models, functional shifts
favourable mutation function 1 function 2 covarion Functional shifts and site-specific rate variations
Constant rate among sites Variable rates between sites Distribution of substitution rate across sites
Covarion models, functional shifts
Site-specific rate variation = COVARIONS
Covarion models, functional shifts
Galtier 2001 Mol Biol Evol 18:866
- a. Constant rate across sites
- c. Site-specific rate variation = covarions = heterotachy
- b. Variable rates across sites
M M.r1 M.r2 M.r3 M.r1 M.r2 M.r3
ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν ν
Covarion models, functional shifts
LR = 2 . [ln(L1) – ln(L0)] ~ χ2 (1 ddl) r < 1 r = 1 r > 1
M0 (constant rate in time)
r2 > r1 r1 > r2
M1 (variable rate in time) A likelihood-ratio test for detecting site-specific rate variation
Covarion models, functional shifts
a b c d e f g S T M F S L P S T M F S L P S T M F I F P S T M F T F P S T M F Y F M S T M F H F H S T M F H F T S T M F Y F P S T M F L F P S T M F F F F S T M F H F T S T M F Y F A S T M F P F P S T M F P F P S T M F P H L S T M F P F P S T M F L H T S T M F W V F S T M F F T P S T M F T V F S T M F L F L A A M V L F I A T M I L F I A T N A L F I A I V S L F I S V M F L F I T T V I L F I F T T L L F I S T M F W S I S T M M W S T S T M F M N Q S T M F P H Y S T M F P H P P R I M A T E S Pupko & Galtier 2002 Proc Roy Soc 269:1313
Coevolution between sites AA AC AG AT CA … AA AG AT CA . . .
Modelling coevolution ?
Tillier & Collins 1998 Genetics 148:1993, Pollock et al 1999 J Mol Biol 287:187
(pairs of states are what evolve) Such models, however, are uneasy to use and to generalize.
Coevolution between sites
An approach based on substitution mapping A C A G T T C . . . A G A G C T A . . . A G A G C T A . . . T C A G T T C . . . T C G G T T T . . . . . . . . .
probabilistic mapping clustering of mappings significance test
Coevolution between sites
Probabilistic substitution mapping For each site di, we want to estimate the number vik of changes that occurred in each branch k of the tree.
Vik = [ Pr(di, X, Y) / Pr(di) ] nXY(k)
Pr(di) : likelihood for site di Pr(di | X, Y) : likelihood for site di and states X and Y at top and bottom nodes
- f branch k
X Y
Σ Σ
nXY(k) : expected number of changes along branch k knowing states X and Y at top and bottom nodes
Coevolution between sites
- U
- A1
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 B17 B18 B19 B20 B21 C1 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 E16 E17 E18 E19 E20 E21 E22 E23 E24 E25 E26 E27 E28 F1 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 H1 H2 H3 H4 H1_1 I1 I2 I3 Escherichia coli D U18997
16S rRNA P formyl-transferase
Dutheil et al 2005 Mol Biol Evol 22:1919, Dutheil & Galtier 2007 submitted
Molecular clock models
Dating divergences
- a natural goal: providing a time scale to species divergence
- easy in principle:
A B t t = r * dist(A,B) / 2
- uneasy in practice:
(i) We typically deal with an arbitrary number of sequences, not just two: need for a phylogenetic approach. (ii) We typically have several, potentially discordant, fossil calibrations, not just one: need to reconcile them. (r : molecular evolutionary rate) (iii) There is no such thing as "the molecular evolutionary rate": need to account for departures from the molecular clock.
Molecular clock models
Dating divergences A B C D E A B C D E
tmin(A,B) tmax(A,B) tmin(D,E) tmax(D,E)
t What we have What we want
Molecular clock models
A bayesian approach
- data D: a set of aligned sequences
- parameters θ : tree topology, divergence times, rate autocorrelation
- priors: "flat" for topology, from fossils for some divergence dates
Pr(θ | D) Pr(D | θ) posterior distribution likelihood
Molecular clock models
A bayesian approach
- data D: a set of aligned sequences
- parameters θ : tree topology, divergence times, rate autocorrelation
- priors: "flat" for topology, from fossils for some divergence dates
Pr(θ | D) = Pr(D | θ) Pr(θ) Pr(D) Bayes theorem: posterior distribution likelihood prior distribution Calculating Pr(D) would require to integrate the likelihood over the space of branch lengths. This is not computationally feasable ⇒ Monte Carlo Markov Chains.
Conclusions
- A basic tool of phyloinformatics
New biological questions in molecular evolution are typically (optimally) answered by building new models
- A field under rapid development since the mid-90's
Most existing programs use Markov models. Markov models: See papers by Yang, Goldman, Whelan, Pupko, Thorne, Nielsen, Huelsenbeck, Rannala, Suchard, Lartillot, Rodrigo, Guindon, among other. A Royal Society conference about this topic, London, April 2008
- Many good achievements, but still some work to be done