[PPT] - Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe PowerPoint Presentation

SLIDE 1

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

SLIDE 2

Gene copies in a population of 10 individuals

Time

A random−mating population

Week 9: Coalescents – p.2/60

SLIDE 3

Going back one generation

Time

A random−mating population

Week 9: Coalescents – p.3/60

SLIDE 4

... and one more

Time

A random−mating population

Week 9: Coalescents – p.4/60

SLIDE 5

... and one more

Time

A random−mating population

Week 9: Coalescents – p.5/60

SLIDE 6

... and one more

Time

A random−mating population

Week 9: Coalescents – p.6/60

SLIDE 7

... and one more

Time

A random−mating population

Week 9: Coalescents – p.7/60

SLIDE 8

... and one more

Time

A random−mating population

Week 9: Coalescents – p.8/60

SLIDE 9

... and one more

Time

A random−mating population

Week 9: Coalescents – p.9/60

SLIDE 10

... and one more

Time

A random−mating population

Week 9: Coalescents – p.10/60

SLIDE 11

... and one more

Time

A random−mating population

Week 9: Coalescents – p.11/60

SLIDE 12

... and one more

Time

A random−mating population

Week 9: Coalescents – p.12/60

SLIDE 13

... and one more

Time

A random−mating population

Week 9: Coalescents – p.13/60

SLIDE 14

The genealogy of gene copies is a tree

Time

Genealogy of gene copies, after reordering the copies

Week 9: Coalescents – p.14/60

SLIDE 15

Ancestry of a sample of 3 copies

Time

Genealogy of a small sample of genes from the population

Week 9: Coalescents – p.15/60

SLIDE 16

Here is that tree of 3 copies in the pedigree

Time

Week 9: Coalescents – p.16/60

SLIDE 17

Kingman’s coalescent

u9 u7 u5 u3 u8 u6 u4 u2

Random collision of lineages as go back in time (sans recombination) Collision is faster the smaller the effective population size

Average time for n Average time for copies to coalesce to 4N k(k−1) k−1 = In a diploid population of effective population size N, copies to coalesce = 4N (1 − 1 n

(

generations k Average time for two copies to coalesce = 2N generations

Week 9: Coalescents – p.17/60

SLIDE 18

The Wright-Fisher model

This is the canonical model of genetic drift in populations. It was invented in 1930 and 1932 by Sewall Wright and R. A. Fisher. In this model the next generation is produced by doing this: Choose two individuals with replacement (including the possibility that they are the same individual) to be parents, Each produces one gamete, these become a diploid individual, Repeat these steps until N diploid individuals have been produced. The effect of this is to have each locus in an individual in the next generation consist of two genes sampled from the parents’ generation at random, with replacement.

Week 9: Coalescents – p.18/60

SLIDE 19

The coalescent – a derivation

The probability that k lineages becomes k − 1 one generation earlier is (as each lineage “chooses” its ancestor independently): k(k − 1)/2 × Prob (First two have same parent, rest are different) (since there are k

2

= k(k − 1)/2 different pairs of copies)

We add up terms, all the same, for the k(k − 1)/2 pairs that could coalesce: k(k − 1)/2 × 1 ×

1 2N ×

1 −

1 2N

×
1 −

2 2N

× · · · ×
1 − k−2

2N

so that the total probability that a pair coalesces is

= k(k − 1)/4N + O(1/N2)

Week 9: Coalescents – p.19/60

SLIDE 20

Can probabilities of two or more lineages coalescing

Note that the total probability that some combination of lineages coalesces is 1 − Prob (Probability all genes have separate ancestors) = 1 −

1 ×
1 − 1

2N 1 − 2 2N

. . .
1 − k − 1

2N

= 1 −
1 − 1 + 2 + 3 + · · · + (k − 1)

2N + O(1/N2)

and since

1 + 2 + 3 + . . . + (n − 1) = n(n − 1)/2 the quantity = 1 −

1 − k(k − 1)/4N + O(1/N2)
≃ k(k − 1)/4N + O(1/N2)

Week 9: Coalescents – p.20/60

SLIDE 21

Can calculate how many coalescences are of pairs

This shows, since the terms of order 1/N are the same, that the events involving 3 or more lineages simultaneously coalescing are in the terms

f order 1/N2 and thus become unimportant if N is large.

Here are the probabilities of 0, 1, or more coalescences with 10 lineages in populations of different sizes: N 1 > 1 100 0.79560747 0.18744678 0.01694575 1000 0.97771632 0.02209806 0.00018562 10000 0.99775217 0.00224595 0.00000187 Note that increasing the population size by a factor of 10 reduces the coalescent rate for pairs by about 10-fold, but reduces the rate for triples (or more) by about 100-fold.

Week 9: Coalescents – p.21/60

SLIDE 22

The coalescent

To simulate a random genealogy, do the following:

1. Start with k lineages
2. Draw an exponential time interval with mean 4N/(k(k − 1))

generations.

3. Combine two randomly chosen lineages.
4. Decrease k by 1.
5. If k = 1, then stop
6. Otherwise go back to step 2.

Week 9: Coalescents – p.22/60

SLIDE 23

Random coalescent trees with 16 lineages

Week 9: Coalescents – p.23/60

SLIDE 24

Coalescence is faster in small populations

Change of population size and coalescents

Ne

time

the changes in population size will produce waves of coalescence

time

Coalescence events

time

the tree

The parameters of the growth curve for Ne can be inferred by likelihood methods as they affect the prior probabilities of those trees that fit the data.

Week 9: Coalescents – p.24/60

SLIDE 25

Migration can be taken into account

Time

population #1 population #2

Week 9: Coalescents – p.25/60

SLIDE 26

Recombination creates loops

Recomb.

Different markers have slightly different coalescent trees

Week 9: Coalescents – p.26/60

SLIDE 27

If we have a sample of 50 copies

50−gene sample in a coalescent tree

Week 9: Coalescents – p.27/60

SLIDE 28

The first 10 account for most of the branch length

10 genes sampled randomly out of a 50−gene sample in a coalescent tree

Week 9: Coalescents – p.28/60

SLIDE 29

... and when we add the other 40 they add less length

10 genes sampled randomly out of a 50−gene sample in a coalescent tree

(orange lines are the 10−gene tree)

Week 9: Coalescents – p.29/60

SLIDE 30

We want to be able to analyze human evolution

Africa Europe Asia "Out of Africa" hypothesis (vertical scale is not time or evolutionary change)

Week 9: Coalescents – p.30/60

SLIDE 31

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.31/60

SLIDE 32

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.32/60

SLIDE 33

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.33/60

SLIDE 34

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.34/60

SLIDE 35

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents – p.35/60

SLIDE 36

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

coalescence time

Week 9: Coalescents – p.36/60

SLIDE 37

If the branch is more than Ne generations long ...

t1 t2 N1 N2 N4 N3 N5

Gene tree and Species tree

Week 9: Coalescents – p.37/60

SLIDE 38

If the branch is more than Ne generations long ...

t1 t2 N1 N2 N4 N3 N5

Gene tree and Species tree

Week 9: Coalescents – p.38/60

SLIDE 39

If the branch is more than Ne generations long ...

t1 t2 N1 N2 N4 N3 N5

Gene tree and Species tree

Week 9: Coalescents – p.39/60

SLIDE 40

Labelled histories

Labelled Histories (Edwards, 1970; Harding, 1971)

Trees that differ in the time−ordering of their nodes A B C D A B C D

These two are the same:

A B C D A B C D

These two are different:

Week 9: Coalescents – p.46/60

SLIDE 41

Inconsistency of estimation from concatenated gene sequences

Degnan and Rosenberg (2006) show that the most likely topology for a gene tree is not necessarily the tree that agrees with the phylogenetic tree. For some phylogenetic shapes (e.g. imbalanced trees with short internal nodes) there exists (at least) one other tree shape that has a higher probability of agreeing with a gene tree. Argues for explicitly considering the coalescent process in phylogenetic inference.

SLIDE 42

How do we compute a likelihood for a population sample?

CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTAC

, CAGTTTCAGCGTCC CAGTTTCAGCGTCC ) , ... L = Prob ( = ??

Week 9: Coalescents – p.40/60

SLIDE 43

If we have a tree for the sample sequences, we can

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTCC CAGTTTTGGCGTCC

,

CAGTTTCAGCGTCC CAGTTTCAGCGTCC

Prob( | Genealogy)

so we can compute but how to computer the overall likelihood from this?

, ...

CAGTTTCAGCGTCC CAGTTTCAGCGTCC

Week 9: Coalescents – p.41/60

SLIDE 44

The basic equation for coalescent likelihoods

In the case of a single population with parameters Ne effective population size µ mutation rate per site and assuming G′ stands for a coalescent genealogy and D for the sequences, L = Prob (D | Ne, µ) =

G′

Prob (G′ | Ne) Prob (D | G′, µ)

Kingman′s prior

likelihood of tree

Week 9: Coalescents – p.42/60

SLIDE 45

Rescaling the branch lengths

Rescaling branch lengths of G′ so that branches are given in expected mutations per site, G = µG′ , we get (if we let Θ = 4Neµ ) L =

G

Prob (G | Θ) Prob (D | G) as the fundamental equation. For more complex population scenarios

ne simply replaces Θ with a vector of parameters.

Week 9: Coalescents – p.43/60

SLIDE 46

The variability comes from two sources

Ne Ne can reduce variability by looking at (i) more gene copies, or (ii) more loci

(2) Randomness of coalescence of lineages

affected by the can reduce variance of branch by examining more sites number of mutations per site per mutation rate u

(1) Randomness of mutation

affected by effective population size coalescence times allow estimation of

Week 9: Coalescents – p.44/60

SLIDE 47

We can compute the likelihood by averaging over coalescents

t

Likelihood of t Likelihood of

The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter

The likelihood calculation in a sample of two gene copies

t

1

Θ

2

Θ

3

Θ Θ

Prior Prob of t

2 Θ

3 Θ Θ1

Θ Θ

Week 9: Coalescents – p.45/60

SLIDE 48

Rearrangement to sample points in tree space

A conditional coalescent rearrangement strategy

Week 9: Coalescents – p.51/60

SLIDE 49

Dissolving a branch and regrowing it backwards

First pick a random node (interior or tip) and remove its subtree

Week 9: Coalescents – p.52/60

SLIDE 50

We allow it coalesce with the other branches

Then allow this node to re−coalesce with the tree

Week 9: Coalescents – p.53/60

SLIDE 51

and this gives anothern coalescent

The resulting tree proposed by this process

Week 9: Coalescents – p.54/60

SLIDE 52

An example of an MCMC likelihood curve

−10 −20 −30 −40 −50 −60 −70 −80 0.001 0.002 0.005 0.01 0.02 0.05 0.1

Θ ln L

0.00650776

Results of analysing a data set with 50 sequences of 500 bases which was simulated with a true value of

Θ = 0.01

Week 9: Coalescents – p.56/60

SLIDE 53

Major MCMC likelihood or Bayesian programs

LAMARC by Mary Kuhner and Jon Yamato and others. Likelihood inference with multiple populations, recombination, migration, population growth. No historical branching events, yet. BEAST by Andrew Rambaut, Alexei Drummond and others. Bayesian inference with multiple populations related by a tree. Support for serial sampling (no migration or recombination yet). genetree by Bob Griffiths and Melanie Bahlo. Likelihood inference

f migration rates and changes in population size.

migrate by Peter Beerli. Likelihood inference with multiple populations and migration rates. IM and IMa by Rasmus Nielsen and Jody Hey. Two populations allowing both historical splitting and migration after that.

Week 9: Coalescents – p.57/60

SLIDE 54

“Skyline” and “Skyride” plots in BEAST

Classical Skyline Plot

Effective Population Size

0.15 0.10 0.05 0.00 0.001 0.01 1.0 ORMCP Model 0.15 0.10 0.05 0.00 0.001 0.01 1.0 Bayesian Skyline Plot 0.15 0.10 0.05 0.00 0.001 0.01 1.0 Uniform Bayesian Skyride

Time (Past to Present) Effective Population Size

0.15 0.10 0.05 0.00 0.001 0.01 1.0 Time−Aware Bayesian Skyride

Time (Past to Present)

0.15 0.10 0.05 0.00 0.001 0.01 1.0 BEAST Bayesian Skyride

Time (Past to Present)

0.15 0.10 0.05 0.00 0.001 0.01 1.0

Figure from Minin, Bloomquist, and Suchard 2008

SLIDE 55

BEST Liu and Pearl (2007); Edwards et al. (2007)

X – sequence data
G – a genealogy (gene tree – with branch lengths)
S – a species tree
θ – demographic parameters
Λ – parameters of molecular sequence evolution

Pr(S, θ|X) = Pr(S, θ) Pr(X|S, θ) Pr(X) = Pr(S) Pr(θ)

Pr(X|G) Pr(G|S, θ)dG

∝ Pr(S) Pr(θ) Pr(X|G, Λ) Pr(Λ)dΛ

Pr(G|S, θ)dG

SLIDE 56

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of the

coalescent prior

2. Sample from the distribution of the species trees conditional on the gene

trees, G.

3. Use “importance weights” to correct the sample for the fact that an

approximate prior was used

SLIDE 57

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of the

coalescent prior (a) Use a tweaked version of MrBayes to sample N sets of gene trees, G, from

†

Pr(G|X) = Pr†(G) Pr(X|G) Pr†(X) (b) Pr†(G) is an approximate prior on gene trees from using a “maximal” species tree.

2. Sample from the distribution of the species trees conditional on the gene

trees, G.

3. Use “importance weights” to correct the sample for the fact that an

approximate prior was used

SLIDE 58

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of the

coalescent prior

2. Sample from the distribution of the species trees conditional on the gene

trees, G. (a) From each set of gene trees (Gj for 1 ≤ j ≤ N) generate k species trees using coalescent theory: Pr(Si|Gj) = Pr(Si) Pr(Gj|Si) Pr(Gj)

3. Use “importance weights” to correct the sample for the fact that an

approximate prior was used

SLIDE 59

BEST – importance sampling

1. Generate a collection of gene trees, G, using an approximation of the

coalescent prior

2. Sample from the distribution of the species trees conditional on the gene

trees, G.

3. Use “importance weights” to correct the sample for the fact that an

approximate prior was used (a) Estimate Pr(Gj) by using the harmonic mean estimator from the MCMC in step 2. (b) Compute a normalization factor β =

N

j=1
Pr(Gj)

Pr(Gj) (c) Reweight all sampled species trees by

Pr(Gj)

Pr(Gj)β

SLIDE 60

BEST – conclusions

1. very expensive computationally (long MrBayes runs are needed)
2. should correctly deal with the variability in gene tree caused by the

coalescent process.

SLIDE 61

∗BEST

Similar model to BEST, but much more efficient implementation. Both will be very sensitive to migration, but they represent the state-of-the- art for estimating species trees from gene trees.

SLIDE 62

Gene tree in a species tree w/ variable population size

Figure from Heled and Drummond 2010

SLIDE 63

Multiple gene tree in a species tree w/ variable population size

Figure from Heled and Drummond 2010

SLIDE 64

References

Degnan, J. and Rosenberg, N. (2006). Discordance of species trees with their most likely gene trees. PLoS Genet, 2(5). Edwards, S. V., Liu, L., and Pearl, D. K. (2007). High-resolution species trees without concatenation. Proceedings of the National Academy of Sciences, 104(14):5936–5941. Liu, L. and Pearl, D. K. (2007). Species trees from gene trees: reconstruction Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology, 56(3):504–514.