Week 9: Coalescents, part 2
Genome 562 March, 2015
Week 9: Coalescents, part 2 – p.1/71
Week 9: Coalescents, part 2 Genome 562 March, 2015 Week 9: - - PowerPoint PPT Presentation
Week 9: Coalescents, part 2 Genome 562 March, 2015 Week 9: Coalescents, part 2 p.1/71 Fixation probabilities with multiplicative fitnesses 1.0 100 0.9 0.8 10 Fixation probability 0.7 1 0.6 0.1 0.5 0.1 1 0.4 0.3 10
Genome 562 March, 2015
Week 9: Coalescents, part 2 – p.1/71
100 10 1 0.1 −0.1 −1 −10 −100 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Fixation probability
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Initial gene frequency U(p) = 1 − e−4Nsp 1 − e−4Ns
Week 9: Coalescents, part 2 – p.2/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.3/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.4/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.5/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.6/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.7/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.8/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.9/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.10/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.11/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.12/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.13/71
Time
A random−mating population
Week 9: Coalescents, part 2 – p.14/71
Time
Genealogy of gene copies, after reordering the copies
Week 9: Coalescents, part 2 – p.15/71
Time
Genealogy of a small sample of genes from the population
Week 9: Coalescents, part 2 – p.16/71
Time
Week 9: Coalescents, part 2 – p.17/71
Random collision of lineages as go back in time (sans recombination) Collision is faster the smaller the effective population size
u9 u7 u5 u3 u8 u6 u4 u2
Average time for n Average time for copies to coalesce to 4N k(k−1) k−1 = In a diploid population of effective population size N, copies to coalesce = 4N (1 − 1 n
generations k Average time for two copies to coalesce = 2N generations
What’s misleading about this diagram: the lineages that coalesce are random pairs, not necessarily ones that are next to each other in a linear
Week 9: Coalescents, part 2 – p.18/71
This is the canonical model of genetic drift in populations. It was invented in 1930 and 1932 by Sewall Wright and R. A. Fisher. In this model the next generation is produced by doing this: Choose two individuals with replacement (including the possibility that they are the same individual) to be parents, Each produces one gamete, these become a diploid individual, Repeat these steps until N diploid individuals have been produced. The effect of this is to have each locus in an individual in the next generation consist of two genes sampled from the parents’ generation at random, with replacement.
Week 9: Coalescents, part 2 – p.19/71
. C. Kingman in about 1983 Currently Emeritus Professor of Mathematics at Cambridge University, U.K., and former head of the Isaac Newton Institute of Mathematical Sciences.
Week 9: Coalescents, part 2 – p.20/71
The probability that k lineages becomes k − 1 one generation earlier turns out to be (as each lineage “chooses” its ancestor independently): k(k − 1)/2 × Prob (First two have same parent, rest are different) (since there are k
2
We add up terms, all the same, for the k(k − 1)/2 pairs that could coalesce; the sum is: k(k − 1)/2 × 1 ×
1 2N ×
1 2N
2 2N
2N
= k(k − 1)/4N + O(1/N2)
Week 9: Coalescents, part 2 – p.21/71
Note that the total probability that some combination of lineages coalesces is 1 − Prob (Probability all genes have separate ancestors) = 1 −
2N 1 − 2 2N
2N
2N + O(1/N2)
1 + 2 + 3 + . . . + (n − 1) = n(n − 1)/2 the quantity = 1 −
Week 9: Coalescents, part 2 – p.22/71
This shows, since the terms of order 1/N are the same, that the events involving 3 or more lineages simultaneously coalescing are in the terms of
Here are the probabilities of 0, 1, or more coalescences with 10 lineages in populations of different sizes: N 1 > 1 100 0.79560747 0.18744678 0.01694575 1000 0.97771632 0.02209806 0.00018562 10000 0.99775217 0.00224595 0.00000187 Note that increasing the population size by a factor of 10 reduces the coalescent rate for pairs by about 10-fold, but reduces the rate for triples (or more) by about 100-fold.
Week 9: Coalescents, part 2 – p.23/71
To simulate a random genealogy, do the following:
generations.
Week 9: Coalescents, part 2 – p.24/71
There is a box ...
Week 9: Coalescents, part 2 – p.25/71
with bugs that are ...
Week 9: Coalescents, part 2 – p.26/71
hyperactive, ...
Week 9: Coalescents, part 2 – p.27/71
indiscriminate, ...
Week 9: Coalescents, part 2 – p.28/71
voracious ...
Week 9: Coalescents, part 2 – p.29/71
(eats other bug) ...
Gulp! Week 9: Coalescents, part 2 – p.30/71
and insatiable.
Week 9: Coalescents, part 2 – p.31/71
Week 9: Coalescents, part 2 – p.32/71
Change of population size and coalescents
Ne
time
the changes in population size will produce waves of coalescence
time
Coalescence events
time
the tree
The parameters of the growth curve for Ne can be inferred by likelihood methods as they affect the prior probabilities of those trees that fit the data.
Week 9: Coalescents, part 2 – p.33/71
Time
population #1 population #2
Week 9: Coalescents, part 2 – p.34/71
Recomb.
Different markers have slightly different coalescent trees
Week 9: Coalescents, part 2 – p.35/71
Becky Cann Mark Stoneking the late Allan Wilson Cann, R. L., M. Stoneking, and A. C. Wilson. 1987. Mitochondrial DNA and human evolution. Nature 325:a 31-36.
Week 9: Coalescents, part 2 – p.36/71
Week 9: Coalescents, part 2 – p.37/71
Africa Europe Asia "Out of Africa" hypothesis (vertical scale is not time or evolutionary change)
Week 9: Coalescents, part 2 – p.38/71
Consistency of gene tree with species tree
Week 9: Coalescents, part 2 – p.39/71
Consistency of gene tree with species tree
Week 9: Coalescents, part 2 – p.40/71
Consistency of gene tree with species tree
Week 9: Coalescents, part 2 – p.41/71
Consistency of gene tree with species tree
Week 9: Coalescents, part 2 – p.42/71
Consistency of gene tree with species tree
Week 9: Coalescents, part 2 – p.43/71
Consistency of gene tree with species tree
coalescence time
Week 9: Coalescents, part 2 – p.44/71
t1 t2 N1 N2 N4 N3 N5
Gene tree and Species tree
Week 9: Coalescents, part 2 – p.45/71
t1 t2 N1 N2 N4 N3 N5
Gene tree and Species tree
Week 9: Coalescents, part 2 – p.46/71
t1 t2 N1 N2 N4 N3 N5
Gene tree and Species tree
Week 9: Coalescents, part 2 – p.47/71
CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTAC
, CAGTTTCAGCGTCC CAGTTTCAGCGTCC ) , ... L = Prob ( = ??
Week 9: Coalescents, part 2 – p.48/71
CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTCC
,
CAGTTTCAGCGTCC CAGTTTCAGCGTCC
Prob( | Genealogy)
so we can compute but how to computer the overall likelihood from this?
, ...
CAGTTTCAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTCC
Week 9: Coalescents, part 2 – p.49/71
In the case of a single population with parameters Ne effective population size µ mutation rate per site and assuming G′ stands for a coalescent genealogy and D for the sequences,
Week 9: Coalescents, part 2 – p.50/71
Rescaling branch lengths of G′ so that branches are given in expected mutations per site, G = µG′ , we get (if we let Θ = 4Neµ ) L =
Prob (G | Θ) Prob (D | G) as the fundamental equation. For more complex population scenarios one simply replaces Θ with a vector of parameters.
Week 9: Coalescents, part 2 – p.51/71
Ne Ne can reduce variability by looking at (i) more gene copies, or
(2) Randomness of coalescence of lineages
affected by the can reduce variance of branch by examining more sites number of mutations per site per mutation rate
(1) Randomness of mutation
affected by effective population size coalescence times allow estimation of µ (ii) more loci
Week 9: Coalescents, part 2 – p.52/71
Likelihood of t Likelihood of
The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter
1
Θ Θ
Prior Prob of t
Θ1
Week 9: Coalescents, part 2 – p.53/71
Likelihood of t Likelihood of
The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter
2
Θ Θ
Prior Prob of t
2
Θ
Week 9: Coalescents, part 2 – p.54/71
Likelihood of t Likelihood of
The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter
3
Θ Θ
Prior Prob of t
3
Θ
Week 9: Coalescents, part 2 – p.55/71
Likelihood of t Likelihood of
The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter
1
Θ
2
Θ
3
Θ Θ
Prior Prob of t
2
Θ
3
Θ Θ1
Week 9: Coalescents, part 2 – p.56/71
Labelled Histories (Edwards, 1970; Harding, 1971)
Trees that differ in the time−ordering of their nodes A B C D A B C D
These two are the same:
A B C D A B C D
These two are different:
Week 9: Coalescents, part 2 – p.57/71
Bob Griffiths Simon Tavaré Mary Kuhner and Jon Yamato
Week 9: Coalescents, part 2 – p.58/71
To get the area under a curve, we can either evaluate the function (f(x)) at a series of grid points and add up heights × widths:
width:
Week 9: Coalescents, part 2 – p.59/71
Week 9: Coalescents, part 2 – p.60/71
Week 9: Coalescents, part 2 – p.61/71
= f(x)
g(x) g(x) dx
= Eg
g(x)
f(x) g(x).
This is approximated by sampling a lot (n) of points from g(x) and the computing the average: L = 1 n
n
f(xi) g(xi)
Week 9: Coalescents, part 2 – p.62/71
In Mary Kuhner and Jon Yamato’s program LAMARC they use as the importance function the probability density of the tree given the data at a set of “driving values” θ0 of the parameters: f(G) = Prob (D | G) Prob (G | θ0) Prob (D | θ0) The denominator is impossible to evaluate but as we will see, isn’t really needed. The resulting likelihood ratio is L(Θ) L(Θ0) = 1 n
n
Prob (Gi|Θ) Prob (Gi|Θ0)
Week 9: Coalescents, part 2 – p.63/71
To do the importance sampling, MCMC methods are employed (in all programs that do full likelihood or Bayesian analyses). To sample from f(G), start with a tree Gold and
Gnew
f(Gold) , accept the new tree. (Note that in that ratio any
horrible, but shared, denominators cancel out). repeat this vast numbers of times (the correct number of times is infinity).
Week 9: Coalescents, part 2 – p.64/71
A conditional coalescent rearrangement strategy
Week 9: Coalescents, part 2 – p.65/71
First pick a random node (interior or tip) and remove its subtree
Week 9: Coalescents, part 2 – p.66/71
Then allow this node to re−coalesce with the tree
Week 9: Coalescents, part 2 – p.67/71
The resulting tree proposed by this process
Week 9: Coalescents, part 2 – p.68/71
−10 −20 −30 −40 −50 −60 −70 −80 0.001 0.002 0.005 0.01 0.02 0.05 0.1
0.00650776
Results of analysing a data set with 50 sequences of 500 bases which was simulated with a true value of
Θ = 0.01
Week 9: Coalescents, part 2 – p.69/71
LAMARC by Mary Kuhner and Jon Yamato and others. Likelihood inference with multiple populations, recombination, migration, population growth. No historical branching events or serial sampling, yet. BEAST by Andrew Rambaut, Alexei Drummond and others. Bayesian inference with multiple populations related by a tree. Support for serial sampling (no migration or recombination yet). genetree by Bob Griffiths and Melanie Bahlo. Likelihood inference of migration rates and changes in population size. No recombination or historical branching events. migrate by Peter Beerli. Likelihood inference with multiple populations and migration rates. No recombination or historical branching events yet. IM and IMa by Rasmus Nielsen and Jody Hey. Two or more populations allowing both historical splitting and migration after that. No recombination yet.
Week 9: Coalescents, part 2 – p.70/71
These involve approximating the sampling by computing some “summary statistics” from the data, then finding parameter values that, in a simulation
They are faster, and very popular now. But ... they are very dependent on getting the right summary statistics so as not to lose too much power compared to fully-powerful likelihood or Bayesian MCMC methods.
Week 9: Coalescents, part 2 – p.71/71