Week 9: Coalescents, part 2 Genome 562 March, 2015 Week 9: - - PowerPoint PPT Presentation

week 9 coalescents part 2
SMART_READER_LITE
LIVE PREVIEW

Week 9: Coalescents, part 2 Genome 562 March, 2015 Week 9: - - PowerPoint PPT Presentation

Week 9: Coalescents, part 2 Genome 562 March, 2015 Week 9: Coalescents, part 2 p.1/71 Fixation probabilities with multiplicative fitnesses 1.0 100 0.9 0.8 10 Fixation probability 0.7 1 0.6 0.1 0.5 0.1 1 0.4 0.3 10


slide-1
SLIDE 1

Week 9: Coalescents, part 2

Genome 562 March, 2015

Week 9: Coalescents, part 2 – p.1/71

slide-2
SLIDE 2

Fixation probabilities with multiplicative fitnesses

100 10 1 0.1 −0.1 −1 −10 −100 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Fixation probability

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

Initial gene frequency U(p) = 1 − e−4Nsp 1 − e−4Ns

Week 9: Coalescents, part 2 – p.2/71

slide-3
SLIDE 3

Gene copies in a population of 10 individuals

Time

A random−mating population

Week 9: Coalescents, part 2 – p.3/71

slide-4
SLIDE 4

Going back one generation

Time

A random−mating population

Week 9: Coalescents, part 2 – p.4/71

slide-5
SLIDE 5

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.5/71

slide-6
SLIDE 6

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.6/71

slide-7
SLIDE 7

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.7/71

slide-8
SLIDE 8

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.8/71

slide-9
SLIDE 9

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.9/71

slide-10
SLIDE 10

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.10/71

slide-11
SLIDE 11

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.11/71

slide-12
SLIDE 12

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.12/71

slide-13
SLIDE 13

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.13/71

slide-14
SLIDE 14

... and one more

Time

A random−mating population

Week 9: Coalescents, part 2 – p.14/71

slide-15
SLIDE 15

The genealogy of gene copies is a tree

Time

Genealogy of gene copies, after reordering the copies

Week 9: Coalescents, part 2 – p.15/71

slide-16
SLIDE 16

Ancestry of a sample of 3 copies

Time

Genealogy of a small sample of genes from the population

Week 9: Coalescents, part 2 – p.16/71

slide-17
SLIDE 17

Here is that tree of 3 copies in the pedigree

Time

Week 9: Coalescents, part 2 – p.17/71

slide-18
SLIDE 18

Kingman’s coalescent

Random collision of lineages as go back in time (sans recombination) Collision is faster the smaller the effective population size

u9 u7 u5 u3 u8 u6 u4 u2

Average time for n Average time for copies to coalesce to 4N k(k−1) k−1 = In a diploid population of effective population size N, copies to coalesce = 4N (1 − 1 n

(

generations k Average time for two copies to coalesce = 2N generations

What’s misleading about this diagram: the lineages that coalesce are random pairs, not necessarily ones that are next to each other in a linear

  • rder.

Week 9: Coalescents, part 2 – p.18/71

slide-19
SLIDE 19

The Wright-Fisher model

This is the canonical model of genetic drift in populations. It was invented in 1930 and 1932 by Sewall Wright and R. A. Fisher. In this model the next generation is produced by doing this: Choose two individuals with replacement (including the possibility that they are the same individual) to be parents, Each produces one gamete, these become a diploid individual, Repeat these steps until N diploid individuals have been produced. The effect of this is to have each locus in an individual in the next generation consist of two genes sampled from the parents’ generation at random, with replacement.

Week 9: Coalescents, part 2 – p.19/71

slide-20
SLIDE 20

Sir John Kingman

  • J. F

. C. Kingman in about 1983 Currently Emeritus Professor of Mathematics at Cambridge University, U.K., and former head of the Isaac Newton Institute of Mathematical Sciences.

Week 9: Coalescents, part 2 – p.20/71

slide-21
SLIDE 21

The coalescent – a derivation

The probability that k lineages becomes k − 1 one generation earlier turns out to be (as each lineage “chooses” its ancestor independently): k(k − 1)/2 × Prob (First two have same parent, rest are different) (since there are k

2

  • = k(k − 1)/2 different pairs of copies)

We add up terms, all the same, for the k(k − 1)/2 pairs that could coalesce; the sum is: k(k − 1)/2 × 1 ×

1 2N ×

  • 1 −

1 2N

  • ×
  • 1 −

2 2N

  • × · · · ×
  • 1 − k−2

2N

  • so that the total probability that a pair coalesces is

= k(k − 1)/4N + O(1/N2)

Week 9: Coalescents, part 2 – p.21/71

slide-22
SLIDE 22

Can probabilities of two or more lineages coalescing

Note that the total probability that some combination of lineages coalesces is 1 − Prob (Probability all genes have separate ancestors) = 1 −

  • 1 ×
  • 1 − 1

2N 1 − 2 2N

  • . . .
  • 1 − k − 1

2N

  • = 1 −
  • 1 − 1 + 2 + 3 + · · · + (k − 1)

2N + O(1/N2)

  • and since

1 + 2 + 3 + . . . + (n − 1) = n(n − 1)/2 the quantity = 1 −

  • 1 − k(k − 1)/4N + O(1/N2)
  • ≃ k(k − 1)/4N + O(1/N2)

Week 9: Coalescents, part 2 – p.22/71

slide-23
SLIDE 23

Can calculate how many coalescences are of pairs

This shows, since the terms of order 1/N are the same, that the events involving 3 or more lineages simultaneously coalescing are in the terms of

  • rder 1/N2 and thus become unimportant if N is large.

Here are the probabilities of 0, 1, or more coalescences with 10 lineages in populations of different sizes: N 1 > 1 100 0.79560747 0.18744678 0.01694575 1000 0.97771632 0.02209806 0.00018562 10000 0.99775217 0.00224595 0.00000187 Note that increasing the population size by a factor of 10 reduces the coalescent rate for pairs by about 10-fold, but reduces the rate for triples (or more) by about 100-fold.

Week 9: Coalescents, part 2 – p.23/71

slide-24
SLIDE 24

The coalescent

To simulate a random genealogy, do the following:

  • 1. Start with k lineages
  • 2. Draw an exponential time interval with mean 4N/(k(k − 1))

generations.

  • 3. Combine two randomly chosen lineages.
  • 4. Decrease k by 1.
  • 5. If k = 1, then stop
  • 6. Otherwise go back to step 2.

Week 9: Coalescents, part 2 – p.24/71

slide-25
SLIDE 25

An accurate analogy: Bugs In A Box

There is a box ...

Week 9: Coalescents, part 2 – p.25/71

slide-26
SLIDE 26

An accurate analogy: Bugs In A Box

with bugs that are ...

Week 9: Coalescents, part 2 – p.26/71

slide-27
SLIDE 27

An accurate analogy: Bugs In A Box

hyperactive, ...

Week 9: Coalescents, part 2 – p.27/71

slide-28
SLIDE 28

An accurate analogy: Bugs In A Box

indiscriminate, ...

Week 9: Coalescents, part 2 – p.28/71

slide-29
SLIDE 29

An accurate analogy: Bugs In A Box

voracious ...

Week 9: Coalescents, part 2 – p.29/71

slide-30
SLIDE 30

An accurate analogy: Bugs In A Box

(eats other bug) ...

Gulp! Week 9: Coalescents, part 2 – p.30/71

slide-31
SLIDE 31

An accurate analogy: Bugs In A Box

and insatiable.

Week 9: Coalescents, part 2 – p.31/71

slide-32
SLIDE 32

Random coalescent trees with 16 lineages

O C S M L P K E J I T R H Q F B N D G A M J B F G C E R A S Q K N L H T I P D O B G T M L Q D O F K P E A I J S C H R N F R N L M D H B T C Q S O G P I A K J E I Q C A J L S G P F O D H B M E T R K N R C L D K H O Q F M B G S I T P A J E N N M P R H L E S O F B G J D C I T K Q A N H M C R P G L T E D S O I K J Q F A B

Week 9: Coalescents, part 2 – p.32/71

slide-33
SLIDE 33

Coalescence is faster in small populations

Change of population size and coalescents

Ne

time

the changes in population size will produce waves of coalescence

time

Coalescence events

time

the tree

The parameters of the growth curve for Ne can be inferred by likelihood methods as they affect the prior probabilities of those trees that fit the data.

Week 9: Coalescents, part 2 – p.33/71

slide-34
SLIDE 34

Migration can be taken into account

Time

population #1 population #2

Week 9: Coalescents, part 2 – p.34/71

slide-35
SLIDE 35

Recombination creates loops

Recomb.

Different markers have slightly different coalescent trees

Week 9: Coalescents, part 2 – p.35/71

slide-36
SLIDE 36

Cann, Stoneking, and Wilson

Becky Cann Mark Stoneking the late Allan Wilson Cann, R. L., M. Stoneking, and A. C. Wilson. 1987. Mitochondrial DNA and human evolution. Nature 325:a 31-36.

Week 9: Coalescents, part 2 – p.36/71

slide-37
SLIDE 37

Mitochondrial Eve

Week 9: Coalescents, part 2 – p.37/71

slide-38
SLIDE 38

We want to be able to analyze human evolution

Africa Europe Asia "Out of Africa" hypothesis (vertical scale is not time or evolutionary change)

Week 9: Coalescents, part 2 – p.38/71

slide-39
SLIDE 39

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents, part 2 – p.39/71

slide-40
SLIDE 40

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents, part 2 – p.40/71

slide-41
SLIDE 41

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents, part 2 – p.41/71

slide-42
SLIDE 42

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents, part 2 – p.42/71

slide-43
SLIDE 43

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

Week 9: Coalescents, part 2 – p.43/71

slide-44
SLIDE 44

coalescent and “gene trees” versus species trees

Consistency of gene tree with species tree

coalescence time

Week 9: Coalescents, part 2 – p.44/71

slide-45
SLIDE 45

If the branch is more than Ne generations long ...

t1 t2 N1 N2 N4 N3 N5

Gene tree and Species tree

Week 9: Coalescents, part 2 – p.45/71

slide-46
SLIDE 46

If the branch is more than Ne generations long ...

t1 t2 N1 N2 N4 N3 N5

Gene tree and Species tree

Week 9: Coalescents, part 2 – p.46/71

slide-47
SLIDE 47

If the branch is more than Ne generations long ...

t1 t2 N1 N2 N4 N3 N5

Gene tree and Species tree

Week 9: Coalescents, part 2 – p.47/71

slide-48
SLIDE 48

How do we compute a likelihood for a population sample?

CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTAC

, CAGTTTCAGCGTCC CAGTTTCAGCGTCC ) , ... L = Prob ( = ??

Week 9: Coalescents, part 2 – p.48/71

slide-49
SLIDE 49

If we have a tree for the sample sequences, we can

CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTAC CAGTTTCAGCGTCC

,

CAGTTTCAGCGTCC CAGTTTCAGCGTCC

Prob( | Genealogy)

so we can compute but how to computer the overall likelihood from this?

, ...

CAGTTTCAGCGTCC

CAGTTTTAGCGTCC

CAGTTTTAGCGTCC CAGTTTCAGCGTCC CAGTTTTGGCGTCC CAGTTTCAGCGTCC

Week 9: Coalescents, part 2 – p.49/71

slide-50
SLIDE 50

The basic equation for coalescent likelihoods

In the case of a single population with parameters Ne effective population size µ mutation rate per site and assuming G′ stands for a coalescent genealogy and D for the sequences,

L = Prob (D | Ne, µ) =

  • G′

Prob (G′ | Ne) Prob (D | G′, µ)

  • Kingman′s prior

likelihood of tree

Week 9: Coalescents, part 2 – p.50/71

slide-51
SLIDE 51

Rescaling the branch lengths

Rescaling branch lengths of G′ so that branches are given in expected mutations per site, G = µG′ , we get (if we let Θ = 4Neµ ) L =

  • G

Prob (G | Θ) Prob (D | G) as the fundamental equation. For more complex population scenarios one simply replaces Θ with a vector of parameters.

Week 9: Coalescents, part 2 – p.51/71

slide-52
SLIDE 52

The variability comes from two sources

Ne Ne can reduce variability by looking at (i) more gene copies, or

(2) Randomness of coalescence of lineages

affected by the can reduce variance of branch by examining more sites number of mutations per site per mutation rate

(1) Randomness of mutation

affected by effective population size coalescence times allow estimation of µ (ii) more loci

Week 9: Coalescents, part 2 – p.52/71

slide-53
SLIDE 53

Computing the likelihood: averaging over coalescents

t

t

Likelihood of t Likelihood of

The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter

The likelihood calculation in a sample of two gene copies

t

1

Θ Θ

Prior Prob of t

Θ1

Θ Θ

Week 9: Coalescents, part 2 – p.53/71

slide-54
SLIDE 54

Computing the likelihood: averaging over coalescents

t

t

Likelihood of t Likelihood of

The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter

The likelihood calculation in a sample of two gene copies

t

2

Θ Θ

Prior Prob of t

2

Θ

Θ Θ

Week 9: Coalescents, part 2 – p.54/71

slide-55
SLIDE 55

Computing the likelihood: averaging over coalescents

t

t

Likelihood of t Likelihood of

The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter

The likelihood calculation in a sample of two gene copies

t

3

Θ Θ

Prior Prob of t

3

Θ

Θ Θ

Week 9: Coalescents, part 2 – p.55/71

slide-56
SLIDE 56

Computing the likelihood: averaging over coalescents

t

t

Likelihood of t Likelihood of

The product of the prior on t, times the likelihood of that t from the data, when integrated over all possible t’s, gives the likelihood for the underlying parameter

The likelihood calculation in a sample of two gene copies

t

1

Θ

2

Θ

3

Θ Θ

Prior Prob of t

2

Θ

3

Θ Θ1

Θ Θ

Week 9: Coalescents, part 2 – p.56/71

slide-57
SLIDE 57

Labelled histories

Labelled Histories (Edwards, 1970; Harding, 1971)

Trees that differ in the time−ordering of their nodes A B C D A B C D

These two are the same:

A B C D A B C D

These two are different:

Week 9: Coalescents, part 2 – p.57/71

slide-58
SLIDE 58

Sampling approaches to coalescent likelihood

Bob Griffiths Simon Tavaré Mary Kuhner and Jon Yamato

Week 9: Coalescents, part 2 – p.58/71

slide-59
SLIDE 59

Monte Carlo integration

To get the area under a curve, we can either evaluate the function (f(x)) at a series of grid points and add up heights × widths:

  • r we can sample at random the same number of points, add up height ×

width:

Week 9: Coalescents, part 2 – p.59/71

slide-60
SLIDE 60

Importance sampling

Week 9: Coalescents, part 2 – p.60/71

slide-61
SLIDE 61

Importance sampling

The function we integrate We sample from this density

f(x) g(x)

Week 9: Coalescents, part 2 – p.61/71

slide-62
SLIDE 62

The math of importance sampling

  • f(x) dx

= f(x)

g(x) g(x) dx

= Eg

  • f(x)

g(x)

  • which is the expectation for points sampled from g(x) of the ratio

f(x) g(x).

This is approximated by sampling a lot (n) of points from g(x) and the computing the average: L = 1 n

n

  • i=1

f(xi) g(xi)

Week 9: Coalescents, part 2 – p.62/71

slide-63
SLIDE 63

The importance function used in LAMARC

In Mary Kuhner and Jon Yamato’s program LAMARC they use as the importance function the probability density of the tree given the data at a set of “driving values” θ0 of the parameters: f(G) = Prob (D | G) Prob (G | θ0) Prob (D | θ0) The denominator is impossible to evaluate but as we will see, isn’t really needed. The resulting likelihood ratio is L(Θ) L(Θ0) = 1 n

n

  • i=1

Prob (Gi|Θ) Prob (Gi|Θ0)

Week 9: Coalescents, part 2 – p.63/71

slide-64
SLIDE 64

Markov Chain Monte Carlo (MCMC) methods

To do the importance sampling, MCMC methods are employed (in all programs that do full likelihood or Bayesian analyses). To sample from f(G), start with a tree Gold and

  • 1. Have a “proposal distribution” from which you sample a new tree

Gnew

  • 2. Compute the function f(Gnew) (we have that also for the old tree)
  • 3. Draw a random fraction R between 0 and 1
  • 4. If R < f(Gnew)

f(Gold) , accept the new tree. (Note that in that ratio any

horrible, but shared, denominators cancel out). repeat this vast numbers of times (the correct number of times is infinity).

Week 9: Coalescents, part 2 – p.64/71

slide-65
SLIDE 65

Rearrangement to sample points in tree space

A conditional coalescent rearrangement strategy

Week 9: Coalescents, part 2 – p.65/71

slide-66
SLIDE 66

Dissolving a branch and regrowing it backwards

First pick a random node (interior or tip) and remove its subtree

Week 9: Coalescents, part 2 – p.66/71

slide-67
SLIDE 67

We allow it coalesce with the other branches

Then allow this node to re−coalesce with the tree

Week 9: Coalescents, part 2 – p.67/71

slide-68
SLIDE 68

and this gives another coalescent

The resulting tree proposed by this process

Week 9: Coalescents, part 2 – p.68/71

slide-69
SLIDE 69

An example of an MCMC likelihood curve

−10 −20 −30 −40 −50 −60 −70 −80 0.001 0.002 0.005 0.01 0.02 0.05 0.1

Θ ln L

0.00650776

Results of analysing a data set with 50 sequences of 500 bases which was simulated with a true value of

Θ = 0.01

Week 9: Coalescents, part 2 – p.69/71

slide-70
SLIDE 70

Major MCMC likelihood or Bayesian programs

LAMARC by Mary Kuhner and Jon Yamato and others. Likelihood inference with multiple populations, recombination, migration, population growth. No historical branching events or serial sampling, yet. BEAST by Andrew Rambaut, Alexei Drummond and others. Bayesian inference with multiple populations related by a tree. Support for serial sampling (no migration or recombination yet). genetree by Bob Griffiths and Melanie Bahlo. Likelihood inference of migration rates and changes in population size. No recombination or historical branching events. migrate by Peter Beerli. Likelihood inference with multiple populations and migration rates. No recombination or historical branching events yet. IM and IMa by Rasmus Nielsen and Jody Hey. Two or more populations allowing both historical splitting and migration after that. No recombination yet.

Week 9: Coalescents, part 2 – p.70/71

slide-71
SLIDE 71

Approximately Bayesian Computation (ABC) methods

These involve approximating the sampling by computing some “summary statistics” from the data, then finding parameter values that, in a simulation

  • f a tree and data, result in summary statistic values close to these.

They are faster, and very popular now. But ... they are very dependent on getting the right summary statistics so as not to lose too much power compared to fully-powerful likelihood or Bayesian MCMC methods.

Week 9: Coalescents, part 2 – p.71/71