The Coalescent Evolution backward in time Joachim Hermisson - - PowerPoint PPT Presentation
The Coalescent Evolution backward in time Joachim Hermisson - - PowerPoint PPT Presentation
The Coalescent Evolution backward in time Joachim Hermisson Mathematics and Biosciences Group Mathematics & MFPL, University of Vienna Introduction to the Coalescent data, data, data, Massive accumulation of DNA sequence data
Massive accumulation of DNA sequence data
- 1980’s:
3-4 years PhD projects to sequence a single gene (some 1000 base pairs)
- 1990 – 2003: Human Genome Project (~ 3 109 (3 billion) bases)
expected: 3 billion $, final: ~ 300 Mio $
- since 2010:
1000 Genome Project 4000 $ – 10000 $ per genome, soon < 1000 $
- today: extended to 2500 (25 x 100), completed May 2013
1000 genomes also for Drosophila, Arabidopsis …
Introduction to the Coalescent
data, data, data, …
A C A T T A A G C G T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A G A T T C A G C C T A G A C T T A G G T G A T G C A G A T T C A G C C T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A C A T T C A G C C T A G A C T T A G T T G T T G C
Patterns of Evolution
”Summary Statistics”
Sample size (n = 6) Sequence alignment (length m = 26) 4(6×26) = 8.3 ×1093
A C A T T A A G C G T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A G A T T C A G C C T A G A C T T A G G T G A T G C A G A T T C A G C C T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A C A T T C A G C C T A G A C T T A G T T G T T G C
Patterns of Evolution
”Summary Statistics”
- nly polymorphic sites …
C
Patterns of Evolution
”Summary Statistics”
C A G T C C C C C A A G G G G G G G A A A T T T T T T T T T C C C C C G T G A T C
- utgroup
compare with outgroup …
Patterns of Evolution
”Summary Statistics”
forget about molecular state … (assumes infinite sites mutation model)
Patterns of Evolution
Summary statistics based on segregating sites
- number of segregating sites and allele frequencies
4 3 1 2 1 1 mutation “size“:
Patterns of Evolution
Summary statistics based on segregating sites
- number of segregating sites and allele frequencies
- associations not important (“molecular bean bag“)
4 3 1 2 1 1 mutation “size“:
Patterns of Evolution
Summary statistics based on segregating sites
- number of segregating sites and allele frequencies
- associations not important (“molecular bean bag“)
4 3 1 2 1 1 mutation “size“:
- genome position
does not matter
Patterns of Evolution
Summary statistics based on segregating sites
Site Frequency Spectrum
4 3 1 2 1 1
1 2 3
1 2 3 4 5
Reconstruction of evolutionary history
selection and demographic events distributions for summary statistics (S, p) Statistical Reconstruction
Process Pattern
Patterns of Evolution
- bserved
patterns (S, p from data) estimated parameters
Reconstruction of evolutionary history
standard neutral model Distributions ?
Process Pattern
Patterns of Evolution
How does pure randomness look like ?
- Null-model of the evolutionary theory
Neutral genetic variation
- single locus, multiple alleles
Drift:
- random sampling of parents
- k types: multinomial offspring distribution
Mutation:
- probability u for each offspring
- infinite alleles model: every mutation leads
to a new allele (“new color”) 1.
- 2. generation
population (size 2N)
Patterns of Evolution
Wright-Fisher model
Patterns of Evolution
Wright-Fisher model
sample generation
Patterns of Evolution
Wright-Fisher model
Patterns of Evolution
Wright-Fisher model
Patterns of Evolution
coalescence process
All information about the genetic variation pattern is contained in the sample genealogy.
Patterns of Evolution
coalescence process
Construct a process to generate genealogies: „coalescence-process“ continuous time All information about the genetic variation pattern is contained in the sample genealogy.
Coalescent Theory
The standard neutral model
Haploid Wright-Fisher population of size 2N :
- Genetic differences have
no consequences on fitness
- No population subdivision
- Constant population size
Individuals are equivalent with respect to descent Exchangable offspring distribution, independent of any state label (genotype, location, age, …) `State´ and `Descent´ are decoupled
- 1. Construct genealogy independently of the state
- 2. Decide on the state only afterwards
2 steps:
- Wright-Fisher: multinomial sampling
Coalescent Theory
Construction of the Genealogy: Sample Size 2
2N Coalescence probability … in a single generation:
N pc 2 1
1 ,
Coalescent Theory
Construction of the Genealogy: Sample Size 2
Coalescence probability … in a single generation: 2N
N pc 2 1
1 ,
N N p
t t c
2 1 2 1 1
1 ,
… for exactly t generations:
Coalescent Theory
Construction of the Genealogy: Sample Size n
Multiple (e.g. triple) mergers: 2N
2 2
4 1
N N ptriple
Coalescent Theory
Construction of the Genealogy: Sample Size n
Multiple (e.g. triple) mergers: 2N
2 2
4 1
N N ptriple
2 2 ,
Pr
N p t
c
Multiple coalescences:
Coalescent Theory
Construction of the Genealogy: Sample Size n
Multiple (e.g. triple) mergers: 2N
2 2
4 1
N N ptriple
2 2 ,
Pr
N p t
c
Multiple coalescences: can be ignored if N >> n :
- nly binary mergers for
N
“Kingman coalescent“
Coalescent Theory
Construction of the Genealogy: Sample Size n
Coalescence probability (single binary merger) 2N
N n n n N p n
c
4 ) 1 ( 2 2 1
) ( 1 ,
… in a single generation:
N n n N n n p
t n t c
4 ) 1 ( 4 ) 1 ( 1
1 ) ( ,
… for exactly t generations:
Coalescent Theory
Distribution of Coalescence Times
Coalescence time T2 for sample size 2: coalescence time
exp 2 1 1 Pr
2 2 N N
N T
Define coalescence time scale:
N t 2
T2
Exponential distribution with parameter 1:
1
E
2
T
(2N generations)
Coalescent Theory
Distribution of Coalescence Times
with sample size n:
2 2 2
exp 2 1 1 Pr
n N N n n
N T
Exponential distribution with parameter :
) 1 ( 2 E n n Tn
T4
2 ) 1 ( 2 n n n
iterate until most recent common ancestor (MRCA): T3 T2 coalescence time
Coalescent Theory
Tree Topologies
- pick two random individuals
from the sample and merge
- sample size n → n-1 and
iterate until n = 1 (MRCA) “random bifurcating tree“ coalescence time
- all individuals exchangable
- topology invariant under
permutation of “leaves“
Coalescent Theory
Tree Topologies
- pick two random individuals
from the sample and merge
- sample size n → n-1 and
iterate until n = 1 (MRCA) “random bifurcating tree“ coalescence time
- all individuals exchangable
- topology invariant under
permutation of “leaves“ same topology
Coalescent Theory
Tree Topologies
- pick two random individuals
from the sample and merge
- sample size n → n-1 and
iterate until n = 1 (MRCA) “random bifurcating tree“
- all individuals exchangable
- topology invariant under
permutation of “leaves“ different topology coalescence time
Coalescent Theory
Tree Topologies
- pick two random individuals
from the sample and merge
- sample size n → n-1 and
iterate until n = 1 (MRCA) “random bifurcating tree“ coalescence time
- all individuals exchangable
- topology invariant under
permutation of “leaves“ Distribution of tree topologies
- independent of coalescence times
- depends only on the separation of
state and descent and on the “no multiple merger“ condition
T4 T3 T2
Coalescent Theory
Mutation “Dropping”
- nly number of mutations
- n each branch matters
Infinite sites mutation model: mutation rate u, all mutations
- n the genealogy are visible as polymorphisms on different sites
- Poisson distributed with
parameter , 2 2 L L Nu
k j i i
T L
branch length
- f branch from state j through k
state 4 3 2
(also other mutation schemes possible)
Coalescent Theory
Basic Properties
Three independent stochastic factors determine the polymorphism pattern:
- 1. coalescent times
- 2. tree topology
- 3. mutation
(very easy to implement in simulations)
Coalescent Theory
Basic Properties
Time to the most recent common ancestor:
n k n k k MRCA
k k T E T E
2 2
) 1 ( 2 ] [ ] [
n k k
n k
1 1 2 1 1 1 2
2
T4 T3 T2
1 ] [ 2 T E
Compare: More than half for the last two branches!
Coalescent Theory
Basic Properties
Total length of the tree and expected number
- f polymorphic sites:
T4 T3 T2
n n k
a k S E
1 1
1 ] [
with:
1 1 2
1 2 ] [ ] [
n k n k k tree
k T kE L E
577 . log 1
1 1
n k a
n k n
(logarithmic dependence on sample size)
Coalescent Theory
Basic Properties
Expected site frequency spectrum:
1 1 1 1
1 ] [
n k k n k
E k S E
indeed:
0,2 0,4 0,6 0,8 1
1 2 3 4 5 6
k
/ ] [
k
E
k E
k
Number of mutations that appear k times in the sample (= of size k) in particular:
1
E
k
n = 7