The Coalescent Evolution backward in time Joachim Hermisson - - PowerPoint PPT Presentation

the coalescent
SMART_READER_LITE
LIVE PREVIEW

The Coalescent Evolution backward in time Joachim Hermisson - - PowerPoint PPT Presentation

The Coalescent Evolution backward in time Joachim Hermisson Mathematics and Biosciences Group Mathematics & MFPL, University of Vienna Introduction to the Coalescent data, data, data, Massive accumulation of DNA sequence data


slide-1
SLIDE 1

The Coalescent

Evolution backward in time

Joachim Hermisson Mathematics and Biosciences Group Mathematics & MFPL, University of Vienna

slide-2
SLIDE 2

Massive accumulation of DNA sequence data

  • 1980’s:

3-4 years PhD projects to sequence a single gene (some 1000 base pairs)

  • 1990 – 2003: Human Genome Project (~ 3 109 (3 billion) bases)

expected: 3 billion $, final: ~ 300 Mio $

  • since 2010:

1000 Genome Project 4000 $ – 10000 $ per genome, soon < 1000 $

  • today: extended to 2500 (25 x 100), completed May 2013

1000 genomes also for Drosophila, Arabidopsis …

Introduction to the Coalescent

data, data, data, …

slide-3
SLIDE 3

A C A T T A A G C G T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A G A T T C A G C C T A G A C T T A G G T G A T G C A G A T T C A G C C T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A C A T T C A G C C T A G A C T T A G T T G T T G C

Patterns of Evolution

”Summary Statistics”

Sample size (n = 6) Sequence alignment (length m = 26) 4(6×26) = 8.3 ×1093

slide-4
SLIDE 4

A C A T T A A G C G T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A G A T T C A G C C T A G A C T T A G G T G A T G C A G A T T C A G C C T A G A C T T A G G T G T T G C A C A T T A A G C C T A G A C A T A G G T G T T G C A C A T T C A G C C T A G A C T T A G T T G T T G C

Patterns of Evolution

”Summary Statistics”

  • nly polymorphic sites …
slide-5
SLIDE 5

C

Patterns of Evolution

”Summary Statistics”

C A G T C C C C C A A G G G G G G G A A A T T T T T T T T T C C C C C G T G A T C

  • utgroup

compare with outgroup …

slide-6
SLIDE 6

Patterns of Evolution

”Summary Statistics”

forget about molecular state … (assumes infinite sites mutation model)

slide-7
SLIDE 7

Patterns of Evolution

Summary statistics based on segregating sites

  • number of segregating sites and allele frequencies

4 3 1 2 1 1 mutation “size“:

slide-8
SLIDE 8

Patterns of Evolution

Summary statistics based on segregating sites

  • number of segregating sites and allele frequencies
  • associations not important (“molecular bean bag“)

4 3 1 2 1 1 mutation “size“:

slide-9
SLIDE 9

Patterns of Evolution

Summary statistics based on segregating sites

  • number of segregating sites and allele frequencies
  • associations not important (“molecular bean bag“)

4 3 1 2 1 1 mutation “size“:

  • genome position

does not matter

slide-10
SLIDE 10

Patterns of Evolution

Summary statistics based on segregating sites

Site Frequency Spectrum

4 3 1 2 1 1

1 2 3

1 2 3 4 5

slide-11
SLIDE 11

Reconstruction of evolutionary history

selection and demographic events distributions for summary statistics (S, p) Statistical Reconstruction

Process Pattern

Patterns of Evolution

  • bserved

patterns (S, p from data) estimated parameters

slide-12
SLIDE 12

Reconstruction of evolutionary history

standard neutral model Distributions ?

Process Pattern

Patterns of Evolution

How does pure randomness look like ?

  • Null-model of the evolutionary theory
slide-13
SLIDE 13

Neutral genetic variation

  • single locus, multiple alleles

Drift:

  • random sampling of parents
  • k types: multinomial offspring distribution

Mutation:

  • probability u for each offspring
  • infinite alleles model: every mutation leads

to a new allele (“new color”) 1.

  • 2. generation

population (size 2N)

Patterns of Evolution

Wright-Fisher model

slide-14
SLIDE 14

Patterns of Evolution

Wright-Fisher model

sample generation

slide-15
SLIDE 15

Patterns of Evolution

Wright-Fisher model

slide-16
SLIDE 16

Patterns of Evolution

Wright-Fisher model

slide-17
SLIDE 17

Patterns of Evolution

coalescence process

All information about the genetic variation pattern is contained in the sample genealogy.

slide-18
SLIDE 18

Patterns of Evolution

coalescence process

Construct a process to generate genealogies: „coalescence-process“ continuous time All information about the genetic variation pattern is contained in the sample genealogy.

slide-19
SLIDE 19

Coalescent Theory

The standard neutral model

Haploid Wright-Fisher population of size 2N :

  • Genetic differences have

no consequences on fitness

  • No population subdivision
  • Constant population size

Individuals are equivalent with respect to descent Exchangable offspring distribution, independent of any state label (genotype, location, age, …) `State´ and `Descent´ are decoupled

  • 1. Construct genealogy independently of the state
  • 2. Decide on the state only afterwards

2 steps:

  • Wright-Fisher: multinomial sampling
slide-20
SLIDE 20

Coalescent Theory

Construction of the Genealogy: Sample Size 2

2N Coalescence probability … in a single generation:

N pc 2 1

1 , 

slide-21
SLIDE 21

Coalescent Theory

Construction of the Genealogy: Sample Size 2

Coalescence probability … in a single generation: 2N

N pc 2 1

1 , 

N N p

t t c

2 1 2 1 1

1 , 

       

… for exactly t generations:

slide-22
SLIDE 22

Coalescent Theory

Construction of the Genealogy: Sample Size n

Multiple (e.g. triple) mergers: 2N

 

2 2

4 1

   N N ptriple

slide-23
SLIDE 23

Coalescent Theory

Construction of the Genealogy: Sample Size n

Multiple (e.g. triple) mergers: 2N

 

2 2

4 1

   N N ptriple

 

2 2 ,

Pr

   N p t

c

Multiple coalescences:

slide-24
SLIDE 24

Coalescent Theory

Construction of the Genealogy: Sample Size n

Multiple (e.g. triple) mergers: 2N

 

2 2

4 1

   N N ptriple

 

2 2 ,

Pr

   N p t

c

Multiple coalescences: can be ignored if N >> n :

  • nly binary mergers for

  N

“Kingman coalescent“

slide-25
SLIDE 25

Coalescent Theory

Construction of the Genealogy: Sample Size n

Coalescence probability (single binary merger) 2N

N n n n N p n

c

4 ) 1 ( 2 2 1

) ( 1 ,

          

… in a single generation:

N n n N n n p

t n t c

4 ) 1 ( 4 ) 1 ( 1

1 ) ( ,

         

… for exactly t generations:

slide-26
SLIDE 26

Coalescent Theory

Distribution of Coalescence Times

Coalescence time T2 for sample size 2: coalescence time

   

 

             

 

exp 2 1 1 Pr

2 2 N N

N T

Define coalescence time scale:

N t 2  

T2

Exponential distribution with parameter 1:

  1

E

2 

T

(2N generations)

slide-27
SLIDE 27

Coalescent Theory

Distribution of Coalescence Times

with sample size n:

 

 

 

                  

             

2 2 2

exp 2 1 1 Pr

n N N n n

N T

Exponential distribution with parameter :

 

) 1 ( 2 E   n n Tn

T4

2 ) 1 ( 2           n n n

iterate until most recent common ancestor (MRCA): T3 T2 coalescence time

slide-28
SLIDE 28

Coalescent Theory

Tree Topologies

  • pick two random individuals

from the sample and merge

  • sample size n → n-1 and

iterate until n = 1 (MRCA) “random bifurcating tree“ coalescence time

  • all individuals exchangable
  • topology invariant under

permutation of “leaves“

slide-29
SLIDE 29

Coalescent Theory

Tree Topologies

  • pick two random individuals

from the sample and merge

  • sample size n → n-1 and

iterate until n = 1 (MRCA) “random bifurcating tree“ coalescence time

  • all individuals exchangable
  • topology invariant under

permutation of “leaves“ same topology

slide-30
SLIDE 30

Coalescent Theory

Tree Topologies

  • pick two random individuals

from the sample and merge

  • sample size n → n-1 and

iterate until n = 1 (MRCA) “random bifurcating tree“

  • all individuals exchangable
  • topology invariant under

permutation of “leaves“ different topology coalescence time

slide-31
SLIDE 31

Coalescent Theory

Tree Topologies

  • pick two random individuals

from the sample and merge

  • sample size n → n-1 and

iterate until n = 1 (MRCA) “random bifurcating tree“ coalescence time

  • all individuals exchangable
  • topology invariant under

permutation of “leaves“ Distribution of tree topologies

  • independent of coalescence times
  • depends only on the separation of

state and descent and on the “no multiple merger“ condition

slide-32
SLIDE 32

T4 T3 T2

Coalescent Theory

Mutation “Dropping”

  • nly number of mutations
  • n each branch matters

Infinite sites mutation model: mutation rate u, all mutations

  • n the genealogy are visible as polymorphisms on different sites
  • Poisson distributed with

parameter , 2 2 L L Nu    

k j i i

T L

branch length

  • f branch from state j through k

state 4 3 2

(also other mutation schemes possible)

slide-33
SLIDE 33

Coalescent Theory

Basic Properties

Three independent stochastic factors determine the polymorphism pattern:

  • 1. coalescent times
  • 2. tree topology
  • 3. mutation

(very easy to implement in simulations)

slide-34
SLIDE 34

Coalescent Theory

Basic Properties

Time to the most recent common ancestor:

 

 

  

n k n k k MRCA

k k T E T E

2 2

) 1 ( 2 ] [ ] [                  

n k k

n k

1 1 2 1 1 1 2

2

T4 T3 T2

1 ] [ 2  T E

Compare: More than half for the last two branches!

slide-35
SLIDE 35

Coalescent Theory

Basic Properties

Total length of the tree and expected number

  • f polymorphic sites:

T4 T3 T2

n n k

a k S E    

 

 

1 1

1 ] [

with:

 

  

 

1 1 2

1 2 ] [ ] [

n k n k k tree

k T kE L E

577 . log 1

1 1

  

 

n k a

n k n

(logarithmic dependence on sample size)

slide-36
SLIDE 36

Coalescent Theory

Basic Properties

Expected site frequency spectrum:

 

 

   

 

1 1 1 1

1 ] [

n k k n k

E k S E  

indeed:

0,2 0,4 0,6 0,8 1

1 2 3 4 5 6

k

  / ] [

k

E

 

k E

k

  

Number of mutations that appear k times in the sample (= of size k) in particular:

  

 

1

E

k

n = 7