Diffusion Models in Population Genetics Laura Kubatko - - PowerPoint PPT Presentation

diffusion models in population genetics
SMART_READER_LITE
LIVE PREVIEW

Diffusion Models in Population Genetics Laura Kubatko - - PowerPoint PPT Presentation

Diffusion Models in Population Genetics Laura Kubatko kubatko.2@osu.edu MBI Workshop on Spatially-varying stochastic differential equations, with application to the biological sciences July 10, 2015 Laura Kubatko Diffusion Models in


slide-1
SLIDE 1

Diffusion Models in Population Genetics

Laura Kubatko kubatko.2@osu.edu MBI Workshop on Spatially-varying stochastic differential equations, with application to the biological sciences

July 10, 2015

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 1 / 24

slide-2
SLIDE 2

Population Genetics

Population genetics: Study of genetic variation within a population Assume that a gene has two alleles, call them A and a Population is composed of N individuals who have two copies of each gene – so possible genotypes are: AA Aa aa The population evolves over time We are interested in the composition of the population at generation t Need a model for how a generation is derived from the previous generation

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 2 / 24

slide-3
SLIDE 3

Wright-Fisher Model

Assumptions:

◮ Population of 2N gene copies ◮ Discrete, non-overlapping generations of equal size ◮ Parents of next generation of 2N genes are picked randomly with replacement

from preceding generation (genetic differences have no fitness consequences)

◮ Probability of a specific parent for a gene in the next generation is

1 2N

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 3 / 24

slide-4
SLIDE 4

Wright-Fisher Model

Source: Popvizard, a python program to simulate evolution under the WF and other models, written by Peter Beerli http://people.sc.fsu.edu/ pbeerli/popvizard.tar.gz Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 4 / 24

slide-5
SLIDE 5

The Wright-Fisher Model

View Wright-Fisher model as a discrete-time Markov process Let Yt = number of alleles of type A in population at generation t, 0 ≤ Yt ≤ 2N for t = 0, 1, . . . Define pij = P(Yt+1 = j|Yt = i). Then, pij = 2N

j

  • ( i

2N )j( 2N−i 2N )2N−j,

j = 0, 1, . . . , 2N 0,

  • therwise

States 0 and 2N are absorbing states – we can never leave these states

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 5 / 24

slide-6
SLIDE 6

The Wright-Fisher Model

Note that:

◮ E(Yt+1|Yt = i) = 2N( i

2N ) = i

◮ Var(Yt+1|Yt = i) = 2N( i

2N )(1 − i 2N )

◮ So the expected number of A alleles remains the same, but the actual number

may vary between 0 and 2N

Classical approach: Look at the limit as the population size N → ∞ Kingman’s Coalescent Process

◮ Widely used in population genetics and phylogenetics ◮ Difficult to extend to handle features of the evolutionary process, such as

selection

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 6 / 24

slide-7
SLIDE 7

Wright-Fisher Model as a Diffusion Process

Define a diffusion process {Xt}t≥0 as a continuous-time Markov process with approximately Guassian increments over small time intervals and for which the following three conditions hold for small δt and Xt = x:

◮ E(Xt+δt − Xt|Xt = x) = µ(t, x)δt + o(δt) ◮ E((Xt+δt − Xt)2|Xt = x) = σ2(t, x)δt + o(δt) ◮ E((Xt+δt − Xt)k|Xt = x) = 0 for k > 2

From Radu’s slides, we had: dXt = S(Xt)dt + σ(Xt)dWt, where S(Xt) is the drift coefficient and σ(Xt) is the diffusion coefficient. For standard Brownian Motion, µ(t, x) = 0 and σ2(t, x) = 1.

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 7 / 24

slide-8
SLIDE 8

Wright-Fisher Model as a Diffusion Process

Let Yt be the number of A alleles in the population at generation t Let Xt = proportion of A alleles in population at generation t; Xt = Yt

2N

Let Xt represent the continuous-time process (eventually measure time in units of 2N generations, as before) Define ∆Yt = Yt+1 − Yt and ∆Xt = Xt+1 − Xt Then E(Yt+1|Xt = x) = 2Nx E(∆Yt|Xt = x) = 0 E[(∆Yt)2|Xt = x)] = 2Nx(1 − x) E(∆Xt|Xt = x) = 0 = µ(t, x) = µ(x) E((∆Xt)2|Xt = x) = x(1−x)

2N

= σ2(t, x) = σ2(x)

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 8 / 24

slide-9
SLIDE 9

Wright-Fisher Model as a Diffusion Process

Now re-define ∆Yt = Yt+∆t − Yt and ∆Xt = Xt+∆t − Xt, where ∆t =

1 2N and let N → ∞, so that E((∆Xt)2|Xt) = Xt(1 − Xt)∆t

The corresponding SDE is dXt =

  • Xt(1 − Xt)dWt, Xt ∈ [0, 1]

where Wt is standard Brownian Motion (See Pardoux, 2009, for a rigorous proof)

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 9 / 24

slide-10
SLIDE 10

The Wright-Fisher Model with Selection

Model for selection:

◮ Suppose that allele A is superior to allele a so that

px = 2Nx(1 + s) 2Nx(1 + s) + (2N − 2Nx)

◮ As before, let N → ∞ and define s = β/(2N). ◮ E(∆Xt|Xt) ≈ (βXt(1 − Xt))∆t ◮ E((∆Xt)2|Xt) ≈ Xt(1 − Xt)∆t

The corresponding SDE is dXt = βXt(1 − Xt)dt +

  • Xt(1 − Xt)dWt, Xt ∈ [0, 1]

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 10 / 24

slide-11
SLIDE 11

The Wright-Fisher Diffusion with Selection: Intuition

Use the Euler Method (see Radu’s lectures) to simulate from the WF Diffusion model

X(ti+1) = X(ti) + βX(ti)(1 − X(ti))(ti+1 − ti) + √ti+1 − ti

  • X(ti)(1 − X(ti))Z

where Z ∼ N(0, 1)

Python code to simulate this:

◮ T = 0.05 ◮ Define 0 = t0 < t1 < · · · < tN−1 < tN = T, equally spaced ◮ Vary β Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 11 / 24

slide-12
SLIDE 12

The Wright-Fisher Diffusion with Selection: Intuition

β = 0, varying N

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 12 / 24

slide-13
SLIDE 13

The Wright-Fisher Diffusion with Selection: Intuition

N = 1000, vary β

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 13 / 24

slide-14
SLIDE 14

Application: Inferring Selection From Genome-scale Data

Diffusion models are currently becoming more widely used in analyzing genome-scale data. Example: Williamson, S. H. et al. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. PNAS: 120(22): 7882-7887. Data: NIEHS Environmental Genome Project web site (http://egp.gs.washington. edu)

◮ Sequenced 301 genes associated with variation in response to environmental

exposure

◮ 90 individuals: 24 African Americans, 24 Asian Americans, 24 European

Americans, 12 Mexican Americans, and 6 Native Americans

Goal: Detect selection in different types of mutations; distinguish selection from other demographic factors, such as population size change

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 14 / 24

slide-15
SLIDE 15

Application: Inferring Selection From Genome-scale Data

Data are recorded as SNPs – bases in the DNA sequence at which there is variation across individuals Example data: Taxon Sequence (A) Human GCCGATGCCGATGCCGAA (B) Chimp GCCGTTGCCGTTGCCGTT (C ) Gorilla GCGGAAGCGGAAGCGGAA this would be Taxon Sequence (A) Human CATCATCAA (B) Chimp CTTCTTCTT (C ) Gorilla GAAGAAGAA

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 15 / 24

slide-16
SLIDE 16

Application: Inferring Selection From Genome-scale Data

Example SNP data is Taxon Sequence (A) Human CATCATCAA (B) Chimp CTTCTTCTT (C ) Gorilla GAAGAAGAA Record this as the site frequency spectrum (SFS), denoted by the vector u, where entry ui = number of SNP sites with i copies of the derived allele For the example, we have (assuming that the ancestral state is that found in Gorilla), u = (4, 5) If we let Human be ancestral, we’d have u = (9, 0)

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 16 / 24

slide-17
SLIDE 17

Application: Inferring Selection From Genome-scale Data Idea of analysis:

◮ Write the likelihood function and obtain MLEs of the parameters of interest ◮ Likelihood function for a sample of K SNPs:

L(β) =

K

  • k=1

Pr(ik, nk|β) where Pr(ik, nk) is the probability of that SNP k is at frequency

ik nk

Pr(ik, nk) comes from the diffusion model – how?

◮ Williamson et al. (2005):

Use numerical methods to approximate the diffusion

◮ Today: use a naive sampling method based on the Euler approximation ◮ Ongoing work (with Radu Herbei and Jeff Gory): use exact sampling from the

WF diffusion to implement a Bayesian version of the model

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 17 / 24

slide-18
SLIDE 18

Application: Inferring Selection From Genome-scale Data

Naive method:

1

Use the Euler method to simulate a path from the WF diffusion with selection parameter β, and record the final allele frequency, q.

2

For the q from step 1, simulate the data for a SNP by drawing Y ∼ Bin(2n, q). n is the number of “people” in the sample.

3

Repeat steps 1-2 a large number of times, say M (the larger, the better), to generate a set of observed Y values, Y1, Y2, · · · , YM.

4

Form the estimates ˆ Pi(β) =

1 M

M

m=1 I(Ym = i)

The approximate likelihood is then ˆ L(β) =

K

  • k=1

ˆ Pik(β)

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 18 / 24

slide-19
SLIDE 19

Application: Inferring Selection From Genome-scale Data

Does it work? Simulate data for 15 people and 100 SNPs with various values of β and M β = 0.2

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 19 / 24

slide-20
SLIDE 20

Application: Inferring Selection From Genome-scale Data

Does it work? Simulate data for 15 people and 100 SNPs with various values of β and M β = 2.0

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 20 / 24

slide-21
SLIDE 21

Application: Inferring Selection From Genome-scale Data

Does it work? Simulate data for 15 people and 100 SNPs with various values of β and M β = 10.0

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 21 / 24

slide-22
SLIDE 22

Application: Inferring Selection From Genome-scale Data

Does it work? Take the maximum value of the approximate likelihood as the MLE Repeat the simulation multiple times and look at properties of the MLEs True β Number of reps Mean MLE MSE 2.0 30 2.10 2.13 10.0 15 10.25 3.58

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 22 / 24

slide-23
SLIDE 23

Conclusions

Diffusion models are being increasingly used for data analysis in population genetics. Methods used for estimation are mostly based on numerical approximations, rather than on statistical techniques. Promising area of application as availability of whole-genome sequence data is increasing rapidly.

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 23 / 24

slide-24
SLIDE 24

References

Wakeley, J. (2009) Coalescent Theory: An Introduction. Robert and Company. Williamson, S. et al. (2005) Simultaneous inference of selection and population growth from patterns of variation in the human genome. PNAS 102(22): 7882-7887. Gutenkunst, R. et al. (2009) Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data. PLoS Genetics 5(10): e1000695. Pardoux, E. (2009). Probabilistic models of population genetics. http://www.cmi.univ-mrs.fr/∼pardoux/enseignement/cours genpop.pdf dadi: Diffusion Approximation for Demographic Inference. https://code.google.com/p/dadi/ Thank you!

Laura Kubatko Diffusion Models in Population Genetics July 10, 2015 24 / 24