An Efficient Algorithm for An Efficient Algorithm for Simulating - - PowerPoint PPT Presentation

an efficient algorithm for an efficient algorithm for
SMART_READER_LITE
LIVE PREVIEW

An Efficient Algorithm for An Efficient Algorithm for Simulating - - PowerPoint PPT Presentation

An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating Coalescence with Recombination Recombination Katy L. Simonsen (Statistics) (Statistics) Katy L. Simonsen Dan A. Noland (CS and ITaP ITaP) ) Dan


slide-1
SLIDE 1

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 1 1

An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating Coalescence with Recombination Recombination

Katy L. Simonsen Katy L. Simonsen (Statistics) (Statistics) Dan A. Noland (CS and Dan A. Noland (CS and ITaP ITaP) ) Chinh Chinh Le ( Le (ITaP ITaP), ), Purdue University Purdue University

slide-2
SLIDE 2

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 2 2

Coalescence with Coalescence with Recombination Recombination

  • The coalescent process is a Markov Chain

The coalescent process is a Markov Chain process which models the ancestry of a process which models the ancestry of a random sample of DNA sequences from a random sample of DNA sequences from a population population

  • Recombination allows genetic information

Recombination allows genetic information from two separate individuals to be from two separate individuals to be combined on a single chromosome combined on a single chromosome

  • Tree

Tree-

  • like structure describes possibly

like structure describes possibly different ancestry at different loci different ancestry at different loci

slide-3
SLIDE 3

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 3 3

Model Parameters Model Parameters

Parameters: Parameters:

  • n

n = 4 sequences = 4 sequences

  • m

m = 3 loci = 3 loci

  • recombination rate

recombination rate r ri

i < 0.5

< 0.5

  • effective

effective population size N population size N

  • R

Ri

i = 2Nr

= 2Nri

i for i = 1

for i = 1 … m … m-

  • 1

1

slide-4
SLIDE 4

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 4 4

Why simulate coalescence? Why simulate coalescence?

  • Simulating CWR

Simulating CWR

  • allows the generation of large samples of DNA

allows the generation of large samples of DNA sequences with linkage sequences with linkage

  • statistical properties can empirically be determined

statistical properties can empirically be determined

  • Realistic levels of linkage disequilibrium are

Realistic levels of linkage disequilibrium are needed to discover effective strategies for needed to discover effective strategies for mapping genes in natural populations mapping genes in natural populations

  • Recently popular in human genetics literature

Recently popular in human genetics literature

slide-5
SLIDE 5

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 5 5

Original algorithm Original algorithm

( (Simonsen Simonsen & Churchill, 1997) & Churchill, 1997)

  • steps through Markov Chain to build tree

steps through Markov Chain to build tree

  • series of recombination and coalescent events

series of recombination and coalescent events

  • To calculate transition probabilities we need

To calculate transition probabilities we need

  • the number of each type of individual present

the number of each type of individual present

  • there are 2

there are 2m

m possible types

possible types

  • the probability that each type has a recombination event

the probability that each type has a recombination event

  • Stores a state vector Y of length 2

Stores a state vector Y of length 2m

m

  • Impossible to go beyond

Impossible to go beyond m m = 32 (or 64) = 32 (or 64)

  • Can’t even store the indices of the vector let alone the vector!

Can’t even store the indices of the vector let alone the vector!

  • Limited by exponential nature

Limited by exponential nature

  • Could not simulate more than

Could not simulate more than m m = 24 loci = 24 loci

  • Very slow, especially for large

Very slow, especially for large m m and and R R

slide-6
SLIDE 6

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 6 6

Goal Goal

  • Want to be able to simulate hundreds,

Want to be able to simulate hundreds, even thousands of loci ( even thousands of loci (m m) )

  • Also want large

Also want large R R = 2Nr where = 2Nr where

  • N is very large (10

N is very large (103

3 –

– 10 106

6),

),

  • 0 < r < 0.5

0 < r < 0.5

  • as

as m m increases, increases, r r decreases in such a decreases in such a way that way that mr mr = constant (genome size) = constant (genome size)

slide-7
SLIDE 7

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 7 7

What is taking so much space? What is taking so much space?

  • Vector

Vector Y Y changes at each step changes at each step

  • Matrix

Matrix Λ Λ (0 (0/1 /1 indicator) size 2 indicator) size 2m

m ×

× m m– –1 (fixed) 1 (fixed)

  • whether recombination can happen for each type

whether recombination can happen for each type

  • Need all entries of

Need all entries of Y Y Λ Λ R R at every step at every step

  • to calculate transition probabilities for next step

to calculate transition probabilities for next step

  • number of steps in the chain is random:

number of steps in the chain is random:

  • G

G = 2H + = 2H + n n – – 1 where r.v. H depends on 1 where r.v. H depends on m m, , n n, , R R

  • End result is a tree with

End result is a tree with G G “ “events events” ”

  • Have to store all this: takes

Have to store all this: takes memory memory

  • Have to calculate all this: takes

Have to calculate all this: takes time time

slide-8
SLIDE 8

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 8 8

How can we fix it? How can we fix it?

  • Don

Don’ ’t want to store t want to store Y Y

  • Don

Don’ ’t want to store t want to store Λ ΛR R

  • Don

Don’ ’t want to do all that multiplication t want to do all that multiplication

  • Don

Don’ ’t want to store the big tree with t want to store the big tree with G G events events

slide-9
SLIDE 9

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 9 9

Exploit structure: Exploit structure: Y Y

  • Y

Y contains integers in [0, contains integers in [0,n n] ]

  • Sum of entries in

Sum of entries in Y Y is is ≤ ≤ nm nm

  • At most nm non

At most nm non-

  • zero elements out of

zero elements out of 2 2m

m

  • Y

Y starts with one non starts with one non-

  • zero element (=

zero element (=n n) and ) and ends with one non ends with one non-

  • zero element (=1)

zero element (=1)

  • Y

Y must be very sparse! must be very sparse!

  • At most 3 entries change at any one step

At most 3 entries change at any one step

  • Can store

Can store Y Y cleverly to exploit this structure cleverly to exploit this structure

slide-10
SLIDE 10

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 10 10

Exploit structure: Exploit structure: Λ ΛR R

  • Λ

Λ turns out not to be sparse turns out not to be sparse

  • Indicates where recombination is relevant (0

Indicates where recombination is relevant (0 – – 1) 1)

  • Looks kind of sparse for small

Looks kind of sparse for small m m

  • Becomes more dense with

Becomes more dense with m m, and is completely , and is completely dense in the limit (Lei Liu) dense in the limit (Lei Liu)

  • Can

Can’ ’t get away with clever storage here t get away with clever storage here

  • BUT since

BUT since Y Y multiplies multiplies Λ ΛR R we don we don’ ’t need it all t need it all

  • Only need rows of

Only need rows of Λ ΛR R with >0 entries of with >0 entries of Y Y

  • Will recalculate

Will recalculate Λ ΛR R “ “on the fly

  • n the fly”

” as entries of as entries of Y Y become >0 become >0

  • Seems inefficient but trades recalculation for space

Seems inefficient but trades recalculation for space

  • Use bitwise calculations for speed

Use bitwise calculations for speed

slide-11
SLIDE 11

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 11 11

Key data representations: Key data representations: Y YΛ ΛR R

  • Store

Store Y Y as a linked list as a linked list

  • nly the non
  • nly the non-
  • zero entries

zero entries

  • Along with each element of

Along with each element of Y Y, store its , store its corresponding corresponding Λ ΛR R row row

  • Store index of

Store index of Y Y (type) as a multi (type) as a multi-

  • precision

precision integer since this is still of length 2 integer since this is still of length 2m

m

  • use the Gnu

use the Gnu Multiprecision Multiprecision ( (gmp gmp) Library (Dan ) Library (Dan’ ’s idea) s idea)

  • At each step, only recalculate

At each step, only recalculate Λ ΛR R row for row for “ “new new” ” Y Y entries entries

slide-12
SLIDE 12

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 12 12

Exploit structure: Trees Exploit structure: Trees

  • Trees are

Trees are complicated when complicated when mingled but simple if mingled but simple if looked at one by one looked at one by one

  • Store the

Store the “ “marginal marginal” ” trees instead of the trees instead of the “ “joint joint” ” tree: tree:

slide-13
SLIDE 13

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 13 13

Key data representations: Key data representations: Trees Trees

  • Instead of storing combined tree, just store

Instead of storing combined tree, just store m trees. m trees.

  • These are just binary trees so data size is

These are just binary trees so data size is FIXED at FIXED at m m(2 (2n n– –1) 1)

  • Storage is independent of

Storage is independent of G G which is which is random and could be huge random and could be huge

  • this does lose information which could be

this does lose information which could be stored separately stored separately

slide-14
SLIDE 14

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 14 14

Timing Results Timing Results

  • Time is polynomial instead of exponential

Time is polynomial instead of exponential

  • Quadratic in

Quadratic in m m

  • Linear in

Linear in n n

  • Linear in

Linear in R R (indirectly through (indirectly through G G) )

  • O(

O(nm nm2

2R

R)

  • Memory is O(

Memory is O(mn mn) and ) and not random not random

slide-15
SLIDE 15

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 15 15

Quadratic in Quadratic in m m

Timings vs Number of Loci

y = 0.0086x2 - 4.0833x + 478.57

  • 1000

1000 2000 3000 4000 5000 6000 200 400 600 800 1000 1200 m (Number of Loci) time (seconds) n=100, R=2 n=100, R=4 n=100, R=8 Quadratic(n=100, R=8)

slide-16
SLIDE 16

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 16 16

Linear in Linear in n n

Timing vs n

2 4 6 8 10 12 14 16 18 20 200 400 600 800 1000 1200 n time (seconds) m=100, R=2 m=100, R=4 m=100, R=6 m=100, R=8

slide-17
SLIDE 17

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 17 17

Linear in Linear in R R

Timings vs Recombination Rate

20 40 60 80 100 120 140 160 180 2 4 6 8 10

R = 2Nr (Recombination Rate) time (seconds)

n=100,m=100 n=100,m=200 n=100,m=300

slide-18
SLIDE 18

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 18 18

  • No. Events
  • No. Events G

G is Quadratic in is Quadratic in m m

Number of Events (G) vs Number of Loci

y = 4.9047x2 - 344.82x + 82791 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 200 400 600 800 1000 1200 Thousands m (Number of Loci) Number of Events n=100, R=2 n=100, R=4 n=100, R=8 Quadratic(n=200, R=8)

slide-19
SLIDE 19

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 19 19

Time vs. Time vs. nm nm2

2R

R

Time vs nRm2 y = 8E-06x - 1049.2 R2 = 0.9868 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200

Thousands Millions

nRm2 time (seconds)

slide-20
SLIDE 20

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 20 20

Comparison: Log(time) vs. Comparison: Log(time) vs. m m

Comparison (Log Scale) n = 200, R = 1

0.01 0.10 1.00 10.00 100.00 1000.00 10000.00 200 400 600 800 1000 1200 m time (seconds) New Orig

slide-21
SLIDE 21

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 21 21

Limits Limits

  • So far have simulated up to

So far have simulated up to m m = 5000 loci = 5000 loci (took 19GB) (took 19GB)

  • Time

Time is proportional to the number of is proportional to the number of events (can’t get around that) events (can’t get around that)

  • Memory

Memory is linear in is linear in m m and and n n, independent , independent

  • f
  • f R

R, and not random , and not random

slide-22
SLIDE 22

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 22 22

Applications Applications

  • Whole genome simulations

Whole genome simulations

  • m

m = 1000 is the scale of markers now in use = 1000 is the scale of markers now in use

  • m

m = 10,000? 100,000 = 10,000? 100,000

  • Human Populations

Human Populations

  • understanding LD and

understanding LD and haplotype haplotype blocks blocks

  • Natural Populations

Natural Populations

  • statistical mapping methods

statistical mapping methods

  • power calculations

power calculations

slide-23
SLIDE 23

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 23 23

Refinements Refinements

  • more complex models

more complex models

  • population size changes

population size changes ☺ ☺

  • Windows GUI

Windows GUI ☺ ☺

  • complex mutation models

complex mutation models

  • population structure

population structure

  • selection

selection

slide-24
SLIDE 24

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 24 24

Thank You Thank You

  • Dan Noland (programming, ideas)

Dan Noland (programming, ideas)

  • Chinh

Chinh Le (profiling, timing, advice) Le (profiling, timing, advice)

  • Ahmed

Ahmed Sameh Sameh, David Moffett, and , David Moffett, and ITaP ITaP

  • Koen

Koen Verhoeven Verhoeven, Lei Liu , Lei Liu

  • http://www.stat.

http://www.stat.purdue purdue.edu/~ .edu/~simonsen simonsen/Argos/ /Argos/

slide-25
SLIDE 25

Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 25 25

Human Genome Parameters Human Genome Parameters

(very approximate) (very approximate)

  • 3

3 × × 10 109

9 base pairs (3 Gb) in 23

base pairs (3 Gb) in 23 chromosomes chromosomes

  • 10

107

7 SNPs

SNPs ⇒ ⇒ m m = = 5 5 × × 10 105

5 SNPs

SNPs / / chrom chrom

  • r

r ≈ ≈ .01/Mb (highly variable) .01/Mb (highly variable)

≈ 1 1-

  • 2 SNP /Kb

2 SNP /Kb ⇒ ⇒ r r ≈ ≈ 10 10-

  • 5

5 between

between SNPs SNPs

  • Historical N

Historical N ≈ ≈ 10 104

4 –

– 10 105

5 so

so R R ≈ ≈ 1 1

  • sample sizes

sample sizes n n ≈ ≈ 1000 or as large as 1000 or as large as necessary (power) necessary (power)