Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 1 1
An Efficient Algorithm for An Efficient Algorithm for Simulating - - PowerPoint PPT Presentation
An Efficient Algorithm for An Efficient Algorithm for Simulating - - PowerPoint PPT Presentation
An Efficient Algorithm for An Efficient Algorithm for Simulating Coalescence with Simulating Coalescence with Recombination Recombination Katy L. Simonsen (Statistics) (Statistics) Katy L. Simonsen Dan A. Noland (CS and ITaP ITaP) ) Dan
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 2 2
Coalescence with Coalescence with Recombination Recombination
- The coalescent process is a Markov Chain
The coalescent process is a Markov Chain process which models the ancestry of a process which models the ancestry of a random sample of DNA sequences from a random sample of DNA sequences from a population population
- Recombination allows genetic information
Recombination allows genetic information from two separate individuals to be from two separate individuals to be combined on a single chromosome combined on a single chromosome
- Tree
Tree-
- like structure describes possibly
like structure describes possibly different ancestry at different loci different ancestry at different loci
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 3 3
Model Parameters Model Parameters
Parameters: Parameters:
- n
n = 4 sequences = 4 sequences
- m
m = 3 loci = 3 loci
- recombination rate
recombination rate r ri
i < 0.5
< 0.5
- effective
effective population size N population size N
- R
Ri
i = 2Nr
= 2Nri
i for i = 1
for i = 1 … m … m-
- 1
1
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 4 4
Why simulate coalescence? Why simulate coalescence?
- Simulating CWR
Simulating CWR
- allows the generation of large samples of DNA
allows the generation of large samples of DNA sequences with linkage sequences with linkage
- statistical properties can empirically be determined
statistical properties can empirically be determined
- Realistic levels of linkage disequilibrium are
Realistic levels of linkage disequilibrium are needed to discover effective strategies for needed to discover effective strategies for mapping genes in natural populations mapping genes in natural populations
- Recently popular in human genetics literature
Recently popular in human genetics literature
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 5 5
Original algorithm Original algorithm
( (Simonsen Simonsen & Churchill, 1997) & Churchill, 1997)
- steps through Markov Chain to build tree
steps through Markov Chain to build tree
- series of recombination and coalescent events
series of recombination and coalescent events
- To calculate transition probabilities we need
To calculate transition probabilities we need
- the number of each type of individual present
the number of each type of individual present
- there are 2
there are 2m
m possible types
possible types
- the probability that each type has a recombination event
the probability that each type has a recombination event
- Stores a state vector Y of length 2
Stores a state vector Y of length 2m
m
- Impossible to go beyond
Impossible to go beyond m m = 32 (or 64) = 32 (or 64)
- Can’t even store the indices of the vector let alone the vector!
Can’t even store the indices of the vector let alone the vector!
- Limited by exponential nature
Limited by exponential nature
- Could not simulate more than
Could not simulate more than m m = 24 loci = 24 loci
- Very slow, especially for large
Very slow, especially for large m m and and R R
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 6 6
Goal Goal
- Want to be able to simulate hundreds,
Want to be able to simulate hundreds, even thousands of loci ( even thousands of loci (m m) )
- Also want large
Also want large R R = 2Nr where = 2Nr where
- N is very large (10
N is very large (103
3 –
– 10 106
6),
),
- 0 < r < 0.5
0 < r < 0.5
- as
as m m increases, increases, r r decreases in such a decreases in such a way that way that mr mr = constant (genome size) = constant (genome size)
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 7 7
What is taking so much space? What is taking so much space?
- Vector
Vector Y Y changes at each step changes at each step
- Matrix
Matrix Λ Λ (0 (0/1 /1 indicator) size 2 indicator) size 2m
m ×
× m m– –1 (fixed) 1 (fixed)
- whether recombination can happen for each type
whether recombination can happen for each type
- Need all entries of
Need all entries of Y Y Λ Λ R R at every step at every step
- to calculate transition probabilities for next step
to calculate transition probabilities for next step
- number of steps in the chain is random:
number of steps in the chain is random:
- G
G = 2H + = 2H + n n – – 1 where r.v. H depends on 1 where r.v. H depends on m m, , n n, , R R
- End result is a tree with
End result is a tree with G G “ “events events” ”
- Have to store all this: takes
Have to store all this: takes memory memory
- Have to calculate all this: takes
Have to calculate all this: takes time time
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 8 8
How can we fix it? How can we fix it?
- Don
Don’ ’t want to store t want to store Y Y
- Don
Don’ ’t want to store t want to store Λ ΛR R
- Don
Don’ ’t want to do all that multiplication t want to do all that multiplication
- Don
Don’ ’t want to store the big tree with t want to store the big tree with G G events events
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 9 9
Exploit structure: Exploit structure: Y Y
- Y
Y contains integers in [0, contains integers in [0,n n] ]
- Sum of entries in
Sum of entries in Y Y is is ≤ ≤ nm nm
- At most nm non
At most nm non-
- zero elements out of
zero elements out of 2 2m
m
- Y
Y starts with one non starts with one non-
- zero element (=
zero element (=n n) and ) and ends with one non ends with one non-
- zero element (=1)
zero element (=1)
- Y
Y must be very sparse! must be very sparse!
- At most 3 entries change at any one step
At most 3 entries change at any one step
- Can store
Can store Y Y cleverly to exploit this structure cleverly to exploit this structure
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 10 10
Exploit structure: Exploit structure: Λ ΛR R
- Λ
Λ turns out not to be sparse turns out not to be sparse
- Indicates where recombination is relevant (0
Indicates where recombination is relevant (0 – – 1) 1)
- Looks kind of sparse for small
Looks kind of sparse for small m m
- Becomes more dense with
Becomes more dense with m m, and is completely , and is completely dense in the limit (Lei Liu) dense in the limit (Lei Liu)
- Can
Can’ ’t get away with clever storage here t get away with clever storage here
- BUT since
BUT since Y Y multiplies multiplies Λ ΛR R we don we don’ ’t need it all t need it all
- Only need rows of
Only need rows of Λ ΛR R with >0 entries of with >0 entries of Y Y
- Will recalculate
Will recalculate Λ ΛR R “ “on the fly
- n the fly”
” as entries of as entries of Y Y become >0 become >0
- Seems inefficient but trades recalculation for space
Seems inefficient but trades recalculation for space
- Use bitwise calculations for speed
Use bitwise calculations for speed
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 11 11
Key data representations: Key data representations: Y YΛ ΛR R
- Store
Store Y Y as a linked list as a linked list
- nly the non
- nly the non-
- zero entries
zero entries
- Along with each element of
Along with each element of Y Y, store its , store its corresponding corresponding Λ ΛR R row row
- Store index of
Store index of Y Y (type) as a multi (type) as a multi-
- precision
precision integer since this is still of length 2 integer since this is still of length 2m
m
- use the Gnu
use the Gnu Multiprecision Multiprecision ( (gmp gmp) Library (Dan ) Library (Dan’ ’s idea) s idea)
- At each step, only recalculate
At each step, only recalculate Λ ΛR R row for row for “ “new new” ” Y Y entries entries
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 12 12
Exploit structure: Trees Exploit structure: Trees
- Trees are
Trees are complicated when complicated when mingled but simple if mingled but simple if looked at one by one looked at one by one
- Store the
Store the “ “marginal marginal” ” trees instead of the trees instead of the “ “joint joint” ” tree: tree:
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 13 13
Key data representations: Key data representations: Trees Trees
- Instead of storing combined tree, just store
Instead of storing combined tree, just store m trees. m trees.
- These are just binary trees so data size is
These are just binary trees so data size is FIXED at FIXED at m m(2 (2n n– –1) 1)
- Storage is independent of
Storage is independent of G G which is which is random and could be huge random and could be huge
- this does lose information which could be
this does lose information which could be stored separately stored separately
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 14 14
Timing Results Timing Results
- Time is polynomial instead of exponential
Time is polynomial instead of exponential
- Quadratic in
Quadratic in m m
- Linear in
Linear in n n
- Linear in
Linear in R R (indirectly through (indirectly through G G) )
- O(
O(nm nm2
2R
R)
- Memory is O(
Memory is O(mn mn) and ) and not random not random
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 15 15
Quadratic in Quadratic in m m
Timings vs Number of Loci
y = 0.0086x2 - 4.0833x + 478.57
- 1000
1000 2000 3000 4000 5000 6000 200 400 600 800 1000 1200 m (Number of Loci) time (seconds) n=100, R=2 n=100, R=4 n=100, R=8 Quadratic(n=100, R=8)
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 16 16
Linear in Linear in n n
Timing vs n
2 4 6 8 10 12 14 16 18 20 200 400 600 800 1000 1200 n time (seconds) m=100, R=2 m=100, R=4 m=100, R=6 m=100, R=8
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 17 17
Linear in Linear in R R
Timings vs Recombination Rate
20 40 60 80 100 120 140 160 180 2 4 6 8 10
R = 2Nr (Recombination Rate) time (seconds)
n=100,m=100 n=100,m=200 n=100,m=300
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 18 18
- No. Events
- No. Events G
G is Quadratic in is Quadratic in m m
Number of Events (G) vs Number of Loci
y = 4.9047x2 - 344.82x + 82791 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 200 400 600 800 1000 1200 Thousands m (Number of Loci) Number of Events n=100, R=2 n=100, R=4 n=100, R=8 Quadratic(n=200, R=8)
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 19 19
Time vs. Time vs. nm nm2
2R
R
Time vs nRm2 y = 8E-06x - 1049.2 R2 = 0.9868 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200
Thousands Millions
nRm2 time (seconds)
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 20 20
Comparison: Log(time) vs. Comparison: Log(time) vs. m m
Comparison (Log Scale) n = 200, R = 1
0.01 0.10 1.00 10.00 100.00 1000.00 10000.00 200 400 600 800 1000 1200 m time (seconds) New Orig
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 21 21
Limits Limits
- So far have simulated up to
So far have simulated up to m m = 5000 loci = 5000 loci (took 19GB) (took 19GB)
- Time
Time is proportional to the number of is proportional to the number of events (can’t get around that) events (can’t get around that)
- Memory
Memory is linear in is linear in m m and and n n, independent , independent
- f
- f R
R, and not random , and not random
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 22 22
Applications Applications
- Whole genome simulations
Whole genome simulations
- m
m = 1000 is the scale of markers now in use = 1000 is the scale of markers now in use
- m
m = 10,000? 100,000 = 10,000? 100,000
- Human Populations
Human Populations
- understanding LD and
understanding LD and haplotype haplotype blocks blocks
- Natural Populations
Natural Populations
- statistical mapping methods
statistical mapping methods
- power calculations
power calculations
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 23 23
Refinements Refinements
- more complex models
more complex models
- population size changes
population size changes ☺ ☺
- Windows GUI
Windows GUI ☺ ☺
- complex mutation models
complex mutation models
- population structure
population structure
- selection
selection
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 24 24
Thank You Thank You
- Dan Noland (programming, ideas)
Dan Noland (programming, ideas)
- Chinh
Chinh Le (profiling, timing, advice) Le (profiling, timing, advice)
- Ahmed
Ahmed Sameh Sameh, David Moffett, and , David Moffett, and ITaP ITaP
- Koen
Koen Verhoeven Verhoeven, Lei Liu , Lei Liu
- http://www.stat.
http://www.stat.purdue purdue.edu/~ .edu/~simonsen simonsen/Argos/ /Argos/
Katy L. Simonsen Katy L. Simonsen Interface 2004 Interface 2004 25 25
Human Genome Parameters Human Genome Parameters
(very approximate) (very approximate)
- 3
3 × × 10 109
9 base pairs (3 Gb) in 23
base pairs (3 Gb) in 23 chromosomes chromosomes
- 10
107
7 SNPs
SNPs ⇒ ⇒ m m = = 5 5 × × 10 105
5 SNPs
SNPs / / chrom chrom
- r
r ≈ ≈ .01/Mb (highly variable) .01/Mb (highly variable)
- ≈
≈ 1 1-
- 2 SNP /Kb
2 SNP /Kb ⇒ ⇒ r r ≈ ≈ 10 10-
- 5
5 between
between SNPs SNPs
- Historical N
Historical N ≈ ≈ 10 104
4 –
– 10 105
5 so
so R R ≈ ≈ 1 1
- sample sizes