[PPT] - INFERENCE OF EVOLUTIONARY HISTORY WITH APPROXIMATE BAYESIAN PowerPoint Presentation

SLIDE 1

INFERENCE OF EVOLUTIONARY HISTORY WITH APPROXIMATE BAYESIAN COMPUTATION

Ariella Gladstein Ecology and Evolutionary Biology University of Arizona

SLIDE 2

SLIDE 3

HOW DID HUMANS SPREAD ACROSS THE WORLD?

WHAT DEMOGRAPHIC EVENTS LEAD US TO WHERE WE ARE TODAY AND THE DIVERSITY WE SEE?

(Nielsen et al. 2017)

SLIDE 4

(Nielsen et al. 2017)

SLIDE 5

(Nielsen et al. 2017)

SLIDE 6

(Nielsen et al. 2017)

SLIDE 7

(Nielsen et al. 2017)

SLIDE 8

(Nielsen et al. 2017)

SLIDE 9

(Nielsen et al. 2017)

SLIDE 10

WHAT ARE “DEMOGRAPHIC EVENTS”?

SLIDE 11

Divergence

WHAT ARE “DEMOGRAPHIC EVENTS”?

SLIDE 12

Divergence
Expansion or reduction

WHAT ARE “DEMOGRAPHIC EVENTS”?

SLIDE 13

Divergence
Expansion or reduction
Gene flow

WHAT ARE “DEMOGRAPHIC EVENTS”?

SLIDE 14

AIM: INFER THE DEMOGRAPHIC HISTORY OF THE ASHKENAZI JEWS.

SLIDE 15

ASHKENAZI JEWS: AN INTERESTING STUDY POPULATION

High frequency of genetic

disorders

Population isolate
Complex demographic

history

Well documented historical

record

SLIDE 16

ASHKENAZI JEWS: AN INTERESTING STUDY POPULATION

High frequency of genetic

disorders

Population isolate
Complex demographic

history

Well documented historical

record

SLIDE 17

HYPOTHESIS OF ASHKENAZI ORIGINS

SLIDE 18

WESTERN VS. EASTERN ASHKENAZI JEWS

YIVO Institute for Jewish Research. People of a Thousand Towns. Online Photographic

Catalog. Record Id: 6820

JDC Archives. Reference Code: NY_02044

Cracow, Poland. 1932 Germany, 1900’s

SLIDE 19

YIVO Institute for Jewish Research. People of a Thousand Towns. Online Photographic

Catalog. Record Id: 6820

JDC Archives. Reference Code: NY_02044

Cracow, Poland. 1932 Germany, 1900’s

Reference census data

WESTERN VS. EASTERN ASHKENAZI JEWS

SLIDE 20

MOTIVATION

Numerous genetic studies on the Ashkenazi Jews.
All genome-wide studies treat Ashkenazi Jews as one

population.

Preliminary work consistent with genetic differentiation.
Not informative of cause of differentiation.

SLIDE 21

MODELS OF ASHKENAZI HISTORY

SLIDE 22

APPROXIMATE BAYESIAN COMPUTATION

Infer parameter values
Choose among models

SLIDE 23

APPROXIMATE BAYESIAN COMPUTATION

1. Define priors of parameters of model

t = unif[10:1000] t = time (generations) of divergence between

Jewish and Middle Eastern populations

SLIDE 24

APPROXIMATE BAYESIAN COMPUTATION

1. Define priors of parameters of model
2. Simulate data many times

SLIDE 25

APPROXIMATE BAYESIAN COMPUTATION

1. Define priors of parameters of model
2. Simulate data many times
3. Choose model and estimate parameters

based on simulations closest to real data

SLIDE 26

SIMULATION

Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb file with parameter values and summaries

SLIDE 27

EMBARRASSINGLY PARALLEL!

Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb fil with paramete values an summarie Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb f with paramet values a summar Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 Kb with parame values summa Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 K wit param values summ Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 w para value summ Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <10 par valu sum Model parameters Store genotype sequences in memory Calculate summaries

f

sequences <1 pa va su Model parameters Store genotype sequences in memory Calculate summaries

f

sequences < p v s Model parameters Store genotype sequences in memory Calculate summaries

f

sequences

SLIDE 28

INHERITED SCRIPT INTENDED FOR SMALL SEQUENCE

1,389 10kb regions 00000110001 00100010000 00000100101 00100000000 00010001010 00100010001

SLIDE 29

0000011000100100000011000111001000000010110010100011100011110100101101010101010011000110010 0000110110100000001010100101001100110001100000110101010100110000011110001001010011100110101 0101101001100010100000000000000000000000000000000000101000000000000000000000000000000000001 0100000000000000000000000000000010000000000000000000000000000000000000000000000001000000000 0000000000000000000000000000000000000000100000000000010000000000000010000000000100000000000 1000000000000000000000011001100000000001000000000000000000000000000001000010000000000000000 0000000000000001001000000000000000100000000000001000000000000000000000000000010000000000000 0000000000000000000000010000000000000000000000000000000100000000000001000000000000000000000 0000000000000000000000000000000000000000000000000000000010000000000001000000000000000000000 0000000000000000000000000000000000000000000000001000000010000000000000000000000000000000000 0000001000000000100000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000010000000000000000000010000000100000001000 0000000000000001000000110001001000000110001110010000000101100101000101000101001001011010101 0101001100011001000001101101000000010101001010011001100011000001101010101001101000111100010 0101001110011010101011010011000101000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000001000000000000000000000100000 1000000000000000000000100000100000000000000000000000000000000000000000000000000000000100000 0000010000000000000000000000000000000000110010000000000010000000000000000000000000000000000 1000000000100000000000000000000000010000000000000001000000000000010000000000000000010000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000010000000000000100000001000001000000000000000000000100000100000010000000000001000001 0100000000000000000000100000000000000000000000010001000000000000000000000000000000000000100 0000000001000000000000000000000000000000000000000010000000001010000000000000000000000000000 0000000001000000010100000000000000000000100000000000000000000000010001000000000000000000000 0000000000000001000000000001000000000000000000000000000000000000000010000000001010000000000 0000000000000000000000000001000000000000000000000000000000000000000000000000000000000000100 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

SIMULATE WHOLE CHROMOSOME

~250 million sites on human chromosome 1

SLIDE 30

PROBLEM!

Parameters Average Walltime Average Memory Minimum 00:21:00 2.7 Gb Random 00:55:11 20 Gb Maximum 08:02:11 117 Gb

T

o much memory!

Over a decade to complete 6000 runs/month w/ UA resources

Each core on UA HPC has 6G - Need memory < 6G for each run

SLIDE 31

EMBARRASSINGLY PARALLEL & RESOURCE LIGHT!

Same input Combined output

Each job
runs ~40 min, and max 50 hrs
Uses ~1G, and max 5G memory
Uses ~2M in storage

SLIDE 32

HIGH THROUGHPUT COMPUTING

OSG Connect XSEDE UA HPC UW HTC

SLIDE 33

SIMULATIONS ON HTC CLUSTERS, ANALYSES ON VM

CyVerse Atmosphere XSEDE UA HPC

UW HTC

OSG Connect Simulations Data storage, Analyses CyVerse Data Store Google Drive Data backup

SLIDE 34

CHALLENGES: TECHNICAL

How to handle millions of files?
UA HPC has file number limit
If there are too many files in a directory simple things take a

long time

How to not overload UA HPC system?
How to reliably backup data?
Why do jobs fail?

SLIDE 35

>1 MILLION SIMULATIONS OF EACH MODEL

SLIDE 36

MODEL CHOICE

Posterior probability: 0.0065 0.85 0.14

SLIDE 37

17 kya 3200 ya 860 ya 490 ya BEST MODEL

~1200 BCE ancestors of Jewish populations

diverged from other Middle Eastern populations

Experienced extreme population size

reduction

~1100 CE ancestors of Ashkenazi Jews diverged

from other Jewish populations

Experienced another population size

reduction

Experienced gene flow from Europeans

(unresolved how much or when)

~1500 CE Eastern and Western Ashkenazi Jews

diverged

Western AJ moderately grew in size
Eastern AJ massively grew in size

SLIDE 38

SIMPRILY: GENERALIZATION OF CODE AND WORKFLOW

Developed program to

simulate any demographic model

Memory & space efficient
Use Singularity container
Pegasus workflow for OSG

https://agladstein.github.io/SimPrily/

SLIDE 39

HAMMER LAB

Michael Hammer
Consuelo Quinto-

Cortes

THANK YOU!

CYVERSE

Blake Joyce
Julian Pistorius

UA HPC CONSULTING

Mike Bruck
Dima Shyshlov

OPEN SCIENCE GRID USER SCHOOL

Tim Cartwright
Lauren Michael
Christina Koch

OPEN SCIENCE GRID & PEGASUS

Mats Rynge

CODING MINIONS

David Christy
Logan Gantner
Mack Skodiak
Daniel Olson
Rafael Lopez
Kayleen Gurrola
Katie McCready

UW CENTER FOR HTC

Lauren Michael
Christina Koch

RESOURCES PROVIDED BY

University of

Arizona HPC

University of

Wisconsin HTC

CyVerse
Open Science Grid
XSEDE
Bridges
Comet
Jetstream

SLIDE 40

CPU HOURS ON THE OPEN SCIENCE GRID

SLIDE 41

DNA SEQUENCE

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA

SLIDE 42

DNA SEQUENCE, SEGREGATING SITES

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA

SLIDE 43

DNA SEQUENCE, SEGREGATING SITES

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA AATCATTTCGGTTTTAATGCTTGGGCTGCCTTGGTAAA Indiv 2 AAACATTTCCGTCTTTATGGTTGCGCTGCATTGGGGAA

SLIDE 44

DNA SEQUENCE, GENOTYPES ENCODED 0/1

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA AATCATTTCGGTTTTAATGCTTGGGCTGCCTTGGTAAA Indiv 2 AAACATTTCCGTCTTTATGGTTGCGCTGCATTGGGGAA 00000000000000000000000000000100000000 Indiv 2 00100000010010010001000100000000001100 00000000000000000000000100000000001000 Indiv 1 00000010000010000000000000000100000000

SLIDE 45

SEQUENCE OF GENOTYPES, ONLY SEGREGATING SITES

0000000100 Indiv 2 1011111011 0000001010 Indiv 1 0101000100

SLIDE 46

PYTHON SCRIPT: GENOME SIMULATIONS AND COMPUTE SUMMARY STATISTICS

Inherited from lab mates
Intended for millions of relatively small simulations
1,389 10kb regions
65 individuals
Originally took a few minutes to run
Originally ran parallel on U of A HPC
1 million runs would take approximately 1

month.

SLIDE 47

PROFILE OF PYTHON SCRIPT

Maximum Simulation Parameters Minimum Simulation Parameters

*Note different scales

SLIDE 48

Maximum Simulation Parameters Minimum Simulation Parameters

*Note different scales

Max memory < 6G goal Can now run efficiently in parallel

Maximum Simulation Parameters Minimum Simulation Parameters