INFERENCE OF EVOLUTIONARY HISTORY WITH APPROXIMATE BAYESIAN - - PowerPoint PPT Presentation

inference of evolutionary history with approximate
SMART_READER_LITE
LIVE PREVIEW

INFERENCE OF EVOLUTIONARY HISTORY WITH APPROXIMATE BAYESIAN - - PowerPoint PPT Presentation

INFERENCE OF EVOLUTIONARY HISTORY WITH APPROXIMATE BAYESIAN COMPUTATION Ariella Gladstein Ecology and Evolutionary Biology University of Arizona HOW DID HUMANS SPREAD ACROSS THE WORLD? (Nielsen et al. 2017) WHAT DEMOGRAPHIC EVENTS LEAD US


slide-1
SLIDE 1

INFERENCE OF EVOLUTIONARY HISTORY WITH APPROXIMATE BAYESIAN COMPUTATION

Ariella Gladstein Ecology and Evolutionary Biology University of Arizona

slide-2
SLIDE 2
slide-3
SLIDE 3

HOW DID HUMANS SPREAD ACROSS THE WORLD?

WHAT DEMOGRAPHIC EVENTS LEAD US TO WHERE WE ARE TODAY AND THE DIVERSITY WE SEE?

(Nielsen et al. 2017)

slide-4
SLIDE 4

(Nielsen et al. 2017)

slide-5
SLIDE 5

(Nielsen et al. 2017)

slide-6
SLIDE 6

(Nielsen et al. 2017)

slide-7
SLIDE 7

(Nielsen et al. 2017)

slide-8
SLIDE 8

(Nielsen et al. 2017)

slide-9
SLIDE 9

(Nielsen et al. 2017)

slide-10
SLIDE 10

WHAT ARE “DEMOGRAPHIC EVENTS”?

slide-11
SLIDE 11
  • Divergence

WHAT ARE “DEMOGRAPHIC EVENTS”?

slide-12
SLIDE 12
  • Divergence
  • Expansion or reduction

WHAT ARE “DEMOGRAPHIC EVENTS”?

slide-13
SLIDE 13
  • Divergence
  • Expansion or reduction
  • Gene flow

WHAT ARE “DEMOGRAPHIC EVENTS”?

slide-14
SLIDE 14

AIM: INFER THE DEMOGRAPHIC HISTORY OF THE ASHKENAZI JEWS.

slide-15
SLIDE 15

ASHKENAZI JEWS: AN INTERESTING STUDY POPULATION

  • High frequency of genetic

disorders

  • Population isolate
  • Complex demographic

history

  • Well documented historical

record

slide-16
SLIDE 16

ASHKENAZI JEWS: AN INTERESTING STUDY POPULATION

  • High frequency of genetic

disorders

  • Population isolate
  • Complex demographic

history

  • Well documented historical

record

slide-17
SLIDE 17

HYPOTHESIS OF ASHKENAZI ORIGINS

slide-18
SLIDE 18

WESTERN VS. EASTERN ASHKENAZI JEWS

YIVO Institute for Jewish Research. People of a Thousand Towns. Online Photographic

  • Catalog. Record Id: 6820

JDC Archives. Reference Code: NY_02044

Cracow, Poland. 1932 Germany, 1900’s

slide-19
SLIDE 19

YIVO Institute for Jewish Research. People of a Thousand Towns. Online Photographic

  • Catalog. Record Id: 6820

JDC Archives. Reference Code: NY_02044

Cracow, Poland. 1932 Germany, 1900’s

Reference census data

WESTERN VS. EASTERN ASHKENAZI JEWS

slide-20
SLIDE 20

MOTIVATION

  • Numerous genetic studies on the Ashkenazi Jews.
  • All genome-wide studies treat Ashkenazi Jews as one

population.

  • Preliminary work consistent with genetic differentiation.
  • Not informative of cause of differentiation.
slide-21
SLIDE 21

MODELS OF ASHKENAZI HISTORY

slide-22
SLIDE 22

APPROXIMATE BAYESIAN COMPUTATION

  • Infer parameter values
  • Choose among models
slide-23
SLIDE 23

APPROXIMATE BAYESIAN COMPUTATION

  • 1. Define priors of parameters of model

t = unif[10:1000] t = time (generations) of divergence between

Jewish and Middle Eastern populations

slide-24
SLIDE 24

APPROXIMATE BAYESIAN COMPUTATION

  • 1. Define priors of parameters of model
  • 2. Simulate data many times
slide-25
SLIDE 25

APPROXIMATE BAYESIAN COMPUTATION

  • 1. Define priors of parameters of model
  • 2. Simulate data many times
  • 3. Choose model and estimate parameters

based on simulations closest to real data

slide-26
SLIDE 26

SIMULATION

Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb file with parameter values and summaries

slide-27
SLIDE 27

EMBARRASSINGLY PARALLEL!

Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb file with parameter values and summaries Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb fil with paramete values an summarie Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb f with paramet values a summar Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 Kb with parame values summa Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 K wit param values summ Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 w para value summ Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <10 par valu sum Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences <1 pa va su Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences < p v s Model parameters Store genotype sequences in memory Calculate summaries

  • f

sequences

slide-28
SLIDE 28

INHERITED SCRIPT INTENDED FOR SMALL SEQUENCE

1,389 10kb regions 00000110001 00100010000 00000100101 00100000000 00010001010 00100010001

slide-29
SLIDE 29

0000011000100100000011000111001000000010110010100011100011110100101101010101010011000110010 0000110110100000001010100101001100110001100000110101010100110000011110001001010011100110101 0101101001100010100000000000000000000000000000000000101000000000000000000000000000000000001 0100000000000000000000000000000010000000000000000000000000000000000000000000000001000000000 0000000000000000000000000000000000000000100000000000010000000000000010000000000100000000000 1000000000000000000000011001100000000001000000000000000000000000000001000010000000000000000 0000000000000001001000000000000000100000000000001000000000000000000000000000010000000000000 0000000000000000000000010000000000000000000000000000000100000000000001000000000000000000000 0000000000000000000000000000000000000000000000000000000010000000000001000000000000000000000 0000000000000000000000000000000000000000000000001000000010000000000000000000000000000000000 0000001000000000100000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000010000000000000000000010000000100000001000 0000000000000001000000110001001000000110001110010000000101100101000101000101001001011010101 0101001100011001000001101101000000010101001010011001100011000001101010101001101000111100010 0101001110011010101011010011000101000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000001000000000000000000000100000 1000000000000000000000100000100000000000000000000000000000000000000000000000000000000100000 0000010000000000000000000000000000000000110010000000000010000000000000000000000000000000000 1000000000100000000000000000000000010000000000000001000000000000010000000000000000010000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000010000000000000100000001000001000000000000000000000100000100000010000000000001000001 0100000000000000000000100000000000000000000000010001000000000000000000000000000000000000100 0000000001000000000000000000000000000000000000000010000000001010000000000000000000000000000 0000000001000000010100000000000000000000100000000000000000000000010001000000000000000000000 0000000000000001000000000001000000000000000000000000000000000000000010000000001010000000000 0000000000000000000000000001000000000000000000000000000000000000000000000000000000000000100 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

SIMULATE WHOLE CHROMOSOME

~250 million sites on human chromosome 1

slide-30
SLIDE 30

PROBLEM!

Parameters Average Walltime Average Memory Minimum 00:21:00 2.7 Gb Random 00:55:11 20 Gb Maximum 08:02:11 117 Gb

T

  • o much memory!

Over a decade to complete 6000 runs/month w/ UA resources

Each core on UA HPC has 6G - Need memory < 6G for each run

slide-31
SLIDE 31

EMBARRASSINGLY PARALLEL & RESOURCE LIGHT!

Same input Combined output

  • Each job
  • runs ~40 min, and max 50 hrs
  • Uses ~1G, and max 5G memory
  • Uses ~2M in storage
slide-32
SLIDE 32

HIGH THROUGHPUT COMPUTING

OSG Connect XSEDE UA HPC UW HTC

slide-33
SLIDE 33

SIMULATIONS ON HTC CLUSTERS, ANALYSES ON VM

CyVerse Atmosphere XSEDE UA HPC

UW HTC

OSG Connect Simulations Data storage, Analyses CyVerse Data Store Google Drive Data backup

slide-34
SLIDE 34

CHALLENGES: TECHNICAL

  • How to handle millions of files?
  • UA HPC has file number limit
  • If there are too many files in a directory simple things take a

long time

  • How to not overload UA HPC system?
  • How to reliably backup data?
  • Why do jobs fail?
slide-35
SLIDE 35

>1 MILLION SIMULATIONS OF EACH MODEL

slide-36
SLIDE 36

MODEL CHOICE

Posterior probability: 0.0065 0.85 0.14

slide-37
SLIDE 37

17 kya 3200 ya 860 ya 490 ya BEST MODEL

  • ~1200 BCE ancestors of Jewish populations

diverged from other Middle Eastern populations

  • Experienced extreme population size

reduction

  • ~1100 CE ancestors of Ashkenazi Jews diverged

from other Jewish populations

  • Experienced another population size

reduction

  • Experienced gene flow from Europeans

(unresolved how much or when)

  • ~1500 CE Eastern and Western Ashkenazi Jews

diverged

  • Western AJ moderately grew in size
  • Eastern AJ massively grew in size
slide-38
SLIDE 38

SIMPRILY: GENERALIZATION OF CODE AND WORKFLOW

  • Developed program to

simulate any demographic model

  • Memory & space efficient
  • Use Singularity container
  • Pegasus workflow for OSG

https://agladstein.github.io/SimPrily/

slide-39
SLIDE 39

HAMMER LAB

  • Michael Hammer
  • Consuelo Quinto-

Cortes

THANK YOU!

CYVERSE

  • Blake Joyce
  • Julian Pistorius

UA HPC CONSULTING

  • Mike Bruck
  • Dima Shyshlov

OPEN SCIENCE GRID USER SCHOOL

  • Tim Cartwright
  • Lauren Michael
  • Christina Koch

OPEN SCIENCE GRID & PEGASUS

  • Mats Rynge

CODING MINIONS

  • David Christy
  • Logan Gantner
  • Mack Skodiak
  • Daniel Olson
  • Rafael Lopez
  • Kayleen Gurrola
  • Katie McCready

UW CENTER FOR HTC

  • Lauren Michael
  • Christina Koch

RESOURCES PROVIDED BY

  • University of

Arizona HPC

  • University of

Wisconsin HTC

  • CyVerse
  • Open Science Grid
  • XSEDE
  • Bridges
  • Comet
  • Jetstream
slide-40
SLIDE 40

CPU HOURS ON THE OPEN SCIENCE GRID

slide-41
SLIDE 41

DNA SEQUENCE

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA

slide-42
SLIDE 42

DNA SEQUENCE, SEGREGATING SITES

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA

slide-43
SLIDE 43

DNA SEQUENCE, SEGREGATING SITES

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA AATCATTTCGGTTTTAATGCTTGGGCTGCCTTGGTAAA Indiv 2 AAACATTTCCGTCTTTATGGTTGCGCTGCATTGGGGAA

slide-44
SLIDE 44

DNA SEQUENCE, GENOTYPES ENCODED 0/1

AATCATTTCGGTTTTAATGCTTGGGCTGCATTGGGAAA Indiv 1 AATCATATCGGTCTTAATGCTTGCGCTGCCTTGGTAAA AATCATTTCGGTTTTAATGCTTGGGCTGCCTTGGTAAA Indiv 2 AAACATTTCCGTCTTTATGGTTGCGCTGCATTGGGGAA 00000000000000000000000000000100000000 Indiv 2 00100000010010010001000100000000001100 00000000000000000000000100000000001000 Indiv 1 00000010000010000000000000000100000000

slide-45
SLIDE 45

SEQUENCE OF GENOTYPES, ONLY SEGREGATING SITES

0000000100 Indiv 2 1011111011 0000001010 Indiv 1 0101000100

slide-46
SLIDE 46

PYTHON SCRIPT: GENOME SIMULATIONS AND COMPUTE SUMMARY STATISTICS

  • Inherited from lab mates
  • Intended for millions of relatively small simulations
  • 1,389 10kb regions
  • 65 individuals
  • Originally took a few minutes to run
  • Originally ran parallel on U of A HPC
  • 1 million runs would take approximately 1

month.

slide-47
SLIDE 47

PROFILE OF PYTHON SCRIPT

Maximum Simulation Parameters Minimum Simulation Parameters

*Note different scales

slide-48
SLIDE 48

Maximum Simulation Parameters Minimum Simulation Parameters

*Note different scales

Max memory < 6G goal Can now run efficiently in parallel

Maximum Simulation Parameters Minimum Simulation Parameters