Quartet Inference from SNP Data Under the Coalescent Model Syed - PowerPoint PPT Presentation

Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi

Quartet Inference from SNP Data Under the Coalescent Model Problem Statement ◮ We’re given aligned sequence data from multiple genes ◮ We want a good estimate for the species tree

Quartet Inference from SNP Data Under the Coalescent Model Two Common Approaches ◮ Summary Methods (eg STEM, MP-EST) ◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees

Quartet Inference from SNP Data Under the Coalescent Model Two Common Approaches ◮ Summary Methods (eg STEM, MP-EST) ◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees ◮ Bayesian Methods (eg BEST, *BEAST) ◮ Co-estimate gene trees and species trees using MCMC

Quartet Inference from SNP Data Under the Coalescent Model Issues ◮ Summary Methods (eg STEM, MP-EST) ◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem

Quartet Inference from SNP Data Under the Coalescent Model Issues ◮ Summary Methods (eg STEM, MP-EST) ◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem ◮ Bayesian Methods (eg BEST, *BEAST) ◮ Don’t scale to large datasets

Quartet Inference from SNP Data Under the Coalescent Model A new approach ◮ SVDQuartets uses the sequence data directly

Quartet Inference from SNP Data Under the Coalescent Model A new approach ◮ SVDQuartets uses the sequence data directly ◮ Does not use a Bayesian approach

Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR)

Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR) ◮ A species tree defines a probability distribution on gene trees

Quartet Inference from SNP Data Under the Coalescent Model Background ◮ Suppose we’re given a species tree and a model for sequence evolution along gene trees (eg Jukes Cantor, GTR) ◮ A species tree defines a probability distribution on gene trees ◮ Using this, and the model for sequence evolution, we can compute the probability of observing a particular character on a leaf of the species tree

Quartet Inference from SNP Data Under the Coalescent Model Background II ◮ For a species tree with 4 taxa, write p ijkl for the probability P ( X 1 = i , X 2 = j , X 3 = k , X 4 = l ) (for a given split)

Quartet Inference from SNP Data Under the Coalescent Model Background II ◮ For a species tree with 4 taxa, write p ijkl for the probability P ( X 1 = i , X 2 = j , X 3 = k , X 4 = l ) (for a given split) ◮ We can calculate all these probabilities, and write them in a 16 × 16 matrix (with rows representing the possible values for X 1 , X 2 )

Quartet Inference from SNP Data Under the Coalescent Model Background III ◮ We can make this matrix for all 3 possible splits (12 | 34 , 13 | 24 , 14 | 23)

Quartet Inference from SNP Data Under the Coalescent Model Background III ◮ We can make this matrix for all 3 possible splits (12 | 34 , 13 | 24 , 14 | 23) ◮ We then have the following theorem Theorem Assuming a strict molecular clock, for the split corresponding to the true species tree, the rank of the corresponding matrix is at most 10 . For all others splits, rank is strictly greater than 10 .

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets ◮ Theorem 1 suggests the following procedure for estimating the species tree ◮ Estimate probabilities using the sequences (assuming each site has it’s own genealogy) ◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species tree

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD ◮ We can use bootstrap samples to estimate uncertainty

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets II ◮ Since the calculated probabilities are estimates, we don’t really expect to find matrix with rank ≤ 10 ◮ So instead we pick the matrix that’s closest to a rank 10 matrix using SVD ◮ We can use bootstrap samples to estimate uncertainty ◮ For more than four taxa � n ◮ Do this for all � taxa 4 ◮ Use a quartet assembly algorithm to get the full species tree

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix)

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11 ◮ This is defined as the SVDScore for the split 12 | 34

Quartet Inference from SNP Data Under the Coalescent Model SVDQuartets III ◮ Let M 12 denote the matrix for the split 12 | 34 ◮ Then factoring M 12 using SVD, we get M 12 = U Σ V T (where Σ is a diagonal matrix) ◮ Then the distance of M 12 to a rank 10 matrix is � 16 � � � Σ 2 � ii i =11 ◮ This is defined as the SVDScore for the split 12 | 34 ◮ We pick the split corresponding to the lowest SVDScore

Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees

Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1

Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1 ◮ For multi-locus data, g = 10 , n = 500 was considered

Quartet Inference from SNP Data Under the Coalescent Model Simulations ◮ Generated a model species tree of the form ((1 : x , 2 : x ) : x , (3 : x , 4 : x ) : x ) (where x is the branch length) ◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to 1 ◮ For multi-locus data, g = 10 , n = 500 was considered ◮ Simulations were done for Jukes Cantor and GTR + I + Γ

Quartet Inference from SNP Data Under the Coalescent Model Results (Jukes Cantor)

Quartet Inference from SNP Data Under the Coalescent Model Results (GTR + I + Γ)

Quartet Inference from SNP Data Under the Coalescent Model Discussion ◮ Across both the datasets, we can see SVDQuartets easily identifies the correct split

Quartet Inference from SNP Data Under the Coalescent Model Discussion ◮ Across both the datasets, we can see SVDQuartets easily identifies the correct split ◮ The theory for the model was derived for SNP sites, and for the GTR and it’s sub-models

Quartet Inference from SNP Data Under the Coalescent Model Syed - PowerPoint PPT Presentation

Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference from SNP Data Under the Coalescent Model Problem Statement Were given aligned

Quartet Inference from SNP Data Under the Coalescent Model

The Metric Coalescent joint with David Aldous Daniel Lanoue University of California, Berkeley

The Metric Coalescent Process joint with David Aldous Daniel Lanoue June 17, 2014 Daniel Lanoue

A Polynomial-Time Approximation Scheme for Maximum Quartet Compatibility Pranjal Vachaspati UIUC

Fundamentals of Evoluon Fundamentals of Evoluon EEEB G6110 EEEB G6110 Session 11: The

Frequency Spectra and Inference in Population Genetics Although coalescent models have come to

The Coalescent Evolution backward in time Joachim Hermisson Mathematics and Biosciences Group

SCOTTI: Inferring transmission with the Structured Coalescent Nicola De Maio, Chieh-Hsi Wu,

X X J. Morabito R. R. Reilly 3/21/01 Coalescent Knowledge Slide No: 1 (KKM)

Scaling and local limits of Baxter permutations and bipolar orientations through coalescent-walk

Fundamentals of Evoluon Fundamentals of Evoluon EEEB G6110 EEEB G6110 Session 12:

SNP SNP ho hone ney be y bee di diagno gnostic pa c pane nel Tugrul Giray, Ph.D.

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Today. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] =

Welcome back. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 ,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Algae Biomonitoring and Organic Pollution Septic Summit March 30,2017 Winooski Natural

An Empirical Analysis of the Impact of Renewable Portfolio Standards and Feed-in-Tariffs on

Rapid Assessment Protocol Culmination of > 10 years Shoreline Research and Outreach

CORESET II Progress of the Project (5/1) Lena Avellan Project Manager CORESET II MONAS20/2014,

Context and Activity Recognition (aka Sensors + Inference) Varun Manjunatha CMSC 818G

Unrolling Inference: The Recurrent Inference Machine Max Welling University of Amsterdam /

Specification In Inference Usin ing Context-Free Language Reachability Osbert Bastani, Saswat

Paradox of Clarity: Defending the Missing Inference Theory George Bronnikov

Quartet Inference from SNP Data Under the Coalescent Model Syed - PowerPoint PPT Presentation

Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference from SNP Data Under the Coalescent Model Problem Statement Were given aligned

Quartet Inference from SNP Data Under the Coalescent Model

The Metric Coalescent joint with David Aldous Daniel Lanoue University of California, Berkeley

The Metric Coalescent Process joint with David Aldous Daniel Lanoue June 17, 2014 Daniel Lanoue

A Polynomial-Time Approximation Scheme for Maximum Quartet Compatibility Pranjal Vachaspati UIUC

Fundamentals of Evoluon Fundamentals of Evoluon EEEB G6110 EEEB G6110 Session 11: The

Frequency Spectra and Inference in Population Genetics Although coalescent models have come to

The Coalescent Evolution backward in time Joachim Hermisson Mathematics and Biosciences Group

SCOTTI: Inferring transmission with the Structured Coalescent Nicola De Maio, Chieh-Hsi Wu,

X X J. Morabito R. R. Reilly 3/21/01 Coalescent Knowledge Slide No: 1 (KKM)

Scaling and local limits of Baxter permutations and bipolar orientations through coalescent-walk

Fundamentals of Evoluon Fundamentals of Evoluon EEEB G6110 EEEB G6110 Session 12:

SNP SNP ho hone ney be y bee di diagno gnostic pa c pane nel Tugrul Giray, Ph.D.

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Today. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] =

Welcome back. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 ,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Algae Biomonitoring and Organic Pollution Septic Summit March 30,2017 Winooski Natural

An Empirical Analysis of the Impact of Renewable Portfolio Standards and Feed-in-Tariffs on

Rapid Assessment Protocol Culmination of &gt; 10 years Shoreline Research and Outreach

CORESET II Progress of the Project (5/1) Lena Avellan Project Manager CORESET II MONAS20/2014,

Context and Activity Recognition (aka Sensors + Inference) Varun Manjunatha CMSC 818G

Unrolling Inference: The Recurrent Inference Machine Max Welling University of Amsterdam /

Specification In Inference Usin ing Context-Free Language Reachability Osbert Bastani, Saswat

Paradox of Clarity: Defending the Missing Inference Theory George Bronnikov

Rapid Assessment Protocol Culmination of > 10 years Shoreline Research and Outreach