Quartet Inference from SNP Data Under the Coalescent Model - PowerPoint PPT Presentation

Quartet ¡Inference ¡from ¡SNP ¡Data ¡ Under ¡the ¡Coalescent ¡Model ¡ ¡ Julia ¡Chifman ¡and ¡Laura ¡Kubatko ¡ ¡ ¡ By ¡ Shashank ¡Yaduvanshi ¡

EsDmaDng ¡Species ¡Tree ¡from ¡Gene ¡ Sequences ¡ • Input: ¡Alignments ¡from ¡mulDple ¡genes ¡ • Output: ¡ ¡Unified ¡species ¡tree ¡ • Challenges: ¡ ¡ – Every ¡gene ¡has ¡its ¡own ¡phylogeny ¡ – Gene ¡trees ¡might ¡vary ¡from ¡species ¡tree ¡due ¡to ¡ ILS, ¡horizontal ¡gene ¡transfer ¡etc ¡

Phylogeny ¡EsDmaDon ¡Methods ¡under ¡ the ¡Coalescent ¡Model ¡ • Used ¡to ¡model ¡ILS ¡in ¡gene ¡trees ¡ • Summary ¡based ¡methods ¡ – Quartet ¡based ¡methods ¡ • ConcatenaDon ¡methods ¡ • Co-‑esDmaDon ¡methods ¡

Summary ¡Based ¡Methods ¡ • First ¡esDmate ¡independent ¡gene ¡trees ¡for ¡each ¡ gene ¡using ¡methods ¡like ¡RaxML ¡ • Second ¡step ¡is ¡combining ¡gene ¡trees ¡to ¡get ¡ species ¡trees ¡by ¡methods ¡like ¡Astral ¡ • ComputaDonally ¡efficient ¡for ¡large ¡data ¡sets ¡ • EsDmaDon ¡error ¡in ¡gene ¡trees ¡will ¡lower ¡the ¡ overall ¡accuracy ¡ ¡

Quartet ¡Based ¡Methods ¡ • EsDmate ¡the ¡most ¡likely ¡true ¡quartet ¡tree ¡for ¡ each ¡4 ¡set ¡of ¡taxa ¡using ¡mulD ¡gene ¡sequences ¡ • Combine ¡all ¡(or ¡a ¡subset) ¡of ¡these ¡quartet ¡ trees ¡using ¡a ¡Supertree ¡method ¡to ¡get ¡the ¡ species ¡tree ¡ • Works ¡on ¡the ¡enDre ¡data ¡together ¡while ¡sDll ¡ remaining ¡computaDonally ¡efficient ¡

ConcatenaDon ¡Methods ¡ ¡ • Concatenate ¡all ¡gene ¡sequence ¡alignments ¡to ¡ get ¡one ¡long ¡sequence ¡alignment ¡for ¡each ¡ taxon ¡ • Get ¡the ¡species ¡tree ¡using ¡these ¡long ¡ alignments ¡directly ¡with ¡methods ¡such ¡as ¡ML ¡ • Ignores ¡differences ¡in ¡the ¡gene ¡trees ¡for ¡ different ¡genes ¡

Co-‑esDmaDon ¡Methods ¡ ¡ • Co-‑esDmate ¡sequence ¡alignments ¡and ¡species ¡ tree ¡with ¡methods ¡such ¡as ¡Bayesian ¡inference ¡ ¡ • Generally ¡higher ¡accuracy ¡than ¡other ¡methods ¡ • ComputaDonally ¡inefficient ¡for ¡large ¡datasets ¡ ¡

EsDmaDng ¡Quartet ¡Trees ¡ • Most ¡methods ¡seen ¡so ¡far ¡are ¡distance ¡based, ¡ or ¡ML-‑based ¡ • This ¡paper ¡introduces ¡a ¡new ¡measure, ¡SVD ¡ scores ¡that ¡is ¡based ¡on ¡the ¡frequency ¡of ¡ quartet ¡pa\erns ¡amongst ¡all ¡gene ¡alignments ¡ • SVD ¡scores ¡can ¡be ¡used ¡to ¡esDmate ¡the ¡most ¡ likely ¡quartet ¡tree ¡for ¡any ¡quartet ¡of ¡taxa ¡

Important ¡Concepts ¡ • p ijkl ¡=P(X1 ¡=i; ¡X2 ¡=j; ¡X3 ¡=k; ¡X4 ¡=l) ¡ ¡ • A ¡SPLIT ¡of ¡a ¡taxa ¡set ¡L ¡is ¡a ¡biparDDon ¡of ¡L ¡into ¡two ¡ non-‑overlapping ¡subsets ¡L1 ¡& ¡L2, ¡denoted ¡L1|L2. ¡ VALID ¡SPLIT ¡L1|L2 ¡for ¡tree ¡T: ¡There ¡is ¡some ¡edge ¡in ¡T ¡ that ¡results ¡in ¡the ¡same ¡biparDDon ¡L1|L2. ¡If ¡no ¡such ¡ edge ¡exists, ¡then ¡the ¡split ¡is ¡INVALID ¡ • For ¡taxa ¡quartets, ¡we ¡will ¡talk ¡about ¡splits ¡ corresponding ¡to ¡groups ¡of ¡two. ¡There ¡are ¡3 ¡such ¡ possible ¡splits ¡for ¡each ¡quartet. ¡ ¡ ¡

Fla\ening ¡

Important ¡Concepts ¡ • The ¡RANK ¡of ¡a ¡matrix ¡A ¡is ¡the ¡size ¡of ¡the ¡largest ¡ collecDon ¡of ¡linearly ¡independent ¡columns(or ¡rows) ¡of ¡ A. ¡ ¡ • SVD: ¡The ¡singular ¡value ¡decomposiDon ¡of ¡a ¡matrix ¡A ¡is ¡ the ¡factorizaDon ¡of ¡A ¡into ¡the ¡product ¡of ¡three ¡ matrices ¡A ¡= ¡UDV T ¡where ¡the ¡columns ¡of ¡U ¡and ¡V ¡are ¡ orthonormal ¡and ¡the ¡matrix ¡D ¡is ¡diagonal ¡with ¡posiDve ¡ real ¡entries. ¡ ¡ • Rank(A) ¡equals ¡the ¡number ¡of ¡non-‑zero ¡diagonal ¡ elements(singular ¡values) ¡in ¡D. ¡

Theorem ¡ • [Chifman ¡and ¡Kubatko, ¡2014]. ¡Let ¡C ¡denote ¡ the ¡class ¡of ¡coalescent ¡models ¡under ¡the ¡four-‑ state ¡GTR ¡model ¡on ¡a ¡four-‑ ¡taxon ¡binary ¡ species ¡tree. ¡For ¡a ¡valid ¡split ¡L1|L2 ¡, ¡ rank(Flat L1|L2 (P))<= ︎ 10 ¡for ¡all ¡distribuDons ¡P ¡ arising ¡from ¡C. ¡For ¡a ¡non-‑valid ¡split ¡L1 ¡|L2 ¡, ¡ rank(Flat L1|L2 (P)) ¡> ¡10. ¡ ¡

ApproximaDon ¡to ¡Fla\ening ¡

Finding ¡the ¡Best ¡Split ¡ • Calculate ¡Flat L1|L2 ( P’ ) ¡for ¡all ¡three ¡possible ¡splits. ¡ • Calculate ¡the ¡rank ¡of ¡each ¡of ¡these ¡three ¡ matrices. ¡True ¡split ¡will ¡have ¡rank<=10. ¡ • Not ¡computaDonally ¡intensive ¡to ¡get ¡these ¡ counts ¡and ¡calculate ¡rank ¡ • Can ¡be ¡run ¡in ¡parallel ¡for ¡different ¡quartets ¡ ¡

SVD ¡Scores ¡ • SVD ¡score ¡0 ¡implies ¡rank(L1|L2)<=10, ¡hence ¡L1|L2 ¡is ¡a ¡ valid ¡split ¡ • SVD ¡score ¡>0 ¡implies ¡rank(L1|L2)>10, ¡hence ¡L1|L2 ¡is ¡an ¡ invalid ¡split ¡ • Choose ¡the ¡split ¡with ¡the ¡lowest ¡SVD ¡score ¡

Suitable ¡Data ¡ • SVD ¡scores ¡are ¡applicable ¡to ¡data ¡where ¡each ¡site ¡evolves ¡ independently, ¡coming ¡from ¡a ¡different ¡locus ¡ • However, ¡authors ¡claim ¡that ¡this ¡method ¡also ¡works ¡well ¡ when ¡each ¡locus ¡produces ¡mulDple ¡sites ¡, ¡simulated ¡and ¡ real ¡world. ¡ • Bootstrapping ¡for ¡a ¡dataset ¡consisDng ¡of ¡M ¡aligned ¡sites ¡ ¡ – Re-‑sample ¡columns ¡with ¡replacement ¡M ¡Dmes ¡ ¡ – Calculate ¡SVD ¡scores ¡of ¡the ¡three ¡splits ¡for ¡this ¡data ¡matrix ¡ – Repeat ¡this ¡procedure ¡B ¡Dmes ¡ – Each ¡bootstrap ¡matrix ¡votes ¡for ¡a ¡parDcular ¡split. ¡Total ¡votes ¡for ¡ each ¡split ¡is ¡its ¡bootstrap ¡support ¡

Experiments ¡ • SimulaDon ¡Study ¡ • Ra\lesnake ¡MulD-‑Loci ¡Data ¡ • Soybean ¡SNP ¡Data ¡

SimulaDon ¡Study ¡ 1 3 x ¡ x ¡ x ¡ x ¡ x ¡ 2 ¡ 4

SimulaDon ¡Study ¡ • Generate ¡a ¡sample ¡of ¡g ¡gene ¡trees ¡from ¡the ¡model ¡species ¡tree ¡ ((1:x,2:x):x,(3:x,4:x):x), ¡where ¡x ¡is ¡the ¡length ¡of ¡each ¡branch ¡ under ¡the ¡coalescent ¡model ¡using ¡the ¡program ¡COAL ¡(Degnan ¡ and ¡Salter). ¡ ¡ • Generate ¡sequence ¡data ¡of ¡length ¡n ¡on ¡each ¡gene ¡tree ¡under ¡a ¡ specified ¡subsDtuDon ¡model. ¡ ¡ • Construct ¡the ¡fla\ening ¡matrix ¡for ¡each ¡of ¡the ¡three ¡possible ¡ splits, ¡and ¡compute ¡SVD(L1|L2) ¡for ¡each ¡ • Repeat ¡1000 ¡Dmes ¡and ¡record ¡SVD(L1|L2) k ; ¡k=1; ¡2; ¡. ¡. ¡. ¡; ¡1000, ¡ for ¡each ¡split. ¡For ¡each ¡of ¡the ¡1000 ¡datasets, ¡generate ¡B ¡ bootstrapped ¡datasets ¡and ¡record ¡SVD(L1|L2) k;b ¡for ¡each ¡split. ¡ ¡

SimulaDon ¡Study ¡ • x(branch ¡length)=0.5,1,2 ¡ • g=5000, ¡n=1: ¡Simulate ¡SNP ¡data, ¡one ¡site ¡per ¡gene ¡ • g=10, ¡n=500: ¡Simulate ¡mulDple ¡sites ¡per ¡gene ¡ ¡ • SubsDtuDon ¡Model: ¡Jukes–Cantor ¡model ¡(JC69) ¡and ¡ the ¡GTR ¡model ¡with ¡a ¡proporDon ¡of ¡invariant ¡sites ¡and ¡ with ¡gamma-‑distributed ¡mutaDon ¡rates ¡across ¡sites ¡ (GTR ¡+ ¡I ¡+ ¡ ︎ Γ) ¡ ¡ Γ) ¡ ¡ • n=1, ¡g=1000,5000,10000: ¡Check ¡runDme ¡for ¡quartets ¡ ¡

Results ¡

Quartet Inference from SNP Data Under the Coalescent Model - PowerPoint PPT Presentation

Quartet Inference from SNP Data Under the Coalescent Model Julia Chifman and Laura Kubatko By Shashank Yaduvanshi EsDmaDng Species Tree from

Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference

The Metric Coalescent joint with David Aldous Daniel Lanoue University of California, Berkeley

The Metric Coalescent Process joint with David Aldous Daniel Lanoue June 17, 2014 Daniel Lanoue

A Polynomial-Time Approximation Scheme for Maximum Quartet Compatibility Pranjal Vachaspati UIUC

Fundamentals of Evoluon Fundamentals of Evoluon EEEB G6110 EEEB G6110 Session 11: The

Frequency Spectra and Inference in Population Genetics Although coalescent models have come to

The Coalescent Evolution backward in time Joachim Hermisson Mathematics and Biosciences Group

SCOTTI: Inferring transmission with the Structured Coalescent Nicola De Maio, Chieh-Hsi Wu,

X X J. Morabito R. R. Reilly 3/21/01 Coalescent Knowledge Slide No: 1 (KKM)

Scaling and local limits of Baxter permutations and bipolar orientations through coalescent-walk

Fundamentals of Evoluon Fundamentals of Evoluon EEEB G6110 EEEB G6110 Session 12:

SNP SNP ho hone ney be y bee di diagno gnostic pa c pane nel Tugrul Giray, Ph.D.

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Today. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 , Pr[T] =

Welcome back. Two populations. Which population? DNA data: Population 1: snp 843: Pr[A] = .4 ,

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

PUBLIC HEARING DRAFT ENVIRONMENTAL IMPACT STATEMENT Ho Chunk Nation Beloit Fee-to-Trust and

BIA/OC Urban Infill Jamboree is a Non-Profit Corporation Founded in 1990 Mission Jamboree

Unionville Business Improvement Area Report Card to the Town of Markham February 6, 2012

iGEM Projects PubMed Entries per 100 000 2 Cas as9 Protein cleaves DNA Target DNA PAM Sit

Analysis of SNP Marker Data for Predictions Some remarks about various methods/software Fikret

(DEI) Work Alfredo Hernandez Equity Officer Michigan Department of Civil Rights (MDCR)

Speaker Series V: Leading the Development of an Inclusive Workplace May 16, 2019 Employment-Based

Sambuz

Useful Links

Newsletter

Mail Us