Quartet Inference from SNP Data Under the Coalescent Model
Quartet Inference from SNP Data Under the Coalescent Model Syed - - PowerPoint PPT Presentation
Quartet Inference from SNP Data Under the Coalescent Model Syed - - PowerPoint PPT Presentation
Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference from SNP Data Under the Coalescent Model Problem Statement Were given aligned
Quartet Inference from SNP Data Under the Coalescent Model
Problem Statement
◮ We’re given aligned sequence data from multiple genes ◮ We want a good estimate for the species tree
Quartet Inference from SNP Data Under the Coalescent Model
Two Common Approaches
◮ Summary Methods (eg STEM, MP-EST)
◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees
Quartet Inference from SNP Data Under the Coalescent Model
Two Common Approaches
◮ Summary Methods (eg STEM, MP-EST)
◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees
◮ Bayesian Methods (eg BEST, *BEAST)
◮ Co-estimate gene trees and species trees using MCMC
Quartet Inference from SNP Data Under the Coalescent Model
Issues
◮ Summary Methods (eg STEM, MP-EST)
◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem
Quartet Inference from SNP Data Under the Coalescent Model
Issues
◮ Summary Methods (eg STEM, MP-EST)
◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem
◮ Bayesian Methods (eg BEST, *BEAST)
◮ Don’t scale to large datasets
Quartet Inference from SNP Data Under the Coalescent Model
A new approach
◮ SVDQuartets uses the sequence data directly
Quartet Inference from SNP Data Under the Coalescent Model
A new approach
◮ SVDQuartets uses the sequence data directly ◮ Does not use a Bayesian approach
Quartet Inference from SNP Data Under the Coalescent Model
Background
◮ Suppose we’re given a species tree and a model for sequence
evolution along gene trees (eg Jukes Cantor, GTR)
Quartet Inference from SNP Data Under the Coalescent Model
Background
◮ Suppose we’re given a species tree and a model for sequence
evolution along gene trees (eg Jukes Cantor, GTR)
◮ A species tree defines a probability distribution on gene trees
Quartet Inference from SNP Data Under the Coalescent Model
Background
◮ Suppose we’re given a species tree and a model for sequence
evolution along gene trees (eg Jukes Cantor, GTR)
◮ A species tree defines a probability distribution on gene trees ◮ Using this, and the model for sequence evolution, we can
compute the probability of observing a particular character on a leaf of the species tree
Quartet Inference from SNP Data Under the Coalescent Model
Background II
◮ For a species tree with 4 taxa, write pijkl for the probability
P(X1 = i, X2 = j, X3 = k, X4 = l) (for a given split)
Quartet Inference from SNP Data Under the Coalescent Model
Background II
◮ For a species tree with 4 taxa, write pijkl for the probability
P(X1 = i, X2 = j, X3 = k, X4 = l) (for a given split)
◮ We can calculate all these probabilities, and write them in a
16 × 16 matrix (with rows representing the possible values for X1, X2)
Quartet Inference from SNP Data Under the Coalescent Model
Background III
◮ We can make this matrix for all 3 possible splits
(12|34, 13|24, 14|23)
Quartet Inference from SNP Data Under the Coalescent Model
Background III
◮ We can make this matrix for all 3 possible splits
(12|34, 13|24, 14|23)
◮ We then have the following theorem
Theorem
Assuming a strict molecular clock, for the split corresponding to the true species tree, the rank of the corresponding matrix is at most 10. For all others splits, rank is strictly greater than 10.
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets
◮ Theorem 1 suggests the following procedure for estimating the
species tree
◮ Estimate probabilities using the sequences (assuming each site
has it’s own genealogy)
◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species
tree
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets
◮ Theorem 1 suggests the following procedure for estimating the
species tree
◮ Estimate probabilities using the sequences (assuming each site
has it’s own genealogy)
◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species
tree
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets II
◮ Since the calculated probabilities are estimates, we don’t
really expect to find matrix with rank ≤ 10
◮ So instead we pick the matrix that’s closest to a rank 10
matrix using SVD
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets II
◮ Since the calculated probabilities are estimates, we don’t
really expect to find matrix with rank ≤ 10
◮ So instead we pick the matrix that’s closest to a rank 10
matrix using SVD
◮ We can use bootstrap samples to estimate uncertainty
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets II
◮ Since the calculated probabilities are estimates, we don’t
really expect to find matrix with rank ≤ 10
◮ So instead we pick the matrix that’s closest to a rank 10
matrix using SVD
◮ We can use bootstrap samples to estimate uncertainty ◮ For more than four taxa
◮ Do this for all
n
4
- taxa
◮ Use a quartet assembly algorithm to get the full species tree
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets III
◮ Let M12 denote the matrix for the split 12|34
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets III
◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where
Σ is a diagonal matrix)
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets III
◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where
Σ is a diagonal matrix)
◮ Then the distance of M12 to a rank 10 matrix is
- 16
- i=11
Σ2
ii
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets III
◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where
Σ is a diagonal matrix)
◮ Then the distance of M12 to a rank 10 matrix is
- 16
- i=11
Σ2
ii ◮ This is defined as the SVDScore for the split 12|34
Quartet Inference from SNP Data Under the Coalescent Model
SVDQuartets III
◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where
Σ is a diagonal matrix)
◮ Then the distance of M12 to a rank 10 matrix is
- 16
- i=11
Σ2
ii ◮ This is defined as the SVDScore for the split 12|34 ◮ We pick the split corresponding to the lowest SVDScore
Quartet Inference from SNP Data Under the Coalescent Model
Simulations
◮ Generated a model species tree of the form
((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)
◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees
Quartet Inference from SNP Data Under the Coalescent Model
Simulations
◮ Generated a model species tree of the form
((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)
◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to
1
Quartet Inference from SNP Data Under the Coalescent Model
Simulations
◮ Generated a model species tree of the form
((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)
◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to
1
◮ For multi-locus data, g = 10, n = 500 was considered
Quartet Inference from SNP Data Under the Coalescent Model
Simulations
◮ Generated a model species tree of the form
((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)
◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to
1
◮ For multi-locus data, g = 10, n = 500 was considered ◮ Simulations were done for Jukes Cantor and GTR + I + Γ
Quartet Inference from SNP Data Under the Coalescent Model
Results (Jukes Cantor)
Quartet Inference from SNP Data Under the Coalescent Model
Results (GTR + I + Γ)
Quartet Inference from SNP Data Under the Coalescent Model
Discussion
◮ Across both the datasets, we can see SVDQuartets easily
identifies the correct split
Quartet Inference from SNP Data Under the Coalescent Model
Discussion
◮ Across both the datasets, we can see SVDQuartets easily
identifies the correct split
◮ The theory for the model was derived for SNP sites, and for
the GTR and it’s sub-models
Quartet Inference from SNP Data Under the Coalescent Model
Discussion
◮ Across both the datasets, we can see SVDQuartets easily
identifies the correct split
◮ The theory for the model was derived for SNP sites, and for
the GTR and it’s sub-models
◮ But we can see that the model performs well even for
multi-locus data under GTR + I + Γ
◮ In fact, changing from SNP to multi-locus has almost
negligible affects on the bootstrap scores
Quartet Inference from SNP Data Under the Coalescent Model
Discussion
◮ Across both the datasets, we can see SVDQuartets easily
identifies the correct split
◮ The theory for the model was derived for SNP sites, and for
the GTR and it’s sub-models
◮ But we can see that the model performs well even for
multi-locus data under GTR + I + Γ
◮ In fact, changing from SNP to multi-locus has almost
negligible affects on the bootstrap scores
◮ However the difference in support for the true split vs false
splits was smaller under the GTR + I + Γ model than under Jukes Cantor
Quartet Inference from SNP Data Under the Coalescent Model
Discussion II
◮ The branch length controls the amount of Incomplete Lineage
Sorting (ILS)
◮ Greater the length, smaller the ILS
Quartet Inference from SNP Data Under the Coalescent Model
Discussion II
◮ The branch length controls the amount of Incomplete Lineage
Sorting (ILS)
◮ Greater the length, smaller the ILS ◮ We see from the results that inference is easier for lower
amounts of ILS
Quartet Inference from SNP Data Under the Coalescent Model
Conclusion
◮ SVDQuartets provides a novel approach to using sequence
data directly to estimate species trees
◮ The model assumes unlinked SNP sites, but works well for
multi-locus data as well
Quartet Inference from SNP Data Under the Coalescent Model
Conclusion
◮ SVDQuartets provides a novel approach to using sequence
data directly to estimate species trees
◮ The model assumes unlinked SNP sites, but works well for
multi-locus data as well
◮ The method can be easily parallelized, by estimating the
different quartets on different cores
◮ However for large number of sequences, estimating all
n
4
- quartets might still become prohibitive
Quartet Inference from SNP Data Under the Coalescent Model
Conclusion
◮ SVDQuartets provides a novel approach to using sequence
data directly to estimate species trees
◮ The model assumes unlinked SNP sites, but works well for
multi-locus data as well
◮ The method can be easily parallelized, by estimating the
different quartets on different cores
◮ However for large number of sequences, estimating all
n
4
- quartets might still become prohibitive
◮ Only estimates the topology (unrooted)
Quartet Inference from SNP Data Under the Coalescent Model