Quartet Inference from SNP Data Under the Coalescent Model Syed - - PowerPoint PPT Presentation

quartet inference from snp data under the coalescent model
SMART_READER_LITE
LIVE PREVIEW

Quartet Inference from SNP Data Under the Coalescent Model Syed - - PowerPoint PPT Presentation

Quartet Inference from SNP Data Under the Coalescent Model Quartet Inference from SNP Data Under the Coalescent Model Syed Shalan Naqvi Quartet Inference from SNP Data Under the Coalescent Model Problem Statement Were given aligned


slide-1
SLIDE 1

Quartet Inference from SNP Data Under the Coalescent Model

Quartet Inference from SNP Data Under the Coalescent Model

Syed Shalan Naqvi

slide-2
SLIDE 2

Quartet Inference from SNP Data Under the Coalescent Model

Problem Statement

◮ We’re given aligned sequence data from multiple genes ◮ We want a good estimate for the species tree

slide-3
SLIDE 3

Quartet Inference from SNP Data Under the Coalescent Model

Two Common Approaches

◮ Summary Methods (eg STEM, MP-EST)

◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees

slide-4
SLIDE 4

Quartet Inference from SNP Data Under the Coalescent Model

Two Common Approaches

◮ Summary Methods (eg STEM, MP-EST)

◮ Use sequence data to estimate gene trees ◮ Use gene trees to estimate species trees

◮ Bayesian Methods (eg BEST, *BEAST)

◮ Co-estimate gene trees and species trees using MCMC

slide-5
SLIDE 5

Quartet Inference from SNP Data Under the Coalescent Model

Issues

◮ Summary Methods (eg STEM, MP-EST)

◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem

slide-6
SLIDE 6

Quartet Inference from SNP Data Under the Coalescent Model

Issues

◮ Summary Methods (eg STEM, MP-EST)

◮ Assume the estimated gene-trees are error free ◮ For short sequences, this can be a big problem

◮ Bayesian Methods (eg BEST, *BEAST)

◮ Don’t scale to large datasets

slide-7
SLIDE 7

Quartet Inference from SNP Data Under the Coalescent Model

A new approach

◮ SVDQuartets uses the sequence data directly

slide-8
SLIDE 8

Quartet Inference from SNP Data Under the Coalescent Model

A new approach

◮ SVDQuartets uses the sequence data directly ◮ Does not use a Bayesian approach

slide-9
SLIDE 9

Quartet Inference from SNP Data Under the Coalescent Model

Background

◮ Suppose we’re given a species tree and a model for sequence

evolution along gene trees (eg Jukes Cantor, GTR)

slide-10
SLIDE 10

Quartet Inference from SNP Data Under the Coalescent Model

Background

◮ Suppose we’re given a species tree and a model for sequence

evolution along gene trees (eg Jukes Cantor, GTR)

◮ A species tree defines a probability distribution on gene trees

slide-11
SLIDE 11

Quartet Inference from SNP Data Under the Coalescent Model

Background

◮ Suppose we’re given a species tree and a model for sequence

evolution along gene trees (eg Jukes Cantor, GTR)

◮ A species tree defines a probability distribution on gene trees ◮ Using this, and the model for sequence evolution, we can

compute the probability of observing a particular character on a leaf of the species tree

slide-12
SLIDE 12

Quartet Inference from SNP Data Under the Coalescent Model

Background II

◮ For a species tree with 4 taxa, write pijkl for the probability

P(X1 = i, X2 = j, X3 = k, X4 = l) (for a given split)

slide-13
SLIDE 13

Quartet Inference from SNP Data Under the Coalescent Model

Background II

◮ For a species tree with 4 taxa, write pijkl for the probability

P(X1 = i, X2 = j, X3 = k, X4 = l) (for a given split)

◮ We can calculate all these probabilities, and write them in a

16 × 16 matrix (with rows representing the possible values for X1, X2)

slide-14
SLIDE 14

Quartet Inference from SNP Data Under the Coalescent Model

Background III

◮ We can make this matrix for all 3 possible splits

(12|34, 13|24, 14|23)

slide-15
SLIDE 15

Quartet Inference from SNP Data Under the Coalescent Model

Background III

◮ We can make this matrix for all 3 possible splits

(12|34, 13|24, 14|23)

◮ We then have the following theorem

Theorem

Assuming a strict molecular clock, for the split corresponding to the true species tree, the rank of the corresponding matrix is at most 10. For all others splits, rank is strictly greater than 10.

slide-16
SLIDE 16

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets

◮ Theorem 1 suggests the following procedure for estimating the

species tree

◮ Estimate probabilities using the sequences (assuming each site

has it’s own genealogy)

◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species

tree

slide-17
SLIDE 17

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets

◮ Theorem 1 suggests the following procedure for estimating the

species tree

◮ Estimate probabilities using the sequences (assuming each site

has it’s own genealogy)

◮ For all 3 splits, compute the rank of the matrices ◮ The matrix with rank ≤ 10 gives the topology of the species

tree

slide-18
SLIDE 18

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets II

◮ Since the calculated probabilities are estimates, we don’t

really expect to find matrix with rank ≤ 10

◮ So instead we pick the matrix that’s closest to a rank 10

matrix using SVD

slide-19
SLIDE 19

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets II

◮ Since the calculated probabilities are estimates, we don’t

really expect to find matrix with rank ≤ 10

◮ So instead we pick the matrix that’s closest to a rank 10

matrix using SVD

◮ We can use bootstrap samples to estimate uncertainty

slide-20
SLIDE 20

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets II

◮ Since the calculated probabilities are estimates, we don’t

really expect to find matrix with rank ≤ 10

◮ So instead we pick the matrix that’s closest to a rank 10

matrix using SVD

◮ We can use bootstrap samples to estimate uncertainty ◮ For more than four taxa

◮ Do this for all

n

4

  • taxa

◮ Use a quartet assembly algorithm to get the full species tree

slide-21
SLIDE 21

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets III

◮ Let M12 denote the matrix for the split 12|34

slide-22
SLIDE 22

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets III

◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where

Σ is a diagonal matrix)

slide-23
SLIDE 23

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets III

◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where

Σ is a diagonal matrix)

◮ Then the distance of M12 to a rank 10 matrix is

  • 16
  • i=11

Σ2

ii

slide-24
SLIDE 24

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets III

◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where

Σ is a diagonal matrix)

◮ Then the distance of M12 to a rank 10 matrix is

  • 16
  • i=11

Σ2

ii ◮ This is defined as the SVDScore for the split 12|34

slide-25
SLIDE 25

Quartet Inference from SNP Data Under the Coalescent Model

SVDQuartets III

◮ Let M12 denote the matrix for the split 12|34 ◮ Then factoring M12 using SVD, we get M12 = UΣV T (where

Σ is a diagonal matrix)

◮ Then the distance of M12 to a rank 10 matrix is

  • 16
  • i=11

Σ2

ii ◮ This is defined as the SVDScore for the split 12|34 ◮ We pick the split corresponding to the lowest SVDScore

slide-26
SLIDE 26

Quartet Inference from SNP Data Under the Coalescent Model

Simulations

◮ Generated a model species tree of the form

((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)

◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees

slide-27
SLIDE 27

Quartet Inference from SNP Data Under the Coalescent Model

Simulations

◮ Generated a model species tree of the form

((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)

◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to

1

slide-28
SLIDE 28

Quartet Inference from SNP Data Under the Coalescent Model

Simulations

◮ Generated a model species tree of the form

((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)

◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to

1

◮ For multi-locus data, g = 10, n = 500 was considered

slide-29
SLIDE 29

Quartet Inference from SNP Data Under the Coalescent Model

Simulations

◮ Generated a model species tree of the form

((1 : x, 2 : x) : x, (3 : x, 4 : x) : x) (where x is the branch length)

◮ Sampled g gene trees according to the model species tree ◮ Generated sequences of length n on those gene trees ◮ For generating unlinked SNP data, g was set to 5000 and n to

1

◮ For multi-locus data, g = 10, n = 500 was considered ◮ Simulations were done for Jukes Cantor and GTR + I + Γ

slide-30
SLIDE 30

Quartet Inference from SNP Data Under the Coalescent Model

Results (Jukes Cantor)

slide-31
SLIDE 31

Quartet Inference from SNP Data Under the Coalescent Model

Results (GTR + I + Γ)

slide-32
SLIDE 32

Quartet Inference from SNP Data Under the Coalescent Model

Discussion

◮ Across both the datasets, we can see SVDQuartets easily

identifies the correct split

slide-33
SLIDE 33

Quartet Inference from SNP Data Under the Coalescent Model

Discussion

◮ Across both the datasets, we can see SVDQuartets easily

identifies the correct split

◮ The theory for the model was derived for SNP sites, and for

the GTR and it’s sub-models

slide-34
SLIDE 34

Quartet Inference from SNP Data Under the Coalescent Model

Discussion

◮ Across both the datasets, we can see SVDQuartets easily

identifies the correct split

◮ The theory for the model was derived for SNP sites, and for

the GTR and it’s sub-models

◮ But we can see that the model performs well even for

multi-locus data under GTR + I + Γ

◮ In fact, changing from SNP to multi-locus has almost

negligible affects on the bootstrap scores

slide-35
SLIDE 35

Quartet Inference from SNP Data Under the Coalescent Model

Discussion

◮ Across both the datasets, we can see SVDQuartets easily

identifies the correct split

◮ The theory for the model was derived for SNP sites, and for

the GTR and it’s sub-models

◮ But we can see that the model performs well even for

multi-locus data under GTR + I + Γ

◮ In fact, changing from SNP to multi-locus has almost

negligible affects on the bootstrap scores

◮ However the difference in support for the true split vs false

splits was smaller under the GTR + I + Γ model than under Jukes Cantor

slide-36
SLIDE 36

Quartet Inference from SNP Data Under the Coalescent Model

Discussion II

◮ The branch length controls the amount of Incomplete Lineage

Sorting (ILS)

◮ Greater the length, smaller the ILS

slide-37
SLIDE 37

Quartet Inference from SNP Data Under the Coalescent Model

Discussion II

◮ The branch length controls the amount of Incomplete Lineage

Sorting (ILS)

◮ Greater the length, smaller the ILS ◮ We see from the results that inference is easier for lower

amounts of ILS

slide-38
SLIDE 38

Quartet Inference from SNP Data Under the Coalescent Model

Conclusion

◮ SVDQuartets provides a novel approach to using sequence

data directly to estimate species trees

◮ The model assumes unlinked SNP sites, but works well for

multi-locus data as well

slide-39
SLIDE 39

Quartet Inference from SNP Data Under the Coalescent Model

Conclusion

◮ SVDQuartets provides a novel approach to using sequence

data directly to estimate species trees

◮ The model assumes unlinked SNP sites, but works well for

multi-locus data as well

◮ The method can be easily parallelized, by estimating the

different quartets on different cores

◮ However for large number of sequences, estimating all

n

4

  • quartets might still become prohibitive
slide-40
SLIDE 40

Quartet Inference from SNP Data Under the Coalescent Model

Conclusion

◮ SVDQuartets provides a novel approach to using sequence

data directly to estimate species trees

◮ The model assumes unlinked SNP sites, but works well for

multi-locus data as well

◮ The method can be easily parallelized, by estimating the

different quartets on different cores

◮ However for large number of sequences, estimating all

n

4

  • quartets might still become prohibitive

◮ Only estimates the topology (unrooted)

slide-41
SLIDE 41

Quartet Inference from SNP Data Under the Coalescent Model

Questions ?