Principles of Phylogenetics Reading and Inferring Trees Finlay - - PowerPoint PPT Presentation

principles of phylogenetics
SMART_READER_LITE
LIVE PREVIEW

Principles of Phylogenetics Reading and Inferring Trees Finlay - - PowerPoint PPT Presentation

Principles of Phylogenetics Reading and Inferring Trees Finlay Maguire April 1, 2020 FCS, Dalhousie Table of contents 1. What are phylogenies? 2. Reading a Tree 3. Making a Tree 4. Tree Inference methods 5. Aside: sources of error 6. Back


slide-1
SLIDE 1

Principles of Phylogenetics

Reading and Inferring Trees

Finlay Maguire April 1, 2020

FCS, Dalhousie

slide-2
SLIDE 2

Table of contents

  • 1. What are phylogenies?
  • 2. Reading a Tree
  • 3. Making a Tree
  • 4. Tree Inference methods
  • 5. Aside: sources of error
  • 6. Back to inference
  • 7. Conclusion

1

slide-3
SLIDE 3

What are phylogenies?

slide-4
SLIDE 4

Hypotheses for understanding alignments

https://itol.embl.de/help.cgi

2

slide-5
SLIDE 5

Tree of Life

[Hug et al., 2016]

3

slide-6
SLIDE 6

2013- Ebola Outbreak

[Holmes et al., 2016]

4

slide-7
SLIDE 7

Uses outside biology

[Skelton, 2008]

5

slide-8
SLIDE 8

Uses outside biology

  • Manuscript change [Barbrook et al., 1998]
  • Social evolution (many examples, some questionable).
  • Plagiarism [Ryu et al., 2008]
  • Anything you can measure distances between.

6

slide-9
SLIDE 9

Reading a Tree

slide-10
SLIDE 10

Toy Tree

Andrew Rambaut’s Tutorial http://artic.network/how-to-read-a-tree.html

7

slide-11
SLIDE 11

Parts of the Tree

Leaf node (or terminal node) Internal node (or vertex) Branch (or edge) Root

8

slide-12
SLIDE 12

Support Values

9

slide-13
SLIDE 13

Other formats

http://artic.network/how-to-read-a-tree.html

10

slide-14
SLIDE 14

Meaningful Branch Lengths

https: //biology-forums.com/gallery/18099_27_04_12_2_16_20.jpeg

11

slide-15
SLIDE 15

Khan Academy

12

slide-16
SLIDE 16

Groupings on the Tree

Can Infect Humans?

13

slide-17
SLIDE 17

Monophyletic

Can Infect Humans? Monophyletic Synapomorphy Pleisomorphy

14

slide-18
SLIDE 18

Paraphyletic

Can Infect Humans? Paraphyletic

15

slide-19
SLIDE 19

Polyphyletic

Can Infect Humans? Polyphyletic

16

slide-20
SLIDE 20

Rooting

17

slide-21
SLIDE 21

Rooting

http://artic.network/how-to-read-a-tree.html

18

slide-22
SLIDE 22

Rooting

http://artic.network/how-to-read-a-tree.html

19

slide-23
SLIDE 23

Topology and rotation

20

slide-24
SLIDE 24

Nodes can rotate

http://artic.network/how-to-read-a-tree.html

21

slide-25
SLIDE 25

Nodes can rotate

http://artic.network/how-to-read-a-tree.html

22

slide-26
SLIDE 26

Adding Metadata

http://artic.network/how-to-read-a-tree.html

23

slide-27
SLIDE 27

Ancestral Node Reconstruction

http://artic.network/how-to-read-a-tree.html

24

slide-28
SLIDE 28

Ancestral Node Reconstruction

http://artic.network/how-to-read-a-tree.html

25

slide-29
SLIDE 29

Ancestral Node Reconstruction

http://artic.network/how-to-read-a-tree.html

26

slide-30
SLIDE 30

Ancestral Node Reconstruction

http://artic.network/how-to-read-a-tree.html

27

slide-31
SLIDE 31

Making a Tree

slide-32
SLIDE 32

Going from data to a tree

  • Getting your data
  • Aligning your data
  • Tree-inference
  • Maximum Parsimony
  • Distance Methods
  • Maximum-Likelihood
  • Bayesian
  • Sequence evolution models
  • Exploring topology space
  • Statistical support

28

slide-33
SLIDE 33

Getting and preparing your data

slide-34
SLIDE 34

Finding Similar Sequences

29

slide-35
SLIDE 35

BLAST

30

slide-36
SLIDE 36

BLAST

31

slide-37
SLIDE 37

Core Genome Inference

anvi’o documentation

32

slide-38
SLIDE 38

Multiple Sequence Alignment

https://bioinf.comav.upv.es/courses/biotech3/theory/ multiple.html

33

slide-39
SLIDE 39

Multiple Sequence Alignment

https://bioinf.comav.upv.es/courses/biotech3/theory/ multiple.html

34

slide-40
SLIDE 40

Alignment Trimming

https://itol.embl.de/help.cgi

35

slide-41
SLIDE 41

Trimmed Alignment

https://bioinf.comav.upv.es/courses/biotech3/theory/ multiple.html

36

slide-42
SLIDE 42

Tree Inference methods

slide-43
SLIDE 43

Difficulties

  • Huge number of possible trees
  • unrooted =

2n−5! 2n−3(n−3)!

  • rooted =

2n−3! 2n−2(n−2)!

  • 10 taxa = 2,027,025 unrooted and 34,459,425 rooted topologies.
  • 50 taxa = 2.84e74 unrooted and 2.75e76 rooted topologies.
  • Topology space geometry is large and awkward to traverse.
  • Large number of parameters to optimise
  • How do you choose which tree is optimal? Which criterion?

37

slide-44
SLIDE 44

Parsimony

Intuitive: minimise the number of changes needed.

38

slide-45
SLIDE 45

Parsimony Pros/Cons

  • Advantages:
  • Very simple
  • Works on any type of data (no explicit model).
  • Disadvantages:
  • Very simple
  • Requires informative sites with consistent signal.
  • Poor handling of multiple substitutions.
  • Can’t incorporate extra information.
  • Not consistent for certain tree shapes (misleading support values).

39

slide-46
SLIDE 46

Evolutionary Models

slide-47
SLIDE 47

Sequence Evolution Models

http: //carrot.mcb.uconn.edu/~olgazh/bioinf2010/class24.html

40

slide-48
SLIDE 48

Sequence Evolution Models

[Nickle et al., 2007]

41

slide-49
SLIDE 49

How do we select a model?

  • Which model is most likely given the data?
  • Information Criterion (regularisation to penalise overly complex

models)

  • Decision Theory: risk minimisation.

42

slide-50
SLIDE 50

What happens theoretically if the wrong model is specified?

  • Increased Inaccuracy (wrong tree more often)
  • Inconsistency (adding more data converges to wrong tree)
  • Wrong branch lengths (important for certain analyses)
  • Wrong tree support values

43

slide-51
SLIDE 51

What actually happens

[Abadi et al., 2019]

  • Almost always use the most flexible model (GTR+I+G/LG)
  • Criteria are inconsistent (BIC/AIC disagree in 62% of cases)
  • Different models change the distance matrix trivially.
  • ALL models lead to very similar topologies.
  • Model only really important if branch length matters to you.

44

slide-52
SLIDE 52

Distance Matrix

https://slideplayer.com/slide/4422868/

45

slide-53
SLIDE 53

Neighbour-Joining

Iteratively pair off branches that minimise the total sum of branch lengths

https://en.wikipedia.org/wiki/Neighbor_joining

46

slide-54
SLIDE 54

Distance Approaches Pros/Cons

  • Advantages:
  • Very fast (often used as starting point)
  • Works well for clock-like and closely related sequences
  • Disadvantages:
  • Requires a sequence evolution model
  • Pairwise distance isn’t always error-free estimate of evolutionary

distance (bigger problem with divergent sequences).

  • Doesn’t use all available information
  • Cannot reconstruct character histories

47

slide-55
SLIDE 55

Aside: sources of error

slide-56
SLIDE 56

Sources of Error

  • Bad data
  • Sampling error
  • Misleading evolutionary events
  • Misspecified models
  • Inappropriate inference

48

slide-57
SLIDE 57

Sources of Error

  • Bad data
  • Sampling error
  • Misleading evolutionary events
  • Misspecified models
  • Inappropriate inference

48

slide-58
SLIDE 58

Sources of Error

  • Bad data
  • Sampling error
  • Misleading evolutionary events
  • Misspecified models
  • Inappropriate inference

48

slide-59
SLIDE 59

Sources of Error

  • Bad data
  • Sampling error
  • Misleading evolutionary events
  • Misspecified models
  • Inappropriate inference

48

slide-60
SLIDE 60

Sources of Error

  • Bad data
  • Sampling error
  • Misleading evolutionary events
  • Misspecified models
  • Inappropriate inference

48

slide-61
SLIDE 61

Saturation

[Leonard, 2010]

49

slide-62
SLIDE 62

Misleading Signal: Recombination

50

slide-63
SLIDE 63

Misleading Signal: Hidden Paralogy/Incomplete Sampling

[Leonard, 2010]

51

slide-64
SLIDE 64

Misleading Signal: Horizontal Gene Transfer

[Leonard, 2010]

52

slide-65
SLIDE 65

Misleading Signal: Horizontal Gene Transfer

[Richards et al., 2009]

53

slide-66
SLIDE 66

Tree not always correct paradigm

Ask for a tree get a tree.

54

slide-67
SLIDE 67

Tree not always correct paradigm

Ask for a tree get a tree.

Reanalysis of [Marwick, 2012] from http://phylonetworks.blogspot.ca/2013/02/

55

slide-68
SLIDE 68

Back to inference

slide-69
SLIDE 69

Maximum-Likelihood

  • Likelihood = p(data | topology, branch, evolutionary model) =

p(D|τ, θ)

  • Maximum likelihood is the topology, branch lengths and model

parameters with the highest likelihood.

  • Performed site by site, search topology space then finding
  • ptimal tree parameters.
  • Too expensive to exhaustively search likelihood surface so

heuristics.

  • Most methods start with distance-based starting tree and

greedily traverse model space.

56

slide-70
SLIDE 70

Maximum-Likelihood

57

slide-71
SLIDE 71

Maximum-Likelihood

58

slide-72
SLIDE 72

Maximum-Likelihood

  • p(D|τ, θ) = ∑

α

β

γ

δ = p(A, A, A, G, G, α, β, γ, δ|τ, θ) 59

slide-73
SLIDE 73

Maximum-Likelihood

60

slide-74
SLIDE 74

Maximum-Likelihood

61

slide-75
SLIDE 75

Maximum-Likelihood Pros/Cons

  • Advantages:
  • Maximum use of information in data
  • Explicit Model
  • Can handle complex models
  • Robust and consistent (for correct model)
  • Allows comparison of trees (which is ‘best’ and by how much)
  • Disadvantages:
  • Default treatment sites as independent.
  • Very slow for exhaustive search
  • Model mispecification issues
  • Difficult to extend.
  • Question formulation can be unintuitive

62

slide-76
SLIDE 76

Bayesian

  • Bayes Rule: p(θ|X) =

p(D|θ)p(θ) ∫ p(D|θ)p(θ)dθ

  • For trees: p(θ, τ|D) =

p(D|θ,τ)p(θ)p(τ) ∫ θ ∫ τ p(D|θ,τ)p(θ)p(τ)dτdθ

  • Approximate marginal probability using Markov-Chain

Monte-Carlo

  • Run multiple chains to estimate convergence

63

slide-77
SLIDE 77

Bayesian Pros/Cons

  • Advantages:
  • Fast (relatively)
  • Can infer many different parameters
  • More flexible framework
  • More intuitive formulation
  • Disadvantages:
  • Choice of priors
  • Difficulty determining convergence
  • Model mispecification issues.

64

slide-78
SLIDE 78

Searching Tree-Space: NNI

65

slide-79
SLIDE 79

Searching Tree-Space

66

slide-80
SLIDE 80

Searching Tree-Space

67

slide-81
SLIDE 81

Searching Tree-Space

68

slide-82
SLIDE 82

Searching Tree-Space

69

slide-83
SLIDE 83

Searching Tree-Space

70

slide-84
SLIDE 84

Searching Tree-Space

71

slide-85
SLIDE 85

Searching Tree-Space

72

slide-86
SLIDE 86

Searching Tree-Space

73

slide-87
SLIDE 87

Searching Tree-Space

74

slide-88
SLIDE 88

Conclusion

slide-89
SLIDE 89

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-90
SLIDE 90

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-91
SLIDE 91

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-92
SLIDE 92

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-93
SLIDE 93

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-94
SLIDE 94

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-95
SLIDE 95

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-96
SLIDE 96

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-97
SLIDE 97

Summary

  • Phylogenetics are a useful tool to investigate the relations

between sequences

  • There are some tricks to interpretation of trees.
  • Inferring a phylogeny requires: data, alignment, trimming,

method selection.

  • Parsimony is simplest but easily misled.
  • Distance, ML, and Bayesian need an evolutionary model.
  • Distance methods are fast but naive.
  • ML and Bayesian methods treat phylogenetics as a statistics

problem.

  • Allow probabilistic reconstruction of ancestral states and

population parameters.

  • Tree topology space is non-trivial to search.

75

slide-98
SLIDE 98

Questions?

75

slide-99
SLIDE 99

References i

Abadi, S., Azouri, D., Pupko, T., and Mayrose, I. (2019). Model selection may not be a mandatory step for phylogeny reconstruction. Nature Communications, 10(1):934. Barbrook, A. C., Howe, C. J., Blake, N., and Robinson, P. (1998). The phylogeny of the canterbury tales. Nature, 394(6696):839. Holmes, E. C., Dudas, G., Rambaut, A., and Andersen, K. G. (2016). The evolution of ebola virus: Insights from the 2013–2016 epidemic. Nature, 538(7624):193.

76

slide-100
SLIDE 100

References ii

Hug, L. A., Baker, B. J., Anantharaman, K., Brown, C. T., Probst, A. J., Castelle, C. J., Butterfield, C. N., Hernsdorf, A. W., Amano, Y., Ise, K., et al. (2016). A new view of the tree of life. Nature microbiology, 1(5):16048. Leonard, G. (2010). Development of fusion and duplication finder blast (fdfblast): a systematic tool to detect differentially distributed gene fusions and resolve trifurcations in the tree of life. Marwick, B. (2012). A cladistic evaluation of ancient thai bronze buddha images: six tests for a phylogenetic signal in the griswold collection. Connecting empires, pages 159–176.

77

slide-101
SLIDE 101

References iii

Nickle, D. C., Heath, L., Jensen, M. A., Gilbert, P. B., Mullins, J. I., and Pond, S. L. K. (2007). Hiv-specific probabilistic models of protein evolution. PLoS One, 2(6):e503. Richards, T. A., Soanes, D. M., Foster, P. G., Leonard, G., Thornton,

  • C. R., and Talbot, N. J. (2009).

Phylogenomic analysis demonstrates a pattern of rare and ancient horizontal gene transfer between plants and fungi. The Plant Cell, 21(7):1897–1911. Ryu, C.-K., Kim, H.-J., Ji, S.-H., Woo, G., and Cho, H.-G. (2008). Detecting and tracing plagiarized documents by reconstruction plagiarism-evolution tree. In 2008 8th IEEE International Conference on Computer and Information Technology, pages 119–124. IEEE.

78

slide-102
SLIDE 102

References iv

Skelton, C. (2008). Methods of using phylogenetic systematics to reconstruct the history of the linear b script. Archaeometry, 50(1):158–176.

79