dimacs tutorial on phylogenetic trees and rapidly
play

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens - PowerPoint PPT Presentation

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John City University of New


  1. Distance-Based Methods Popular distance based methods include • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the “nearest neighbors” to build a tree, and • UPGMA (“Unweighted Pair Group Method with Arithmetic Mean”) (Sneath & Snokal ‘73 ) similarly clusters close taxa, assuming the rate of evolution is the same across lineages.

  2. Distance-Based Methods Popular distance based methods include • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the “nearest neighbors” to build a tree, and • UPGMA (“Unweighted Pair Group Method with Arithmetic Mean”) (Sneath & Snokal ‘73 ) similarly clusters close taxa, assuming the rate of evolution is the same across lineages. • Quartet-based methods that decide the topology for every 4 taxa and then assemble them to form a tree (Berry et al. 1999, 2000, 2001). Katherine St. John City University of New York 18

  3. Other Distance-Based Methods • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor Joining, that combines based on a likelihood function of the distances.

  4. Other Distance-Based Methods • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor Joining, that combines based on a likelihood function of the distances. • Disk Covering Method (Warnow et al. ‘98, ‘99, ‘04)– a divide-and-conquer approach of theoretical interest that has been combined with many other methods.

  5. Other Distance-Based Methods • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor Joining, that combines based on a likelihood function of the distances. • Disk Covering Method (Warnow et al. ‘98, ‘99, ‘04)– a divide-and-conquer approach of theoretical interest that has been combined with many other methods. Katherine St. John City University of New York 19

  6. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) .

  7. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat.

  8. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree.

  9. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree. – Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges.

  10. Neighbor Joining (NJ) • [Saitou & Nei 1987]: very popular and fast: O ( n 3 ) . – Based on the distance between nodes, join neighboring leaves , replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree. – Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges. – Experimental work shows that NJ trees are reasonably accurate, given a rate of evolution is neither too low nor too high. Katherine St. John City University of New York 20

  11. Quartet Methods • A quartet is an unrooted binary tree on four taxa: c c b d d d t t t t t t ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � r r r r r r � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ a c a a b b t t t t t t { ab | cd } { ac | bd } { ad | bc }

  12. Quartet Methods • A quartet is an unrooted binary tree on four taxa: c c b d d d t t t t t t ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � r r r r r r � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ a c a a b b t t t t t t { ab | cd } { ac | bd } { ad | bc } • Let Q ( T ) = all quartets that agree with T . os et al. 1997] : T can be reconstructed from Q ( T ) in [Erd˝ polynomial time. Katherine St. John City University of New York 21

  13. Quartet Methods • Quartet-based methods operate in two phases:

  14. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets.

  15. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree.

  16. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast.

  17. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast. – There are Θ( n 4 ) quartets, giving Ω( n 4 ) running time.

  18. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast. – There are Θ( n 4 ) quartets, giving Ω( n 4 ) running time. – In practice, the input quality is insufficient to ensure that all quartets are accurately inferred.

  19. Quartet Methods • Quartet-based methods operate in two phases: – Construct quartets on all four taxa sets. – Combine these quartets into a tree. • Running time: – For most optimizations, determining a quartet is fast. – There are Θ( n 4 ) quartets, giving Ω( n 4 ) running time. – In practice, the input quality is insufficient to ensure that all quartets are accurately inferred. – Quartet methods have to handle incorrect quartets. Katherine St. John City University of New York 22

  20. Popular Quartet Methods • Q ∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]: Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree.

  21. Popular Quartet Methods • Q ∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]: Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree. • Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a small number of errors proportional to q e . Many variants: all handle a small number of errors.

  22. Popular Quartet Methods • Q ∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]: Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree. • Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a small number of errors proportional to q e . Many variants: all handle a small number of errors. • Quartet Puzzling [Strimmer & von Haeseler 1996]: “Order taxa randomly, greedily add edges, repeat 1000 times.” Output majority tree. Most popular with biologists. Katherine St. John City University of New York 23

  23. Constructing Networks • What if evolution isn’t tree-like?

  24. Constructing Networks • What if evolution isn’t tree-like? For example:

  25. Constructing Networks • What if evolution isn’t tree-like? For example:

  26. Constructing Networks • What if evolution isn’t tree-like? For example: (from W.P. Maddison, Systematic Biology ‘97) Katherine St. John City University of New York 24

  27. Network Methods • Split Decomposition (Bandelt & Dress ‘92) decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa).

  28. Network Methods • Split Decomposition (Bandelt & Dress ‘92) decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa). • NeighborNet (Bryant & Moulton ‘02) is an agglomerative clustering algorithm that uses splits to produce networks.

  29. Network Methods • Split Decomposition (Bandelt & Dress ‘92) decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa). • NeighborNet (Bryant & Moulton ‘02) is an agglomerative clustering algorithm that uses splits to produce networks. • TCS (Posada & Crandall ‘01) estimates gene phylogenies based on statistical parsimony method. Katherine St. John City University of New York 25

  30. Input to Reconstruction Algorithms • Almost all assume that the data is aligned: (Alignment of bacterial genes by Geneious (Drummond ‘06).) • Many assume corrections have been made for the underlying model of evolution. Katherine St. John City University of New York 26

  31. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution.

  32. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ Katherine St. John City University of New York 27

  33. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ � ❅ � ❅ � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ Katherine St. John City University of New York 28

  34. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ Katherine St. John City University of New York 29

  35. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 30

  36. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 31

  37. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . { ACCCT, GACGT, AACGT, GACGT, GGCGA } Katherine St. John City University of New York 32

  38. Models of Evolution • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . • The assumptions of the model are: 1. the sites (i.e., the positions within the sequences) evolve independently and identically 2. if a site changes state it changes with equal probability to each of the remaining states, and 3. the number of changes of each site on an edge e is a Poisson random variable with expectation λ ( e ) (this is also called the “length” of the edge e ). Katherine St. John City University of New York 33

  39. How Methods Use Models of Evolution • As an explicit part of the algorithm: for example, maximum likelihood, weighbor.

  40. How Methods Use Models of Evolution • As an explicit part of the algorithm: for example, maximum likelihood, weighbor. • Indirectly, via assumptions on the data or by inputting data that has been corrected under a certain model. Katherine St. John City University of New York 34

  41. Testing Methods Empirically • How accurate are the methods at reconstructing trees?

  42. Testing Methods Empirically • How accurate are the methods at reconstructing trees? • In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

  43. Testing Methods Empirically • How accurate are the methods at reconstructing trees? • In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

  44. Testing Methods Empirically • How accurate are the methods at reconstructing trees? • In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic. • Simulation is used instead to evaluate methods, given a model of evolution. Katherine St. John City University of New York 35

  45. Simulation Studies 1. Construct a “model” tree.

  46. Simulation Studies 1. Construct a 2. “Evolve” “model” tree. sequences down the tree. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . .

  47. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . .

  48. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 36

  49. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 37

  50. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉

  51. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉ • Can view this as two different random processes:

  52. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉ • Can view this as two different random processes: – generate the tree shape, and then

  53. Simulating Data: Choosing Trees • Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) ✉ ✉ ✉ ❅ � ❅ � ❅ � r r � ❅ � ❅ � ❅ ✉ ✉ ✉ • Can view this as two different random processes: – generate the tree shape, and then – assign weights or branch lengths to the shape. Katherine St. John City University of New York 38

  54. Simulating Data: Evolving Sequences • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . AACGT ✟ ❍❍❍❍❍❍❍❍ ✟ ✟ ✟ ✟ ✟ 0 1 ✟ ✟ AACGT AACGA ✑ ◗◗◗◗◗◗ ✑ ◗◗◗◗◗◗ ✑ ✑ ✑ ✑ ✑ ✑ 2 1 1 3 ✑ ✑ ✑ ✑ ACCCT GACGT � ❅ AACGA � ❅ GGCGT � ❅ � ❅ � ❅ � ❅ 0 1 0 1 � ❅ � ❅ GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 39

  55. Simulating Data: Evolving Sequences • The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. • A DNA sequence (a string over { A, C, T, G } ) at the root evolves down a rooted binary tree T . { ACCCT, GACGT, AACGT, GACGT, GGCGA } Katherine St. John City University of New York 40

  56. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 41

  57. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 42

  58. Simulation Studies 1. Construct a 2. “Evolve” 3. Reconstruct “model” tree. sequences down the tree using the tree. method. A GTTAGAAGGCGGCCA . . . B CATTTGTCCTAACTT . . . C CAAGAGGCCACTGCA . . . D CCGACTTCCAACCTC . . . E ATGGGGCACGATGGA . . . F TACAAATACGCGCAA . . . 4. Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 43

  59. Evaluating Accuracy • To compare reconstructed tree to model tree, the Robinson-Foulds Score is often used: False Positives + False Negatives total edges ✟ ❍❍❍❍ ✟ ❍❍❍❍ ✟ ✟ ✑ . ✟ . ✟ . . ✟ ✟ . . . ✑ ◗◗◗ ✑ ◗◗◗ ✑ . ✑ ◗◗◗ . . ✑ ✑ ✑ ✑ ✑ ✑ ✑ � ❅ � ❅ � ❅ � ❅ • a b c b � ❅ � ❅ � ❅ � ❅ c d e f d a f e

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend