DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens - - PowerPoint PPT Presentation

dimacs tutorial on phylogenetic trees and rapidly
SMART_READER_LITE
LIVE PREVIEW

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens - - PowerPoint PPT Presentation

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John City University of New


slide-1
SLIDE 1

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens

Katherine St. John City University of New York 1

slide-2
SLIDE 2

Thanks to the DIMACS Staff

  • Linda Casals
  • Walter Morris
  • Nicole Clark

Katherine St. John City University of New York 2

slide-3
SLIDE 3

Tutorial Outline

  • Day 1: Introduction to Phylogenetic Reconstruction
  • Day 2: Applications to Rapidly Evolving Pathogens

Katherine St. John City University of New York 3

slide-4
SLIDE 4

Tutorial Outline

  • Day 1: Introduction to Phylogenetic Reconstruction

– Overview: Katherine St. John, CUNY – Parsimony Reconstruction of Phylogenetic Trees: Trevor Bruen, McGill University – Using Maximum Likelihood for Phylogenetic Tree Reconstruction: Rachel Bevan, McGill University – Hands-on Session: Constructing Trees Katherine St. John

  • Day 2: Applications to Rapidly Evolving Pathogens

Katherine St. John City University of New York 4

slide-5
SLIDE 5

Tutorial Outline

  • Day 1: Intro to Phylogenetic Reconstruction
  • Day 2: Applications to Rapidly Evolving Pathogens

– Statistical Overview: Alexei Drummond, University of Auckland – Tricks for trees: Having reconstructed trees, what can we do with them? Mike Steel, University of Canterbury – Hands-on Session: Katherine St. John

Katherine St. John City University of New York 5

slide-6
SLIDE 6

Overview Outline

  • Overview
slide-7
SLIDE 7

Overview Outline

  • Overview
  • Constructing Trees
slide-8
SLIDE 8

Overview Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
slide-9
SLIDE 9

Overview Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods
slide-10
SLIDE 10

Overview Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods
  • Evaluating the Results

Katherine St. John City University of New York 6

slide-11
SLIDE 11

Talk Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods
  • Evaluating the Results

Katherine St. John City University of New York 7

slide-12
SLIDE 12

Goal: Reconstruct the Evolutionary History

(www.amnh.org/education/teacherguides/dinosaurs)

slide-13
SLIDE 13

Goal: Reconstruct the Evolutionary History

(www.amnh.org/education/teacherguides/dinosaurs)

The evolutionary process not only determines relationships among taxa, but allows prediction of structural, physiological, and biochemical properties.

Katherine St. John City University of New York 8

slide-14
SLIDE 14

Process for Reconstruction: Input Data

Start with information about the taxa. For example: Morphological Characters

slide-15
SLIDE 15

Process for Reconstruction: Input Data

Start with information about the taxa. For example: Morphological Characters Biomolecular Sequences

A

  • GTTAGAAGGCGGCCAGCGAC. . .

B

  • CATTTGTCCTAACTTGACGG. . .

C

  • CAAGAGGCCACTGCAGAATC. . .

D

  • CCGACTTCCAACCTCATGCG. . .

E

  • ATGGGGCACGATGGATATCG. . .

F

  • TACAAATACGCGCAAGTTCG. . .

(Other: molecular markers (ie SNPs), gene order, etc.)

Katherine St. John City University of New York 9

slide-16
SLIDE 16

Process for Reconstruction

slide-17
SLIDE 17

Process for Reconstruction

Input Data

A

  • GTTAGAAGGC. . .

B

  • CATTTGTCCT. . .

C

  • CAAGAGGCCA. . .

D

  • CCGACTTCCA. . .

E

  • ATGGGGCACG. . .

F

  • TACAAATACG. . .
slide-18
SLIDE 18

Process for Reconstruction

Input Data

A

  • GTTAGAAGGC. . .

B

  • CATTTGTCCT. . .

C

  • CAAGAGGCCA. . .

D

  • CCGACTTCCA. . .

E

  • ATGGGGCACG. . .

F

  • TACAAATACG. . .

→ Reconstruction Algorithms

Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering, . . .

slide-19
SLIDE 19

Process for Reconstruction

Input Data

A

  • GTTAGAAGGC. . .

B

  • CATTTGTCCT. . .

C

  • CAAGAGGCCA. . .

D

  • CCGACTTCCA. . .

E

  • ATGGGGCACG. . .

F

  • TACAAATACG. . .

→ Reconstruction Algorithms

Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering, . . .

→ Output Tree

Katherine St. John City University of New York 10

slide-20
SLIDE 20

Applications

In addition to finding the evolutionary history of species, phylogeny is also used for:

slide-21
SLIDE 21

Applications

In addition to finding the evolutionary history of species, phylogeny is also used for:

  • drug discovery: used to determine structural and

biochemical properties of potential drugs

slide-22
SLIDE 22

Applications

In addition to finding the evolutionary history of species, phylogeny is also used for:

  • drug discovery: used to determine structural and

biochemical properties of potential drugs

  • multiple sequence alignment
slide-23
SLIDE 23

Applications

In addition to finding the evolutionary history of species, phylogeny is also used for:

  • drug discovery: used to determine structural and

biochemical properties of potential drugs

  • multiple sequence alignment
  • origin of virus and bacteria strains

Katherine St. John City University of New York 11

slide-24
SLIDE 24

Talk Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods
  • Evaluating the Results

Katherine St. John City University of New York 12

slide-25
SLIDE 25

Process for Reconstruction

Input Data

A

  • GTTAGAAGGC. . .

B

  • CATTTGTCCT. . .

C

  • CAAGAGGCCA. . .

D

  • CCGACTTCCA. . .

E

  • ATGGGGCACG. . .

F

  • TACAAATACG. . .

→ Reconstruction Algorithms

Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering, . . .

→ Output Tree

Katherine St. John City University of New York 13

slide-26
SLIDE 26

Algorithms for Reconstruction

  • Most optimization criteria are hard:
slide-27
SLIDE 27

Algorithms for Reconstruction

  • Most optimization criteria are hard:

– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)

find the tree that can explain the observed sequences with a minimal number of substitutions.

slide-28
SLIDE 28

Algorithms for Reconstruction

  • Most optimization criteria are hard:

– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)

find the tree that can explain the observed sequences with a minimal number of substitutions.

– Maximum Likelihood Estimation: find the tree with the

maximum likelihood: P(data|tree).

slide-29
SLIDE 29

Algorithms for Reconstruction

  • Most optimization criteria are hard:

– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)

find the tree that can explain the observed sequences with a minimal number of substitutions.

– Maximum Likelihood Estimation: find the tree with the

maximum likelihood: P(data|tree).

  • More on these later today...

Katherine St. John City University of New York 14

slide-30
SLIDE 30

Approximating Trees

  • Exact answers are often wanted, but hard to find.
slide-31
SLIDE 31

Approximating Trees

  • Exact answers are often wanted, but hard to find.
  • But approximate is often good enough:
slide-32
SLIDE 32

Approximating Trees

  • Exact answers are often wanted, but hard to find.
  • But approximate is often good enough:

– drug design: predicting function via similarity

slide-33
SLIDE 33

Approximating Trees

  • Exact answers are often wanted, but hard to find.
  • But approximate is often good enough:

– drug design: predicting function via similarity – sequence alignment: guide trees for alignment

slide-34
SLIDE 34

Approximating Trees

  • Exact answers are often wanted, but hard to find.
  • But approximate is often good enough:

– drug design: predicting function via similarity – sequence alignment: guide trees for alignment – use as priors or starting points for expensive searches

Katherine St. John City University of New York 15

slide-35
SLIDE 35

Approximation Algorithms

  • Since calculating the exact answer is hard, algorithms

that estimate the answer have been developed.

slide-36
SLIDE 36

Approximation Algorithms

  • Since calculating the exact answer is hard, algorithms

that estimate the answer have been developed. – Heuristics for maximum parsimony and maximum likelihood estimation

(use clever ways to limit the number of trees checked, while still sampling much of “tree-space”)

slide-37
SLIDE 37

Approximation Algorithms

  • Since calculating the exact answer is hard, algorithms

that estimate the answer have been developed. – Heuristics for maximum parsimony and maximum likelihood estimation

(use clever ways to limit the number of trees checked, while still sampling much of “tree-space”)

– Polynomial-time methods, often based on the distance between taxa

Katherine St. John City University of New York 16

slide-38
SLIDE 38

Distance-Based Methods

  • These methods calculate the distance between taxa:

B D A C F E B 0.496505 0.496505 0.444519 0.375798 0.268166 D 0.496505 0.496505 0.375798 0.275673 0.279728 A 0.496505 0.496505 0.362124 0.323812 0.496505 C 0.444519 0.375798 0.362124 0.496505 0.496505 F 0.375798 0.275673 0.323812 0.496505 0.496505 E 0.268166 0.279728 0.496505 0.496505 0.496505

and then determine the tree using the distance matrix.

slide-39
SLIDE 39

Distance-Based Methods

  • These methods calculate the distance between taxa:

B D A C F E B 0.496505 0.496505 0.444519 0.375798 0.268166 D 0.496505 0.496505 0.375798 0.275673 0.279728 A 0.496505 0.496505 0.362124 0.323812 0.496505 C 0.444519 0.375798 0.362124 0.496505 0.496505 F 0.375798 0.275673 0.323812 0.496505 0.496505 E 0.268166 0.279728 0.496505 0.496505 0.496505

and then determine the tree using the distance matrix.

  • One way to calculate distance is to take differences

divided by the length (the normalized Hamming distance).

Katherine St. John City University of New York 17

slide-40
SLIDE 40

Distance-Based Methods

Popular distance based methods include

slide-41
SLIDE 41

Distance-Based Methods

Popular distance based methods include

  • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the

“nearest neighbors” to build a tree, and

slide-42
SLIDE 42

Distance-Based Methods

Popular distance based methods include

  • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the

“nearest neighbors” to build a tree, and

  • UPGMA (“Unweighted Pair Group Method with Arithmetic

Mean”) (Sneath & Snokal ‘73 ) similarly clusters close taxa, assuming the rate of evolution is the same across lineages.

slide-43
SLIDE 43

Distance-Based Methods

Popular distance based methods include

  • Neighbor Joining (Saitou & Nei ‘87) which repeatedly joins the

“nearest neighbors” to build a tree, and

  • UPGMA (“Unweighted Pair Group Method with Arithmetic

Mean”) (Sneath & Snokal ‘73 ) similarly clusters close taxa, assuming the rate of evolution is the same across lineages.

  • Quartet-based methods that decide the topology for every 4 taxa

and then assemble them to form a tree (Berry et al. 1999, 2000, 2001).

Katherine St. John City University of New York 18

slide-44
SLIDE 44

Other Distance-Based Methods

  • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor

Joining, that combines based on a likelihood function of the distances.

slide-45
SLIDE 45

Other Distance-Based Methods

  • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor

Joining, that combines based on a likelihood function of the distances.

  • Disk Covering Method (Warnow et al. ‘98, ‘99, ‘04)– a

divide-and-conquer approach of theoretical interest that has been combined with many other methods.

slide-46
SLIDE 46

Other Distance-Based Methods

  • Weighbor (Bruno et al. ‘00) is a weighted version of Neighbor

Joining, that combines based on a likelihood function of the distances.

  • Disk Covering Method (Warnow et al. ‘98, ‘99, ‘04)– a

divide-and-conquer approach of theoretical interest that has been combined with many other methods.

Katherine St. John City University of New York 19

slide-47
SLIDE 47

Neighbor Joining (NJ)

  • [Saitou & Nei 1987]: very popular and fast: O(n3).
slide-48
SLIDE 48

Neighbor Joining (NJ)

  • [Saitou & Nei 1987]: very popular and fast: O(n3).

– Based on the distance between nodes, join neighboring leaves, replace them by their parent, calculate distances to this node, and repeat.

slide-49
SLIDE 49

Neighbor Joining (NJ)

  • [Saitou & Nei 1987]: very popular and fast: O(n3).

– Based on the distance between nodes, join neighboring leaves, replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree.

slide-50
SLIDE 50

Neighbor Joining (NJ)

  • [Saitou & Nei 1987]: very popular and fast: O(n3).

– Based on the distance between nodes, join neighboring leaves, replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree. – Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges.

slide-51
SLIDE 51

Neighbor Joining (NJ)

  • [Saitou & Nei 1987]: very popular and fast: O(n3).

– Based on the distance between nodes, join neighboring leaves, replace them by their parent, calculate distances to this node, and repeat. – This process eventually returns a binary (fully resolved) tree. – Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges. – Experimental work shows that NJ trees are reasonably accurate, given a rate of evolution is neither too low nor too high.

Katherine St. John City University of New York 20

slide-52
SLIDE 52

Quartet Methods

  • A quartet is an unrooted binary tree on four taxa:

t t t t r r

❅ ❅ ❅ ❅ ❅

  • a

b c d {ab|cd}

t t t t r r

❅ ❅ ❅ ❅ ❅

  • a

c b d {ac|bd}

t t t t r r

❅ ❅ ❅ ❅ ❅

  • a

d b c {ad|bc}

slide-53
SLIDE 53

Quartet Methods

  • A quartet is an unrooted binary tree on four taxa:

t t t t r r

❅ ❅ ❅ ❅ ❅

  • a

b c d {ab|cd}

t t t t r r

❅ ❅ ❅ ❅ ❅

  • a

c b d {ac|bd}

t t t t r r

❅ ❅ ❅ ❅ ❅

  • a

d b c {ad|bc}

  • Let Q(T) = all quartets that agree with T.

[Erd˝

  • s et al. 1997]: T can be reconstructed from Q(T) in

polynomial time.

Katherine St. John City University of New York 21

slide-54
SLIDE 54

Quartet Methods

  • Quartet-based methods operate in two phases:
slide-55
SLIDE 55

Quartet Methods

  • Quartet-based methods operate in two phases:

– Construct quartets on all four taxa sets.

slide-56
SLIDE 56

Quartet Methods

  • Quartet-based methods operate in two phases:

– Construct quartets on all four taxa sets. – Combine these quartets into a tree.

slide-57
SLIDE 57

Quartet Methods

  • Quartet-based methods operate in two phases:

– Construct quartets on all four taxa sets. – Combine these quartets into a tree.

  • Running time:

– For most optimizations, determining a quartet is fast.

slide-58
SLIDE 58

Quartet Methods

  • Quartet-based methods operate in two phases:

– Construct quartets on all four taxa sets. – Combine these quartets into a tree.

  • Running time:

– For most optimizations, determining a quartet is fast. – There are Θ(n4) quartets, giving Ω(n4) running time.

slide-59
SLIDE 59

Quartet Methods

  • Quartet-based methods operate in two phases:

– Construct quartets on all four taxa sets. – Combine these quartets into a tree.

  • Running time:

– For most optimizations, determining a quartet is fast. – There are Θ(n4) quartets, giving Ω(n4) running time. – In practice, the input quality is insufficient to ensure that all quartets are accurately inferred.

slide-60
SLIDE 60

Quartet Methods

  • Quartet-based methods operate in two phases:

– Construct quartets on all four taxa sets. – Combine these quartets into a tree.

  • Running time:

– For most optimizations, determining a quartet is fast. – There are Θ(n4) quartets, giving Ω(n4) running time. – In practice, the input quality is insufficient to ensure that all quartets are accurately inferred. – Quartet methods have to handle incorrect quartets.

Katherine St. John City University of New York 22

slide-61
SLIDE 61

Popular Quartet Methods

  • Q∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]:

Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree.

slide-62
SLIDE 62

Popular Quartet Methods

  • Q∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]:

Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree.

  • Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a

small number of errors proportional to qe. Many variants: all handle a small number of errors.

slide-63
SLIDE 63

Popular Quartet Methods

  • Q∗ or Naive Method [Berry & Gascuel ‘97, Buneman ‘71]:

Only add edges that agree with all input quartets. Doesn’t tolerate errors– outputs conservative, but unresolved tree.

  • Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a

small number of errors proportional to qe. Many variants: all handle a small number of errors.

  • Quartet Puzzling [Strimmer & von Haeseler 1996]: “Order

taxa randomly, greedily add edges, repeat 1000 times.” Output majority tree. Most popular with biologists.

Katherine St. John City University of New York 23

slide-64
SLIDE 64

Constructing Networks

  • What if evolution isn’t tree-like?
slide-65
SLIDE 65

Constructing Networks

  • What if evolution isn’t tree-like?

For example:

slide-66
SLIDE 66

Constructing Networks

  • What if evolution isn’t tree-like?

For example:

slide-67
SLIDE 67

Constructing Networks

  • What if evolution isn’t tree-like?

For example:

(from W.P. Maddison, Systematic Biology ‘97)

Katherine St. John City University of New York 24

slide-68
SLIDE 68

Network Methods

  • Split Decomposition (Bandelt & Dress ‘92)

decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa).

slide-69
SLIDE 69

Network Methods

  • Split Decomposition (Bandelt & Dress ‘92)

decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa).

  • NeighborNet (Bryant & Moulton ‘02) is an

agglomerative clustering algorithm that uses splits to produce networks.

slide-70
SLIDE 70

Network Methods

  • Split Decomposition (Bandelt & Dress ‘92)

decomposes the distance matrix into sums of “split” metrics and small residue, yielding a set of splits (bipartitions of taxa).

  • NeighborNet (Bryant & Moulton ‘02) is an

agglomerative clustering algorithm that uses splits to produce networks.

  • TCS (Posada & Crandall ‘01) estimates gene

phylogenies based on statistical parsimony method.

Katherine St. John City University of New York 25

slide-71
SLIDE 71

Input to Reconstruction Algorithms

  • Almost all assume that the data is aligned:

(Alignment of bacterial genes by Geneious (Drummond ‘06).)

  • Many assume corrections have been made for the

underlying model of evolution.

Katherine St. John City University of New York 26

slide-72
SLIDE 72

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

slide-73
SLIDE 73

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

1

❍❍❍❍❍❍❍❍

AACGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

2 1

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

1 3

❅ ❅ ❅

1

❅ ❅ ❅

1 Katherine St. John City University of New York 27

slide-74
SLIDE 74

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

1

❍❍❍❍❍❍❍❍

AACGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGT 2 1

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGA 1 3

❅ ❅ ❅

1

❅ ❅ ❅

1 Katherine St. John City University of New York 28

slide-75
SLIDE 75

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

1

❍❍❍❍❍❍❍❍

AACGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGT 2 1 ACCCT GACGT AACGA GGCGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGA 1 3

❅ ❅ ❅

1

❅ ❅ ❅

1 Katherine St. John City University of New York 29

slide-76
SLIDE 76

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

1

❍❍❍❍❍❍❍❍

AACGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGT 2 1 ACCCT GACGT AACGA GGCGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGA 1 3

❅ ❅ ❅

GACGT AACGT GACGT GGCGA 1

❅ ❅ ❅

1 Katherine St. John City University of New York 30

slide-77
SLIDE 77

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

1

❍❍❍❍❍❍❍❍

AACGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGT 2 1 ACCCT GACGT AACGA GGCGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGA 1 3

❅ ❅ ❅

GACGT AACGT GACGT GGCGA 1

❅ ❅ ❅

1 Katherine St. John City University of New York 31

slide-78
SLIDE 78

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

{ACCCT, GACGT, AACGT, GACGT, GGCGA} Katherine St. John City University of New York 32

slide-79
SLIDE 79

Models of Evolution

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

  • The assumptions of the model are:
  • 1. the sites (i.e., the positions within the sequences) evolve independently and

identically

  • 2. if a site changes state it changes with equal probability to each of the

remaining states, and

  • 3. the number of changes of each site on an edge e is a Poisson random

variable with expectation λ(e) (this is also called the “length” of the edge e).

Katherine St. John City University of New York 33

slide-80
SLIDE 80

How Methods Use Models of Evolution

  • As an explicit part of the algorithm: for example, maximum

likelihood, weighbor.

slide-81
SLIDE 81

How Methods Use Models of Evolution

  • As an explicit part of the algorithm: for example, maximum

likelihood, weighbor.

  • Indirectly, via assumptions on the data or by inputting data that

has been corrected under a certain model.

Katherine St. John City University of New York 34

slide-82
SLIDE 82

Testing Methods Empirically

  • How accurate are the methods at reconstructing trees?
slide-83
SLIDE 83

Testing Methods Empirically

  • How accurate are the methods at reconstructing trees?
  • In biological applications, the true, historical tree is almost never

known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

slide-84
SLIDE 84

Testing Methods Empirically

  • How accurate are the methods at reconstructing trees?
  • In biological applications, the true, historical tree is almost never

known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

slide-85
SLIDE 85

Testing Methods Empirically

  • How accurate are the methods at reconstructing trees?
  • In biological applications, the true, historical tree is almost never

known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

  • Simulation is used instead to evaluate methods, given a model of

evolution.

Katherine St. John City University of New York 35

slide-86
SLIDE 86

Simulation Studies

  • 1. Construct a

“model” tree.

slide-87
SLIDE 87

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
slide-88
SLIDE 88

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
  • 3. Reconstruct

the tree using method.

slide-89
SLIDE 89

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
  • 3. Reconstruct

the tree using method.

  • 4. Evaluate the accuracy of the constructed tree.

Katherine St. John City University of New York 36

slide-90
SLIDE 90

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
  • 3. Reconstruct

the tree using method.

  • 4. Evaluate the accuracy of the constructed tree.

Katherine St. John City University of New York 37

slide-91
SLIDE 91

Simulating Data: Choosing Trees

  • Usually chosen from a random distribution on trees: Uniform, or

Yule-Harding (birth-death trees)

✉ ✉ ✉ ✉ ✉ ✉ r r

❅ ❅ ❅ ❅ ❅

slide-92
SLIDE 92

Simulating Data: Choosing Trees

  • Usually chosen from a random distribution on trees: Uniform, or

Yule-Harding (birth-death trees)

✉ ✉ ✉ ✉ ✉ ✉ r r

❅ ❅ ❅ ❅ ❅

  • Can view this as two different random processes:
slide-93
SLIDE 93

Simulating Data: Choosing Trees

  • Usually chosen from a random distribution on trees: Uniform, or

Yule-Harding (birth-death trees)

✉ ✉ ✉ ✉ ✉ ✉ r r

❅ ❅ ❅ ❅ ❅

  • Can view this as two different random processes:

– generate the tree shape, and then

slide-94
SLIDE 94

Simulating Data: Choosing Trees

  • Usually chosen from a random distribution on trees: Uniform, or

Yule-Harding (birth-death trees)

✉ ✉ ✉ ✉ ✉ ✉ r r

❅ ❅ ❅ ❅ ❅

  • Can view this as two different random processes:

– generate the tree shape, and then – assign weights or branch lengths to the shape.

Katherine St. John City University of New York 38

slide-95
SLIDE 95

Simulating Data: Evolving Sequences

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

✟ ✟ ✟ ✟ ✟ ✟ ✟ ✟

1

❍❍❍❍❍❍❍❍

AACGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGT 2 1 ACCCT GACGT AACGA GGCGT

✑ ✑ ✑ ✑ ✑ ✑ ◗◗◗◗◗◗

AACGA 1 3

❅ ❅ ❅

GACGT AACGT GACGT GGCGA 1

❅ ❅ ❅

1 Katherine St. John City University of New York 39

slide-96
SLIDE 96

Simulating Data: Evolving Sequences

  • The Jukes-Cantor (JC) model is the simplest Markov model of

biomolecular sequence evolution.

  • A DNA sequence (a string over {A, C, T, G}) at the root evolves

down a rooted binary tree T.

{ACCCT, GACGT, AACGT, GACGT, GGCGA} Katherine St. John City University of New York 40

slide-97
SLIDE 97

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
  • 3. Reconstruct

the tree using method.

  • 4. Evaluate the accuracy of the constructed tree.

Katherine St. John City University of New York 41

slide-98
SLIDE 98

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
  • 3. Reconstruct

the tree using method.

  • 4. Evaluate the accuracy of the constructed tree.

Katherine St. John City University of New York 42

slide-99
SLIDE 99

Simulation Studies

  • 1. Construct a

“model” tree.

  • 2. “Evolve”

sequences down the tree.

A

  • GTTAGAAGGCGGCCA. . .

B

  • CATTTGTCCTAACTT. . .

C

  • CAAGAGGCCACTGCA. . .

D

  • CCGACTTCCAACCTC. . .

E

  • ATGGGGCACGATGGA. . .

F

  • TACAAATACGCGCAA. . .
  • 3. Reconstruct

the tree using method.

  • 4. Evaluate the accuracy of the constructed tree.

Katherine St. John City University of New York 43

slide-100
SLIDE 100

Evaluating Accuracy

  • To compare reconstructed tree to model tree, the Robinson-Foulds

Score is often used: False Positives + False Negatives total edges

✟ ✟ ✟ ✟ ❍❍❍❍ ✑ ✑ ✑ ◗◗◗ ✑ ✑ ✑ ◗◗◗

a

b

c d e f

✟ ✟ ✟ ✟ ❍❍❍❍ ✑ ✑ ✑ .

. . . . . . . . .

✑ ✑ ✑ ◗◗◗

c

b

d a f e

slide-101
SLIDE 101

Evaluating Accuracy

  • To compare reconstructed tree to model tree, the Robinson-Foulds

Score is often used: False Positives + False Negatives total edges

✟ ✟ ✟ ✟ ❍❍❍❍ ✑ ✑ ✑ ◗◗◗ ✑ ✑ ✑ ◗◗◗

a

b

c d e f

✟ ✟ ✟ ✟ ❍❍❍❍ ✑ ✑ ✑ .

. . . . . . . . .

✑ ✑ ✑ ◗◗◗

c

b

d a f e

  • If there are many possible answers, choose the one with the best

parsimony score: the sum of the number of site changes acrosss the edges in the tree.

Katherine St. John City University of New York 44

slide-102
SLIDE 102

Talk Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods
  • Evaluating the Results

Katherine St. John City University of New York 45

slide-103
SLIDE 103

Talk Outline

  • Overview
  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods
  • Evaluating the Results

Katherine St. John City University of New York 46

slide-104
SLIDE 104

Analyzing & Visualizing Sets of Trees

  • Visualizing single trees
  • Comparing pairs of trees
  • Handling Large Sets of Trees

Katherine St. John City University of New York 47

slide-105
SLIDE 105

Visualizing Single or Pairs of Trees

  • SplitsTree (Huson et al.)
slide-106
SLIDE 106

Visualizing Single or Pairs of Trees

  • SplitsTree (Huson et al.)
  • TreeView (Page et al.)
slide-107
SLIDE 107

Visualizing Single or Pairs of Trees

  • SplitsTree (Huson et al.)
  • TreeView (Page et al.)
  • TLreeJuxtaposer (Munzner et al.)

Katherine St. John City University of New York 48

slide-108
SLIDE 108

Analyzing & Visualizing Sets of Trees

Amenta & Klingner, InfoVis ‘02 Hillis, Heath, &

  • St. John, Sys. Biol. ‘05

Katherine St. John City University of New York 49

slide-109
SLIDE 109

Evaluating the Results

  • Often, a search will result in many (often thousands) of trees with

the same score.

slide-110
SLIDE 110

Evaluating the Results

  • Often, a search will result in many (often thousands) of trees with

the same score.

Input Data A

  • GTTAGAAGGC. . .

B

  • CATTTGTCCT. . .

C

  • CAAGAGGCCA. . .

D

  • CCGACTTCCA. . .

E

  • ATGGGGCACG. . .

F

  • TACAAATACG. . .

→ Reconstruction Algorithms Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering, . . . → Output Tree Katherine St. John City University of New York 50

slide-111
SLIDE 111

Evaluating the Results

  • Often, a search, will result in many (often thousands) of trees

with the same score.

Input Data A

  • GTTAGAAGGC. . .

B

  • CATTTGTCCT. . .

C

  • CAAGAGGCCA. . .

D

  • CCGACTTCCA. . .

E

  • ATGGGGCACG. . .

F

  • TACAAATACG. . .

→ Reconstruction Algorithms Maximum Parsimony Maximum Likelihood → Output Trees Katherine St. John City University of New York 51

slide-112
SLIDE 112

Summarizing Trees

Input Trees → Consensus Method

Strict Consensus Majority-rule

→ Output Trees

Katherine St. John City University of New York 52

slide-113
SLIDE 113

Strict Consensus Tree

Input trees Strict Consensus

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2

s3

s4

s0 s1 s2 s3 s4

s1s2 | s0s3s4 s2s3 | s0s1s4 s2s4 | s0s1s3 s1s2s3 | s0s4 s1s2s3 | s0s4 s2s3s4 | s0s1

O(nt) running time: Day ‘85.

Katherine St. John City University of New York 53

slide-114
SLIDE 114

Majority-rule Tree

Input trees Majority-rule Tree

s0 s1 s2 s3 s4 s0 s1 s2 s3 s4 s0 s1 s2

s3

s4

s0 s1 s2 s3 s4

Includes splits found in a majority of trees Can be 2/3 majority, etc. O(nt) randomized running time: Amenta, Clark, & S. ‘03.

Katherine St. John City University of New York 54

slide-115
SLIDE 115

Visualizing Sets of Trees

Efficiency is important for real-time visualization.

Katherine St. John City University of New York 55

slide-116
SLIDE 116

Multidimensional Scaling (MDS)

  • Each point represents a tree.
  • Points for similar trees are displayed near one another.

Katherine St. John City University of New York 56

slide-117
SLIDE 117

Distances Between Trees

  • Robinson-Foulds distance: # of edges that occur in only one tree.
  • Calculate in O(n) time using Day’s Algorithm (1985).
  • Extends naturally to weighted trees.

Katherine St. John City University of New York 57

slide-118
SLIDE 118

Other Natural Metrics

  • Tree-bisection-reconnect (TBR):

F G E D C A B F G E D C A B F G E D C A B B A C D E F G

  • TBR is NP-hard. (Allen & Steel ‘01)
  • Many attempts, but no approximations with provable bounds.

Katherine St. John City University of New York 58

slide-119
SLIDE 119

Other Natural Metrics

  • Subtree-prune-regraft (SPR):

F G E D C A B A B F G E D C A B F G E D C

  • NP-hard for rooted trees (Bordewich & Semple ‘05).
  • 5-approximation for rooted trees (Bonet, Amenta, Mahindru, & S.).

Katherine St. John City University of New York 59

slide-120
SLIDE 120

Summary

  • Constructing Trees
  • Constructing Networks
  • Comparing Reconstruction Methods:
  • Evaluating the Results:

Katherine St. John City University of New York 60

slide-121
SLIDE 121

Tutorial Outline

  • Day 1: Introduction to Phylogenetic Reconstruction

– Overview: Katherine St. John, CUNY – Parsimony Reconstruction of Phylogenetic Trees: Trevor Bruen, McGill University – Using Maximum Likelihood for Phylogenetic Tree Reconstruction: Rachel Bevan, McGill University – Hands-on Session: Constructing Trees Katherine St. John

  • Day 2: Applications to Rapidly Evolving Pathogens

Katherine St. John City University of New York 61

slide-122
SLIDE 122

Tutorial Outline

  • Day 1: Intro to Phylogenetic Reconstruction
  • Day 2: Applications to Rapidly Evolving Pathogens

– Statistical Overview: Alexei Drummond, University of Auckland – Tricks for trees: Having reconstructed trees, what can we do with them? Mike Steel, University of Canterbury – Hands-on Session: Katherine St. John

Katherine St. John City University of New York 62