1 Molecular characters Nucleotide sequences structural genes - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Molecular characters Nucleotide sequences structural genes - - PDF document

Phylogenetics 3: Methods to reconstruct phylogenies A generalized protocol for molecular phylogenetics, and the associated concerns with each step Concerns: Collect homolgous sequences gene tree-species tree / paralogyorthology / trees


slide-1
SLIDE 1

1

Phylogenetics 3: Methods to reconstruct phylogenies

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Collect homolgous sequences Multiple sequence alignment Phylogeny estimation Test reliability or fit of phylogenetic estimates Interpretation and application Concerns: gene tree-species tree / paralogy–orthology / trees within trees positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy branch support / tree comparison / statistic issues with trees independent contrasts / impact of error on conclusions

slide-2
SLIDE 2

2

Molecular characters

  • Nucleotide sequences

structural genes (protein, RNA, regulatory)

non-structural genes (introns, intergenic sequences, pseudogenes)

  • Protein sequences

translate DNA to protein

thousands to choose from

  • INDELS

nucleotides, amino acids, genes, segments of DNA, etc.

  • DNA-DNA hybridization
  • Restriction fragment length polymorphisms (RFLPs)
  • Large scale genomic rearrangements

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Collect homolgous sequences Multiple sequence alignment Phylogeny estimation Test reliability or fit of phylogenetic estimates Interpretation and application Concerns: gene tree-species tree / paralogy–orthology / trees within trees positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy branch support / tree comparison / statistic issues with trees independent contrasts / impact of error on conclusions

slide-3
SLIDE 3

3

Molecular characters: multiple sequence alignment Some methods:

  • 1. sum-all-pairs method: count cost of aligning all pairs of

sequences and select alignment that minimizes the total cost

  • 2. star alignment: alignment based on tree that assumes all

seqeunces are equally related

  • 3. tree alignment: uses “known” information about relationships of

sequences (lineages) to guide the alignment

Take course called “Bioinformatics” (BIOC 4010 / BIOL 4041) to learn more of alignments.

Species 1 … T A G … Species 2 … T A G … Species 3 … T A A … Species 4 … T C A … Species 5 … C C A …

slide-4
SLIDE 4

4

Molecular characters: multiple sequence alignment

human cow rabbit rat

  • possum

GTG CTG TCT CCT GCC GAC AAG ACC AAC GTC AAG GCC GCC TGG GGC AAG GTT GGC GCG CAC ... ... ... G.C ... ... ... T.. ..T ... ... ... ... ... ... ... ... ... .GC A.. ... ... ... ..C ..T ... ... ... ... A.. ... A.T ... ... .AA ... A.C ... AGC ... ... ..C ... G.A .AT ... ..A ... ... A.. ... AA. TG. ... ..G ... A.. ..T .GC ..T ... ..C ..G GA. ..T ... ... ..T C.. ..G ..A ... AT. ... ..T ... ..G ..A .GC ... GCT GGC GAG TAT GGT GCG GAG GCC CTG GAG AGG ATG TTC CTG TCC TTC CCC ACC ACC AAG ... ..A .CT ... ..C ..A ... ..T ... ... ... ... ... ... AG. ... ... ... ... ... .G. ... ... ... ..C ..C ... ... G.. ... ... ... ... T.. GG. ... ... ... ... ... .G. ..T ..A ... ..C .A. ... ... ..A C.. ... ... ... GCT G.. ... ... ... ... ... ..C ..T .CC ..C .CA ..T ..A ..T ..T .CC ..A .CC ... ..C ... ... ... ..T ... ..A ACC TAC TTC CCG CAC TTC GAC CTG AGC CAC GGC TCT GCC CAG GTT AAG GGC CAC GGC AAG ... ... ... ..C ... ... ... ... ... ... ... ..G ... ... ..C ... ... ... ... G.. ... ... ... ..C ... ... ... T.C .C. ... ... ... .AG ... A.C ..A .C. ... ... ... ... ... ... T.T ... A.T ..T G.A ... .C. ... ... ... ... ..C ... .CT ... ... ... ..T ... ... ..C ... ... ... ... TC. .C. ... ..C ... ... A.C C.. ..T ..T ..T ... The order of DNA sequences in the alignment is specified by the order of the taxa in the list. To fit it on the page, the alignment is broken into three parts; such alignments are called INTERLEAVED. The complete DNA sequence is shown for the fist taxon (human). All the other sequences are shown relative to human, with the dot, “.”, signifying a match in the character state with the human sequences. Differences are indicated by using the single-letter nucleotide code (A,C,T or G). Note that this alignment could also be analyzed by using distance, likelihood, and Bayesian methods.

Alignment of the nucleotide character states of the -globin gene from five species of mammals

Molecular characters: DNA alignment

Positional homology is always assumed when constructing alignments

β

slide-5
SLIDE 5

5

Molecular characters: presence-absence data matrix (easy)

Species 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 1 1 0 1 1 1 0 0 Species 2 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 Species 3 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 1 Species 4 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 Species 5 1 0 1 0 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 1 1 Amino acid INDELS in 8 different genes Pseudogene / functional gene Presence / absence

  • f transposon

elements at 8 different genomic locations Tandem gene duplication events Hypothetical presence-absence data matrix for a diversity of molecular characters

10 20 30 40 50 60 ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| Mus2.FAS MTTPALLPLS -----GRRIP PLNL--GPP- ----SFPHHR ATLRLSEKFI LLLILSAFIT Human_GIA ---------- ---------- ---------- -----MNSNF ITFDLKMSLL PSNLFSAFIT Human_GIB MTTPALLPLS -----GRRIP PLNL--GPP- ----SFPHHR ATLRLSEKFI LLLILSAFIT Mus_GIA MPVGGLLPLF SSPGGGGLGS GLGGGLGGG- ----RKGSGP AAFRLTEKFV LLLVFSAFIT Rabbit_GIA ---------- ---------- ---------- ---------- ---------- ---------- Sus_GIA MPVGGLLPLF SSPAGGGLGG GLGGGLGGGG GGGGRKGSGP SAFRLTEKFV LLLVFSAFIT 70 80 90 100 110 120 ....|....| ....|....| ....|....| ....|....| ....|....| ....|....| Mus2.FAS LCFGAFFFLP DSSKHKRFDL G-LEDVLIPH VDAGKG---- AKNPGVFLIH GPDEHRHREE Human_GIA LCFGAIFFLP DSSKLLSGVL FHSSPALQPA ADHKPGPGAR AEDAAEGRAR RREEGAPGDP Human_GIB LCFGAFFFLP DSSKHKRFDL G-LEDVLIPH VDAGKG---- AKNPGVFLIH GPDEHRHREE Mus_GIA LCFGAIFFLP DSSKLLSGVL FHSNPALQPP AEHKPGLGAR AEDAAEGRVR HREEGAPGDP Rabbit_GIA ---------- ---------- ---------- ---------- AEDAADGRAR PGEEGAPGDP Sus_GIA LCFGAIFFLP DSSKLLSGVL FHSSPALQPA ADHKPGPGAR AEDAADGRAR PGEEGAPGDP

Molecular characters: positional homology of gaps are a real pain in the ass

An alignment of real amino acid sequences of the mannosidase protein

slide-6
SLIDE 6

6

Molecular characters: multiple sequence alignment

  • software is far from flawless (many different methods)
  • all alignments must be inspected “by eye”
  • any manual adjustments (by eye) introduces subjectivity
  • one solution is to publish alignments:
  • public database
  • in scientific paper
  • supplementary online materials of a scientific journal

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Collect homolgous sequences Multiple sequence alignment Phylogeny estimation Test reliability or fit of phylogenetic estimates Interpretation and application Concerns: gene tree-species tree / paralogy–orthology / trees within trees positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy branch support / tree comparison / statistic issues with trees independent contrasts / impact of error on conclusions

slide-7
SLIDE 7

7

Molecular phylogenetics: methods

We divide methods up by two criteria (data and method): Type of data:

  • 1. characters: discrete character states at positionally homologous

sites in a multiple sequence alignment (hence, discrete character methods)

  • 2. distances: evolutionary distance, measured in average numbers
  • f substitutions per positional homologous sites, between all pairs
  • f taxa (hence, distance methods)

Species 1 ! T A G ! Species 2 ! T A G ! Species 3 ! T A A ! Species 4 ! T C A ! Species 5 ! C C A !

Molecular phylogenetics: methods

We divide methods up by two criteria (data and method): Type of data:

  • 1. characters: discrete character states at positionally homologous

sites in a multiple sequence alignment (hence, discrete character methods)

  • 2. distances: evolutionary distance, measured in average numbers
  • f substitutions per positional homologous sites, between all pairs
  • f taxa (hence, distance methods)
slide-8
SLIDE 8

8

Molecular phylogenetics: methods

We divide methods up by two criteria (data and method): Type of data:

  • 1. characters: discrete character states at positionally homologous

sites in a multiple sequence alignment (hence, discrete character methods)

  • 2. distances: evolutionary distance, measured in average numbers
  • f substitutions per positional homologous sites, between all pairs
  • f taxa (hence, distance methods)

Type of tree-building:

  • 1. clustering algorithm: computationally “build a tree” according to

a specific set of “steps”.

  • 2. optimality criterion: a criterion for scoring a tree and comparing

different trees with the goal of finding the tree with the best (optimal) score. [also called objective function]

Molecular phylogenetics: most common methods

Type of data Character Distance Tree-building method Clustering algorithm Optimality criterion

UPGMA Neighbor-joining (NJ) Least squares Minimum evolution (ME) Maximum parsimony (MP) Maximum likelihood (ML)

slide-9
SLIDE 9

9

Molecular phylogenetics: maximum parsimony

The parsimony principle is derived from the principle of philosophy called Occam’s Razor: plurality should not be posited without necessity (Pluralitas non est poneneda sine necessitate, William of Occam, medieval English philosopher [ca. 1285-1349]). The “simplest” hypothesis is the one that is chosen under the maximum parsimony criterion (principle of economy).

Molecular phylogenetics: maximum parsimony as optimality criterion

Simple dataset: one positionally homologous DNA site with two character states (A and G) Species 1 A Species 2 A Species 3 G Species 4 G We do not know if these states reflect homology or homoplasy. What do we do? What happens if we consider the other two character states that are possible?

? ?

Tree length: called steps

slide-10
SLIDE 10

10

Molecular phylogenetics: maximum parsimony as optimality criterion Parsimony optimality criterion: minimum tree length in steps

  • can be used to reconstruct ancestral character states
  • can be used to compare different tree topologies
  • tree length is a sum over all the characters in a given dataset
  • e.g., tree 1 = 10 steps

tree 2 = 27 steps Molecular phylogenetics: maximum parsimony as optimality criterion Parsimony optimality criterion: minimum tree length in steps Shortest tree (in steps) = “most parsimonious tree” What if the lowest number of steps applies to two or more trees? “equally parsimonious trees”

slide-11
SLIDE 11

11

Molecular phylogenetics: maximum parsimony as optimality criterion

G 4 A 4 A 3 1 G 2 C C 4 C 3 1 G 2 G A 3 1 A 2 G A 4 C 2 1 G 3 A C 4 G 2 1 G 3 C G 4 G 2 1 A 3 A A 3 C 2 1 G 4 A C 3 G 2 1 G 4 C A 3 G 2 1 A 4 G TREE 1 TREE 2 TREE 3 SITE 6 SITE 8 SITE 9

SITE 1 2 3 4 5 6 7 8 9 0 1 2 SPECIES 1 A T G T T G T G A T A A SPECIES 2 A T G T T C T G G T A A SPECIES 3 A T G T T A T C A T A A SPECIES 4 A T G T T A T C G T A A

G A A A A A G C C C C C G [A] G [A] A G A[ G ] A[ G ]

Example of the maximum parsimony principle in phylogenetics : Lengths of three possible trees: TREE 1: 5 steps TREE 2: 5 steps T REE 3: 6 steps * * * A 4 C 2 1 G 3 A A A

Molecular phylogenetics: distance method with an optimality criterion

Least squares method: criterion is the fit of the pairwise distances to the branch lengths of the tree. Minimum evolution: criterion is the sum of the branch lengths (t or L). Linear programming techniques under branch length constraints

slide-12
SLIDE 12

12

Obtain set of homologous gene sequences and produce an alignment.

human chimp gorilla

  • rang

Transform primary data into a matrix of pairwise genetic distance values. In this case, determine the S statistic for the set of candidate trees, and select a tree that minimizes S. Note that S is a function of both the tree topology and its branch lengths Select a method of inferring a phylogenetic tree from distance data; in this case it is the least squares method.

Example of distance based approach to molecular phylogenetics:

Most often done by using a model of evolution

Molecular phylogenetics: distance method with an algorithm

Common algorithms: 1. Stepwise addition 2. Star decomposition 3. Quartet puzzling Clustering algorithm: computationally “build a tree” according to a specific set of “steps”.

slide-13
SLIDE 13

13

Molecular phylogenetics: distance method with an algorithm

Obtaining a tree by star decomposition A B C D E F A B C D E F A B C D E F A B C D E F

Molecular phylogenetics: maximum likelihood as optimality criterion

Maximum likelihood is statistical tool to answer the question “how adequate is a given hypothesis at explaining the data in hand”?

  • Data: multiple alignment
  • Hypothesis: Tree (and model of sequence evolution)

The method of maximum likelihood measures this notion of “adequacy” by the probability

  • f observing the data (e.g., DNA sequence) given a hypothesis (e.g., tree topology and

model). The model is required because we need to compute probabilities of character-state change. We write the likelihood as L = Prob (D|H) (Note trees are scored in log-likelihoods)

slide-14
SLIDE 14

14

A T T T

Molecular phylogenetics: another BIG problem

Optimality based methods require comparison of different tress. The notion of the “best”, or optimal, tree implies that all trees have been scored and compared. Tree space is vast!

NT = 3 × 5 × 7 × … × (2n – 5)

slide-15
SLIDE 15

15

Number of lineages Number of unrooted trees 3 1 4 3 5 15 6 105 7 945 8 10,395 9 135,135 10 2,027,025 11 34,495,425 12 645,729,075 13 13,749,310,575 14 316,234,143,225 15 7,905,853,580,625 20 221,643,095,476,699,771,875 50 ~3 × 1074 100 ~3 × 10184

Getting close to Eddington’s number !!!

Molecular phylogenetics: tree space is vast!

[Avogadro’s number is only 6 × 1023]

Molecular phylogenetics: big datasets require heuristic methods Heuristic: a procedure that seeks a solution with no guarantee of finding the globally optimal solution. Phylogenetics uses branch-swapping algorithms to search tree space. Branch-swapping algorithms make small rearrangements in the tree topology, and the optimality criterion is recalculated after each rearrangement. These methods work by trying to climb a hill (blind)

slide-16
SLIDE 16

16

Maximum Likelihood score = 0.228

0.05 0.1 0.15 0.2 0.25 0.2 0.4 0.6 0.8 1

ML estimate of p = 0.42

You have seen “hill-climbing” before…

Gene 1: ATG ATC CTG CCG ACT … … … … … … TAA Gene 2: ATG ATT CTG CCA ACT … … … … … … TAA

Mouse gene 1 Human gene 2 common ancestor

t0 t1

Genetic distance between gene 1 and 2 (t) = t0 + t1

slide-17
SLIDE 17

17

Molecular phylogenetics: hill climbing

An extremely simplified example (but from real data):

! t

Parameters: t and ω Gene: acetylcholine α receptor

mouse human common ancestor

Molecular phylogenetics: the problem with hill climbing

slide-18
SLIDE 18

18

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Collect homolgous sequences Multiple sequence alignment Phylogeny estimation Test reliability or fit of phylogenetic estimates Interpretation and application Concerns: gene tree-species tree / paralogy–orthology / trees within trees positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy branch support / tree comparison / statistic issues with trees independent contrasts / impact of error on conclusions

Random verses systematic error Random error is defined as the deviation between the true value of a parameter and an estimate of that parameter that is due solely to the effects of finite sample size Systematic error is the deviation between the true value of a parameter and an estimate that is due to incorrect assumptions of the estimation method *An important difference between these two types of error is that while random error decreases with increasing sample size, systematic errors persist, and sometimes intensify, as sample sizes increase

slide-19
SLIDE 19

19 See lecture notes for much more detail…

slide-20
SLIDE 20

20 Uncertainty in a phylogeny: bootstrap

slide-21
SLIDE 21

21 Bootstrap: sample with replacement ⇒ pseudo-sample Jackknife: sample without replacement ⇒ pseudo-sample

A phylogenetic tree is not a numerical quantity

slide-22
SLIDE 22

22

Original data: Site 1 2 3 4 5 6 7 8 9 10 Species 1 T C A G T T C G A T Species 2 C C G G T G A C A T Species 3 A C A T T T A G A A Species 4 G C A T T G A C A A Site 9 4 3 1 4 8 10 3 9 10 Species 1 A G A T G G T A A T Species 2 A G G C G C T G A T Species 3 A T A A T G A A A A Species 4 A T A G T C A A A A One possible bootstrap pseudosample Original data: Site 1 2 3 4 5 6 7 8 9 10 Species 1 T C A G T T C G A T Species 2 C C G G T G A C A T Species 3 A C A T T T A G A A Species 4 G C A T T G A C A A Site 9 4 Original data: Site 1 2 3 4 5 6 7 8 9 10 Species 1 T C A G T T C G A T Species 2 C C G G T G A C A T Species 3 A C A T T T A G A A Species 4 G C A T T G A C A A Site 9 4 3 1 4 8 10 3 9 10 Species 1 A G A T G G T A A T Species 2 A G G C G C T G A T Species 3 A T A A T G A A A A Species 4 A T A G T 3 1 4 8 10 3 9 10 Species 1 A G A T G G T A A T Species 2 A G G C G C T G A T Species 3 A T A A T G A A A A Species 4 A T A G T C A A A A One possible bootstrap pseudosample

Bootstrap: sampling a data matrix with replacement

slide-23
SLIDE 23

23

A generalized protocol for molecular phylogenetics, and the associated concerns with each step

Collect homolgous sequences Multiple sequence alignment Phylogeny estimation Test reliability or fit of phylogenetic estimates Interpretation and application Concerns: gene tree-species tree / paralogy–orthology / trees within trees positional homology / gaps / subjectivity-objectivity / methods philosophy / methods / consistency / power and accuracy branch support / tree comparison / statistic issues with trees independent contrasts / impact of error on conclusions

slide-24
SLIDE 24

24

Rooted trees provide polarity

G G A A G G A A A A G G G G A A G G A A

“G” is derived and “A” is ancestral “A” is derived and “G” is ancestral We don’t know what is derived and ancestral

Rooted trees provide polarity

slide-25
SLIDE 25

25

Rooting a phylogeney with an outgroup

Let’s define some terms: INGROUP: A group of lineages, assumed to be monophyletic, but whose phylogenetic relationships are of primary interest. OUTGROUP: One or more terminal taxa that are assumed to be outside of the monophyletic group that has been specified as the ingroup. Unlike the ingroup, the outgroup does not have to be monophyletic ROOT: The most evolutionary basal point of a phylogeny. The root orients the direction of change along a phylogeny relative to time. CHARACTER POLARITY: The evolutionary relationship between two or more states for a given

  • character. Say we have a character with two states, “a” and “b”. By mapping them on a phylogeny

we can determine that “b” preceded “a” in evolutionary history; hence “a” is the derived state and “b” is the primitive state.

Rooting a phylogeney with an outgroup

OG IG-4 IG-3 IG-2 IG-2 IG-1 IG-3 IG-4 OG IG-1 OG IG-3 IG-2 IG-1 IG-4

Unrooted tree Placing root between ingroup and outgroup Rooted tree

Root Root IG: ingroup OG: outgroup Rooting a phylogenetic tree by placing the root between the ingroup and outgroup

slide-26
SLIDE 26

26 See lecture notes for much more detail…