} } 3. Take samples of HIV from other HIV positive people from the - - PDF document

3 take samples of hiv from other hiv positive people from
SMART_READER_LITE
LIVE PREVIEW

} } 3. Take samples of HIV from other HIV positive people from the - - PDF document

31 Mar 15 A case story Phylogenetic trees In August 1994 a nurse in Lafayette, LA, tests negative for HIV A few weeks later, she breaks off a messy 10 year affair with a doctor Three weeks later, while suffering from chronic


slide-1
SLIDE 1

31‐Mar‐15 1

Phylogenetic trees

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 31st 2015

A case story

  • In August 1994 a nurse in Lafayette, LA, tests negative for HIV
  • A few weeks later, she breaks off a messy 10 year affair with a doctor
  • Three weeks later, while suffering from chronic fatigue symptoms, the

doctor gives his ex‐mistress a vitamin B‐12 shot, somewhat against her will

  • In January 1995, the nurse tests positive for both HIV and hepatitis C.

Investigation reveals no obvious means of infection (positive test for a sexual partner, accident with a patient, et cetera). The vitamin B‐12 shot becomes suspicious

  • The doctor’s office records from the day are conveniently missing but

eventually found by police buried in the back of a closet. The records show that the doctor had withdrawn blood samples from a known HIV patient and a known hepatitis C patient the same day as the vitamin B‐12 shot. The record keeping is not in line with standard office procedure and there is no information as to what happened to either blood sample

  • The nurse never had contact with either patient
  • Seemingly strong, but otherwise circumstantial, evidence that the doctor

deliberately infected the nurse with HIV and hepatitis C

Case story continued

  • HIV evolves very fast

– This is partly why it has been so difficult to develop a cure

  • Can we show that the HIV in the nurse is related to the

HIV from the patient?

  • 1. Take samples of HIV from the nurse
  • 2. Take samples of HIV from the patient

3 Take samples of HIV from other HIV positive people from the

  • 3. Take samples of HIV from other HIV positive people from the

same town

  • 4. Sequence HIV gene sequences
  • 5. Construct a phylogeny of the HIV

HIV phylogeny

HIV strains found in patient

}

}

HIV strains found in victim

}

}

HIV strains found in other individuals from Lafayette

Phylogenetic trees

  • A phylogenetic tree represents the

phylogeny of species or sequences

– Evolutionary signatures reveal the phylogenetic history

  • Phylotenetic trees contain:

– Present day sequences – Ancestral nodes

Time

Ancestral nodes – A root

  • The same tree can be represented in many different ways:
slide-2
SLIDE 2

31‐Mar‐15 2

Time

Ways of constructing phylogenetic trees

  • Distance‐based approaches

– Among the fastest programs for making phylogenetic trees – Unweighted Pair Group Method with Arithmetic mean (UPGMA) – Neighbor Joining (NJ)

  • Maximum likelihood and maximum parsimony approaches

Distance‐based approach

Multiple sequence alignment Evolutionary di t

Calculate evolutionary divergence

distance matrix Phylogenetic tree

Cluster: UPGMA, Neighbor Joining

Evolutionary distances

  • Sequence (dis‐)similarity represents evolutionary distance

– Use similarity quantification methods from last week’s lectures

  • But: evolutionary distance does not correlate 1:1 with

sequence alignment score

– Because mutations at the same position in the sequence become increasingly likely – So we have to correct for that:

) 4 1 ln( 3 D d   

So we have to correct for that:

) 3 ( 4 d Observed number of mutated positions Actual number of mutations

Jukes Cantor correction

The molecular clock

  • The concept of a molecular

clock is based on the

  • bservation that the number
  • f amino acid differences

in hemoglobin between different lineages changes roughly linearly with time, as estimated from fossil estimated from fossil evidence

  • In reverse, this would allow

us to estimate the dates of evolutionary events from by biological sequence analysis

  • The molecular clock holds in

some cases

Human Mouse Chimp Worm Yeast Human ‐ 5 1 8 9 Mouse ‐ 4 10 11 Chimp 9 9

UPGMA example

  • UPGMA is a greedy algorithm

– This means that nodes with the smallest distances are joined first

Chimp ‐ 9 9 Worm ‐ 2 Yeast ‐ H+C Mouse Worm Yeast H+C ‐ 4.5 8.5 9 Mouse ‐ 10 11 Worm ‐ 2 Yeast ‐ Y W

1 1

H C

0.5 0.5

slide-3
SLIDE 3

31‐Mar‐15 3

UPGMA example

Human Mouse Chimp Worm Yeast Human ‐ 5 1 8 9 Mouse ‐ 4 10 11 Chimp 9 9

  • UPGMA is a greedy algorithm

– This means that nodes with the smallest distances are joined first

H+C Mouse W+Y H+C ‐ 4.5 8.75 Mouse ‐ 10.5 W+Y ‐ M

1.75 2.25

Y W

1 1

H C

0.5 0.5

Chimp ‐ 9 9 Worm ‐ 2 Yeast ‐

UPGMA example

Human Mouse Chimp Worm Yeast Human ‐ 5 1 8 9 Mouse ‐ 4 10 11 Chimp 9 9

  • UPGMA is a greedy algorithm

– This means that nodes with the smallest distances are joined first – The root is added last at the mid‐point

(H+C)+M W+Y (H+C)+M ‐ 9.33 W+Y ‐ M

1.75 2.25 3.67 2.42

Y W

1 1

H C

0.5 0.5

Chimp ‐ 9 9 Worm ‐ 2 Yeast ‐

An exam question

A B D E C 2 1.5 1 0.5

  • 1. Assume that the molecular clock holds.

a) Fill in the missing branch lengths. b) What algorithm was used for building this phylogenetic tree? c) What are dAB and dCD? d) Write this tree in Newick tree format with branch lengths.

  • 2. Research has revealed that the molecular clock does not hold

for the lineage leading to C. If dBC = 6, what is the distance between C and its last common ancestor with A, B, D, and E?

Answers

A B D E C 2 1.5 1 0.5 2 1.5 1.5 3.5

  • 1. Assume that the molecular clock holds

a) See above b) Unweighted Pair Group Method with Arithmetic mean (UPGMA) c) dAB = 4, dCD = 7 d) (((A:2,B:2):1,(D:1.5,E:1.5):1.5):0.5),C:3.5);

  • 2. 2.5

Answers

C 0.5 A B 2 1 2 D E 1.5 1.5 1.5 3.5 A B E D C 2 1.5 1 0.5 2 1.5 1.5 3.5

  • 1. The following bracket‐notations are also correct:

d) (((A:2,B:2):1,(D:1.5,E:1.5):1.5):0.5),C:3.5); d) (C:3.5,((D:1.5,E:1.5):1.5,(B:2,A:2):1):0.5)); d) (C:3.5,((B:2,A:2):1,(D:1.5,E:1.5):1.5):0.5)); d) (C:3.5,((A:2,B:2):1,(E:1.5,D:1.5):1.5):0.5));

Et cetera

slide-4
SLIDE 4

31‐Mar‐15 4

Non‐uniform molecular clock

  • Greedy algorithms only work if the

clock runs at the same speed in all branches

– All distances to the root are equal – The tree is called ultrametric

  • This is often not the case:

Species A (fast evolving) Species B (slow evolving) Species C (fast evolving) Species D (slow evolving)

Now Now

Unequal rates of evolution are the rule

Protist mitochondrion Plant mitochondrion

  • Neighbor Joining (NJ) is designed to account for a non‐

uniform molecular clock

For detailed explanation check: youtu.be/B‐oHOoYvE6E

Maximum parsimony (MP) and likelihood (ML)

  • Maximum parsimony (MP):

the tree that requires the fewest evolutionary events to explain the alignment

– Occam’s razor: the simplest explanation of the observations

  • Maximum likelihood (ML): the

Maximum likelihood (ML): the tree most likely to have led to the alignment given a certain model of evolution

Maximum parsimony (MP)

  • MP example for a single position “alignment” in 5 species:

Chimpanzee C Gibbon T Gorilla C Human C Orangutan T

  • Draw all possible trees for the sequences/species present in your

multiple alignment

  • For each tree, identify where the mutations have taken place

– Make parsimony assumption: minimum number of required mutations

Maximum parsimony (MP)

  • How many trees are there?

– # unrooted trees NU = (2n ‐ 5)!! – # rooted trees NR = (2n ‐ 3)!!

  • The MP tree has the minimum number of required

mutations

– It is the simplest explanation of the alignment Informative positions contain at ≥2 different characters ≥2 × each = (2n ‐ 5) × (2n ‐ 7) × ... × 3 × 1 = (2n ‐ 3) × (2n ‐ 5) × ... × 3 × 1 – Informative positions contain at ≥2 different characters ≥2 × each

7: a-t 6: a-t 1: c-t 3: t-a

Maximum likelihood (ML)

  • The simplest explanation is not always the most likely one

– We know that some mutations are more likely than others

  • Like MP, ML starts by drawing all possible trees
  • For each tree, ML then calculates how likely it is that this

tree gave rise to the observations (i.e. the alignment)

– This depends on assumptions made about evolutionary events

S b tit ti t i d lti

  • Substitution matrix and gap penalties
  • Faster/slower evolving positions in the sequence
  • Faster/slower evolving lineages in the tree

– These assumptions are called the evolutionary model

  • The maximum likelihood tree is the tree that is most likely

to have generated the observations (i.e. the multiple alignment) given the model

slide-5
SLIDE 5

31‐Mar‐15 5

How to root a tree

  • Outgroup rooting

– Include a basal sequence/species in the tree that branched off before the rest of the sequences/species diverged – You know that the root lies between this distant homolog and the rest of the sequences/species

M Fish Yeast Mouse Human Y F M H

1

  • Midpoint rooting

– Assume that the root lies in the middle

  • f longest distance between two nodes

in the tree

  • Use a gene duplication event

– Applicable when studying paralogous genes in a family (next lecture)

f- h- f- h-

Fish Yeast Mouse Human

5 2 2 1 1 4 2 1 2 1

Y F M H

f- h- f- h-

Remember: garbage in = garbage out!

  • Any sequences will be aligned by an alignment program

– Even of the sequences are not homologous, and thus the “multiple sequence alignment” is meaningless

  • Any “multiple sequence alignment” will be turned into a

“tree” by an phylogeny program

– Even if the sequences are not homologous or if they are very badly aligned, and thus the “phylogenetic tree” is meaningless

  • Solutions:

– Check your multiple sequence alignment carefully before making a phylogenetic tree – If the tree shows unexpected branching order, think twice about your methods before interpreting it as a true biological event

Which phylogeny approach to use?

  • As with sequence alignments, remember that a

phylogenetic tree is a hypothesis of the true evolutionary history

  • Distance trees are fast and give you a quick idea of the

phylogenetic relationships

– NJ is better than UPGMA because unequal rates of evolution are q a common phenomenon

  • ML trees are slower but tend to give more reliable

phylogenetic trees

– MP is rarely used for phylogenetic inference

  • You can test which approach is the best if you know the

true Tree of Life

Phylogenetic markers

  • Available/easy to sequence

– Cytochrome C

Fitch Science 1967

  • Present in all species
  • Constant function
  • Slowly evolving

– SSU rRNA

Fitch, Science 1967 Woese et al, PNAS 1977

Ribosomal RNA genes

  • Small subunit ribosomal RNA (SSU rRNA) is a universal

marker gene that indicates the taxonomic group of an

  • rganism, and was used to discover the three domains

in the Tree of Life (ToL)

– 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes)

Different genes tell different stories

  • Conflict between trees based on single genes

– Unrecognized paralogy (next lecture) – Horizontal gene transfer – Mutation saturation biases divergent rates of evolution Mutation saturation, biases, divergent rates of evolution

  • Combine information from more

genes to average out these anomalies

– Complete genomes contain the maximum phylogenetic information – Application of complete genomes to reconstruct phylogenies is called phylogenomics

slide-6
SLIDE 6

31‐Mar‐15 6

Testing phylogeny approaches on fungi

  • Many complete genomes are available
  • Consensus about phylogeny
  • ML tree of the concatenated multiple

sequence alignments of universal proteins recovers most target nodes

19 target nodes

How do we assess confidence in an experiment?

  • We use statistics to

assess confidence in an experiment

– We repeat the experiment N times and test how robust the result is – How much variation is there in the result?

How do we assess confidence in a tree?

  • We have already used all the data to create the

phylogenetic tree, so how can we get information about the statistical confidence of the nodes in the tree?

  • “To pull oneself up by one's bootstraps” means “to better
  • neself without external help”
  • Bootstrap re‐sampling or bootstrapping:

– A multiple alignment consists of many observations (i.e. the positions or columns) – We can randomly re‐sample new datasets from these many

  • bservations and test if the nodes in the phylogenetic tree are

robust in these new datasets

Bootstrap re‐sampling

  • Randomly sample columns (positions) from a multiple

alignment

– The number of sampled characters is identical to the number of positions in the original alignment – Sampling is done with replacement, so some positions can be sampled multiple times, while other positions are never chosen

  • Re‐calculate a bootstrap tree based on the randomly re‐

p y sampled alignment

  • Repeat this e.g. 100‐1,000 times

– This means making 100‐1,000 trees

  • For each branch in the original tree,

in what percentage of the bootstrap trees was it correctly recovered?

– This is the bootstrap score for the branch

  • Like many alignment algorithms,

bootstrapping assumes that all positions in the alignment evolve independently (which is not true)

An exam question

  • What is the maximum

number of bootstrap trees where Zebrafish_1A and Stickleback_1A could have formed a clade of two leaves?

  • What is the maximum

number of bootstrap trees p where any Clawed Frog sequence could have formed a clade of two leaves with any Stickleback sequence?