Outline Probabilis3c Models of Phylogeny 1. Models of nucleo3de - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Probabilis3c Models of Phylogeny 1. Models of nucleo3de - - PDF document

2/15/09 CSCI1950Z Computa3onal Methods for Biology Lecture 7 Ben Raphael February 9, 2009 hHp://cs.brown.edu/courses/csci1950z/ Outline Probabilis3c Models of Phylogeny 1. Models of nucleo3de change 2. Compu3ng likelihood of trees 1


slide-1
SLIDE 1

2/15/09 1

CSCI1950‐Z Computa3onal Methods for Biology Lecture 7

Ben Raphael February 9, 2009

hHp://cs.brown.edu/courses/csci1950‐z/

Outline

Probabilis3c Models of Phylogeny

  • 1. Models of nucleo3de change
  • 2. Compu3ng likelihood of trees
slide-2
SLIDE 2

2/15/09 2

Distances from Sequences

Chimpanzee: CCTGCCAGTTAGCAAACGG Ancestor: CCCGCGACTTAACAAACGC Human: CCTGCGAGTTAACAAACGA

Hamming distance = 3. DS = differences per site = 3/20 12 total muta3ons.

T G G

coincident back parallel convergent mul3ple

Jukes‐Cantor Model

A C G T A ‐3α α α α C α ‐3α α α G α α ‐3α α T α α α ‐3α

Q =

P(t) = eQt pxy(t) = Pr[Xt = x | X0 = y] = ¼ + ¾ e‐4αt if x = y

¼ ‐ ¼ e‐4αt if x ≠ y

slide-3
SLIDE 3

2/15/09 3

Observed Differences vs. Jukes‐Cantor

= αt

Other Models

Kimura 2 parameter model Other models

  • HKY model for DNA
  • Other models for protein sequences (20 x 20

matrices)

A C G T A ‐2β‐ α β α β C β ‐2β‐ α β α G α β ‐2β‐ α β T β α β ‐2β‐ α

In biology, not all subs3tu3ons are equally likely. {A, G} {C, T} purines pyrimidines

transi3on transversion

slide-4
SLIDE 4

2/15/09 4

Probabilis3c Model

Given a character matrix M, what is the “most likely” tree T that generated M? Pr[ x | y, t] = probability that y mutates to x in 3me t

y x t

Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Likelihood

Given data D and a model (hypothesis) H for genera3on of D, the likelihood of H is the quan3ty L[H; D] = Pr[ D | H ]. Note: likelihood of H is not the probability of H.

slide-5
SLIDE 5

2/15/09 5

Probabilis3c Model

Given a character matrix M, what is the “most likely” tree T that generated M?

Assume:

  • 1. Characters evolve independently.
  • 2. Constant rate of muta3on on each branch.
  • 3. State of a node depends only on parent and branch length:

i.e. Pr[ x | y, t] depends only on y and t. (Markov process)

Pr[ x | y, t] = probability that y mutates to x in 3me t

y x t

Maximum Likelihood

Given a character matrix M, a tree T with branch lengths t* = t1, …, t2n‐2: L(T, t*) = Pr[ M | T, t*] is called the likelihood. Maximum likelihood: Find argmaxT, t L(T, t*) First: How to compute L(T, t*)?

slide-6
SLIDE 6

2/15/09 6

Probabilis3c Model

Given a tree (T, t*) with leaves labeled by characters in M, Pr[ M | T, t*] is the probability of a labeling of ancestral nodes.

Assume: 1. Characters evolve independently: Pr[ M | T, t*] = Πi Pr[ Mi | T, t*] so consider each character separately 2. Constant rate of muta3on on each branch. 3. State of a vertex depends only on parent and branch length: i.e. Pr[ x | y, t] depends only on y and t. (Markov process)

Pr[ x | y, t] = probability that y mutates to x in 3me t

y x t

Example

t1 t3 t2 t5 t4 t6 A T C G y z x

slide-7
SLIDE 7

2/15/09 7

Probabilis3c Model

Two species Pr[ x1, x2, a | T, t1, t2] = qa Pr[x1 | a, t1] Pr[x2 | a, t2] Pr[ x | y, t] = probability that y mutates to x in 3me t

T = tree topology x1 , x2 : characters for each species a : character for ancestor a x1 x2 t1 t2 y x t qa = Pr[ ancestor has character a]

Probabilis3c Model

Two species Pr[ x1, x2 | T, t1, t2] = Σa qa Pr[x1 | a, t1] Pr[x2 | a, t2]

T = tree topology x1 , x2 : characters for each species a : character for ancestor a x1 x2 t1 t2 qa = Pr[ ancestor has character a]

Pr[ x1, x2, a | T, t1, t2] = qa Pr[x1 | a, t1] Pr[x2 | a, t2]

Follows from Law of Total Probability: P(X) = Σ P(X| Yi) P(Yi).

slide-8
SLIDE 8

2/15/09 8

Probabilis3c Model

n species: x1, x2, …, xn Let α(i) = ancestor of node i. Let an+1, an+2, …, a2n‐1 = characters on internal nodes, where nodes are number from internal ver3ces up to root.

  • an+1,an+2,..,a2n−1

qa2n−1

2n−2

  • i=n+1

Pr[ai|aα(i), ti]

n

  • i=1

Pr[xi|aα(i), ti]

Pr[x1, ..., xn|T, t1, ..., t2n−2] =

Follows from Law of Total Probability: P(X) = Σ P(X| Yi) P(Yi).

Felsenstein’s Algorithm

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a. Compute via dynamic programming

Pr[Tk|a] =

  • b

Pr[b|a, ti]Pr[Ti|b]

  • c

Pr[c|a, tj]Pr[Tj|c]

Ini3al condi3ons. For k = 1, …, n (leaf nodes) Pr[Tk | a] = 1, if a = xk

0, otherwise.

a b c

slide-9
SLIDE 9

2/15/09 9

Compu3ng the Likelihood

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a.

Pr[x1, . . . , xn|T, t∗] =

  • a

Pr[T2n−1|a]qa

Note: Root is node 2n‐1

Maximum Likelihood when T unknown

Must search over all trees T. Complexity unknown un3l recently:

– Felsenstein book (2004): “There has also been no proof that the problem is NP‐hard (as there has been for many

  • ther methods”

– Shamir notes (2000): “[Maximum likelihood] not proven to be NP‐complete.”

  • ML is NP‐hard (B. Chor and T. Tuller, RECOMB 2005).

Pr[x1, . . . , xn|T, t∗] =

  • a

Pr[T2n−1|a]qa

Find T, t* that maximize:

slide-10
SLIDE 10

2/15/09 10

Unknown branch lengths

  • T fixed, branch lengths t* are unknown.
  • Use local op3miza3on rou3ne: e.g. Newton’s

method or Expecta3on Maximiza3on

Finding Ancestral States

Let Pr[Tk | a] = probability of best assignment of ancestral states to nodes “below” node k, given ak = a. Traceback as before with Sankoff’s algorithm.

Pr[Tk|a] =

  • max

b

Pr[b|a, ti]Pr[Ti|b] max

c

Pr[c|a, tj]Pr[Tj|c]

slide-11
SLIDE 11

2/15/09 11

  • Max. Parsimony vs. Max. Likelihood
  • Set δij = ‐log P(j | i) in weighted parsimony

(Sankoff algorithm)

  • Weighted parsimony produces “maximum

probability” assignments, ignoring branch lengths

Searching over Tree Space

  • How to find T with maximum likelihood?
  • How to find T with maximum parsimony?