Outline Review of trees. Coun4ng features. Characterbased - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Review of trees. Coun4ng features. Characterbased - - PDF document

1/27/09 CSCI1950Z Computa4onal Methods for Biology Lecture 2 Ben Raphael January 26, 2009 hHp://cs.brown.edu/courses/csci1950z/ Outline Review of trees. Coun4ng features. Characterbased phylogeny Maximum parsimony


slide-1
SLIDE 1

1/27/09 1

CSCI1950‐Z Computa4onal Methods for Biology Lecture 2

Ben Raphael January 26, 2009

hHp://cs.brown.edu/courses/csci1950‐z/

Outline

  • Review of trees. Coun4ng features.
  • Character‐based phylogeny

– Maximum parsimony – Maximum likelihood

slide-2
SLIDE 2

1/27/09 2

Tree Defini4ons

tree: A connected acyclic graph G = (V, E). graph: A set V of vertices (nodes) and a set E

  • f edges, where each edge (vi, vj) connects a

pair of vertices. A path in G is a sequence (v1, v2, …, vn) of vertices in V such that (vi, vi+1) are edges in E. A graph is connected provided for every pair vi vj of vertices, there is a path between vi and vj. A cycle is a path with the same starting and ending vertices. A graph is acyclic provided it has no cycles.

Tree Defini4ons

degree of vertex v is the number of edges incident to v. A phylogenetic tree is a tree with a label for each leaf (vertex of degree one). A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; (one parent and two children). A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2). w is a parent (ancestor) of v provided (v,w) is on path to root. In this case v is a child (descendant) of w.

slide-3
SLIDE 3

1/27/09 3

Tree Defini4ons

tree: A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v. A phylogenetic tree is a tree with a label for each leaf (vertex of degree one).

  • Leaves represent existing species
  • Other vertices represent most recent

common ancestor.

  • Length of branches represent evolutionary

time.

  • Root (if present) represents the oldest

evolutionary ancestor.

Coun4ng and Trees

  • A tree with n ver4ces has n‐1 edges. (Proof?)
  • A rooted binary phylogene4c tree with n

leaves has n‐1 internal ver4ces; and thus 2n ‐1 total ver4ces.

  • How many rooted binary phylogene4c trees

with n leaves?

slide-4
SLIDE 4

1/27/09 4

Character‐based Phylogene4c Tree Reconstruc4on

  • 1. What is character data?
  • 2. What is the criteria for evalua6ng a tree?
  • 3. How do we op6mize this criteria:
  • 1. Over all possible trees?
  • 2. Over a restricted class of trees?

Output Op6mal phylogene4c tree

Algorithm

Input Characters Molecular

Morphological

Character‐Based Tree Reconstruc4on

  • Characters may be nucleo4des of

DNA (A, G, C, T) or amino acids (20 leHer alphabet).

  • Values are called states of character.
  • Characters may be

morphological features

# of eyes or legs or the shape of a beak or a fin.

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA 2‐state character Non‐informa4ve character

slide-5
SLIDE 5

1/27/09 5

Character‐Based Tree Reconstruc4on

GOAL: determine what character strings at internal nodes would best explain the character strings for the n observed species

An Example

Value1 Value2 Mouth Smile Frown Eyebrows Normal Pointed

slide-6
SLIDE 6

1/27/09 6

Character‐Based Tree Reconstruc4on

Which tree is beAer?

Character‐Based Tree Reconstruc4on

Count changes on tree

slide-7
SLIDE 7

1/27/09 7

Character‐Based Tree Reconstruc4on

Maximum Parsimony: minimize number of changes on edges of tree

Maximum Parsimony

  • Ockham’s razor: “simplest” explana4on for the

data

  • Assumes that observed character differences

resulted from the fewest possible muta4ons

  • Seeks tree with the lowest possible parsimony

score, defined sum of cost of all muta4ons found in the tree

slide-8
SLIDE 8

1/27/09 8

Character Matrix

Given n species, each labeled by m characters. Each character has k possible states. n x m character matrix Assume that characters in character string are independent.

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Parsimony Score

Assume that characters in character string are independent. Given character strings S=s1…sm and T=t1…tm: #changes (S  T) = Σi dH(si, ti) where dH = Hamming distance dH(v, w) = 0 if v=w dH(v, w) = 1 otherwise parsimony score of the tree as the sum of the lengths (weights) of the edges

Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

slide-9
SLIDE 9

1/27/09 9

Parsimony and Tree Reconstruc4on

Maximum Parsimony

Two computa4onal sub‐problems:

  • 1. Find the parsimony score for a fixed tree.

– Small Parsimony Problem (easy)

  • 2. Find the lowest parsimony score over all

trees with n leaves.

– Large parsimony problem (hard)

slide-10
SLIDE 10

1/27/09 10

Small Parsimony Problem

Input: Tree T with each leaf labeled by an m‐ character string. Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score.

Since characters are independent, every leaf is labeled by a single character.

Small Parsimony Problem

Input: T: tree with each leaf labeled by an m‐character string. Output: Labeling of internal ver4ces

  • f the tree T minimizing

the parsimony score. Input: M: an n x m character matrix. Output: A tree T with:

  • n leaves labeled by the n

rows of matrix M

  • labeling of the internal

ver4ces of T minimizing the parsimony score over all possible trees and all possible labelings of internal ver4ces

Large Parsimony Problem

slide-11
SLIDE 11

1/27/09 11

Small Parsimony Problem

Input: Binary tree T with each leaf labeled by an m‐character string. Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score.

Since characters are independent, every leaf is labeled by a single character.

Weighted Small Parsimony Problem

More general version of Small Parsimony Problem

  • Input includes a k x k scoring matrix δ

describing the cost of transforming each of k states into another state.

  • Small Parsimony Problem is special case:

δij = 0, if i = j, 1, otherwise.

slide-12
SLIDE 12

1/27/09 12

Scoring Matrices

A T G C A 1 1 1 T 1 1 1 G 1 1 1 C 1 1 1 A T G C A 3 4 9 T 3 2 4 G 4 2 4 C 9 4 4

Small Parsimony Problem Weighted Small Parsimony Problem

Unweighted vs. Weighted

Small Parsimony Scoring Matrix:

A T G C A 1 1 1 T 1 1 1 G 1 1 1 C 1 1 1

Small Parsimony Score: 5

slide-13
SLIDE 13

1/27/09 13

Unweighted vs. Weighted

Weighted Parsimony Scoring Matrix:

A T G C A 3 4 9 T 3 2 4 G 4 2 4 C 9 4 4

Weighted Parsimony Score: 22

Weighted Small Parsimony Problem

Input: T: tree with each leaf labeled by an m‐character string from a k‐leHer alphabet. δ: k x k scoring matrix Output: Labeling of internal ver4ces of the tree T minimizing the weighted parsimony score.

slide-14
SLIDE 14

1/27/09 14

Sankoff Algorithm

Calculate and keep track of a score for every possible label at each vertex:

st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t

t …. ….

st(v)

Sankoff Algorithm

st(v) = minimum parsimony score of the subtree rooted at vertex v if v has character t

The score st(v) is based only on scores of its children:

st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t}

t sj(right child) si(leo child) δj, t δi, t

slide-15
SLIDE 15

1/27/09 15

Sankoff Algorithm (cont.)

  • Begin at leaves:

– If leaf has the character in ques4on, score is 0 – Else, score is ∞

Sankoff Algorithm (cont.)

st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}

si(u)

δi, A

sum A T ∞ 3 ∞ G ∞ 4 ∞ C ∞ 9 ∞

sA(v) = 0

slide-16
SLIDE 16

1/27/09 16

Sankoff Algorithm (cont.)

st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

sA(v) = mini{si(u) + δi, A} + minj{sj(w) + δj, A}

sj(u)

δj, A

sum A ∞ ∞ T ∞ 3 ∞ G ∞ 4 ∞ C 9 9

+ 9 = 9 sA(v) = 0

Sankoff Algorithm (cont.)

st(v) = mini {si(u) + δi, t} + minj{sj(w) + δj, t}

Repeat for T, G, and C

slide-17
SLIDE 17

1/27/09 17

Sankoff Algorithm (cont.)

Repeat for right subtree

Sankoff Algorithm (cont.)

Repeat for root

slide-18
SLIDE 18

1/27/09 18

Sankoff Algorithm (cont.)

Smallest score at root is minimum weighted parsimony score

In this case, 9 – so label with T

Sankoff Algorithm: Traveling down the Tree

  • The scores at the root vertex have been

computed by going up the tree

  • Aoer the scores at root vertex are computed

the Sankoff algorithm moves down the tree and assign each vertex with op4mal character.

slide-19
SLIDE 19

1/27/09 19

Sankoff Algorithm (cont.)

9 is derived from 7 + 2 So left child is T, And right child is T

Sankoff Algorithm (cont.)

And the tree is thus labeled…

slide-20
SLIDE 20

1/27/09 20

Analysis of Sankoff’s Algorithm

A dynamic programming problem algorithm:

Op>mal substructure: solu4on obtained by solving smaller problem of same type. st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t}

Recurrence terminates at leaves, where solu4on is known.

t sj(right child) si(leo child) δj, t δi, t

Analysis of Sankoff’s Algorithm

How many computa6ons do we perform for n species, m characters, and k states per character? Forward step:

  • At each internal node of tree:

st(parent) = mini {si( leo child ) + δi, t} + minj {sj( right child ) + δj, t}

  • 2k sums and 2 (k‐1) comparisons = 4k ‐2
  • n‐1 internal nodes.
  • (4k – 2)(n ‐1) sums.

Traceback: one “lookup” per internal node. (n‐1) opera4ons For each character (4k – 2)(n‐1) + (n‐1) opera4ons ≤ C n k

  • Above calcula4on performed once for each character:

≤ C m n k opera4ons

  • O( m n k) 4me. [“big‐O”]
  • Increases linearly w/ # of species or # of characters.
slide-21
SLIDE 21

1/27/09 21

Analysis of Sankoff’s Algorithm

How many computa6ons do we perform for n species, m characters, and k states per character? Traceback: 2k sums

  • Above calcula4on performed once for each

character

  • O( m n k) 4me. [“big‐O”]
  • Increases linearly w/ # of species or # of

characters.

Fitch’s Algorithm

  • Solves Small Parsimony problem

– Published 4 years before Sankoff (1971)

  • Makes two passes through tree:

– Leaves  root. – Root  leaves.

slide-22
SLIDE 22

1/27/09 22

Fitch Algorithm: Step 1

Assign a set S(v) of leHers to every vertex v in the tree, traversing the tree from leaves to root

  • S(l) = observed character for each leaf l
  • For vertex v with children u and w:

if non‐empty intersec4on , otherwise

  • E.g. if the node we are looking at has a leo child

labeled {A, C} and a right child labeled {A, T}, the node will be given the set {A, C, T}

S(v) = S(u) ∩ S(w) S(u) ∪ S(w)

Fitch’s Algorithm: Example

a a a a a a c c {t,a} c t t t {t,a} a {a,c} {a,c}

slide-23
SLIDE 23

1/27/09 23

Fitch Algorithm: Step 2

Assign labels to each vertex, traversing the tree from root to leaves.

  • Assign root r arbitrarily from its set S(r)
  • For all other ver4ces v:

– If its parent’s label is in its set S(v), assign it its parent’s label – Else, choose an arbitrary leHer from its set S(v) as its label

Fitch’s Algorithm: Example

a a a a a a c c {t,a} c t t t {t,a} a {a,c} {a,c} a a a a a t c

slide-24
SLIDE 24

1/27/09 24

Fitch Algorithm (cont.) Fitch vs. Sankoff

  • Both have an O(nk) run4me
  • Are they actually different?
  • Let’s compare …
slide-25
SLIDE 25

1/27/09 25

Fitch

As seen previously:

Comparison of Fitch and Sankoff

  • As seen earlier, the scoring matrix for the Fitch

algorithm is merely:

  • So let’s do the same problem using Sankoff

algorithm and this scoring matrix

A T G C A 1 1 1 T 1 1 1 G 1 1 1 C 1 1 1

slide-26
SLIDE 26

1/27/09 26

Sankoff Sankoff vs. Fitch

  • The Sankoff algorithm gives the same set of op4mal

labels as the Fitch algorithm

  • For Sankoff algorithm, character t is op4mal for vertex

v if st(v) = min1<i<k si(v)

  • Let Sv = set of op4mal leHers for v.
  • Then
  • This is also the Fitch recurrence
  • The two algorithms are iden4cal

Sv = Su ∩ Sw if Su ∩ Sv = ∅, Su ∪ Sw,

  • therwise.
slide-27
SLIDE 27

1/27/09 27

A Problem with Parsimony

Ignores branch lengths on trees

A A A A A C A A A C A A A A

Same parsimony score. Muta4on “more likely” on longer branch.

Probabilis4c Model

Given a tree T with leaves labeled by present characters, what is the probability of a labeling of ancestral nodes?

Assume:

  • 1. Characters evolve independently.
  • 2. Constant rate of muta4on on each branch.
  • 3. State of a vertex depends only on parent and branch length:

i.e. Pr[ x | y, t] depends only on y and t. (Markov process)

Pr[ x | y, t] = probability that y mutates to x in 4me t

y x t

slide-28
SLIDE 28

1/27/09 28

Probabilis4c Model

Two species Pr[ x1, x2, a | T, t1, t2] = qa Pr[x1 | a, t1] Pr[x2 | a, t2] Pr[ x | y, t] = probability that y mutates to x in 4me t

T = tree topology x1 , x2 : characters for each species a : character for ancestor a x1 x2 t1 t2 y x t qa = Pr[ ancestor has character a]

Probabilis4c Model

n species: x1, x2, …, xn Let α(i) = ancestor of node i. Let an+1, an+2, …, a2n‐1 = characters on internal nodes, where nodes are number from internal ver4ces up to root.

  • an+1,an+2,..,a2n−1

qa2n−1

2n−2

  • i=n+1

Pr[ai|aα(i), ti]

n

  • i=1

Pr[xi|aα(i), ti]

Follows from Law of Total Probability: P(X) = Σ P(X| Yi) P(Yi).

Pr[x1, ..., xn|T, t1, ..., t2n−2] =

slide-29
SLIDE 29

1/27/09 29

Felsenstein’s Algorithm

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a. Compute via dynamic programming

Pr[Tk|a] =

  • b

Pr[b|a, ti]Pr[Ti|b]

  • c

Pr[c|a, tj]Pr[Tj|c]

Ini4al condi4ons. For k = 1, …, n (leaf nodes) Pr[Tk | a] = 1, if a = xk

0, otherwise.

a b c

Compu4ng the Likelihood

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a.

Pr[x1, . . . , xn|T, t∗] =

  • a

Pr[T2n−1|a]qa

Note: Root is node 2n‐1

slide-30
SLIDE 30

1/27/09 30

Maximum Likelihood

Let Pr[Tk | a] = probability of leaf nodes “below” node k, given ak = a. Traceback as before with Sankoff’s algorithm.

Pr[Tk|a] =

  • max

b

Pr[b|a, ti]Pr[Ti|b] max

c

Pr[c|a, tj]Pr[Tj|c]

  • Max. Parsimony vs. Max. Likelihood
  • Set δij = ‐log P(j | i) in weighted parsimony

(Sankoff algorithm)

  • Weighted parsimony produces “maximum

probability” assignments, ignoring branch lengths