outline
play

Outline Review of trees. Coun4ng features. Characterbased - PDF document

1/27/09 CSCI1950Z Computa4onal Methods for Biology Lecture 2 Ben Raphael January 26, 2009 hHp://cs.brown.edu/courses/csci1950z/ Outline Review of trees. Coun4ng features. Characterbased phylogeny Maximum parsimony


  1. 1/27/09 CSCI1950‐Z Computa4onal Methods for Biology Lecture 2 Ben Raphael January 26, 2009 hHp://cs.brown.edu/courses/csci1950‐z/ Outline • Review of trees. Coun4ng features. • Character‐based phylogeny – Maximum parsimony – Maximum likelihood 1

  2. 1/27/09 Tree Defini4ons tree : A connected acyclic graph G = (V, E). graph : A set V of vertices ( nodes ) and a set E of edges , where each edge ( v i , v j ) connects a pair of vertices. A path in G is a sequence ( v 1 , v 2 , …, v n ) of vertices in V such that ( v i , v i+1 ) are edges in E. A graph is connected provided for every pair v i v j of vertices, there is a path between v i and v j . A cycle is a path with the same starting and ending vertices. A graph is acyclic provided it has no cycles. Tree Defini4ons degree of vertex v is the number of edges incident to v . A phylogenetic tree is a tree with a label for each leaf (vertex of degree one). A binary phylogenetic tree is a phylogenetic tree where every interior (non-leaf) vertex has degree 3; (one parent and two children ). A rooted (*binary) phylogenetic tree is phylogenetic tree with a single designated vertex r (* of degree 2). w is a parent (ancestor) of v provided (v,w) is on path to root. In this case v is a child ( descendant ) of w . 2

  3. 1/27/09 Tree Defini4ons tree : A connected acyclic graph G = (V, E). degree of vertex v is the number of edges incident to v . A phylogenetic tree is a tree with a label for each leaf (vertex of degree one). • Leaves represent existing species • Other vertices represent most recent common ancestor. • Length of branches represent evolutionary time. • Root (if present) represents the oldest evolutionary ancestor. Coun4ng and Trees • A tree with n ver4ces has n ‐1 edges. (Proof?) • A rooted binary phylogene4c tree with n leaves has n ‐1 internal ver4ces; and thus 2 n ‐1 total ver4ces. • How many rooted binary phylogene4c trees with n leaves? 3

  4. 1/27/09 Character‐based Phylogene4c Tree Reconstruc4on Output Input Op6mal phylogene4c Characters tree Molecular Algorithm Morphological 1. What is character data? 2. What is the criteria for evalua6ng a tree? 3. How do we op6mize this criteria: 1. Over all possible trees? 2. Over a restricted class of trees? Character‐Based Tree Reconstruc4on • Characters may be morphological features # of eyes or legs or the shape of a beak or a fin. • Characters may be nucleo4des of DNA (A, G, C, T) or amino acids (20 leHer alphabet). • Values are called states of character. 2‐state character Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA CCTGTGACGTAGCAAACGA Human: Non‐informa4ve character 4

  5. 1/27/09 Character‐Based Tree Reconstruc4on GOAL : determine what character strings at internal nodes would best explain the character strings for the n observed species An Example Value1 Value2 Mouth Smile Frown Eyebrows Normal Pointed 5

  6. 1/27/09 Character‐Based Tree Reconstruc4on Which tree is beAer? Character‐Based Tree Reconstruc4on Count changes on tree 6

  7. 1/27/09 Character‐Based Tree Reconstruc4on Maximum Parsimony : minimize number of changes on edges of tree Maximum Parsimony • Ockham’s razor: “simplest” explana4on for the data • Assumes that observed character differences resulted from the fewest possible muta4ons • Seeks tree with the lowest possible parsimony score , defined sum of cost of all muta4ons found in the tree 7

  8. 1/27/09 Character Matrix Given n species, each labeled by m characters. Each character has k possible states . Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA n x m character matrix Assume that characters in character string are independent. Parsimony Score Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA Assume that characters in character string are independent. Given character strings S=s 1 …s m and T=t 1 …t m : #changes (S  T) = Σ i d H ( s i , t i ) where d H = Hamming distance d H ( v , w ) = 0 if v=w d H ( v , w ) = 1 otherwise parsimony score of the tree as the sum of the lengths (weights) of the edges 8

  9. 1/27/09 Parsimony and Tree Reconstruc4on Maximum Parsimony Two computa4onal sub‐problems: 1. Find the parsimony score for a fixed tree. – Small Parsimony Problem (easy) 2. Find the lowest parsimony score over all trees with n leaves. – Large parsimony problem (hard) 9

  10. 1/27/09 Small Parsimony Problem Input: Tree T with each leaf labeled by an m ‐ character string. Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score. Since characters are independent, every leaf is labeled by a single character. Small Parsimony Large Parsimony Problem Problem Input: Input: T : tree with each leaf M : an n x m character labeled by an m ‐character matrix . string. Output: A tree T with: Output: • n leaves labeled by the n Labeling of internal ver4ces rows of matrix M of the tree T minimizing • labeling of the internal the parsimony score. ver4ces of T minimizing the parsimony score over all possible trees and all possible labelings of internal ver4ces 10

  11. 1/27/09 Small Parsimony Problem Input: Binary tree T with each leaf labeled by an m ‐character string. Output: Labeling of internal ver4ces of the tree T minimizing the parsimony score. Since characters are independent, every leaf is labeled by a single character. Weighted Small Parsimony Problem More general version of Small Parsimony Problem • Input includes a k x k scoring matrix δ describing the cost of transforming each of k states into another state. • Small Parsimony Problem is special case: δ ij = 0, if i = j , 1, otherwise. 11

  12. 1/27/09 Scoring Matrices Weighted Small Small Parsimony Problem Parsimony Problem A T G C A T G C A 0 1 1 1 A 0 3 4 9 T 1 0 1 1 T 3 0 2 4 G 1 1 0 1 G 4 2 0 4 C 1 1 1 0 C 9 4 4 0 Unweighted vs. Weighted Small Parsimony Scoring Matrix: A T G C A 0 1 1 1 T 1 0 1 1 G 1 1 0 1 C 1 1 1 0 Small Parsimony Score: 5 12

  13. 1/27/09 Unweighted vs. Weighted Weighted Parsimony Scoring Matrix: A T G C A 0 3 4 9 T 3 0 2 4 G 4 2 0 4 C 9 4 4 0 Weighted Parsimony Score: 22 Weighted Small Parsimony Problem Input: T: tree with each leaf labeled by an m ‐character string from a k ‐leHer alphabet. δ : k x k scoring matrix Output: Labeling of internal ver4ces of the tree T minimizing the weighted parsimony score. 13

  14. 1/27/09 Sankoff Algorithm Calculate and keep track of a score for every possible label at each vertex: s t ( v ) = minimum parsimony score of the subtree rooted at vertex v if v has character t s t ( v ) t …. …. Sankoff Algorithm s t ( v ) = minimum parsimony score of the subtree rooted at vertex v if v has character t The score s t ( v ) is based only on scores of its children: s t (parent) = min i { s i ( leo child ) + δ i, t } + min j { s j ( right child ) + δ j, t } t δ i, t δ j, t s i (leo child) s j (right child) 14

  15. 1/27/09 Sankoff Algorithm (cont.) • Begin at leaves: – If leaf has the character in ques4on, score is 0 – Else, score is ∞ Sankoff Algorithm (cont.) s t ( v ) = min i { s i ( u ) + δ i, t } + min j { s j ( w ) + δ j, t } s i ( u ) sum δ i, A A 0 0 0 s A ( v ) = 0 s A ( v ) = min i { s i ( u ) + δ i, A } + min j { s j ( w ) + δ j, A } T ∞ 3 ∞ G ∞ 4 ∞ C ∞ 9 ∞ 15

  16. 1/27/09 Sankoff Algorithm (cont.) s t ( v ) = min i { s i ( u ) + δ i, t } + min j { s j ( w ) + δ j, t } s j ( u ) sum δ j, A A ∞ 0 ∞ s A ( v ) = min i { s i ( u ) + δ i, A } + s A ( v ) = 0 min j { s j ( w ) + δ j, A } + 9 = 9 T ∞ 3 ∞ G ∞ 4 ∞ C 0 9 9 Sankoff Algorithm (cont.) s t ( v ) = min i { s i ( u ) + δ i, t } + min j { s j ( w ) + δ j, t } Repeat for T, G, and C 16

  17. 1/27/09 Sankoff Algorithm (cont.) Repeat for right subtree Sankoff Algorithm (cont.) Repeat for root 17

  18. 1/27/09 Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T Sankoff Algorithm: Traveling down the Tree • The scores at the root vertex have been computed by going up the tree • Aoer the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with op4mal character. 18

  19. 1/27/09 Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T Sankoff Algorithm (cont.) And the tree is thus labeled… 19

  20. 1/27/09 Analysis of Sankoff’s Algorithm A dynamic programming problem algorithm: Op>mal substructure : solu4on obtained by solving smaller problem of same type. s t (parent) = min i { s i ( leo child ) + δ i, t } + min j { s j ( right child ) + δ j, t } t Recurrence terminates at δ i, t δ j, t leaves, where solu4on is s i (leo child) s j (right child) known. Analysis of Sankoff’s Algorithm How many computa6ons do we perform for n species, m characters, and k states per character? Forward step: • At each internal node of tree: s t (parent) = min i { s i ( leo child ) + δ i, t } + min j { s j ( right child ) + δ j, t } • 2k sums and 2 (k‐1) comparisons = 4k ‐2 • n‐1 internal nodes. • (4k – 2)(n ‐1) sums. Traceback: one “lookup” per internal node. (n‐1) opera4ons For each character (4k – 2)(n‐1) + (n‐1) opera4ons ≤ C n k • Above calcula4on performed once for each character: ≤ C m n k opera4ons • O( m n k) 4me. [“big‐O”] • Increases linearly w/ # of species or # of characters. 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend