algorithm summary
play

Algorithm Summary Method Input Output Sankoffs & Fitchs - PDF document

2/4/09 CSCI1950Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hHp://cs.brown.edu/courses/csci1950z/ Algorithm Summary Method Input Output Sankoffs & Fitchs Characters, T A, B Parsimony Alg.


  1. 2/4/09 CSCI1950‐Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hHp://cs.brown.edu/courses/csci1950‐z/ Algorithm Summary Method Input Output Sankoff’s & Fitch’s Characters, T A, B Parsimony Alg. Perfect Phylogeny Characters A, B, T Probabilis4c Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states 1

  2. 2/4/09 Pairwise Compa4bility Test (Wilson 1965) Binary characters i and j are pairwise compa4ble if and only if: j is homogenous w.r.t i 0 or i 1 . Equivalently: i 1 and j 1 are disjoint or one contains the other Equivalently: i j k A 0 A 0 A 0 all 4 rows do not exist B 0 B 0 B 0 i 0 C 0 C 1 C 1 (0,0), (0,1), (1,0), (1,1) D 1 D 0 D 1 i 1 E 1 E 0 E 0 Pairwise Compa4bility Theorem (Estabrook et al. 1976) A set S of binary characters is mutually compa4ble if and only if all pairs c and c ’ of characters in S are pairwise compa4ble. Pairwise compa4bility  mutual compa4bility. 2

  3. 2/4/09 Perfect Phylogeny traits A set of mutually compa4ble binary 1 2 3 4 5 characters gives a perfect phylogeny : A 1 1 0 0 0 species B 0 0 1 0 0 C 1 1 0 1 0 1. Evolu4onary model D 0 0 1 0 1 – Binary characters {0,1} E 1 0 0 0 0 – Each character changes state only once in evolu4onary history (no homoplasy!). 2. Tree in which every muta4on is on an edge of the tree. 1 – All the species in one sub‐tree contain a 0, and all species in the other contain a 1. – For simplicity, assume root = (0, 0, 0, 0, 0) Last )me: algorithm to reconstruct a tree. 1 0 Trees and Splits • Given a set X, a split is a par44on of X into two non‐empty subsets A and B. X = A | B. • For a phylogene4c tree T with leaves L , each edge e defines a split L e = A | B , where A and B are the leaves in the subtrees obtained by removing e . i In perfect phylogeny, edges where binary character changes state gave split i 0 and i 1 . We will return to splits in a future lecture. i 1 i 0 3

  4. 2/4/09 Splits Equivalence Theorem A phylogene4c tree T defines a collec4on of splits Σ(T) = { L e | e is edge in T}. Splits A 1 | B 1 and A 2 | B 2 are pairwise compa3ble if at least one of A 1 ∩ A 2 , A 1 ∩ B 2 , B 1 ∩ A 2 , and B 1 ∩ B 2 is the empty set. Splits Equivalence Theorem : Let Σ be a collec4on of splits. There is a phylogene4c tree such that Σ(T) = Σ if and only if the splits in Σ are pairwise compa4ble. The Pairwise Compa4bility Theorem (for binary characters) follows from this theorem. Outline Distance‐based methods for phylogene4c tree reconstruc4on. • Review of distances/metrics. • Tree distances and addi4ve distances – Small and large phylogeny problems. • Non‐addi4ve distances and clustering – UPGMA and ultrametric distances. 4

  5. 2/4/09 Distances A distance on a set X is a func4on d: X  R sa4sfying: d( x , y ) ≥ 0, with equality iff x = y . For all x , y ∈ X, d( x , y ) = d( y , x ) [symmetry] For all x , y , z ∈ X, d( x , z ) ≤ d( x , y ) + d( y , z ) [triangle inequality] Examples: X = real numbers, d( x , y ) = | x – y | is distance. X = strings over some alphabet. d H ( s , t ) = number of posi4ons where s and t differ is called Hamming distance. Distances in Biological Data • String distances (e.g. Hamming distance, edit distance) on DNA/protein sequence data • Subs4tu4on model (Jukes‐Cantor, Kimura, etc.): scores for par4cular changes A  T, C  G, etc. Rat: ACAGTCACGCCCCACACGT Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGAGGTAGCAAACGA CCTGTGAGGTAGCACACGA Human: 5

  6. 2/4/09 Distance Matrix • For n species, form n x n distance matrix D ij • Example: D ij = edit distance between a gene in species i and species j . 0 7 11 10 Mouse: ACAGTGACGCCACACACGT 7 0 4 6 Gorilla: CCTGCGACGTAACAAACGC 11 4 0 2 Chimpanzee: CCTGCCAGTTAGCAAACGC 10 6 2 0 CCTGCCAGTTAGCACACGA Human: Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Sequence a gene of Gorilla: CCTGCGACGTAACAAACGC length m in n Chimpanzee: CCTGCCAGTTAGCAAACGC species  n x m CCTGCCAGTTAGCACACGA Human: alignment matrix. Reverse Transform transforma4on not possible due to loss into… of informa4on . 0 7 11 10 n x n distance matrix 7 0 4 6 11 4 0 2 10 6 2 0 6

  7. 2/4/09 Distances in Trees Given a tree T with a posi4ve weight w ( e ) on each edge, we define the tree distance d T on the set L of leaves by: d T ( i , j ) = sum of weights of edges on unique path from i to j. In evolu4onary biology, weights are some4mes called branch lengths . Distance in Trees: an Example j i d T (1,4) = 12 + 13 + 14 + 17 + 13 = 69 7

  8. 2/4/09 Distance vs. Tree Distance • n x n distance matrix for n species • Note that d T ( i , j ), tree distance between i and j, not necessarily equal to D ij as given by distance matrix. Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA Fivng a Distance Matrix • Given n species, we can compute the n x n distance matrix D ij • Evolu4on of these species is described by a tree that we don’t know . • We need an algorithm to construct a tree that best fits the distance matrix D ij Find a tree T such that: Lengths of path in an ( unknown ) tree T D ij = d T (i,j ) Distance between species ( known ) 8

  9. 2/4/09 Distance Based Phylogeny Problem Goal: Reconstruct an evolu4onary tree from a distance matrix Input: n x n distance matrix D ij Output: weighted tree T with n leaves fivng D Unknown topology of tree makes evolu4onary tree reconstruc4on hard ! # unrooted binary trees n leaves: T(n) = (2n‐3)! / ((n‐2)! 2 n‐2 ) 26 n = 24: T(n) = 5.74 x 10 If D is addi3ve , this problem has a solu4on and there is a simple algorithm to solve it Distance‐based vs. character‐based Key difference: Distance‐based methods do not reconstruct ancestral states. A B C D A 0 1 2 2 B 1 0 1 1 C 2 1 0 0 D 2 1 0 0 Note that C and D are iden4cal. 9

  10. 2/4/09 Reconstruc4ng a 3 Leaved Tree • Tree reconstruc4on for a 3x3 matrix is straighxorward • We have 3 leaves i, j, k and a center vertex c Observe: d ic + d jc = D ij d ic + d kc = D ik d jc + d kc = D jk Reconstruc4ng a 3 Leaved Tree (cont’d) d ic + d jc = D ij + d ic + d kc = D ik 2d ic + d jc + d kc = D ij + D ik 2d ic + D jk = D ij + D ik d ic = (D ij + D ik – D jk )/2 Similarly, d jc = (D ij + D jk – D ik )/2 d kc = (D ki + D kj – D ij )/2 10

  11. 2/4/09 Trees with > 3 Leaves • A binary tree with n leaves has 2n‐3 edges • Fivng a given tree to a distance matrix D requires solving a system with n ( n ‐1)/2 equa4ons and 2n‐3 variables • Solu4on not always possible for n > 3. Addi4ve Distance Matrices Matrix D is ADDITIVE if there exists a tree T with d ij ( T ) = D ij NON-ADDITIVE otherwise 11

  12. 2/4/09 Addi4ve Distance Phylogeny Small Addi>ve Distance Phylogeny : Given phylogene4c tree T and distance matrix D, determine branch lengths such that d T (i,j ) = D ij . Large Addi>ve Distance Phylogeny : Given distance matrix D, find T and branch lengths such that d T (i,j ) = D ij . Both of these problems can be solved efficiently. Reconstruc4ng Addi4ve Distances Given T x T D y 5 4 v w x y z 3 z v 0 10 17 16 16 3 4 7 w w 0 15 14 14 6 x 0 9 15 v y 0 14 If we know T and D, but do not know the length of each edge, we z 0 can reconstruct those lengths 12

  13. 2/4/09 Reconstruc4ng Addi4ve Distances Given T x T D y v w x y z z v 0 10 17 16 16 w w 0 15 14 14 x 0 9 15 v y 0 14 z 0 Reconstruc4ng Addi4ve Distances Given T x v w x y z Find neighbors v, w v 0 10 17 16 16 y (common parent) D w 0 15 14 14 x 0 9 15 z y 0 14 a w z 0 v a x y z d ax = ½ (d vx + d wx – d vw ) a 0 11 10 10 d ay = ½ (d vy + d wy – d vw ) D 1 x 0 9 15 y 0 14 d az = ½ (d vz + d wz – d vw ) z 0 13

  14. 2/4/09 Reconstruc4ng Addi4ve Distances Given T x a x y z Neighbors x, y y 5 a 0 11 10 10 (common parent) 4 D 1 x 0 9 15 b 3 y 0 14 z 3 a 4 c 7 w z 0 6 d(a, c) = 3 v d(b, c) = d(a, b) – d(a, c) = 3 a b z D 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 a 0 6 10 a c D 2 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 a 0 3 b 0 10 d(a, v) = d(z, v) – d(a, z) = 6 c 0 Correct!!! z 0 Trees and Neighbors Previous algorithm relied only on finding neighboring leaves: 1. Find neighboring leaves i and j with parent k 2. Remove the rows and columns of i and j 3. Add a new row and column corresponding to k , where the distance from k to any other leaf m can be computed as: D km = (D im + D jm – D ij )/2 Compress i and j into k , iterate algorithm for rest of tree 14

  15. 2/4/09 Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves. WRONG! i j k l i 0 13 21 22 j 0 12 13 k 0 13 l 0 i and j are neighbors, but ( d ij = 13) > ( d jk = 12). Finding a pair of neighboring leaves is a nontrivial problem! Degenerate Triples • A degenerate triple is a set of three dis4nct elements 1 ≤ i, j, k ≤ n where D ij + D jk = D ik • Element j in a degenerate triple i,j,k lies on the evolu4onary path from i to k (or is aHached to this path by an edge of length 0). 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend