algorithms in bioinformatics a practical introduction
play

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic - PowerPoint PPT Presentation

Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and Consensus Trees Phylogenetic Tree comparison Why tree comparison? Different phylogenies are resulted using different Kind of data (different


  1. Algorithms in Bioinformatics: A Practical Introduction Phylogenetic Tree comparison and Consensus Trees

  2. Phylogenetic Tree comparison

  3. Why tree comparison?  Different phylogenies are resulted using different  Kind of data (different segments of the genomes)  Kind of model (CF model, Jukes-Cantor Model)  Kind of reconstruction algorithm  Tree comparison helps us to gain information from multiple trees.

  4. Two types of comparsions  Similarity measurement  Find the common structure among the given trees  Maximum Agreement Subtree  Dissimilarity measurement  Determine the differences among the given trees  Robinson-Foulds distance  Nearest neighbor interchange  Subtree Transfer Distance  Quartet Distance

  5. Restricted subtree  Consider a trees T  Evolution information Restricted on of X 1 , X 3 , X 5 x 4 x 5 x 5 X 1 , X 3 , X 5 x 2 x 3 x 3 x 1 x 1 Evolution Simplify information of X 1 , x 5 X 2 , X 3 , X 4 , X 5 x 3 x 1

  6. Agreement subtree T x 4 x 5 x 4 x 5 x 2 x 3 x 2 x 1 x 1 Restricted on Simplify x 1 , x 2 , x 4 , x 5 x 5 T ’ x 1 x 2 x 4 Agreement subtree of x 4 x 1 x 2 x 4 x 1 x 2 T and T ’ x 5 x 3 x 5

  7. Maximum agreement subtree (MAST)  Given two trees T 1 and T 2  Agreement subtree of T 1 and T 2 is the common information agreed by both trees.  Since it is agreed by both trees, the evolution of the agreement subtree is more reliable!  Maximum agreement subtree problem  Find the agreement subtree with the largest possible number of leaves.  Such agreement subtree is called the maximum agreement subtree

  8. MAST for rooted trees  MAST of two degree-d rooted trees T 1 and T 2 with n leaves can be computed in ( log( n )) time O d n (Journal of Algorithm 2001)  d  This lecture considers an O(n 2 )-time algorithm which compute the maximum agreement subtree of two binary trees with n leaves.

  9. Computing MAST by dynamic programming  For any two binary rooted trees T 1 and T 2 , denote MAST(T 1 , T 2 ) be the number of leaves in the maximum agreement subtree  Some definition:  For a tree T and a node u, T u is the subtree of T rooted at u

  10. Not complete!  For any node pair (u,v) ∈ T 1 × T 2 ,  let a and b be two children of u  let c and d be two children of v  Let R be the maximum agreement subtree of T 1 and T 2 .  We have the following cases: a  R is an agreement subtree of T 1 b  R is an agreement subtree of T 1

  11. Recurrence = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  12. Recurrence (II) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d   ( , ) MAST T T 1 2 T 1 T 2

  13. Recurrence (III) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c  ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  14. Recurrence (IV) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v   ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  15. Recurrence (V) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v   ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  16. Recurrence (VI) = u v ( , ) MAST T T 1 2  + a c b d ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c  ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  17. Recurrence (VII) = u v ( , ) MAST T T 1 2  + a c b d  ( , ) ( , ) MAST T T MAST T T 1 2 1 2  +  a d b c ( , ) ( , ) MAST T T MAST T T 1 2 1 2  a v  ( , ) MAST T T 1 2  max u v b v  ( , ) MAST T T 1 2  a b c d u c ( , ) MAST T T  1 2  u d  ( , ) MAST T T 1 2 T 1 T 2

  18. Time complexity  Suppose T 1 and T 2 are rooted phylogenies for n species. u , T 2 v ) for  We have to compute MAST(T 1 every u in T 1 and v in T 2 .  Thus, we need to fill in n 2 entries. Each entry can be computed in O(1) time.  In total, the time complexity is O(n 2 ).

  19. MAST for unrooted trees  In real life, we normally want to compute MAST for unrooted trees.  For unrooted degree-3 trees U 1 and U 2 , MAST(U 1 , U 2 ) can be computed in O(n log n) time. (STOC 97)  For general unrooted trees U 1 and U 2 , MAST(U 1 , U 2 ) can be computed in O(n 1.5 log n) time. (SIAM J. of Comp 2000)  This lecture shows the relationship between unrooted MAST and rooted MAST!

  20. Relating rooted and unrooted trees (I)  Definition:  For an unrooted tree U, for any edge e in U, U e is the rooted tree rooted at the edge e. x 1 x 4 e  rooted at x 4 x 1 x 2 edge e x 2 x 3 x 5 x 3 x 5

  21. Relating rooted and unrooted trees (II)  Consider two unrooted trees U 1 and U 2  Lemma: For any edge e of U 1 , = e f ( , ) max{ ( , ) | is an edge of } MAST U U MAST U U f U 1 2 1 2 2  Proof: Exercise!  Based on the above lemma, we can relate rooted MAST and unrooted MAST!

  22. Robinson-Foulds distance  Given two phylogenies T 1 and T 2 ,  Intuitively, this method tries to count the number of edges which are not agreed by T 1 and T 2 .  First, we need to have some definitions!

  23. Partitioning of a tree  Each edge can partition the set of species  In the following tree, the red edge partition the species into { a, b, c} and { d, e} c d e a b

  24. Good and bad edges Consider two unrooted trees T and T ’ , an edge x in T is called a  good edge if there exists an edge x ’ in T ’ such that both of them form the same partitions! Similarly, x ’ is also called a good edge. Otherwise, the edge is called a bad edge!  c a T ’ d e T x x ’ e d a b b c

  25. Leaf edges are always good c a T ’ d x ’ e T x e d a b b c

  26. Robinson-Foulds (RF) distance  Robinson-Foulds distance = (number of bad edges in T w.r.t T ’ + number of bad edges in T ’ w.r.t. T)/2  T and T ’ looks similar if RF-dist(T, T ’ ) is small.  For example, the robinson-foulds distance of T and T ’ = (1+ 1)/2 = 1. c a T ’ d e T e d a b b c Bad edges!

  27. Degree-3 trees T and T ’  When both T and T ’ are of degree-3, number of bad edges in T w.r.t. T ’ = number of bad edges in T ’ w.r.t. T  Proof:  Since both T and T ’ are of degree-3, T and T ’ have the same number of edges  Number of good edges in T w.r.t. T ’ = number of good edges in T ’ w.r.t. T  Lemma follows.

  28. How to find the set of good edges in T w.r.t. T ’ ?  Brute-force algorithm:  For every edge e in T,  If the partition formed by e is the same as the partition formed by some edge e ’ in T ’ , e is a good edge!  Time analysis:  For every edge e in T, the checking takes O(n) time.  In total, the time complexity is O(n 2 )!  Can we do better?

  29. Day ’ s algorithm  Yes! The problem can be solved in O(n) time based on Day ’ s algorithm.  Input: two unrooted phylogenies T 1 and T 2 for the same set of species  Output: the set of good edges in T 1 w.r.t. T 2  Idea:  Build data-structure which enables constant time checking whether a particular partition of leaves exists in T 1 .

  30. Step 1  Root T 1 and T 2 at the leaves with label n.  This step takes O(n) time. n n T 2 T 1

  31. Example for step 1 3 1 T 2 4 5 T 1 5 4 1 2 2 3 ↓ 5 5 T 2 T 1 3 1 2 4 1 2 3 4

  32. Step 2  Relabel the leaves of T 1 in increasing order.  Note: for every internal node x of T 1 , the set of leaf labels in the subtree of x form an interval [i..j].  This step takes O(n) time. n n T 2 T 1 x 1 i j n-1

  33. Example for step 2 5 5 T 2 T 1 3 1 2 4 1 2 3 4 ↓ 5 5 T 2 T 1 [2..3] 1 2 3 4 2 3 1 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend