on line construction of compact suffix vectors and
play

On-line Construction of Compact Suffix Vectors and Maximal Repeats - PowerPoint PPT Presentation

On-line Construction of Compact Suffix Vectors and Maximal Repeats Elise Prieur and Thierry Lecroq elise.prieur@univ-rouen.fr Laboratoire dInformatique de Traitement de lInformation et des Syst` emes. Journ ees Montoises August


  1. On-line Construction of Compact Suffix Vectors and Maximal Repeats ´ Elise Prieur and Thierry Lecroq elise.prieur@univ-rouen.fr Laboratoire d’Informatique de Traitement de l’Information et des Syst` emes. Journ´ ees Montoises August 30th, 2006, Rennes

  2. Introduction Suffix Vectors Computing maximal repeats Conclusion Plan Introduction 1 Suffix Vectors 2 Computing maximal repeats 3 Conclusion 4 ´ Elise Prieur Compact Suffix Vectors 2/24

  3. Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction 1 Motivation Suffix trees Ukkonen’s algorithm 2 Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector 3 Computing maximal repeats 4 Conclusion ´ Elise Prieur Compact Suffix Vectors 3/24

  4. Introduction Suffix Vectors Computing maximal repeats Conclusion Motivation Detecting repeats in long biological sequences. Adapted index structure. ´ Elise Prieur Compact Suffix Vectors 4/24

  5. Introduction Suffix Vectors Computing maximal repeats Conclusion Notations Suffix tree of tata$ y is a sequence of length n on the alphabet A . (4,1)$ (0,2)ta $ is a terminator symbol. (1,1) a a 4 ta Suffix tree (2,3) (4,1)$ (2,3) index structure; (4,1) ta$ ta$ $ 3 all substrings represented; 1 0 edges labeled (begin position, 2 length); leaves represent suffixes. ´ Elise Prieur Compact Suffix Vectors 5/24

  6. Introduction Suffix Vectors Computing maximal repeats Conclusion Ukkonen’s algorithm On-line algorithm Construction split into n phases which are also split into extensions. During the phase i , construction of the implicit tree of y [0 ..i ] from the one of y [0 ..i − 1]. During the extension j of the phase i , the suffix y [ j + 1 ..i ] is added to the tree. The last added substring is w = y [ j + 1 ..i − 1]. ´ Elise Prieur Compact Suffix Vectors 6/24

  7. Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield 1 : Rule 1 =y[j+1...i−1] w 1 Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology , Cambridge University Press, 1997 ´ Elise Prieur Compact Suffix Vectors 7/24

  8. Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 1 y[i]=y[j+1...i] w ´ Elise Prieur Compact Suffix Vectors 7/24

  9. Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2 x w ´ Elise Prieur Compact Suffix Vectors 7/24

  10. Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2 w x y[i] ´ Elise Prieur Compact Suffix Vectors 7/24

  11. Introduction Suffix Vectors Computing maximal repeats Conclusion The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 3 y[i]x w ´ Elise Prieur Compact Suffix Vectors 7/24

  12. Introduction Suffix Vectors Computing maximal repeats Conclusion Some properties leaves are added in increasing order; rule 1 does not need any treatment; phase i begins at the extension j ℓ + 1, where j ℓ is the number of the last created leaf; phase i ends at the first extension j > j ℓ such that rule 3 is applied. ´ Elise Prieur Compact Suffix Vectors 8/24

  13. Introduction Suffix Vectors Computing maximal repeats Conclusion 1 Introduction Motivation Suffix trees Ukkonen’s algorithm Suffix Vectors 2 Introduction Compact Suffix Vectors On-line construction of a compact suffix vector 3 Computing maximal repeats 4 Conclusion ´ Elise Prieur Compact Suffix Vectors 9/24

  14. Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors Root (0, 2) − (1,1) − (4, 1) (4,1)$ (0,2)ta (1,1) a 0 1 2 3 4 a 4 ta t a t a $ (2,3) (4,1)$ (2,3) (4,1) ta$ ta$ $ 3 1 2 3 (4,1) 0 1 3 (4,1) 2 ´ Elise Prieur Compact Suffix Vectors 10/24

  15. Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors (0 , 1) − (2 , 1) − (13 , 1) Root (13 , 1) $ R (2 , 1) t 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 (0 , 1) a 0 (13 , 1) $ (5 , 1) a a a t t t a t t t a t t a $ 2 (13 , 1) $ 11 (2 , 2) tt 5 ′ 12 (3 , 1) t (6 , 2) tt (1 , 13) 3 | 2 | (13 , 1) 3 ′ 3 (5 , 1) a 7 ′′′ 2 | 2 | (13 , 1) (12 , 2) a$ 0 (12 , 2) a$ 5 (8 , 6) tatta$ 4 3 | 4 | (12 , 2) (4 , 4) tatt (4 , 4) tatt 8 2 | 4 | (5 , 1) (13 , 1) $ (6 , 2) tt 9 10 7 7 ′ (12 , 2) a$ 7 | 6 | (12 , 2) (12 , 2) a$ 6 | 6 | (12 , 2) 1 | 2 | (5 , 1) 5 | 6 | (12 , 2) 7 ′′ 4 | 6 | (12 , 2) (8 , 6) tatta$ (8 , 6) tatta$ (12 , 2) a$ 2 6 1 | 13 | (2 , 2) − (13 , 1) 1 5 (8 , 6) tatta$ 3 7 ´ Elise Prieur Compact Suffix Vectors 11/24

  16. Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors (0 , 1) − (2 , 1) − (13 , 1) Root Alternative data structure to 0 1 2 3 4 5 6 7 8 9 10 11 12 13 suffix trees a a t t t a t t t a t t a $ same information in reduced space 3 | 2 | (13 , 1) 2 | 2 | (13 , 1) introduced by K. Monostori in 3 | 4 | (12 , 2) 2 | 4 | (5 , 1) 2001 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) 1 | 2 | (5 , 1) 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) 1 | 13 | (2 , 2) − (13 , 1) ´ Elise Prieur Compact Suffix Vectors 11/24

  17. Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors Definition (0 , 1) − (2 , 1) − (13 , 1) Root A succession of boxes whose lines contain: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a a t t t a t t t a t t a $ the depth of the node; the natural edge; 3 | 2 | (13 , 1) 2 | 2 | (13 , 1) the edge list. 3 | 4 | (12 , 2) The root is a special box. 2 | 4 | (5 , 1) 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) Notations 1 | 2 | (5 , 1) 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) - B j : box at position j in y , 1 | 13 | (2 , 2) − (13 , 1) - The natural edge of a line in B j is the end position of the edge beginning by y [ j + 1]. ´ Elise Prieur Compact Suffix Vectors 11/24

  18. Introduction Suffix Vectors Computing maximal repeats Conclusion Introduction to suffix vectors Example Root (0 , 1) − (2,1) − (13 , 1) tatt is a substring of y ? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 The root contains the edge (2 , 1) a a t t t a t t t a t t a $ beginning by t leading to B 2 . The edge (5 , 1) by a leads to B 5 . 3 | 2 | (13 , 1) 2 | 2 | (13 , 1) The natural edge begins by tt . 3 | 4 | (12 , 2) 2 | 4 | (5 , 1) 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) 1 | 1 | (5,1) 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) 1 | 13 | (2 , 2) − (13 , 1) ´ Elise Prieur Compact Suffix Vectors 11/24

  19. Introduction Suffix Vectors Computing maximal repeats Conclusion Compact a vector Definition A group of nodes is a set of nodes which are in the same box and have exactly the same edges. ´ Elise Prieur Compact Suffix Vectors 12/24

  20. Introduction Suffix Vectors Computing maximal repeats Conclusion Compact suffix vectors 3 rules of compaction of a box: Rule A the node with depth d − 2 has the same edges as the node with depth d − 1, Rule B the node with depth d − 1 has the same edges as the node with depth d and some extra edges, Rule C the node with depth d − 3 has different edges to the node with depth d − 2. d Rule B d−1 Rule A d−2 Rule C d−3 ´ Elise Prieur Compact Suffix Vectors 13/24

  21. Introduction Suffix Vectors Computing maximal repeats Conclusion Compacting V ( aatttatttatta$ ) Root (0 , 1) − (2 , 1) − (13 , 1) Root (0 , 1) − (2 , 1) − (13 , 1) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 a a t t t a t t t a t t a $ a a t t t a t t t a t t a $ 3 | 2 | (13 , 1) 3 | 2 | (13 , 1) 2 2 | 2 | (13 , 1) = ⇒ 3 | 4 | (12 , 2) 3 | 4 | (12 , 2) 2 | 4 | (5 , 1) 2 | 4 | (5 , 1) 7 | 6 | (12 , 2) 6 | 6 | (12 , 2) 1 | 2 | (5 , 1) 1 | 2 | (5 , 1) 7 | 6 | (12 , 2) 4 5 | 6 | (12 , 2) 4 | 6 | (12 , 2) 1 | 13 | (2 , 2) − (13 , 1) 1 | 13 | (2 , 2) − (13 , 1) ´ Elise Prieur Compact Suffix Vectors 14/24

  22. Introduction Suffix Vectors Computing maximal repeats Conclusion y Monostori Extended vector Monostori − − − − − − → − − − − − − → Compact vector O ( n ) O ( n ) ´ Elise Prieur Compact Suffix Vectors 15/24

  23. Introduction Suffix Vectors Computing maximal repeats Conclusion On-line construction of a compact vector Monostori ✲ Monostori ✲ y extended vector compact vector O ( n ) O ( n ) ✻ Prieur, Lecroq O ( n ) Faster and more space economical construction. ´ Elise Prieur Compact Suffix Vectors 16/24

  24. Introduction Suffix Vectors Computing maximal repeats Conclusion On-line construction of a compact vector Proposition When an edge is added to the node w of depth d in a box B p , this edge will be added to all the nodes in B p of depth smaller then d in the group of nodes of w . p+1 j i a a y v v w w p’+1 j i a a y v v w w ´ Elise Prieur Compact Suffix Vectors 17/24

  25. Introduction Suffix Vectors Computing maximal repeats Conclusion On-line construction of a compact vector Skip k − 1 extensions where k is the number of the nodes in the group into the edge is added. ´ Elise Prieur Compact Suffix Vectors 18/24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend