composite repetition aware text indexing
play

Composite repetition-aware text indexing Djamal Belazzougui Fabio - PowerPoint PPT Presentation

Composite repetition-aware text indexing Djamal Belazzougui Fabio Cunial Travis Gagie Nicola Prezza Mathieu Raffinot Compressed text indexes LZ family: LZ77 or LZ78. BWT family: FM index or Run-length encoded BWT (RLBWT). Compact


  1. Composite repetition-aware text indexing Djamal Belazzougui Fabio Cunial Travis Gagie Nicola Prezza Mathieu Raffinot

  2. Compressed text indexes ◮ LZ family: LZ77 or LZ78. ◮ BWT family: FM index or Run-length encoded BWT (RLBWT). ◮ Compact directed acyclic word graph.

  3. Repetition measures ◮ Number of phrases in Lempel-Ziv parsing (LZ77). ◮ Number of runs in Burrows Wheeler Transform (RLBWT). ◮ Number of maximal repeats. Number of right extensions and/or left extensions of maximal repeats (CDAWG).

  4. Repetition measures (notation) ◮ Number of phrases in Lempel-Ziv parsing |Z T | (LZ77). ◮ Number of runs in BWT |R T | (RLBWT). ◮ Number of runs in BWT of reverse |R T | (RLBWT). ◮ Number of right extensions of maximal repeats |E r T ∪ F r T | (CDAWG). ◮ Number of left extensions of maximal repeats |E ℓ T ∪ F ℓ T | (CDAWG).

  5. Repetition measures Highly-repetitive strings 39 Saccharomyces cerevisiae genomes Composite repetition-aware data structures Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 r r Distinct measures of repetition all grow sublinearly (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25. http://pizzachili.dcc.uchile.cl/repcorpus.html

  6. Results Combining repetition-aware data structures Highly-repetitive strings Locating 39 Saccharomyces cerevisiae genomes Words: RLBWT+CDAWG RLBWT T CDAWG T LZ77 index RLBWT+LZ77 , [1] Composite repetition-aware Locating Locating Time: data structures RLBWT T RLBWT+CDAWG RLBWT+LZ77 [2] Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 r r [1] Suffix tree representations Distinct measures of repetition all grow sublinearly (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections . Journal of [1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25. Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences . Theoretical Computer Science, 483:115–133, 2013. http://pizzachili.dcc.uchile.cl/repcorpus.html

  7. Results Combining repetition-aware data structures Suffix tree representation Highly-repetitive strings Locating 39 Saccharomyces cerevisiae genomes Words: RLBWT+CDAWG RLBWT T CDAWG T LZ77 index RLBWT+LZ77 , [1] Composite repetition-aware Locating Locating Time: data structures Words: RLBWT T RLBWT+CDAWG RLBWT+LZ77 [2] Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 r r [1] Suffix tree representations Distinct measures of repetition all grow sublinearly (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections . Journal of [1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25. Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences . Theoretical Computer Science, 483:115–133, 2013. http://pizzachili.dcc.uchile.cl/repcorpus.html

  8. Locate with LZ77 and RLBWT Locating with RLBWT+LZ77 RLBWT T CDAWG T LZ77 index , Rank/select in Primary occurrences: time, words time, Composite repetition-aware (predecessor data structure) . words (4-sided range reporting) . data structures Secondary occurrences: cccccccccccc time, words (2-sided range reporting) . Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. [1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ (N) . Information Processing Letters, 17(2):81–84, 1983. (3) Department of Mathematics and Computer Science, University of Udine, Italy. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai P ă tra ş cu. Orthogonal range searching on the RAM, revisited . In Proceedings of the Twenty- (4) LIAFA, Paris Diderot University - Paris 7, France. seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching . In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.

  9. Locate with LZ77 and RLBWT Locating with RLBWT+LZ77 Locating with RLBWT+LZ77 Locating with RLBWT+LZ77 1 1 k k m m P = P = RLBWT T RLBWT T RLBWT T RLBWT T CDAWG T LZ77 index predecessor data structure: , words -time rank Rank/select in Primary occurrences: P [1.. k -1] time, words time, Composite repetition-aware (predecessor data structure) . words (4-sided range reporting) . data structures Secondary occurrences: words words P [ k .. m ] P [ k .. m ] cccccccccccc time, time time words (2-sided range reporting) . Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. [1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ (N) . Information Processing Letters, 17(2):81–84, 1983. (3) Department of Mathematics and Computer Science, University of Udine, Italy. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai P ă tra ş cu. Orthogonal range searching on the RAM, revisited . In Proceedings of the Twenty- (4) LIAFA, Paris Diderot University - Paris 7, France. seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching . In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.

  10. Locate with CDAWG Locating with RLBWT+CDAWG Locating with RLBWT+CDAWG blind a P = W 1 = a X RLBWT T ( c , p ) CDAWG T ε Composite repetition-aware T ( c , | Y |) data structures |W| ( a , |X| ) |V| W 1 P Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 X V Y c W = (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. p (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France. [1] Maxime Crochemore and Christophe Hancart. Automata for matching patterns . In Handbook of formal languages, pages 399–462. Springer, 1997.

  11. Suffix tree operations with CDAWG Suffix tree operations Suffix tree operations CDAWG for locating Time: Time: matching statistics ( c, p ) ε T constant-space traversal Composite repetition-aware | ) | Y c , ( 5 V data structures |W| 1) 1) Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 2) 5 V (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. c W= (3) Department of Mathematics and Computer Science, University of Udine, Italy. c c c c c a 3) 3) (4) LIAFA, Paris Diderot University - Paris 7, France. p Y

  12. Maximal Repeats and LZ-factorization Rightmost maximal repeats and LZ factors Rightmost maximal repeats and LZ factors W i T i T i T i+ 1 T i+ 1 c c Composite repetition-aware data structures maximal repeat X X c Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

  13. Maximal Repeats and LZ-77 Rightmost maximal repeats and LZ factors Rightmost maximal repeats and LZ factors Rightmost maximal repeats and LZ factors W i W i T i T i T i T i+ 1 T i+ 1 T i+ 1 W j T j T j+ 1 c c c d Composite repetition-aware data structures maximal repeat X maximal repeat X maximal repeat X X X X c c d Djamal Belazzougui 1 , Fabio Cunial 2 , Travis Gagie 1 , Nicola Prezza 3 , Mathieu Raffinot 4 (1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend