Composite repetition-aware text indexing Djamal Belazzougui Fabio - - PowerPoint PPT Presentation
Composite repetition-aware text indexing Djamal Belazzougui Fabio - - PowerPoint PPT Presentation
Composite repetition-aware text indexing Djamal Belazzougui Fabio Cunial Travis Gagie Nicola Prezza Mathieu Raffinot Compressed text indexes LZ family: LZ77 or LZ78. BWT family: FM index or Run-length encoded BWT (RLBWT). Compact
Compressed text indexes
◮ LZ family: LZ77 or LZ78. ◮ BWT family: FM index or Run-length encoded BWT
(RLBWT).
◮ Compact directed acyclic word graph.
Repetition measures
◮ Number of phrases in Lempel-Ziv parsing (LZ77). ◮ Number of runs in Burrows Wheeler Transform (RLBWT). ◮ Number of maximal repeats. Number of right extensions
and/or left extensions of maximal repeats (CDAWG).
Repetition measures (notation)
◮ Number of phrases in Lempel-Ziv parsing |ZT| (LZ77). ◮ Number of runs in BWT |RT| (RLBWT). ◮ Number of runs in BWT of reverse |RT| (RLBWT). ◮ Number of right extensions of maximal repeats |Er T ∪ Fr T|
(CDAWG).
◮ Number of left extensions of maximal repeats |Eℓ T ∪ Fℓ T|
(CDAWG).
Repetition measures
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Highly-repetitive strings
39 Saccharomyces cerevisiae genomes
Distinct measures of repetition all grow sublinearly
r r
[1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25.
http://pizzachili.dcc.uchile.cl/repcorpus.html
Results
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Highly-repetitive strings
39 Saccharomyces cerevisiae genomes
Distinct measures of repetition all grow sublinearly
r r
[1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25.
http://pizzachili.dcc.uchile.cl/repcorpus.html
RLBWTT LZ77 index CDAWGT
Locating Locating Suffix tree representations
Combining repetition-aware data structures
RLBWTT
, Words:
RLBWT+LZ77 RLBWT+CDAWG RLBWT+CDAWG RLBWT+LZ77
Time:
[2] [1] [1]
[1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.
Locating
Results
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Highly-repetitive strings
39 Saccharomyces cerevisiae genomes
Distinct measures of repetition all grow sublinearly
r r
[1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25.
http://pizzachili.dcc.uchile.cl/repcorpus.html
RLBWTT LZ77 index CDAWGT
Locating Locating Suffix tree representations
Combining repetition-aware data structures
RLBWTT
, Words:
RLBWT+LZ77 RLBWT+CDAWG RLBWT+CDAWG RLBWT+LZ77
Time:
[2] [1] [1]
[1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.
Locating Suffix tree representation
Words:
Locate with LZ77 and RLBWT
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Secondary occurrences: cccccccccccc time, words (2-sided range reporting). Primary occurrences: time, words (4-sided range reporting). Rank/select in time, words (predecessor data structure).
[1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81–84, 1983. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the Twenty- seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.
Locating with RLBWT+LZ77
RLBWTT LZ77 index CDAWGT
,
Locate with LZ77 and RLBWT
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Secondary occurrences: cccccccccccc time, words (2-sided range reporting). Primary occurrences: time, words (4-sided range reporting). Rank/select in time, words (predecessor data structure).
[1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81–84, 1983. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the Twenty- seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.
Locating with RLBWT+LZ77
RLBWTT LZ77 index CDAWGT
, P =
RLBWTT
k 1 m P[k..m]
words time
Locating with RLBWT+LZ77
P =
RLBWTT RLBWTT
k 1 m P[1..k-1]
predecessor data structure: words words time
- time rank
Locating with RLBWT+LZ77
P[k..m]
Locate with CDAWG
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
ε T
|V| |W| V X Y (c, |Y|) (c, p) W = c p
Locating with RLBWT+CDAWG
P = W1 =
RLBWTT CDAWGT
P
W1 (a, |X|)
a X a
blind
Locating with RLBWT+CDAWG
[1] Maxime Crochemore and Christophe Hancart. Automata for matching patterns. In Handbook of formal languages, pages 399–462. Springer, 1997.
Suffix tree operations with CDAWG
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
ε T
CDAWG for locating
|W| ( c , | Y | ) (c, p)
c a c c c c
W= V V
5 5
Y p c
Suffix tree operations
Time: 1) 2) 3)
Suffix tree operations
Time: 1) 3)
constant-space traversal matching statistics
Maximal Repeats and LZ-factorization
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Ti Ti+1
c
Rightmost maximal repeats and LZ factors
Ti Ti+1 Wi
maximal repeat X c X c
Rightmost maximal repeats and LZ factors
Maximal Repeats and LZ-77
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Ti Ti+1
c
Rightmost maximal repeats and LZ factors
Ti Ti+1 Wi
maximal repeat X c X c
Rightmost maximal repeats and LZ factors
Ti Tj Ti+1 Tj+1 Wi Wj
maximal repeat X maximal repeat X c d X d
Rightmost maximal repeats and LZ factors
X c
Rightmost Maximal Repeats and RLBWT
Composite repetition-aware data structures
Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4
(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.
Maximal repeats and the BWT and the CDAWG "Rightmost" maximal repeats and the BWT and LZ77 factors
Preliminaries Maximal repeats
a
W
c b d
W
b a
STT
Maximal repeats
a c g b d f e h i
V
V STT W
b g h i a
Maximal repeats
l a c b d f h
V U
chain of explicit Weiner links
V U STT W
b a h l
Maximal repeats and BWT
W= V
k 1 m
BWTT BWTT
W[1..m] W[3..m] W[2..m] W[k..m] W[k..m]c W[2..m]c W[3..m]c W[1..m]c W[k..m]d W[2..m]d W[3..m]d W[1..m]d
a a a a a b b b c c c c
... ...
c d
V STT W
ε T
Compact Directed Acyclic Word Graph
V W V X Y Y W =
[1] Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and
- analysis. Journal of the ACM, 34(3):578–595, 1987.
[2] Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Alberto Apostolico and Jotun Hein, editors, CPM, volume 1264 of Lecture Notes in Computer Science, pages 116–129. Springer, 1997.
Rightmost maximal repeats
c a a c b b c c c c c c d BWTT STT
Maximal repeats and CDAWG
W= V = V =
ε T
V V V W
1 1 5 2 5
...
parent parent
Rightmost maximal repeats
STT
Rightmost maximal repeats and BWT runs
c a c b c c c c c c d BWTT STT
Open Problems
◮ For which class of strings does LZ77 behave better than
RLBWT (and vice-versa).
◮ Fibonacci strings and Thue-Morse strings of length n has
O(log n) LZ77 factors but only 2 BWT runs.
◮ Binary De brujin sequence of length n has n BWT runs, but
- nly n/ log n LZ77 factors.
◮ What is the widest asymptotic gap between LZ77 and
RLBWT (both directions).
Bibliography
◮ CPM 2015: arxiv.org/abs/1502.05937. ◮ Experiments : arxiv.org/abs/1604.06002.
Follow up
◮ We can support (most of) the other operations in ST in
O(log n) time.
◮ Including CSA and inverse CSA and also treedepth,
level-ancestor, LCA....
Time Operation O(1) stringDepth , locateLeaf isAncestor nLeaves Child , firstChild O(log log n) parent , nextSibling suffixLink weinerLink edgeChar O(log n) Treedepth SA[i] , leafSelect ISA[i] LCA levelAncestor stringLevelAncestor