Composite repetition-aware text indexing Djamal Belazzougui Fabio - - PowerPoint PPT Presentation

composite repetition aware text indexing
SMART_READER_LITE
LIVE PREVIEW

Composite repetition-aware text indexing Djamal Belazzougui Fabio - - PowerPoint PPT Presentation

Composite repetition-aware text indexing Djamal Belazzougui Fabio Cunial Travis Gagie Nicola Prezza Mathieu Raffinot Compressed text indexes LZ family: LZ77 or LZ78. BWT family: FM index or Run-length encoded BWT (RLBWT). Compact


slide-1
SLIDE 1

Composite repetition-aware text indexing

Djamal Belazzougui Fabio Cunial Travis Gagie Nicola Prezza Mathieu Raffinot

slide-2
SLIDE 2

Compressed text indexes

◮ LZ family: LZ77 or LZ78. ◮ BWT family: FM index or Run-length encoded BWT

(RLBWT).

◮ Compact directed acyclic word graph.

slide-3
SLIDE 3

Repetition measures

◮ Number of phrases in Lempel-Ziv parsing (LZ77). ◮ Number of runs in Burrows Wheeler Transform (RLBWT). ◮ Number of maximal repeats. Number of right extensions

and/or left extensions of maximal repeats (CDAWG).

slide-4
SLIDE 4

Repetition measures (notation)

◮ Number of phrases in Lempel-Ziv parsing |ZT| (LZ77). ◮ Number of runs in BWT |RT| (RLBWT). ◮ Number of runs in BWT of reverse |RT| (RLBWT). ◮ Number of right extensions of maximal repeats |Er T ∪ Fr T|

(CDAWG).

◮ Number of left extensions of maximal repeats |Eℓ T ∪ Fℓ T|

(CDAWG).

slide-5
SLIDE 5

Repetition measures

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Highly-repetitive strings

39 Saccharomyces cerevisiae genomes

Distinct measures of repetition all grow sublinearly

r r

[1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25.

http://pizzachili.dcc.uchile.cl/repcorpus.html

slide-6
SLIDE 6

Results

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Highly-repetitive strings

39 Saccharomyces cerevisiae genomes

Distinct measures of repetition all grow sublinearly

r r

[1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25.

http://pizzachili.dcc.uchile.cl/repcorpus.html

RLBWTT LZ77 index CDAWGT

Locating Locating Suffix tree representations

Combining repetition-aware data structures

RLBWTT

, Words:

RLBWT+LZ77 RLBWT+CDAWG RLBWT+CDAWG RLBWT+LZ77

Time:

[2] [1] [1]

[1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.

Locating

slide-7
SLIDE 7

Results

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Highly-repetitive strings

39 Saccharomyces cerevisiae genomes

Distinct measures of repetition all grow sublinearly

r r

[1] Paolo Ferragina and Gonzalo Navarro. Pizza&Chili repetitive corpus. Accessed: 2015-01-25.

http://pizzachili.dcc.uchile.cl/repcorpus.html

RLBWTT LZ77 index CDAWGT

Locating Locating Suffix tree representations

Combining repetition-aware data structures

RLBWTT

, Words:

RLBWT+LZ77 RLBWT+CDAWG RLBWT+CDAWG RLBWT+LZ77

Time:

[2] [1] [1]

[1] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. [2] Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.

Locating Suffix tree representation

Words:

slide-8
SLIDE 8

Locate with LZ77 and RLBWT

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Secondary occurrences: cccccccccccc time, words (2-sided range reporting). Primary occurrences: time, words (4-sided range reporting). Rank/select in time, words (predecessor data structure).

[1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81–84, 1983. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the Twenty- seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.

Locating with RLBWT+LZ77

RLBWTT LZ77 index CDAWGT

,

slide-9
SLIDE 9

Locate with LZ77 and RLBWT

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Secondary occurrences: cccccccccccc time, words (2-sided range reporting). Primary occurrences: time, words (4-sided range reporting). Rank/select in time, words (predecessor data structure).

[1] Dan E Willard. Log-logarithmic worst-case range queries are possible in space Θ(N). Information Processing Letters, 17(2):81–84, 1983. [2] Timothy M. Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the Twenty- seventh Annual Symposium on Computational Geometry, pages 1–10. ACM, 2011. [3] Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd South American Workshop on String Processing (WSP’96), pages 141–155, 1996.

Locating with RLBWT+LZ77

RLBWTT LZ77 index CDAWGT

, P =

RLBWTT

k 1 m P[k..m]

words time

Locating with RLBWT+LZ77

P =

RLBWTT RLBWTT

k 1 m P[1..k-1]

predecessor data structure: words words time

  • time rank

Locating with RLBWT+LZ77

P[k..m]

slide-10
SLIDE 10

Locate with CDAWG

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

ε T

|V| |W| V X Y (c, |Y|) (c, p) W = c p

Locating with RLBWT+CDAWG

P = W1 =

RLBWTT CDAWGT

P

W1 (a, |X|)

a X a

blind

Locating with RLBWT+CDAWG

[1] Maxime Crochemore and Christophe Hancart. Automata for matching patterns. In Handbook of formal languages, pages 399–462. Springer, 1997.

slide-11
SLIDE 11

Suffix tree operations with CDAWG

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

ε T

CDAWG for locating

|W| ( c , | Y | ) (c, p)

c a c c c c

W= V V

5 5

Y p c

Suffix tree operations

Time: 1) 2) 3)

Suffix tree operations

Time: 1) 3)

constant-space traversal matching statistics

slide-12
SLIDE 12

Maximal Repeats and LZ-factorization

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Ti Ti+1

c

Rightmost maximal repeats and LZ factors

Ti Ti+1 Wi

maximal repeat X c X c

Rightmost maximal repeats and LZ factors

slide-13
SLIDE 13

Maximal Repeats and LZ-77

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Ti Ti+1

c

Rightmost maximal repeats and LZ factors

Ti Ti+1 Wi

maximal repeat X c X c

Rightmost maximal repeats and LZ factors

Ti Tj Ti+1 Tj+1 Wi Wj

maximal repeat X maximal repeat X c d X d

Rightmost maximal repeats and LZ factors

X c

slide-14
SLIDE 14

Rightmost Maximal Repeats and RLBWT

Composite repetition-aware data structures

Djamal Belazzougui1, Fabio Cunial2, Travis Gagie1, Nicola Prezza3, Mathieu Raffinot4

(1) Department of Computer Science, University of Helsinki, Finland. (2) Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany. (3) Department of Mathematics and Computer Science, University of Udine, Italy. (4) LIAFA, Paris Diderot University - Paris 7, France.

Maximal repeats and the BWT and the CDAWG "Rightmost" maximal repeats and the BWT and LZ77 factors

Preliminaries Maximal repeats

a

W

c b d

W

b a

STT

Maximal repeats

a c g b d f e h i

V

V STT W

b g h i a

Maximal repeats

l a c b d f h

V U

chain of explicit Weiner links

V U STT W

b a h l

Maximal repeats and BWT

W= V

k 1 m

BWTT BWTT

W[1..m] W[3..m] W[2..m] W[k..m] W[k..m]c W[2..m]c W[3..m]c W[1..m]c W[k..m]d W[2..m]d W[3..m]d W[1..m]d

a a a a a b b b c c c c

... ...

c d

V STT W

ε T

Compact Directed Acyclic Word Graph

V W V X Y Y W =

[1] Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and

  • analysis. Journal of the ACM, 34(3):578–595, 1987.

[2] Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Alberto Apostolico and Jotun Hein, editors, CPM, volume 1264 of Lecture Notes in Computer Science, pages 116–129. Springer, 1997.

Rightmost maximal repeats

c a a c b b c c c c c c d BWTT STT

Maximal repeats and CDAWG

W= V = V =

ε T

V V V W

1 1 5 2 5

...

parent parent

Rightmost maximal repeats

STT

Rightmost maximal repeats and BWT runs

c a c b c c c c c c d BWTT STT

slide-15
SLIDE 15

Open Problems

◮ For which class of strings does LZ77 behave better than

RLBWT (and vice-versa).

◮ Fibonacci strings and Thue-Morse strings of length n has

O(log n) LZ77 factors but only 2 BWT runs.

◮ Binary De brujin sequence of length n has n BWT runs, but

  • nly n/ log n LZ77 factors.

◮ What is the widest asymptotic gap between LZ77 and

RLBWT (both directions).

slide-16
SLIDE 16

Bibliography

◮ CPM 2015: arxiv.org/abs/1502.05937. ◮ Experiments : arxiv.org/abs/1604.06002.

slide-17
SLIDE 17

Follow up

◮ We can support (most of) the other operations in ST in

O(log n) time.

◮ Including CSA and inverse CSA and also treedepth,

level-ancestor, LCA....

Time Operation O(1) stringDepth , locateLeaf isAncestor nLeaves Child , firstChild O(log log n) parent , nextSibling suffixLink weinerLink edgeChar O(log n) Treedepth SA[i] , leafSelect ISA[i] LCA levelAncestor stringLevelAncestor