Indexing in repetition-aware space Nicola Prezza University of - - PowerPoint PPT Presentation

indexing in repetition aware space
SMART_READER_LITE
LIVE PREVIEW

Indexing in repetition-aware space Nicola Prezza University of - - PowerPoint PPT Presentation

Overview LZ77 in RLE space lz-rlbwt, in practice Indexing in repetition-aware space Nicola Prezza University of Udine, department of Computer Science Dagstuhl Seminar 16431: "Computation over Compressed Structured Data" Overview


slide-1
SLIDE 1

Overview LZ77 in RLE space lz-rlbwt, in practice

Indexing in repetition-aware space

Nicola Prezza

University of Udine, department of Computer Science

Dagstuhl Seminar 16431: "Computation over Compressed Structured Data"

slide-2
SLIDE 2

Overview LZ77 in RLE space lz-rlbwt, in practice

Topics: LZ77 computation in O(|RLBWT|) space RLBWT ↔ LZ77 conversions in O(|RLBWT| + |LZ77|) space lz-rlbwt construction in asymptotically-optimal space The DYNAMIC library Practical variants of the lz-rlbwt index + results

slide-3
SLIDE 3

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Lempel-Ziv parsing

Problem: LZ77 computation LZ77 can be computed with an index on T[1..n] Problem: on extremely repetitive texts, an entropy-compressed FM-index can be exponentially larger than |LZ77| ... r = number of runs in BWT(T) r is a good measure of repetitiveness, and can be exponentially smaller than n on repetitive texts Goal: build LZ77 in O(r) working space

slide-4
SLIDE 4

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Run-length compression of the BWT

T = "abcabbcaabcabcabbc"

#abcabbcaabcabcabbc aabcabcabbc#abcabbc abbc#abcabbcaabcabc abbcaabcabcabbc#abc abcabbc#abcabbcaabc abcabbcaabcabcabbc# abcabcabbc#abcabbca bbc#abcabbcaabcabca bbcaabcabcabbc#abca bc#abcabbcaabcabcab bcaabcabcabbc#abcab bcabbc#abcabbcaabca bcabbcaabcabcabbc#a bcabcabbc#abcabbcaa c#abcabbcaabcabcabb caabcabcabbc#abcabb cabbc#abcabbcaabcab cabbcaabcabcabbc#ab cabcabbc#abcabbcaab

BWT(abcabbcaabcabcabbc) = ccccc#aaabbaaabbbbb RLE(BWT(T)) = RLBWT(T) = 5, c1, #3, a2, b3, a5, b r = 6

slide-5
SLIDE 5

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Case of study: highly repetitive text collections

Motivational example (from pizzachili.dcc.uchile.cl) All revisions (since 2001, text only) of en.wikipedia.org/wiki/Albert_Einstein Uncompressed: 456 MB z ≈ 76 · 103 z log n + z log σ ≈ 310KB (7-Zip: 314 KB) 1400x compression rate r ≈ 290 · 103 r log(n/r) + r log σ ≈ 544KB 840x compression rate

slide-6
SLIDE 6

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

LZ77 in RLE space: overview algorithm

1

build (online) RLBWT(← − T )

2

forward-navigate T using RLBWT(← − T ) and:

keep, for each BWT run, only the 2 most extern SA samples search the current LZ77 factor on RLBWT(← − T )

It can be shown that this SA sampling is sufficient to locate at least 1 previous occurrence of LZ factors

slide-7
SLIDE 7

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa

Between horizontal lines: range of current LZ phrase (here full range) Parsing rule 1/3: current BWT range contains > 1 runs and none of the "b" inside the range is marked with a SA sample: new LZ phrase −, 0, b Sampling rule 1/2: always add a SA sample on current position

#aababab aababab# ab#aabab abab#aab ababab#a b#aababa bab#aaba babab#aa

slide-8
SLIDE 8

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b

Parsing rule 1. new LZ phrase −, 0, a Sampling rule 1. Add new SA sample on current position

#aababab aababab# ab#aabab abab#aab ababab#a b#aababa bab#aaba babab#aa

slide-9
SLIDE 9

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b −, 0, a

Parsing rule 2/3: current BWT range contains > 1 runs and there is a sampled b in the current range.

  • cc = sample − length = 0 − 0 = 0.

Update BWT range ("b") Sampling rule 1. Add new SA sample on current position

#aababab aababab# ab#aabab abab#aab ababab#a b#aababa 1 bab#aaba babab#aa

slide-10
SLIDE 10

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b −, 0, a

Parsing rule 3/3: current range contains

  • nly one run. Keep previous occ (=0),

update BWT range ("ab") Sampling rule 1. Add new SA sample on current position

#aababab aababab# ab#aabab 2 abab#aab ababab#a b#aababa 1 bab#aaba babab#aa

slide-11
SLIDE 11

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b −, 0, a

Parsing rule 2.

  • cc = sample − length = 2 − 2 = 0

Update BWT range ("bab") Sampling rule 1. Add new SA sample on current position

#aababab aababab# ab#aabab 2 abab#aab ababab#a b#aababa 1 bab#aaba 3 babab#aa

slide-12
SLIDE 12

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b −, 0, a

Parsing rule 3. Keep previous occ (=0), update BWT range ("abab"). Sampling rule 1. Add new SA sample on current position. Sampling rule 2/2: the current a-run has now 3 samples. delete the sample in the middle (3)

#aababab aababab# ab#aabab 2 abab#aab 4 ababab#a b#aababa 1 bab#aaba 3 babab#aa

slide-13
SLIDE 13

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b −, 0, a

Parsing rule 1. Output new phrase

  • cc, length, current_char = 0, 4, a,

reset BWT range Sampling rule 1. Add new SA sample on current position. Sampling rule 2. The current a-run has now 3 samples: delete the sample in the middle (1)

#aababab aababab# ab#aabab 2 abab#aab 4 ababab#a b#aababa 1 bab#aaba

3 babab#aa 5

slide-14
SLIDE 14

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Example T = bababaa LZ77(T) = −, 0, b −, 0, a 0, 4, a Finish! #aababab aababab# ab#aabab 2 abab#aab 4 ababab#a 6 b#aababa

1 bab#aaba babab#aa 5

slide-15
SLIDE 15

Overview LZ77 in RLE space lz-rlbwt, in practice Algorithm

Data structures

Important: at each stage, ≤ 2r SA samples Result LZ77 can be computed in O(r) space (words) and O(n log r) time Simon G. does not like big-O notation... r(4 log(n/r) + 2 log n + log σ)(1 + o(1)) bits

slide-16
SLIDE 16

Overview LZ77 in RLE space lz-rlbwt, in practice applications

Applications: The lz-rlbwt index can be built in O(r + z) words of space and O(n · (log r + log z)) time Conversion between compressed formats/compressed indexes: next slide ...

slide-17
SLIDE 17

Overview LZ77 in RLE space lz-rlbwt, in practice applications

Conversion between compressed formats

LZ77 = πi, λi, cii=1,...,z RLBWT = λi, cii=1,...,r Results We can compute RLBWT → LZ77 in O(n log r) time and O(r) words of space We can compute LZ77 → RLBWT in O(n(log r + log z)) time and O(r + z) words of space (not discussed here)

slide-18
SLIDE 18

Overview LZ77 in RLE space lz-rlbwt, in practice Some results

The DYNAMIC library

github.com/nicolaprezza/DYNAMIC

slide-19
SLIDE 19

Overview LZ77 in RLE space lz-rlbwt, in practice Some results

Theoretical bounds, two examples: SPSI: sequence s1, . . . , sm of total sum M. Space: ≈ 1.3 · m · log(M/m) bits O(log m)-time sum, search, update, and insert Run-length encoded string with r runs: Space: ≈ r · (1.1 · log |Σ| + 2.6 · log(n/r)) bits O(log r)-time rank, select, access, and insert

slide-20
SLIDE 20

Overview LZ77 in RLE space lz-rlbwt, in practice Some results

LZ77 construction algorithms: benchmark File Size (MB) 7-Zip-compressed size (MB) Rate cere 440.0 8.10 0.0184 para 410.0 9.80 0.0239 influenzae 148.0 2.50 0.0169 escherichia 108.0 7.10 0.0657 sdsl 1024.0 0.60 0.0006 samtools 1024.0 1.20 0.0012 boost 1024.0 0.20 0.0002 bwa 419.0 0.38 0.0009 Einstein 1024.0 1.60 0.0016 earth 1024.0 1.70 0.0017 Bush 1024.0 1.90 0.0019 wikipedia 1024.0 2.40 0.0023

slide-21
SLIDE 21

Overview LZ77 in RLE space lz-rlbwt, in practice Some results

  • 1

3 5 1 2 3 4 cere RAM (log10(MB))

  • 1

3 5 1 2 3 4 para

  • 1

3 5 1 2 3 4 influenzae

  • 1

3 5 1 2 3 4 escherichia

  • 1

3 5 1 2 3 4 sdsl RAM (log10(MB))

  • 1

3 5 1 2 3 4 samtools

  • 1

3 5 1 2 3 4 boost

  • 1

3 5 1 2 3 4 bwa

  • 1

3 5 1 2 3 4 einstein Time (log10(s)) RAM (log10(MB))

  • 1

3 5 1 2 3 4 earth Time (log10(s))

  • 1

3 5 1 2 3 4 bush Time (log10(s))

  • 1

3 5 1 2 3 4 wikipedia Time (log10(s))

  • ISA6r

KKP1s LZscan h0−lz77 rle−lz77−1 rle−lz77−2 plain size 7−zip

slide-22
SLIDE 22

Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation

lz-rlbwt: implementation C++ implementations of the lz-rlbwt index (using SDSL): https://github.com/nicolaprezza/lz-rlbwt https://github.com/nicolaprezza/lz-rlbwt-sparse

slide-23
SLIDE 23

Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation

Variants We propose 3 variants of the index: full, bidirectional, sparse RLBWT sparsification We implemented RLBWT (SDSL) using sparsification on the gap-encoded bitvectors: (1 + ǫ)r log(n/r) + r log σ bits of space ⇒ half of the space of the RLCSA

slide-24
SLIDE 24

Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation

Full index RLBWT(T), RLBWT(← − T ), 4-sided and 2-sided range structures, subset

  • f suffix-tree nodes

Theorem The lz-rlbwt-f index takes

  • 6z log n + (2 + ǫ)r log(n/r) + 2r log σ
  • · (1 + o(1))

bits of space and supports: Count in O(m · (log(n/r) + log σ)) time Locate in O((m + occ) · log n) time For any constant 0 < ǫ ≤ 1 (RLBWT sparsification).

slide-25
SLIDE 25

Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation

Bidirectional index RLBWT(T), 4-sided and 2-sided range structures, subset of suffix-tree nodes Theorem The lz-rlbwt-b index takes

  • 6z log n + (1 + ǫ)r log(n/r) + r log σ
  • · (1 + o(1))

bits of space and supports: Count in O(m · (log(n/r) + log σ)) time Locate in O

  • m2σ log(n/r) + (m + occ) · log n
  • time

For any constant 0 < ǫ ≤ 1.

slide-26
SLIDE 26

Overview LZ77 in RLE space lz-rlbwt, in practice lz-rlbwt implementation

Sparse index RLBWT(T), 2-sided range structure sparse variant of LZ77: after each phrase, skip d characters before

  • pening new phrase ⇒ zd ≤ z phrases

Theorem The lz-rlbwt-s index takes

  • 4zd log n + (1 + ǫ)r log(n/r) + r log σ
  • · (1 + o(1))

bits of space and supports: Count in O(m(log(n/r) + log σ)) time Locate in O((occ + 1) · (m + d) · log n) time For any constants 0 < ǫ ≤ 1 and d ≥ 0.

slide-27
SLIDE 27

Overview LZ77 in RLE space lz-rlbwt, in practice results