Introduction CST design CST in practice
Compressed Suffix Trees in Practice
Simon Gog
Computing and Information Systems The University of Melbourne
February 13th 2013
Compressed Suffix Trees in Practice Simon Gog Computing and - - PowerPoint PPT Presentation
Introduction CST design CST in practice Compressed Suffix Trees in Practice Simon Gog Computing and Information Systems The University of Melbourne February 13th 2013 Introduction CST design CST in practice Outline Introduction 1 Basic
Introduction CST design CST in practice
Compressed Suffix Trees in Practice
Simon Gog
Computing and Information Systems The University of Melbourne
February 13th 2013
Introduction CST design CST in practice
Outline
1
Introduction Basic data structures The suffix tree
2
CST design NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes)
3
CST in practice The sdsl library
Introduction CST design CST in practice
Succinct data structures (1)
Data structure D representation of an
+
Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n bits space + access b[i] in O(1) time rank(i) = i−1
j=0 b[j]
in O(n) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.
Introduction CST design CST in practice
Succinct data structures (1)
Data structure D representation of an
+
Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n bits space + access b[i] in O(1) time rank(i) = i−1
j=0 b[j]
in O(n) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.
Introduction CST design CST in practice
Succinct data structures (1)
Data structure D representation of an
+
Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n + n log n bits space + access b[i] in O(1) time rank(i) = i−1
j=0 b[j]
in O(1) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.
Introduction CST design CST in practice
Succinct data structures (1)
Data structure D representation of an
+
Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n + n log n bits space + access b[i] in O(1) time rank(i) = i−1
j=0 b[j]
in O(1) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.
Introduction CST design CST in practice
Succinct data structures (2)
Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory
develop succinct data structures
in practice
constants in O(1)-time terms are large
complex data structures are hard to implement
Introduction CST design CST in practice
Succinct data structures (2)
Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !?
CPU L1-Cache L2-Cache L3-Cache DRAM Disk ≈ 100 B ≈ 10 KB ≈ 512 KB ≈ 1-8 MB ≈ 4 GB ≈ x · 100 GB ≈ 1 CPU cycle ≈ 5 CPU cycles ≈ 10-20 ≈ 20-100 ≈ 100-500 ≈ 106
Less memory ⇒ less costs !? Problems: in theory
develop succinct data structures
in practice
constants in O(1)-time terms are large
complex data structures are hard to implement
Introduction CST design CST in practice
Succinct data structures (2)
Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !?
Instance name main memory price per hour Micro 613.0 MB 0.02 US$ High-Memory Quadruple Extra Large 68.4 GB 2.00 US$ Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.
Problems: in theory
develop succinct data structures
in practice
constants in O(1)-time terms are large
complex data structures are hard to implement
Introduction CST design CST in practice
Succinct data structures (2)
Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory
develop succinct data structures
in practice
constants in O(1)-time terms are large
complex data structures are hard to implement
Introduction CST design CST in practice
The classic index data structure: The suffix tree (ST)
Let T be a text of length n over alphabet Σ of size σ. Suffix tree index data structure for T (construction O(n)) can be used to solve many problems in optimal time complexity
bioinformatics data compression
uses O(n log n) bits! In practice (ASCII-alphabet) ≥ 17 times the size of T Can not handle „The Attack of Massive Data” DNA sequencing data (NGS) ...
Introduction CST design CST in practice
Example: ST of T=umulmundumulmum$
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m $ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 n d u m u l m u m $ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u n = 16 Σ ={$,d,l,m,n,u} σ = 6 Classic implementation uses pointers each of size 4 or 8 bytes!
Introduction CST design CST in practice
Example: ST of T=umulmundumulmum$
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m $ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 n d u m u l m u m $ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u Operations root() is leaf(v) parent(v) degree(v) child(v, c) select child(v, i) depth(v) edge(v, d) lca(v, w) sl(v) wl(v, c)
Introduction CST design CST in practice
CSTs
Goal of a CST implementation Replace fastest uncompressed ST implementations in different scenarios (a) both fit in RAM and we measure time (b) both fit in RAM and we measure resource costs (c) only CST fits in RAM and we measure time Proposals Sadakane’s CST cst_sada Fully Compressed Suffix Tree (Russo et al.) CSTs based on interval representation of nodes (Fischer et
Introduction CST design CST in practice
CSTs
Goal of a CST implementation Replace fastest uncompressed ST implementations in different scenarios (a) both fit in RAM and we measure time (b) both fit in RAM and we measure resource costs (c) only CST fits in RAM and we measure time Proposals which might work for (a) and (b) Sadakane’s CST cst_sada Fully Compressed Suffix Tree (Russo et al.) CSTs based on interval representation of nodes (Fischer et
Introduction CST design CST in practice
Outline
1
Introduction Basic data structures The suffix tree
2
CST design NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes)
3
CST in practice The sdsl library
Introduction CST design CST in practice
Big picture of CST design
B W T
excess
2n bits
Introduction CST design CST in practice
Big picture of CST design
15
$
7
d u m u l m u m $
11
m$
3
ndumulmum$ lmu
14
$
9
m$
1
ndumulmum$ lmu
12
m $
4
ndumulmum$ u m
6
ndumulmum$
10
m$
2
n d u m u l m u m $ lmu
13
$
8
m$ n d u m u l m u m $ ulmu m
5
n d u m u l m u m $ u
15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 0 0 3 0 1 5 2 2 0 0 4 1 2 6 1 (()()(()())(()((()())()()))()((()())(()(()()))()))
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs=
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= ( BPSdfs=
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (( BPSdfs= 1
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (() BPSdfs= 1
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()( BPSdfs= 1 3
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()() BPSdfs= 1 3
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()( BPSdfs= 1 3 5
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(( BPSdfs= 1 3 5 6
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(() BPSdfs= 1 3 5 6
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(()( BPSdfs= 1 3 5 6 8
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(()() BPSdfs= 1 3 5 6 8
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
Example: Compressing NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(()()) BPSdfs= 1 3 5 6 8 1 3 5 6 8 11 12 14 15 16 18 21 23 27 29 30 31 33 36 37 39 40 42 46
tree uncompressed O(n log n) bits compressed 4n bits
Introduction CST design CST in practice
NAV data structures (1)
4n bits
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= 1 3 5 6 8 11 12 14 15 16 18 21 23 27 29 30 31 33 36 37 39 40 42 46
2n bits
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u ( ( ( ( 3 )( ( 1 ( 5 )( 2 ( 2 )))( ( ( 4 )( 1 ( 2 ( 6 ))( 1 )))))))) BPSsct= LCP= 0-[0, 15] 3-[2, 3] 1-[4, 8] 2-[5, 8] 5-[5, 6] 1-[10, 15] 4-[10, 11] 2-[12, 14] 6-[13, 14]
+o(n) bits to answer find open(i), find close(i), enclose(i), double enclose(i, j),rank(i, c), select(i, c),. . . in constant time
Introduction CST design CST in practice
NAV data structures (2)
Comparison of different NAV structures cst_sada cst_sct cst_sct3 space in bits 4n + o(n) 2n + o(n) 3n + o(n) root() O(1) O(1) O(1) degree(v) O(σ) O(tLCP log σ) O(1) depth(v) O(tLCP) O(tLCP) O(tLCP) parent(v) O(1) O(tLCP log σ) O(1) select child(v, i) O(i) O(tLCP) O(1) sibling(v) O(1) O(tLCP) O(1) sl(v), lca(v, w) O(1) O(tLCP log σ) O(1) child(v, c) O(tSAσ) O(tSA log σ) O(tSA log σ)
Introduction CST design CST in practice
Example operations: select leaf(i) and lca(v, w) on NAV
15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= 8 12
select leaf(4) = select(4,′ 10′) = 8 select leaf(5) = select(5,′ 10′) = 12 lca(8,12) = double enclose(8,12) = 0
Introduction CST design CST in practice
Virtues of a CSA based on BWT
Small size: |CSA| = |BWT| + n log n
sSA
bits where |BWT| can be chosen to be
n log σ bits nH0(T) bits nHk(T) + O(σk) bits
pattern matching in time O(|P| log σ) (even O(|P|) for σ ∈ polylog(n)) by backward search (Ferragina & Manzini)
Introduction CST design CST in practice
Hk of the Pizza&Chili 200MB test cases
dblp.xml dna english proteins rand_k128 sources k Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n 5.257 0.0000 1.974 0.0000 4.525 0.0000 4.201 0.0000 7.000 0.0000 5.465 0.0000 1 3.479 0.0000 1.930 0.0000 3.620 0.0000 4.178 0.0000 7.000 0.0000 4.077 0.0000 2 2.170 0.0000 1.920 0.0000 2.948 0.0001 4.156 0.0000 6.993 0.0001 3.102 0.0000 3 1.434 0.0007 1.916 0.0000 2.422 0.0005 4.066 0.0001 5.979 0.0100 2.337 0.0012 4 1.045 0.0043 1.910 0.0000 2.063 0.0028 3.826 0.0011 0.666 0.6939 1.852 0.0082 5 0.817 0.0130 1.901 0.0000 1.839 0.0103 3.162 0.0173 0.006 0.9969 1.518 0.0250 6 0.705 0.0265 1.884 0.0001 1.672 0.0265 1.502 0.1742 0.000 1.0000 1.259 0.0509 7 0.634 0.0427 1.862 0.0001 1.510 0.0553 0.340 0.4506 0.000 1.0000 1.045 0.0850 8 0.574 0.0598 1.834 0.0004 1.336 0.0991 0.109 0.5383 0.000 1.0000 0.867 0.1255 9 0.537 0.0773 1.802 0.0013 1.151 0.1580 0.074 0.5588 0.000 1.0000 0.721 0.1701 10 0.508 0.0955 1.760 0.0051 0.963 0.2292 0.061 0.5699 0.000 1.0000 0.602 0.2163
Introduction CST design CST in practice
Backward search
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 TBWT m n u u u u u l l u m m m d $ m F $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$ (a) TBWT m n u u u u u l l u m m m d $ m F $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$ (b) TBWT m n u u u u u l l u m m m d $ m F $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$ (c)
Introduction CST design CST in practice
Wavelet tree: rank for character sequences
mnuuuuullummmd$m 1111111001111001 lld$ 1100 d$ 10 $ d 1 ll 1 mnuuuuuummmm 001111110000 mnmmmm 010000 mmmmm n 1 uuuuuu 1 1 (a) mnuuuuullummmd$m 0011111001000000 mnllmmmd$m 1000111001 nlld$ 11100 d$ 10 $ d 1 nll 100 ll n 1 1 mmmmm 1 uuuuuu 1 (b)
Introduction CST design CST in practice
Example: Compressing CSA (practical approach)
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SA 15 7 11 3 14 9 1 12 4 6 10 2 13 8 5 LF 4 9 10 11 12 13 14 2 3 15 5 6 7 1 8 TBWT m n u u u u u l l u m m m d $ m T $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$
Introduction CST design CST in practice
Example: Compressing CSA (practical approach)
15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u
sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)
SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +
n log n sSA
bits
Introduction CST design CST in practice
Example: Compressing CSA (practical approach)
15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=
sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)
SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +
n log n sSA
bits
Introduction CST design CST in practice
Example: Compressing CSA (practical approach)
15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=SA[1]+1
sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)
SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +
n log n sSA
bits
Introduction CST design CST in practice
Example: Compressing CSA (practical approach)
15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=SA[9]+2
sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)
SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +
n log n sSA
bits
Introduction CST design CST in practice
Example: Compressing CSA (practical approach)
15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=6+2=8
sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)
SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +
n log n sSA
bits
Introduction CST design CST in practice
Overview of LCP data structures
data structure uses access memory in bits lcp_uncompressed
n log n lcp_support_sada CSA O(tSA) 2n + o(n) lcp_kurtz
8n1 + 2n2 log n . . . . . . . . . . . . lcp_support_tree NAV O(log σ′) H0q1 + q2 log n lcp_support_tree2 NAV & LF O(sLCP log σ′) H0q1 + (q2 log n)/sLCP
with n1 + n2 = n and q1 + q2 = q < n q number of inner nodes of the ST
Introduction CST design CST in practice
Runtime for random access to the LCP array (1)
Memory in bits per character Memory in bits per character
lcp_wt lcp_dac lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2 lcp_uncompressed
dblp.xml.200MB
10 102 103 104 4 8 16 24 Time in nanoseconds per operation
proteins.200MB
10 102 103 104 4 8 16 24
Introduction CST design CST in practice
Runtime for random access to the LCP array (2)
lcp_uncompressed lcp_wt lcp_dac lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2
Memory in bits per character Memory in bits per character
dna.200MB
10 102 103 104 4 8 16 24 Time in nanoseconds per operation
rand_k128.200MB
10 102 103 104 4 8 16 24
Introduction CST design CST in practice
Runtime for random access to the LCP array (3)
lcp_uncompressed lcp_wt lcp_dac lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2
Memory in bits per character Memory in bits per character
english.200MB
10 102 103 104 4 8 16 24 Time in nanoseconds per operation
sources.200MB
10 102 103 104 4 8 16 24
Introduction CST design CST in practice
Outline
1
Introduction Basic data structures The suffix tree
2
CST design NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes)
3
CST in practice The sdsl library
Introduction CST design CST in practice
The succinct data structure library sdsl
Provides basic and advanced succinct data structures Easy to use (very similar to C++ STL) Fast and space-efficient construction of data structures 64-bit implementation Well-optimized implementation (e.g. now using hardware POPCOUNT operation,.. ) Easy configuration of myriads of CSAs, CSTs with many time-space trade-offs Fast prototyping of other complex succinct data structures
Introduction CST design CST in practice
cst_sada<csa_sada<>,lcp_..
vector int_vector enc_vector rrr_vector rank_support rank_support_v rank_support_v5 rrr_rank_support select_support select_support_mcl select_support_bs rrr_select_support wavelet_tree wt wt_int wt_huff wt_rlmn wt_rlg csa csa_uncompressed csa_sada csa_wt lcp lcp_uncompressed lcp_dac lcp_wt lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2 bp_support bp_support_g bp_support_gg bp_support_sada rmq rmq_support_sparse_table rmq_succinct_sct rmq_succinct_sada cst cst_sada cst_sct3 has member of class type has template parametercst_sct3<csa_wt<wt_huff<>..
vector int_vector enc_vector rrr_vector rank_support rank_support_v rank_support_v5 rrr_rank_support select_support select_support_mcl select_support_bs rrr_select_support wavelet_tree wt wt_int wt_huff wt_rlmn wt_rlg csa csa_uncompressed csa_sada csa_wt lcp lcp_uncompressed lcp_dac lcp_wt lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2 bp_support bp_support_g bp_support_gg bp_support_sada rmq rmq_support_sparse_table rmq_succinct_sct rmq_succinct_sada cst cst_sada cst_sct3 has member of class type has template parameterIntroduction CST design CST in practice
CST space in practice (for english.200MB)
180.8 1 4 8 . 114.0 2 1 . 9 215.0 170.4 41.9 1 4 4 . 4 80.9 37.1 2 . 2 2 . 2
wt: 148 MB data: 114 MB rank: 7.1 MB select 1: 14.4 MB select 0: 12.5 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 215 MB lcp values: 170.4 MB
rank: 2.6 MB nav: 144.4 MB bp_support: 37.1 MB small block: 5.7 MB medium block: 1.6 MB bp rank: 20.2 MB bp select: 9.6 MB rank_support10: 20.2 MB select_support10: 6.1 MB BPSdfs(bit_vector): 80.9 MB CSA: 180.8 MB
cst_sada<csa_wt<wt_huff<> >,lcp_dac<> >
Introduction CST design CST in practice
CST space in practice (for english.200MB)
180.8 1 4 8 . 114.0 2 1 . 9 96.1 72.1 67.9 24.0 1 2 9 . 2 80.9 2 1 . 9 20.2
CSA: 180.8 MB wt: 148 MB data: 114 MB rank: 7.1 MB select 1: 14.4 MB select 0: 12.5 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 96.1 MB small lcp: 72.1 MB data: 67.9 MB rank: 4.2 MB big lcp: 24 MB nav: 129.2 MB bp: 80.9 MB bp_support: 21.9 MB small block: 5.7 MB medium block: 1.6 MB bp rank: 5.1 MB bp select: 9.6 MB rank_support10: 20.2 MB select_support10: 6.1 MB
cst_sada<csa_wt<wt_huff<> >,lcp_support_tree2<> >
Introduction CST design CST in practice
CST space in practice (for english.200MB)
108.4 75.6 75.6 30.4 3 1 . 4 21.9 96.1 72.1 67.9 24.0 129.2 8 . 9 2 1 . 9 20.2
CSA: 108.4 MB wt: 75.6 MB data: 75.6 MB bt: 30.4 MB btnr: 31.4 MB btnrp: 6.7 MB rank samples: 6.9 MB invert: 0.2 MB rank: 0 MB select 1: 0 MB select 0: 0 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 96.1 MB small lcp: 72.1 MB data: 67.9 MB rank: 4.2 MB big lcp: 24 MB nav: 129.2 MB BPSdfs(bit_vector): 80.9 MB bp_support: 21.9 MB small block: 5.7 MB medium block: 1.6 MB bp rank: 5.1 MB bp select: 9.6 MB rank_support10: 20.2 MB select_support10: 6.1 MB
cst_sada<csa_wt<wt_huff<rrr_vector<> > >,lcp_support_tree2<> >
Introduction CST design CST in practice
CST space in practice (for english.200MB)
108.4 75.6 75.6 30.4 3 1 . 4 21.9 96.1 72.1 67.9 24.0 88.4 50.0 2 5 .
CSA: 108.4 MB wt: 75.6 MB data: 75.6 MB bt: 30.4 MB btnr: 31.4 MB btnrp: 6.7 MB rank samples: 6.9 MB invert: 0.2 MB rank: 0 MB select 1: 0 MB select 0: 0 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 96.1 MB small lcp: 72.1 MB data: 67.9 MB rank: 4.2 MB big lcp: 24 MB nav: 88.4 MB bp: 50 MB bp_support: 13.4 MB small block: 3.5 MB medium block: 0.8 MB
cst_sct3<csa_wt<wt_huff<rrr_vector<> > >,lcp_support_tree2<> >
Introduction CST design CST in practice
Experimental setup
0 ˆ = cst_sada<csa_sada<>, lcp_dac<> > 1 ˆ = cst_sada<csa_sada<>, lcp_support_tree2<> > 2 ˆ = cst_sada<csa_wt<>, lcp_dac<> > 3 ˆ = cst_sada<csa_wt<>, lcp_support_tree2<> > 4 ˆ = cst_sct3<csa_sada<>, lcp_dac<> > 5 ˆ = cst_sct3<csa_sada<>, lcp_support_tree2<> > 6 ˆ = cst_sct3<csa_wt<>, lcp_dac<> > 7 ˆ = cst_sct3<csa_wt<>, lcp_support_tree2<> > The same basic data structures are used, i.e. its a very fair comparison
Introduction CST design CST in practice
Runtime of operations of cst_sada
mstats : dfs and depth(v): dfs and id(v): lca(v, w): select child(v, 1): child(v, c): sl(v): sibling(v): parent(v): depth(v)∗ : depth(v): id(v): lcp[i]: psi[i]: csa[i]: 0µs 20µs 40µs 60µs 0µs 2µs 4µs 6µs
Introduction CST design CST in practice
Runtime of operations of cst_sct3
mstats : dfs and depth(v): dfs and id(v): lca(v, w): select child(v, 1): child(v, c): sl(v): sibling(v): parent(v): depth(v)∗ : depth(v): id(v): lcp[i]: psi[i]: csa[i]: 0µs 20µs 40µs 60µs 0µs 2µs 4µs 6µs
Introduction CST design CST in practice
Time-space trade-off for select child(v, 1)
1 2 3 4’ 5’ 6’ 7’ 8’ 8’ 8’ 1 2 3 4’ 5’ 6’ 7’ 8’ 8’ 8’
english.200MB
4 8 16 24 32 100 200 Time in nanoseconds per operation Memory in bits per character
sources.200MB
4 8 16 24 32 100 200 Memory in bits per character
Introduction CST design CST in practice
Time-space trade-off for child(v, c)
1 2 3 4 5 6 7 8” 8” 8” 1 2 3 4 5 6 7 8” 8” 8”
english.200MB
4 8 16 24 32 20000 60000 Time in nanoseconds per operation Memory in bits per character
sources.200MB
4 8 16 24 32 20000 60000 Memory in bits per character
Introduction CST design CST in practice
Construction of a CST
#include <sdsl / s u f f i x t r e e s . hpp> #include <sdsl / u t i l . hpp> using namespace sdsl ; typedef cst_sct3 <> tCST ; int main ( int argc , char∗ argv [ ] ) { tCST cst ; construct_cst ( argv [ 1 ] , cst ) ; }
Introduction CST design CST in practice
Runtime for construction (prefixes of english text)
Text size in MB Time in seconds 10 102 103 104 100 200 300 400 500
cstV ST cst_sada (0) cst_sct3 (4)
Introduction CST design CST in practice
CST construction – resources comparison
σ: 226 text: english.200MB 1 5 6 1 6 4 8 4 8 LCP NAV 5000 3000 1000 3 5 7 2000 4000 2 7 2 3
Time in seconds Memory in bytes per character CSA
LCP type: lcp_sada CST type: cst_sada CSA type: csa_wt
sdsl (2010) first CST implementation (2007)
Introduction CST design CST in practice
Detailed resources for the construction of CSTs
text : english.200MB text : english.200MB text : english.200MB CST type: cst_sada CST type: cst_sada CST type: cst_sct3 CSA type: csa_wt CSA type: csa_wt CSA type: csa_wt
Time in seconds
4 8 4 8 4 8 200 300 400 100 bp_support: 24.3 NAV: 141.5 ISA: 33.9 TBWT: 29.5 wt: 9.9 LCP: 98.8 CSA: 106.8 CSA: 108.5 LCP: 65.9 NAV: 144.6 bp_support: 23.5 SA: 56.7 SA: 56.5 TBWT: 29.6 wt: 9.9 TBWT: 29.4 wt: 9.9 CSA: 108.3 bp_support: 2.3 NAV: 9.9 LCP: 67.2
Memory in bytes per input character
SA: 57.4
Introduction CST design CST in practice
Depth first search traversal in a CST
template<class Cst> void test_cst_dfs_iterator_and_depth ( Cst &cst ) { typedef typename Cst : : c o n s t _ i t e r a t o r i t e r a t o r ; long long cnt = 0; for ( i t e r a t o r i t =cst . begin ( ) ; i t != cst . end ();++ i t ) { i f ( ! cst . i s _ l e a f (∗ i t ) ) cnt += cst . depth (∗ i t ) ; } cout << cnt << endl ; }
Introduction CST design CST in practice
Runtime of dfs on CST (prefixes of english text)
Text size in MB Time in nanoseconds 102 103 104 100 200 300 400 500
cstV ST cst_sada (0) cst_sct3 (4)
Introduction CST design CST in practice
Conclusion
CSTs ... ... can be build fast and space-efficient ... provide a rich set of functionality
fast operations: basic navigation, access LF or Ψ slow operations: child(v, c), access to SA
You can use the sdsl library to configure a CSTs which fits your needs
Introduction CST design CST in practice
Introduction CST design CST in practice
Runtime of int_vector access
int_vector<32> bit_vector int_vector<> v(..,..,32) int_vector<> v(..,..,27)
Time in nanoseconds per operation 20 40 60 80 random access read random access write sequential write
Introduction CST design CST in practice
Operation runtime of basic data structures
⁀ int_vector<32> rank_support_v rank_support_v5 select_support_mcl
Time in nanoseconds per operation 100 200 300
(random access)
Introduction CST design CST in practice
Runtime of CSA access (1)
Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation
dblp.xml.200MB
102 103 104 105 106 2 6 16 24 32
proteins.200MB
csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>
Introduction CST design CST in practice
Runtime of CSA access (2)
Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation
dna.200MB
102 103 104 105 106 2 6 16 24 32
rand_k128.200MB
csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>
Introduction CST design CST in practice
Runtime of CSA access (3)
Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation
english.200MB
102 103 104 105 106 2 6 16 24 32
sources.200MB
csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>
Introduction CST design CST in practice
Runtime of CSA operations (1)
Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation
dblp.xml.200MB
102 103 104 105 106 2 6 16 24 32
proteins.200MB
csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>
Introduction CST design CST in practice
Runtime of CSA operations (2)
Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation
dna.200MB
102 103 104 105 106 2 6 16 24 32
rand_k128.200MB
csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>
Introduction CST design CST in practice
Runtime of CSA operations (3)
Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation
english.200MB
102 103 104 105 106 2 6 16 24 32
sources.200MB
csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>