Compressed Suffix Trees in Practice Simon Gog Computing and - - PowerPoint PPT Presentation

compressed suffix trees in practice
SMART_READER_LITE
LIVE PREVIEW

Compressed Suffix Trees in Practice Simon Gog Computing and - - PowerPoint PPT Presentation

Introduction CST design CST in practice Compressed Suffix Trees in Practice Simon Gog Computing and Information Systems The University of Melbourne February 13th 2013 Introduction CST design CST in practice Outline Introduction 1 Basic


slide-1
SLIDE 1

Introduction CST design CST in practice

Compressed Suffix Trees in Practice

Simon Gog

Computing and Information Systems The University of Melbourne

February 13th 2013

slide-2
SLIDE 2

Introduction CST design CST in practice

Outline

1

Introduction Basic data structures The suffix tree

2

CST design NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes)

3

CST in practice The sdsl library

slide-3
SLIDE 3

Introduction CST design CST in practice

Succinct data structures (1)

Data structure D representation of an

  • bject X

+

  • perations on X

Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n bits space + access b[i] in O(1) time rank(i) = i−1

j=0 b[j]

in O(n) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.

slide-4
SLIDE 4

Introduction CST design CST in practice

Succinct data structures (1)

Data structure D representation of an

  • bject X

+

  • perations on X

Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n bits space + access b[i] in O(1) time rank(i) = i−1

j=0 b[j]

in O(n) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.

slide-5
SLIDE 5

Introduction CST design CST in practice

Succinct data structures (1)

Data structure D representation of an

  • bject X

+

  • perations on X

Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n + n log n bits space + access b[i] in O(1) time rank(i) = i−1

j=0 b[j]

in O(1) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.

slide-6
SLIDE 6

Introduction CST design CST in practice

Succinct data structures (1)

Data structure D representation of an

  • bject X

+

  • perations on X

Example: Rank-bit-vector bit vector b of length n (0,1,0,1,1,0,1,1) (0,0,1,1,2,3,3,4) in n + n log n bits space + access b[i] in O(1) time rank(i) = i−1

j=0 b[j]

in O(1) time Succinct data structure D Space of D is close the information theoretic lower bound to represent X, while operations can still be performed efficient.

slide-7
SLIDE 7

Introduction CST design CST in practice

Succinct data structures (2)

Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory

develop succinct data structures

in practice

constants in O(1)-time terms are large

  • (n)-space term is not negligible

complex data structures are hard to implement

slide-8
SLIDE 8

Introduction CST design CST in practice

Succinct data structures (2)

Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !?

CPU L1-Cache L2-Cache L3-Cache DRAM Disk ≈ 100 B ≈ 10 KB ≈ 512 KB ≈ 1-8 MB ≈ 4 GB ≈ x · 100 GB ≈ 1 CPU cycle ≈ 5 CPU cycles ≈ 10-20 ≈ 20-100 ≈ 100-500 ≈ 106

Less memory ⇒ less costs !? Problems: in theory

develop succinct data structures

in practice

constants in O(1)-time terms are large

  • (n)-space term is not negligible

complex data structures are hard to implement

slide-9
SLIDE 9

Introduction CST design CST in practice

Succinct data structures (2)

Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !?

Instance name main memory price per hour Micro 613.0 MB 0.02 US$ High-Memory Quadruple Extra Large 68.4 GB 2.00 US$ Pricing of Amazons Elastic Cloud Computing (EC2) service in July 2011.

Problems: in theory

develop succinct data structures

in practice

constants in O(1)-time terms are large

  • (n)-space term is not negligible

complex data structures are hard to implement

slide-10
SLIDE 10

Introduction CST design CST in practice

Succinct data structures (2)

Can succinct data structures replace classic uncompressed data structures in practice? Less memory ⇒ fewer CPU cycles !? Less memory ⇒ less costs !? Problems: in theory

develop succinct data structures

in practice

constants in O(1)-time terms are large

  • (n)-space term is not negligible

complex data structures are hard to implement

slide-11
SLIDE 11

Introduction CST design CST in practice

The classic index data structure: The suffix tree (ST)

Let T be a text of length n over alphabet Σ of size σ. Suffix tree index data structure for T (construction O(n)) can be used to solve many problems in optimal time complexity

bioinformatics data compression

uses O(n log n) bits! In practice (ASCII-alphabet) ≥ 17 times the size of T Can not handle „The Attack of Massive Data” DNA sequencing data (NGS) ...

slide-12
SLIDE 12

Introduction CST design CST in practice

Example: ST of T=umulmundumulmum$

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m $ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 n d u m u l m u m $ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u n = 16 Σ ={$,d,l,m,n,u} σ = 6 Classic implementation uses pointers each of size 4 or 8 bytes!

slide-13
SLIDE 13

Introduction CST design CST in practice

Example: ST of T=umulmundumulmum$

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m $ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 n d u m u l m u m $ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u Operations root() is leaf(v) parent(v) degree(v) child(v, c) select child(v, i) depth(v) edge(v, d) lca(v, w) sl(v) wl(v, c)

slide-14
SLIDE 14

Introduction CST design CST in practice

CSTs

Goal of a CST implementation Replace fastest uncompressed ST implementations in different scenarios (a) both fit in RAM and we measure time (b) both fit in RAM and we measure resource costs (c) only CST fits in RAM and we measure time Proposals Sadakane’s CST cst_sada Fully Compressed Suffix Tree (Russo et al.) CSTs based on interval representation of nodes (Fischer et

  • al. cstY, Ohlebusch et al. cst_sct3)
slide-15
SLIDE 15

Introduction CST design CST in practice

CSTs

Goal of a CST implementation Replace fastest uncompressed ST implementations in different scenarios (a) both fit in RAM and we measure time (b) both fit in RAM and we measure resource costs (c) only CST fits in RAM and we measure time Proposals which might work for (a) and (b) Sadakane’s CST cst_sada Fully Compressed Suffix Tree (Russo et al.) CSTs based on interval representation of nodes (Fischer et

  • al. cstY, Ohlebusch et al. cst_sct3)
slide-16
SLIDE 16

Introduction CST design CST in practice

Outline

1

Introduction Basic data structures The suffix tree

2

CST design NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes)

3

CST in practice The sdsl library

slide-17
SLIDE 17

Introduction CST design CST in practice

Big picture of CST design

W a v e l e t T r e e

Ψ

LF T

B W T

H u f f m a n

CSA

M i n

  • M

a x

  • T

r e e

excess

RMQ P i

  • n

e e r s B a l a n c e d P a r e n t h e s e s S e q u e n c e

NAV

fi r s t c h i l d PLCP

2n bits

LCP

slide-18
SLIDE 18

Introduction CST design CST in practice

Big picture of CST design

15

$

7

d u m u l m u m $

11

m$

3

ndumulmum$ lmu

14

$

9

m$

1

ndumulmum$ lmu

12

m $

4

ndumulmum$ u m

6

ndumulmum$

10

m$

2

n d u m u l m u m $ lmu

13

$

8

m$ n d u m u l m u m $ ulmu m

5

n d u m u l m u m $ u

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 0 0 3 0 1 5 2 2 0 0 4 1 2 6 1 (()()(()())(()((()())()()))()((()())(()(()()))()))

slide-19
SLIDE 19

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs=

tree uncompressed O(n log n) bits compressed 4n bits

slide-20
SLIDE 20

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= ( BPSdfs=

tree uncompressed O(n log n) bits compressed 4n bits

slide-21
SLIDE 21

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (( BPSdfs= 1

tree uncompressed O(n log n) bits compressed 4n bits

slide-22
SLIDE 22

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (() BPSdfs= 1

tree uncompressed O(n log n) bits compressed 4n bits

slide-23
SLIDE 23

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()( BPSdfs= 1 3

tree uncompressed O(n log n) bits compressed 4n bits

slide-24
SLIDE 24

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()() BPSdfs= 1 3

tree uncompressed O(n log n) bits compressed 4n bits

slide-25
SLIDE 25

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()( BPSdfs= 1 3 5

tree uncompressed O(n log n) bits compressed 4n bits

slide-26
SLIDE 26

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(( BPSdfs= 1 3 5 6

tree uncompressed O(n log n) bits compressed 4n bits

slide-27
SLIDE 27

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(() BPSdfs= 1 3 5 6

tree uncompressed O(n log n) bits compressed 4n bits

slide-28
SLIDE 28

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(()( BPSdfs= 1 3 5 6 8

tree uncompressed O(n log n) bits compressed 4n bits

slide-29
SLIDE 29

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(()() BPSdfs= 1 3 5 6 8

tree uncompressed O(n log n) bits compressed 4n bits

slide-30
SLIDE 30

Introduction CST design CST in practice

Example: Compressing NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= (()()(()()) BPSdfs= 1 3 5 6 8 1 3 5 6 8 11 12 14 15 16 18 21 23 27 29 30 31 33 36 37 39 40 42 46

tree uncompressed O(n log n) bits compressed 4n bits

slide-31
SLIDE 31

Introduction CST design CST in practice

NAV data structures (1)

4n bits

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= 1 3 5 6 8 11 12 14 15 16 18 21 23 27 29 30 31 33 36 37 39 40 42 46

2n bits

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u ( ( ( ( 3 )( ( 1 ( 5 )( 2 ( 2 )))( ( ( 4 )( 1 ( 2 ( 6 ))( 1 )))))))) BPSsct= LCP= 0-[0, 15] 3-[2, 3] 1-[4, 8] 2-[5, 8] 5-[5, 6] 1-[10, 15] 4-[10, 11] 2-[12, 14] 6-[13, 14]

+o(n) bits to answer find open(i), find close(i), enclose(i), double enclose(i, j),rank(i, c), select(i, c),. . . in constant time

slide-32
SLIDE 32

Introduction CST design CST in practice

NAV data structures (2)

Comparison of different NAV structures cst_sada cst_sct cst_sct3 space in bits 4n + o(n) 2n + o(n) 3n + o(n) root() O(1) O(1) O(1) degree(v) O(σ) O(tLCP log σ) O(1) depth(v) O(tLCP) O(tLCP) O(tLCP) parent(v) O(1) O(tLCP log σ) O(1) select child(v, i) O(i) O(tLCP) O(1) sibling(v) O(1) O(tLCP) O(1) sl(v), lca(v, w) O(1) O(tLCP log σ) O(1) child(v, c) O(tSAσ) O(tSA log σ) O(tSA log σ)

slide-33
SLIDE 33

Introduction CST design CST in practice

Example operations: select leaf(i) and lca(v, w) on NAV

15 $ 7 dumulmum$ 11 m$ 3 ndumulmum$ lmu 14 $ 9 m$ 1 ndumulmum$ lmu 12 m$ 4 ndumulmum$ u m 6 ndumulmum$ 10 m$ 2 ndumulmum$ lmu 13 $ 8 m$ ndumulmum$ ulmu m 5 ndumulmum$ u (()()(()())(()((()())()()))()((()())(()(()()))())) BPSdfs= 8 12

select leaf(4) = select(4,′ 10′) = 8 select leaf(5) = select(5,′ 10′) = 12 lca(8,12) = double enclose(8,12) = 0

slide-34
SLIDE 34

Introduction CST design CST in practice

Virtues of a CSA based on BWT

Small size: |CSA| = |BWT| + n log n

sSA

bits where |BWT| can be chosen to be

n log σ bits nH0(T) bits nHk(T) + O(σk) bits

pattern matching in time O(|P| log σ) (even O(|P|) for σ ∈ polylog(n)) by backward search (Ferragina & Manzini)

slide-35
SLIDE 35

Introduction CST design CST in practice

Hk of the Pizza&Chili 200MB test cases

dblp.xml dna english proteins rand_k128 sources k Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n Hk CT/n 5.257 0.0000 1.974 0.0000 4.525 0.0000 4.201 0.0000 7.000 0.0000 5.465 0.0000 1 3.479 0.0000 1.930 0.0000 3.620 0.0000 4.178 0.0000 7.000 0.0000 4.077 0.0000 2 2.170 0.0000 1.920 0.0000 2.948 0.0001 4.156 0.0000 6.993 0.0001 3.102 0.0000 3 1.434 0.0007 1.916 0.0000 2.422 0.0005 4.066 0.0001 5.979 0.0100 2.337 0.0012 4 1.045 0.0043 1.910 0.0000 2.063 0.0028 3.826 0.0011 0.666 0.6939 1.852 0.0082 5 0.817 0.0130 1.901 0.0000 1.839 0.0103 3.162 0.0173 0.006 0.9969 1.518 0.0250 6 0.705 0.0265 1.884 0.0001 1.672 0.0265 1.502 0.1742 0.000 1.0000 1.259 0.0509 7 0.634 0.0427 1.862 0.0001 1.510 0.0553 0.340 0.4506 0.000 1.0000 1.045 0.0850 8 0.574 0.0598 1.834 0.0004 1.336 0.0991 0.109 0.5383 0.000 1.0000 0.867 0.1255 9 0.537 0.0773 1.802 0.0013 1.151 0.1580 0.074 0.5588 0.000 1.0000 0.721 0.1701 10 0.508 0.0955 1.760 0.0051 0.963 0.2292 0.061 0.5699 0.000 1.0000 0.602 0.2163

slide-36
SLIDE 36

Introduction CST design CST in practice

Backward search

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 TBWT m n u u u u u l l u m m m d $ m F $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$ (a) TBWT m n u u u u u l l u m m m d $ m F $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$ (b) TBWT m n u u u u u l l u m m m d $ m F $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$ (c)

slide-37
SLIDE 37

Introduction CST design CST in practice

Wavelet tree: rank for character sequences

mnuuuuullummmd$m 1111111001111001 lld$ 1100 d$ 10 $ d 1 ll 1 mnuuuuuummmm 001111110000 mnmmmm 010000 mmmmm n 1 uuuuuu 1 1 (a) mnuuuuullummmd$m 0011111001000000 mnllmmmd$m 1000111001 nlld$ 11100 d$ 10 $ d 1 nll 100 ll n 1 1 mmmmm 1 uuuuuu 1 (b)

slide-38
SLIDE 38

Introduction CST design CST in practice

Example: Compressing CSA (practical approach)

i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 SA 15 7 11 3 14 9 1 12 4 6 10 2 13 8 5 LF 4 9 10 11 12 13 14 2 3 15 5 6 7 1 8 TBWT m n u u u u u l l u m m m d $ m T $ dumulmum$ lmum$ lmundumulmum$ m$ mulmum$ mulmundumulmum$ mum$ mundumulmum$ ndumulmum$ ulmum$ ulmundumulmum$ um$ umulmum$ umulmundumulmum$ undumulmum$

slide-39
SLIDE 39

Introduction CST design CST in practice

Example: Compressing CSA (practical approach)

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u

sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)

SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +

n log n sSA

bits

slide-40
SLIDE 40

Introduction CST design CST in practice

Example: Compressing CSA (practical approach)

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=

sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)

SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +

n log n sSA

bits

slide-41
SLIDE 41

Introduction CST design CST in practice

Example: Compressing CSA (practical approach)

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=SA[1]+1

sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)

SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +

n log n sSA

bits

slide-42
SLIDE 42

Introduction CST design CST in practice

Example: Compressing CSA (practical approach)

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=SA[9]+2

sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)

SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +

n log n sSA

bits

slide-43
SLIDE 43

Introduction CST design CST in practice

Example: Compressing CSA (practical approach)

15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 0 1 2 3 4 5 6 7 8 9 101112131415 15 7 11 3 14 9 1 12 4 6 10 2 13 8 0 5 m $ n d u l u l u m u m u m l m l m u n m u m u m u d u $ u m u SA[13]=6+2=8

sSA = 3 access LF[i] in time O(log σ) access CSA[i] in time O(sSA log σ)

SA uncompressed n log n bits compressed CSA = TBWT + SA samples: n log σ + o(n log σ) bits +

n log n sSA

bits

slide-44
SLIDE 44

Introduction CST design CST in practice

Overview of LCP data structures

data structure uses access memory in bits lcp_uncompressed

  • O(1)

n log n lcp_support_sada CSA O(tSA) 2n + o(n) lcp_kurtz

  • O(log n) or O(1)

8n1 + 2n2 log n . . . . . . . . . . . . lcp_support_tree NAV O(log σ′) H0q1 + q2 log n lcp_support_tree2 NAV & LF O(sLCP log σ′) H0q1 + (q2 log n)/sLCP

with n1 + n2 = n and q1 + q2 = q < n q number of inner nodes of the ST

slide-45
SLIDE 45

Introduction CST design CST in practice

Runtime for random access to the LCP array (1)

Memory in bits per character Memory in bits per character

lcp_wt lcp_dac lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2 lcp_uncompressed

dblp.xml.200MB

10 102 103 104 4 8 16 24 Time in nanoseconds per operation

proteins.200MB

10 102 103 104 4 8 16 24

slide-46
SLIDE 46

Introduction CST design CST in practice

Runtime for random access to the LCP array (2)

lcp_uncompressed lcp_wt lcp_dac lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2

Memory in bits per character Memory in bits per character

dna.200MB

10 102 103 104 4 8 16 24 Time in nanoseconds per operation

rand_k128.200MB

10 102 103 104 4 8 16 24

slide-47
SLIDE 47

Introduction CST design CST in practice

Runtime for random access to the LCP array (3)

lcp_uncompressed lcp_wt lcp_dac lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2

Memory in bits per character Memory in bits per character

english.200MB

10 102 103 104 4 8 16 24 Time in nanoseconds per operation

sources.200MB

10 102 103 104 4 8 16 24

slide-48
SLIDE 48

Introduction CST design CST in practice

Outline

1

Introduction Basic data structures The suffix tree

2

CST design NAV (tree topology and navigation) CSA (lexicographic information) LCP (longest common prefixes)

3

CST in practice The sdsl library

slide-49
SLIDE 49

Introduction CST design CST in practice

The succinct data structure library sdsl

Provides basic and advanced succinct data structures Easy to use (very similar to C++ STL) Fast and space-efficient construction of data structures 64-bit implementation Well-optimized implementation (e.g. now using hardware POPCOUNT operation,.. ) Easy configuration of myriads of CSAs, CSTs with many time-space trade-offs Fast prototyping of other complex succinct data structures

slide-50
SLIDE 50

Introduction CST design CST in practice

cst_sada<csa_sada<>,lcp_..

vector int_vector enc_vector rrr_vector rank_support rank_support_v rank_support_v5 rrr_rank_support select_support select_support_mcl select_support_bs rrr_select_support wavelet_tree wt wt_int wt_huff wt_rlmn wt_rlg csa csa_uncompressed csa_sada csa_wt lcp lcp_uncompressed lcp_dac lcp_wt lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2 bp_support bp_support_g bp_support_gg bp_support_sada rmq rmq_support_sparse_table rmq_succinct_sct rmq_succinct_sada cst cst_sada cst_sct3 has member of class type has template parameter
  • f concept

cst_sct3<csa_wt<wt_huff<>..

vector int_vector enc_vector rrr_vector rank_support rank_support_v rank_support_v5 rrr_rank_support select_support select_support_mcl select_support_bs rrr_select_support wavelet_tree wt wt_int wt_huff wt_rlmn wt_rlg csa csa_uncompressed csa_sada csa_wt lcp lcp_uncompressed lcp_dac lcp_wt lcp_kurtz lcp_support_sada lcp_support_tree lcp_support_tree2 bp_support bp_support_g bp_support_gg bp_support_sada rmq rmq_support_sparse_table rmq_succinct_sct rmq_succinct_sada cst cst_sada cst_sct3 has member of class type has template parameter
  • f concept
slide-51
SLIDE 51

Introduction CST design CST in practice

CST space in practice (for english.200MB)

180.8 1 4 8 . 114.0 2 1 . 9 215.0 170.4 41.9 1 4 4 . 4 80.9 37.1 2 . 2 2 . 2

540.2 MB 270%

wt: 148 MB data: 114 MB rank: 7.1 MB select 1: 14.4 MB select 0: 12.5 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 215 MB lcp values: 170.4 MB

  • verflow mark: 41.9 MB

rank: 2.6 MB nav: 144.4 MB bp_support: 37.1 MB small block: 5.7 MB medium block: 1.6 MB bp rank: 20.2 MB bp select: 9.6 MB rank_support10: 20.2 MB select_support10: 6.1 MB BPSdfs(bit_vector): 80.9 MB CSA: 180.8 MB

cst_sada<csa_wt<wt_huff<> >,lcp_dac<> >

slide-52
SLIDE 52

Introduction CST design CST in practice

CST space in practice (for english.200MB)

180.8 1 4 8 . 114.0 2 1 . 9 96.1 72.1 67.9 24.0 1 2 9 . 2 80.9 2 1 . 9 20.2

406.2 MB 203%

CSA: 180.8 MB wt: 148 MB data: 114 MB rank: 7.1 MB select 1: 14.4 MB select 0: 12.5 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 96.1 MB small lcp: 72.1 MB data: 67.9 MB rank: 4.2 MB big lcp: 24 MB nav: 129.2 MB bp: 80.9 MB bp_support: 21.9 MB small block: 5.7 MB medium block: 1.6 MB bp rank: 5.1 MB bp select: 9.6 MB rank_support10: 20.2 MB select_support10: 6.1 MB

cst_sada<csa_wt<wt_huff<> >,lcp_support_tree2<> >

slide-53
SLIDE 53

Introduction CST design CST in practice

CST space in practice (for english.200MB)

108.4 75.6 75.6 30.4 3 1 . 4 21.9 96.1 72.1 67.9 24.0 129.2 8 . 9 2 1 . 9 20.2

333.7 MB 167%

CSA: 108.4 MB wt: 75.6 MB data: 75.6 MB bt: 30.4 MB btnr: 31.4 MB btnrp: 6.7 MB rank samples: 6.9 MB invert: 0.2 MB rank: 0 MB select 1: 0 MB select 0: 0 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 96.1 MB small lcp: 72.1 MB data: 67.9 MB rank: 4.2 MB big lcp: 24 MB nav: 129.2 MB BPSdfs(bit_vector): 80.9 MB bp_support: 21.9 MB small block: 5.7 MB medium block: 1.6 MB bp rank: 5.1 MB bp select: 9.6 MB rank_support10: 20.2 MB select_support10: 6.1 MB

cst_sada<csa_wt<wt_huff<rrr_vector<> > >,lcp_support_tree2<> >

slide-54
SLIDE 54

Introduction CST design CST in practice

CST space in practice (for english.200MB)

108.4 75.6 75.6 30.4 3 1 . 4 21.9 96.1 72.1 67.9 24.0 88.4 50.0 2 5 .

294.5 MB 147%

CSA: 108.4 MB wt: 75.6 MB data: 75.6 MB bt: 30.4 MB btnr: 31.4 MB btnrp: 6.7 MB rank samples: 6.9 MB invert: 0.2 MB rank: 0 MB select 1: 0 MB select 0: 0 MB sa_sample: 21.9 MB isa_sample: 10.9 MB lcp: 96.1 MB small lcp: 72.1 MB data: 67.9 MB rank: 4.2 MB big lcp: 24 MB nav: 88.4 MB bp: 50 MB bp_support: 13.4 MB small block: 3.5 MB medium block: 0.8 MB

cst_sct3<csa_wt<wt_huff<rrr_vector<> > >,lcp_support_tree2<> >

slide-55
SLIDE 55

Introduction CST design CST in practice

Experimental setup

0 ˆ = cst_sada<csa_sada<>, lcp_dac<> > 1 ˆ = cst_sada<csa_sada<>, lcp_support_tree2<> > 2 ˆ = cst_sada<csa_wt<>, lcp_dac<> > 3 ˆ = cst_sada<csa_wt<>, lcp_support_tree2<> > 4 ˆ = cst_sct3<csa_sada<>, lcp_dac<> > 5 ˆ = cst_sct3<csa_sada<>, lcp_support_tree2<> > 6 ˆ = cst_sct3<csa_wt<>, lcp_dac<> > 7 ˆ = cst_sct3<csa_wt<>, lcp_support_tree2<> > The same basic data structures are used, i.e. its a very fair comparison

slide-56
SLIDE 56

Introduction CST design CST in practice

Runtime of operations of cst_sada

mstats : dfs and depth(v): dfs and id(v): lca(v, w): select child(v, 1): child(v, c): sl(v): sibling(v): parent(v): depth(v)∗ : depth(v): id(v): lcp[i]: psi[i]: csa[i]: 0µs 20µs 40µs 60µs 0µs 2µs 4µs 6µs

slide-57
SLIDE 57

Introduction CST design CST in practice

Runtime of operations of cst_sct3

mstats : dfs and depth(v): dfs and id(v): lca(v, w): select child(v, 1): child(v, c): sl(v): sibling(v): parent(v): depth(v)∗ : depth(v): id(v): lcp[i]: psi[i]: csa[i]: 0µs 20µs 40µs 60µs 0µs 2µs 4µs 6µs

slide-58
SLIDE 58

Introduction CST design CST in practice

Time-space trade-off for select child(v, 1)

1 2 3 4’ 5’ 6’ 7’ 8’ 8’ 8’ 1 2 3 4’ 5’ 6’ 7’ 8’ 8’ 8’

english.200MB

4 8 16 24 32 100 200 Time in nanoseconds per operation Memory in bits per character

sources.200MB

4 8 16 24 32 100 200 Memory in bits per character

slide-59
SLIDE 59

Introduction CST design CST in practice

Time-space trade-off for child(v, c)

1 2 3 4 5 6 7 8” 8” 8” 1 2 3 4 5 6 7 8” 8” 8”

english.200MB

4 8 16 24 32 20000 60000 Time in nanoseconds per operation Memory in bits per character

sources.200MB

4 8 16 24 32 20000 60000 Memory in bits per character

slide-60
SLIDE 60

Introduction CST design CST in practice

Construction of a CST

#include <sdsl / s u f f i x t r e e s . hpp> #include <sdsl / u t i l . hpp> using namespace sdsl ; typedef cst_sct3 <> tCST ; int main ( int argc , char∗ argv [ ] ) { tCST cst ; construct_cst ( argv [ 1 ] , cst ) ; }

slide-61
SLIDE 61

Introduction CST design CST in practice

Runtime for construction (prefixes of english text)

Text size in MB Time in seconds 10 102 103 104 100 200 300 400 500

cstV ST cst_sada (0) cst_sct3 (4)

slide-62
SLIDE 62

Introduction CST design CST in practice

CST construction – resources comparison

σ: 226 text: english.200MB 1 5 6 1 6 4 8 4 8 LCP NAV 5000 3000 1000 3 5 7 2000 4000 2 7 2 3

Time in seconds Memory in bytes per character CSA

LCP type: lcp_sada CST type: cst_sada CSA type: csa_wt

sdsl (2010) first CST implementation (2007)

slide-63
SLIDE 63

Introduction CST design CST in practice

Detailed resources for the construction of CSTs

text : english.200MB text : english.200MB text : english.200MB CST type: cst_sada CST type: cst_sada CST type: cst_sct3 CSA type: csa_wt CSA type: csa_wt CSA type: csa_wt

Time in seconds

4 8 4 8 4 8 200 300 400 100 bp_support: 24.3 NAV: 141.5 ISA: 33.9 TBWT: 29.5 wt: 9.9 LCP: 98.8 CSA: 106.8 CSA: 108.5 LCP: 65.9 NAV: 144.6 bp_support: 23.5 SA: 56.7 SA: 56.5 TBWT: 29.6 wt: 9.9 TBWT: 29.4 wt: 9.9 CSA: 108.3 bp_support: 2.3 NAV: 9.9 LCP: 67.2

Memory in bytes per input character

SA: 57.4

slide-64
SLIDE 64

Introduction CST design CST in practice

Depth first search traversal in a CST

template<class Cst> void test_cst_dfs_iterator_and_depth ( Cst &cst ) { typedef typename Cst : : c o n s t _ i t e r a t o r i t e r a t o r ; long long cnt = 0; for ( i t e r a t o r i t =cst . begin ( ) ; i t != cst . end ();++ i t ) { i f ( ! cst . i s _ l e a f (∗ i t ) ) cnt += cst . depth (∗ i t ) ; } cout << cnt << endl ; }

slide-65
SLIDE 65

Introduction CST design CST in practice

Runtime of dfs on CST (prefixes of english text)

Text size in MB Time in nanoseconds 102 103 104 100 200 300 400 500

cstV ST cst_sada (0) cst_sct3 (4)

slide-66
SLIDE 66

Introduction CST design CST in practice

Conclusion

CSTs ... ... can be build fast and space-efficient ... provide a rich set of functionality

fast operations: basic navigation, access LF or Ψ slow operations: child(v, c), access to SA

You can use the sdsl library to configure a CSTs which fits your needs

slide-67
SLIDE 67

Introduction CST design CST in practice

Thank you!

slide-68
SLIDE 68

Introduction CST design CST in practice

Runtime of int_vector access

int_vector<32> bit_vector int_vector<> v(..,..,32) int_vector<> v(..,..,27)

Time in nanoseconds per operation 20 40 60 80 random access read random access write sequential write

slide-69
SLIDE 69

Introduction CST design CST in practice

Operation runtime of basic data structures

⁀ int_vector<32> rank_support_v rank_support_v5 select_support_mcl

Time in nanoseconds per operation 100 200 300

(random access)

slide-70
SLIDE 70

Introduction CST design CST in practice

Runtime of CSA access (1)

Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation

dblp.xml.200MB

102 103 104 105 106 2 6 16 24 32

proteins.200MB

csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>

slide-71
SLIDE 71

Introduction CST design CST in practice

Runtime of CSA access (2)

Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation

dna.200MB

102 103 104 105 106 2 6 16 24 32

rand_k128.200MB

csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>

slide-72
SLIDE 72

Introduction CST design CST in practice

Runtime of CSA access (3)

Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation

english.200MB

102 103 104 105 106 2 6 16 24 32

sources.200MB

csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>

slide-73
SLIDE 73

Introduction CST design CST in practice

Runtime of CSA operations (1)

Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation

dblp.xml.200MB

102 103 104 105 106 2 6 16 24 32

proteins.200MB

  • =psi[i], △ =psi(i)=LF[i], + =bwt[i]

csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>

slide-74
SLIDE 74

Introduction CST design CST in practice

Runtime of CSA operations (2)

Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation

dna.200MB

102 103 104 105 106 2 6 16 24 32

rand_k128.200MB

  • =psi[i], △ =psi(i)=LF[i], + =bwt[i]

csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>

slide-75
SLIDE 75

Introduction CST design CST in practice

Runtime of CSA operations (3)

Memory in bits per character Memory in bits per character 102 103 104 105 106 2 6 16 24 32 Time in nanoseconds per operation

english.200MB

102 103 104 105 106 2 6 16 24 32

sources.200MB

  • =psi[i], △ =psi(i)=LF[i], + =bwt[i]

csa_wt<wt<>> csa_wt<wt_huff<>> csa_wt<wt_rlmn<>> csa_wt<wt_rlg<8>> csa_sada<δ> csa_sada<Φ>