Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel - - PowerPoint PPT Presentation

compact data strutures
SMART_READER_LITE
LIVE PREVIEW

Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel - - PowerPoint PPT Presentation

(To compress is to Conquer) Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda Introduction Basic


slide-1
SLIDE 1

Compact Data Strutures

(To compress is to Conquer)

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training School Keyword search in Big Linked Data

slide-2
SLIDE 2
  • Introduction
  • Basic compression
  • Sequences
  • Bit sequences
  • Integer sequences
  • A brief Review about Indexing

PAGE 2

Agenda

images: zurb.com

slide-3
SLIDE 3

Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry

  • ut some operations on the data.

Introduction to Compact Data Structures

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 3

slide-4
SLIDE 4
  • f 78

4

Introduction Why compression?

  • Disks are cheap !! But they are also slow!
  • Compression can help more data to fit in main memory.

(access to memory is around 106 times faster than HDD)

  • CPU speed is increasing faster
  • We can trade processing time (needed to uncompress data) by space.
slide-5
SLIDE 5
  • f 78

5

Introduction Why compression?

  • Compression does not only reduce space!
  • I/O access on disks and networks
  • Processing time* (less data has to be processed)
  • … If appropriate methods are used
  • For example: Allowing handling data compressed all the time.

Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%)

Doc 1 Doc 2 Doc 3 Doc n

Compressed Text collection (20%) P7zip, others

Doc 1 Doc 2 Doc 3 Doc n

Let’s search for “Keystone"

slide-6
SLIDE 6
  • f 78

6

Introduction Why indexing?

  • Indexing permits sublinear search time

Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%)

Doc 1 Doc 2 Doc 3 Doc n

term 1 … Keystone … term n (> 5-30%) Index

Let’s search for “Keystone"

slide-7
SLIDE 7
  • f 78

7

Introduction Why compact data structures?

  • Self-indexes:
  • sublinear search time
  • Text implicitly kept

Text collection Doc 1 Doc 2 Doc 3 Doc n term 1 … Keystone … term n (> 5-30%) Index 1 1

1

1 1 1 Self-index (WT, WCSA,…) term 1 … Keystone … term n

Let’s search for “Keystone"

slide-8
SLIDE 8
  • Introduction
  • Basic compression
  • Sequences
  • Bit sequences
  • Integer sequences
  • A brief Review about Indexing

PAGE 8

Agenda

images: zurb.com

slide-9
SLIDE 9

Compressing aims at representing data within less

  • space. How does it work? Which are the most

traditional compression techniques?

Compression

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 9

slide-10
SLIDE 10
  • f 78

10

Basic Compression Modeling & Coding

  • A compressor could use as a source alphabet:
  • A fixed number of symbols (statistical compressors)
  • 1 char, 1 word
  • A variable number of symbols (dictionary-based compressors)
  • 1st occ of ‘a’ encoded alone, 2nd occ encoded with next one ‘ax’
  • Codes are built using symbols of a target alphabet:
  • Fixed length codes (10 bits, 1 byte, 2 bytes, …)
  • Variable length codes (1,2,3,4 bits/bytes …)
  • Classification (fixed-to-variable, variable-to-fixed,…)
  • statistical

Input alphabet

dictionary

var2var

Target alphabet fixed var fixed var

slide-11
SLIDE 11
  • f 78

11

Basic Compression Main families of compressors

  • Taxonomy
  • Dictionary based (gzip, compress, p7zip… )
  • Grammar based (BPE, Repair)
  • Statistical compressors (Huffman, arithmetic, Dense, PPM,… )
  • Statistical compressors
  • Gather the frequencies of the source symbols.
  • Assign shorter codewords to the most frequent symbols.

Obtain compression

slide-12
SLIDE 12
  • f 78

12

Basic Compression Dictionary-based compressors

  • How do they achieve compression?
  • Assign fixed-length codewords to variable-length symbols (text substrings)
  • The longer the replaced substring  the better compression
  • Well-known representatives: Lempel-Ziv family
  • LZ77 (1977): GZIP, PKZIP, ARJ, P7zip
  • LZ78 (1978)
  • LZW (1984): Compress, GIF images
slide-13
SLIDE 13
  • f 78

13

Basic Compression LZW

  • Starts with an initial dictionary D (contains symbols in S)
  • For a given position of the text.
  • while D contains w, reads prefix w=w0 w1 w2 …
  • If w0 …wk wk+1 is not in D (w0 …wk does!)
  • utput (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
  • Add w0 …wk wk+1 to D
  • Continue from wk+1 on (included)
  • Dictionary has limited length? Policies: LRU, truncate& go, …

EXAMPLE

slide-14
SLIDE 14
  • f 78

14

Basic Compression LZW

  • Starts with an initial dictionary D (contains symbols in S)
  • For a given position of the text.
  • while D contains w, reads prefix w=w0 w1 w2 …
  • If w0 …wk wk+1 is not in D (w0 …wk does!)
  • utput (i = entryPos(w0 …wk)) (Note: codeword = log2 (|D|))
  • Add w0 …wk wk+1 to D
  • Continue from wk+1 on (included)
  • Dictionary has limited length? Policies: LRU, truncate& go, …

EXAMPLE

slide-15
SLIDE 15
  • f 78

15

Basic Compression Grammar-based – BPE - Repair

  • Replaces pairs of symbols by a new one, until no pair repeats twice
  • Adds a rule to a Dictionary.

A B C D E A B D E F D E D E F A B E C D A B C G A B G F G G F A B E C D H C G H G F G G F H E C D H C G H I G I H E C D DE G AB  H GF  I Source sequence Dictionary of Rules Final Repair Sequence

slide-16
SLIDE 16
  • f 78

16

Basic Compression Statistical compressors

  • Assign shorter codewords to the most frequent symbols
  • Must gather symbol frequencies for each symbol c in S.
  • Compression is lower bounded by the (zero-order) empirical entropy of the

sequence (S).

  • Most representative method: Huffman coding

n= num of symbols nc= occs of symbol c

H0(S) <= log (|S|) n H0(S) = lower bound of the size of S compressed with a zero-order compressor

slide-17
SLIDE 17
  • f 78

17

Basic Compression Statistical compressors: Huffman coding

  • Optimal prefix free coding
  • No codeword is a prefix of one another.
  • Decoding requires no look-ahead!
  • Asymptotically optimal: |Huffman(S)| <= n(H0(S)+1)
  • Typically using bit-wise codewords
  • Yet D-ary Huffman variants exist (D=256 byte-wise)
  • Builds a Huffman tree to generate codewords
slide-18
SLIDE 18
  • f 78

18

Basic Compression Statistical compressors: Huffman coding

  • Sort symbols by frequency: S=ADBAAAABBBBCCCCDDEEE
slide-19
SLIDE 19
  • f 78

19

Basic Compression Statistical compressors: Huffman coding

  • Bottom – Up tree construction
slide-20
SLIDE 20
  • f 78

20

Basic Compression Statistical compressors: Huffman coding

  • Bottom – Up tree construction
slide-21
SLIDE 21
  • f 78

21

Basic Compression Statistical compressors: Huffman coding

  • Bottom – Up tree construction
slide-22
SLIDE 22
  • f 78

22

Basic Compression Statistical compressors: Huffman coding

  • Bottom – Up tree construction
slide-23
SLIDE 23
  • f 78

23

Basic Compression Statistical compressors: Huffman coding

  • Bottom – Up tree construction
slide-24
SLIDE 24
  • f 78

24

Basic Compression Statistical compressors: Huffman coding

  • Branch labeling
slide-25
SLIDE 25
  • f 78

25

Basic Compression Statistical compressors: Huffman coding

  • Code assignment
slide-26
SLIDE 26
  • f 78

26

Basic Compression Statistical compressors: Huffman coding

  • Compression of sequence S= ADB…
  • ADB…  01 000 10 …
slide-27
SLIDE 27
  • f 78

27

Basic Compression Burrows-Wheeler Transform (BWT)

  • Given S= mississipii$, BWT(S) is obtained by: (1) creating a Matrix M with all

circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column.

mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi sort L = BWT(S) F

slide-28
SLIDE 28
  • f 78

28

Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)

  • Given L=BWT(S), we can recover S=BWT-1(L)

$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi L F 1 2 3 4 5 6 7 8 9 10 11 12

Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F

2 7 9 10 6 1 8 3 11 12 4 5 LF

slide-29
SLIDE 29
  • f 78

29

Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)

  • Given L=BWT(S), we can recover S=BWT-1(L)

$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF

Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1;

  • $

S L F

slide-30
SLIDE 30
  • f 78

30

Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)

  • Given L=BWT(S), we can recover S=BWT-1(L)

Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]=‘$’ p = LF[p]; p = 1 i = i+1; i=1

$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF

  • $

S L F

slide-31
SLIDE 31
  • f 78

31

Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)

  • Given L=BWT(S), we can recover S=BWT-1(L)

Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]=‘i’ p = LF[p]; p = 2 i = i+1; i=2

$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF

  • i

$ S L F

slide-32
SLIDE 32
  • f 78

32

Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)

  • Given L=BWT(S), we can recover S=BWT-1(L)

Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L  LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]=‘i’ p = LF[p]; p = 2 i = i+1; i=2

m i s s i s s i p p i $ $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF S L F

slide-33
SLIDE 33
  • f 78

33

Basic Compression Bzip2: Burrows-Wheeler Transform (BWT)

  • BWT. Many similar symbols appear adjacent
  • MTF

.

  • Output the position of the current symbol within S ‘
  • Keep the alphabet S ‘= {a,b,c,d,e,… } sorted so that the last used symbol is moved

to the begining of S ‘ .

  • RLE.
  • If a value (0) appears several times (000000  6 times)
  • replace it by a pair <value,times>  <0,6>
  • Huffman stage.

Why does it work? In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …

slide-34
SLIDE 34
  • Introduction
  • Basic compression
  • Sequences
  • Bit sequences
  • Integer sequences
  • A brief Review about Indexing

PAGE 34

Agenda

images: zurb.com

slide-35
SLIDE 35

We want to represent (compactly) a sequence of elements and to efficiently handle them.

(Who is in the 2nd position?? How many Barts up to position 5?? Where is the 3rd Bart??)

Sequences

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 35

1 2 3 4 5 6 7 8 9

slide-36
SLIDE 36
  • f 78

36

Sequences Plain Representation of Data

  • Given a Sequence of
  • n integers
  • m = maximum value
  • We can represent it with n ⌈log2(m+1)⌉ bits
  • 16 symbols x 3 bits per symbol = 48 bits  array of two 32-bit ints
  • Direct access (access to an integer + bit operations)

4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

slide-37
SLIDE 37
  • f 78

37

Sequences Compressed Representation of Data (H0)

  • Is it compressible?
  • Ho(S) = 1.59 (bits per symbol)
  • Huffman: 1.62 bits per symbol

26 bits: No direct access!

(but we could add sampling)

Symbol 4 1 2 3 Occurrences (nc) 9 4 2 1

1

16 7

1

4

3

1

2 1

2 3 1 4

9

4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1

1 5 10 15 20 25

slide-38
SLIDE 38
  • f 78

38

Sequences Summary: Plain/Compressed  access/rank/select

  • Operations of interest:
  • Access(i) : Value of the ith symbol
  • Ranks(i) : Number of occs of symbol s up to position i (count)
  • Selects (i) : Where the ith occ of symbol s? (locate)

4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100

1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46

1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1

1 5 10 15 20 25

slide-39
SLIDE 39
  • Introduction
  • Basic compression
  • Sequences
  • Bit sequences
  • Integer sequences
  • A brief Review about Indexing

PAGE 39

Agenda

images: zurb.com

slide-40
SLIDE 40
  • f 78

40

Bit Sequences access/rank/select on bitmaps

Rank1(6) = 3 Rank0(10) = 5

0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

B =

select0(10) =15 access (19) = 0

see [Navarro 2016]

slide-41
SLIDE 41
  • f 78

41

Bit Sequences Applications

  • Bitmaps a basic part of most Compact Data Structures
  • Example: (We will see it later in the CSA)

S: AAABBCCCCCCCCDDDEEEEEEEEEEFG  n log s bits B: 1001010000000100100000000011  n bits D: ABCDEFG  s log s bits

  • Saves space:
  • Fast access/rank/select is of interest !!
  • Where is the 2nd C?
  • How many Cs up to position k?

HDT Bitmaps from Javi's talk !!!

slide-42
SLIDE 42
  • f 78

42

Bit Sequences Reaching O(1) rank & o(n) bits of extra space

  • Jacobson, Clark, Munro
  • Variant by Fariña et al.
  • Assuming 32 bit machine-word
  • Step 1: Split de Bitmap into superblocks of 256 bits, and store the

number of 1s up to positions 1+256k (k= 0,1,2,…)

  • O(1) time to superblock. Space: n/256 superblocks and 1 int each

0 1 0

...

1

1 2 3 256

35 bits set to 1

1

...

1

257 512

27 bits set to 1 35 1 2

Ds =

62 3 ...

1

513 768

45 bits set to 1

...

97 3

...

slide-43
SLIDE 43
  • f 78

43

Bit Sequences Reaching O(1) rank & o(n) bits of extra space

  • Step 2: For each superblock of 256 bits
  • Divide it into 8 blocks of 32 bits each (machine word size)
  • Store the number of ones from the beginning of the superblock
  • O(1) time to the blocks, 8 blocks per superblock, 1 byte each

1 1 0

...

1

1 2 3 256

35 bits set to 1

1

...

257 512

27 bits set to 1 35 1 2

Ds =

62 3 ...

1

513 768

45 bits set to 1

...

97 3

... 1 1 0

...

1

1 2 3 32

4 bits set to 1 ...

1

33 64

6 bits set to 1

...

4 1 2

Db =

25 7

... 1

...

224 256

8 bits set to 1

300 44
slide-44
SLIDE 44
  • f 78

44

Bit Sequences Reaching O(1) rank & o(n) bits of extra space

  • Step 3: Rank within a 32 bit block

Finally solving: rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i) where i= p mod 32

  • Ex: rank1(D,300) = 35 + 4 + 4 = 43
  • Yet, how to compute rank1(blk, i) in constant time?

1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 blk =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
slide-45
SLIDE 45
  • f 78

45

Bit Sequences Reaching O(1) rank & o(n) bits of extra space

  • How to compute rank1 (blk, i) in constant time?
  • Option 1: popcount within a machine word
  • Option 2: Universal Table onesInByte (solution for each byte)

Only 256 entries storing values [0..8]

  • Finally, sum value onesInByte for the 4 bytes in blk
  • Overall space: 1.375 n bits

1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 blk =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 blks =

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

1 0 0 1 0 0 0 0 1 1 0 0

Shift 32 – 12 = 20 posicións Rank1(blk,12)

Val binary OnesInByte 00000000 1 00000001 1 2 00000010 1 3 00000011 2 252 11111100 6 253 11111101 7 254 11111110 7 255 11111111 8 ... ... ...
slide-46
SLIDE 46
  • f 78

46

Bit Sequences Select1 in O(log n) with the same structures

select1(p)

  • In practice, binary search using rank
slide-47
SLIDE 47
  • f 78

48

Bit Sequences Compressed representations

  • Compressed Bit-Sequence representations exist !!
  • Compressed  [Raman et al, 2002]
  • For very sparse bitmaps [Okanohara and Sadakane, 2007]
  • ... see [Navarro 2016]
slide-48
SLIDE 48
  • Introduction
  • Basic compression
  • Sequences
  • Bit sequences
  • Integer sequences
  • A brief Review about Indexing

PAGE 49

Agenda

images: zurb.com

slide-49
SLIDE 49
  • f 78

50

Integer Sequences access/rank/select on general sequences

Rank2(9) = 3

S=

select4(3) =7 access (13) = 3

4 4 3 2 6 2 4 2 4 1 1 2 3 5

1 2 3 4 5 6 7 8 9 10 11 12 13 14

see [Navarro 2016]

slide-50
SLIDE 50
  • f 78

51

  • [Grossi et al 2003]
  • Given a sequence of symbols and an encoding
  • The bits of the code of each symbol are distributed along the different

levels of the tree

000100101100 A B A C D A C 0 0 0 10 1 1

1

A B A A C D C 0 1 0 1 1

DATA SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A

Integer Sequences Wavelet tree (construction)

slide-51
SLIDE 51
  • f 78

52

  • Searching for the 1st occurrence of ‘D’?

52 OF 74

DATA SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A

A B A C D A C 0 0 0 1 1

1

A B A A C D C 0 1 0 1 it is the 2nd bit in B1 Where is the 2nd ‘1’?  at pos 5. 1 Where is the 1st ‘1’?  at pos 2.

Broot B0 B1

Integer Sequences Wavelet tree (select)

slide-52
SLIDE 52
  • f 78

53

  • Recovering Data: extracting the next symbol
  • Which symbol appears in the 6th position?

A B A C D A C 0 0 0 1 1

1

A B A A C D C 0 1 0 1 Which bit occurs at position 4 in B0? How many ‘0’s are there up to pos 6? it is the 4th ‘0’ 1 It is set to 0 The codeword read is ’00’ A

DATA SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A

Broot B0 B1

Integer Sequences Wavelet tree (access)

slide-53
SLIDE 53
  • f 78

54

  • Recovering Data: extracting the next symbol
  • Which symbol appears in the 7th position?

A B A C D A C 0 0 0 1 1

1

A B A A C D C 0 1 0 1 Which bit occurs at position 3 in B1? How many ‘1’s are there up to pos 7? it is the 3rd ‘1’ 1 It is set to 0 The codeword read is ’10’  C

TEXT SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A

B1 Broot B0

Integer Sequences Wavelet tree (access)

slide-54
SLIDE 54
  • f 78

55

  • How many C’s are there up to position 7?

A B A C D A C 0 0 0 1 1

1

A B A A C D C 0 1 0 1 How many 0s up to position 3 in B1? How many ‘1’s are there up to pos 7? it is the 3rd ‘1’ 1 2 !!

TEXT SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A

B1 Broot B0 Select (locate symbol) Access and Rank:

Integer Sequences Wavelet tree (rank)

slide-55
SLIDE 55
  • f 78

56

  • Each level contains n + o(n) bits
  • Rank/select/access expected O(log s) time

A B A C D A C 0 0 0 1 1

1

A B A A C D C 0 1 0 1 1

WAVELET TREE

00010010110010

DATA SYMBOL CODE A B A C D A C C D 00 01 10 11 B A

n + o(n) bits n + o(n) bits n ⌈log s ⌉ (1 + o(1)) bits

Integer Sequences Wavelet tree (space and times)

slide-56
SLIDE 56
  • f 78

57

  • Using Huffman coding (or others)  unbalanced
  • Rank/select/access  O(H0(S)) time

A B A C D A C 1 0 1 1 0 0

1

B C D C A A A 0 1 0 0

WAVELET TREE

1 000 1 01 001 1 01

DATA SYMBOL CODE A B A C D A C C D 1 000 01 001 B A

nH0(S) + o(n) bits

1

B D C C 1 0

Integer Sequences Huffman-shaped (or others) Wavelet tree

slide-57
SLIDE 57
  • Introduction
  • Basic compression
  • Sequences
  • Bit sequences
  • Integer sequences
  • A brief Review about Indexing

PAGE 58

Agenda

images: zurb.com

slide-58
SLIDE 58

Inverted Indexes are the most well-known index for text […] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance

A brief review about indexing

COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 59

slide-59
SLIDE 59
  • f 78

60

A brief Review about Indexing

Text indexing: well-known structures from the Web

  • Traditional indexes (with or without compression)
  • Inverted Indexes, Suffix Arrays,...
  • Compressed Self-indexes
  • Wavelet trees, Compressed Suffix Arrays, FM-index, LZ-index, …

implicit text auxiliar structure explicit text

slide-60
SLIDE 60
  • f 78

61

A brief Review about Indexing Inverted indexes

Space-time trade-off

DCC communications compression image data information Cliff Logde 142 104 165 341 506 368 219 445 DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not

  • nly

compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...] ... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). 99 207 336 128 395 19 25 Vocabulary Posting Lists Indexed text

Searches

Word  posting of that word Phrase  intersection of postings

Doc 1 Doc 2

Compression

  • Indexed text (Huffman,...)
  • Posting lists (Rice,...)

1 1 2 2 1 2 1 2 1 2 1 1 DCC communications compression image data information Cliff Lodge Vocabulary Posting Lists Full-positional information Doc-addressing inverted index

slide-61
SLIDE 61
  • f 78

62

A brief Review about Indexing Inverted indexes

  • Lists contain increasing integers
  • Gaps between integers are smaller in the longest lists

4 10 15 25 29 40 46 54 57 70 79 82

Original posting list

1 2 3 4 5 6 7 8 9 10 11 12

4 6 5 10 4 11 6 8 3 13 9 3

Diferenc.

4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3

Absolute sampling + var length coding

 Direct access Partial decompression

c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3

Var-length coding

Complete decompression

slide-62
SLIDE 62
  • f 78

63

A brief Review about Indexing Suffix Arrays

  • Sorting all the suffixes of T lexicographically

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$

slide-63
SLIDE 63
  • f 78

64

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

slide-64
SLIDE 64
  • f 78

65

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

slide-65
SLIDE 65
  • f 78

66

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

slide-66
SLIDE 66
  • f 78

67

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

slide-67
SLIDE 67
  • f 78

68

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

slide-68
SLIDE 68
  • f 78

69

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

P = a b

slide-69
SLIDE 69
  • f 78

70

A brief Review about Indexing Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

locations

Noccs = (4-3)+1 Occs = A[3] .. A[4] = {8, 1} Fast Space O(m lg n) O(4n) O(m lg n + noccs) + |T|

P = a b

slide-70
SLIDE 70
  • f 78

71

A brief Review about Indexing BWT  FM-index

  • BWT(S) + other structures  it is an index
  • C[c] : for each char c in S , stores the number of occs

in S of the chars that are lexicographically smaller than c.

C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8

  • OCC(c, k): Number of occs of char c in the prefix of

L: L [1 ..k]

For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4

  • Char L[i] occurs in F at position LF(i):

LF(i) = C[L[i]] + Occ(L[i],i)

slide-71
SLIDE 71
  • f 78

74

A brief Review about Indexing BWT  FM-index

  • Count (S[1,u], P[1,p])
  • Count (S, “issi”)

s s i

C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4

slide-72
SLIDE 72
  • f 78

75

A brief Review about Indexing BWT  FM-index

  • Representing L with a wavelet tree  occ is “compressed”
slide-73
SLIDE 73
  • f 78

76

Bibliography

1.

  • M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124,

Digital Systems Research Center, 1994. http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/. 2.

  • F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th SPIRE,

LNCS 5280, pages 176–187, 2008. 3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001. 4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. 5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994 6.

  • A. Golynski, I. Munro, and S. Rao. Rank/select operations on large alphabets: a tool for text indexing. In
  • Proc. 17th SODA, pages 368–373, 2006.

7.

  • R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th SODA,

pages 841–850, 2003.

slide-74
SLIDE 74
  • f 78

77

Bibliography

8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098-1101, 1952 9.

  • N. J. Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE,

88(11):1722–1732, 2000

  • 10. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Comp.,

22(5):935–948, 1993

  • 11. Alistair Moffat, Andrew Turpin: Compression and Coding Algorithms .Kluwer 2002, ISBN 0-7923-7668-4
  • 12. I. Munro. Tables. In Proc. 16th FSTTCS, LNCS 1180, pages 37–42, 1996.
  • 13. Gonzalo Navarro , Veli Mäkinen, Compressed full-text indexes, ACM Computing Surveys (CSUR), v.39

n.1, p.2-es, 2007

  • 14. Gonzalo Navarro. Compact Data Structures -A practical approach. Cambridge University Press, 570

pages, 2016

  • 15. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. 9th

ALENEX, 2007.

slide-75
SLIDE 75
  • f 78

78

Bibliography

  • 16. R. Raman, V. Raman, and S. Rao. Succinct indexable dictionaries with applications to encoding k-ary

trees and multisets. In Proc. 13th SODA, pages 233–242, 2002.

  • 17. Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates. Fast and flexible word

searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, 2000.

  • 18. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing

Documents and Images. Morgan Kaufmann, 1999.

  • 19. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Transactions on

Information Theory 23, 3, 337–343.

  • 20. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE

Transactions on Information Theory 24, 5, 530–536.

slide-76
SLIDE 76

Compact Data Strutures

(To compress is to Conquer)

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training School Keyword search in Big Linked Data

(Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)