Compact Data Strutures
(To compress is to Conquer)
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel - - PowerPoint PPT Presentation
(To compress is to Conquer) Compact Data Strutures Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda Introduction Basic
Compact Data Strutures
(To compress is to Conquer)
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
PAGE 2
Agenda
images: zurb.com
Compact data structures lie at the intersection of Data Structures (indexing) and Information Theory (compression): One looks at data representations that not only permit space close to the minimum possible (as in compression) but also require that those representations allow one to efficiently carry
Introduction to Compact Data Structures
COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 3
4
Introduction Why compression?
(access to memory is around 106 times faster than HDD)
5
Introduction Why compression?
Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
Compressed Text collection (20%) P7zip, others
Doc 1 Doc 2 Doc 3 Doc nLet’s search for “Keystone"
6
Introduction Why indexing?
Text collection (100%) Doc 1 Doc 2 Doc 3 Doc n Compressed Text collection (30%)
Doc 1 Doc 2 Doc 3 Doc n
term 1 … Keystone … term n (> 5-30%) Index
Let’s search for “Keystone"
7
Introduction Why compact data structures?
Text collection Doc 1 Doc 2 Doc 3 Doc n term 1 … Keystone … term n (> 5-30%) Index 1 1
1
1 1 1 Self-index (WT, WCSA,…) term 1 … Keystone … term n
Let’s search for “Keystone"
PAGE 8
Agenda
images: zurb.com
Compressing aims at representing data within less
traditional compression techniques?
Compression
COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 9
10
Basic Compression Modeling & Coding
Input alphabet
dictionary
var2var
Target alphabet fixed var fixed var
11
Basic Compression Main families of compressors
Obtain compression
12
Basic Compression Dictionary-based compressors
13
Basic Compression LZW
EXAMPLE
14
Basic Compression LZW
EXAMPLE
15
Basic Compression Grammar-based – BPE - Repair
A B C D E A B D E F D E D E F A B E C D A B C G A B G F G G F A B E C D H C G H G F G G F H E C D H C G H I G I H E C D DE G AB H GF I Source sequence Dictionary of Rules Final Repair Sequence
16
Basic Compression Statistical compressors
sequence (S).
n= num of symbols nc= occs of symbol c
H0(S) <= log (|S|) n H0(S) = lower bound of the size of S compressed with a zero-order compressor
17
Basic Compression Statistical compressors: Huffman coding
18
Basic Compression Statistical compressors: Huffman coding
19
Basic Compression Statistical compressors: Huffman coding
20
Basic Compression Statistical compressors: Huffman coding
21
Basic Compression Statistical compressors: Huffman coding
22
Basic Compression Statistical compressors: Huffman coding
23
Basic Compression Statistical compressors: Huffman coding
24
Basic Compression Statistical compressors: Huffman coding
25
Basic Compression Statistical compressors: Huffman coding
26
Basic Compression Statistical compressors: Huffman coding
27
Basic Compression Burrows-Wheeler Transform (BWT)
circular permutations of S$, (2) sorting the rows of M, and (3) taking the last column.
mississippi$ $mississippi i$mississipp pi$mississip ppi$mississi ippi$mississ sippi$missis ssippi$missi issippi$miss sissippi$mis ssissippi$mi ississippi$m $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi sort L = BWT(S) F
28
Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)
$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi L F 1 2 3 4 5 6 7 8 9 10 11 12
Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F
2 7 9 10 6 1 8 3 11 12 4 5 LF
29
Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)
$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF
Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; In each step: S[n-i] = L[p]; p = LF[p]; i = i+1;
S L F
30
Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)
Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=0: S[n-i] = L[p]; S[12]=‘$’ p = LF[p]; p = 1 i = i+1; i=1
$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF
S L F
31
Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)
Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]=‘i’ p = LF[p]; p = 2 i = i+1; i=2
$mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF
$ S L F
32
Basic Compression Burrows-Wheeler Transform: reversible (BWT-1)
Steps: 1. Sort L to obtain F 2. Build LF mapping so that If L[i]=‘c’, and k= the number of times ‘c’ occurs in L[1..i], and j=position in F of the kth occurrence of ‘c’ Then set LF[i]=j Example: L[7] = ‘p’, it is the 2nd ‘p’ in L LF[7] = 8 which is the 2nd occ of ‘p’ in F 3. Recover the source sequence S in n steps: Initially p=l=6 (position of $ in L); i=0; n=12; Step i=1: S[n-i] = L[p]; S[11]=‘i’ p = LF[p]; p = 2 i = i+1; i=2
m i s s i s s i p p i $ $mississippi i$mississipp ippi$mississ issippi$miss ississippi$m mississippi$ pi$mississip ppi$mississi sippi$missis sissippi$mis ssippi$missi ssissippi$mi 1 2 3 4 5 6 7 8 9 10 11 12 2 7 9 10 6 1 8 3 11 12 4 5 LF S L F
33
Basic Compression Bzip2: Burrows-Wheeler Transform (BWT)
.
to the begining of S ‘ .
Why does it work? In a text it is likely that “he” is preceeded by “t”, “ssisii” by “i”, …
PAGE 34
Agenda
images: zurb.com
We want to represent (compactly) a sequence of elements and to efficiently handle them.
(Who is in the 2nd position?? How many Barts up to position 5?? Where is the 3rd Bart??)
Sequences
COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 35
1 2 3 4 5 6 7 8 9
36
Sequences Plain Representation of Data
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
37
Sequences Compressed Representation of Data (H0)
26 bits: No direct access!
(but we could add sampling)
Symbol 4 1 2 3 Occurrences (nc) 9 4 2 1
1
16 7
1
4
3
1
2 1
2 3 1 4
9
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1
1 5 10 15 20 25
38
Sequences Summary: Plain/Compressed access/rank/select
4 1 4 4 4 4 1 4 2 4 1 1 2 3 4 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
100 010 100 100 100 100 001 100 010 100 001 001 010 011 100 100
1 4 5 10 13 16 19 22 25 28 31 34 37 40 43 46
1 01 000 001 1 1 1 1 01 1 000 1 01 01 1 1
1 5 10 15 20 25
PAGE 39
Agenda
images: zurb.com
40
Bit Sequences access/rank/select on bitmaps
Rank1(6) = 3 Rank0(10) = 5
0 1 0 0 1 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
B =
select0(10) =15 access (19) = 0
see [Navarro 2016]
41
Bit Sequences Applications
S: AAABBCCCCCCCCDDDEEEEEEEEEEFG n log s bits B: 1001010000000100100000000011 n bits D: ABCDEFG s log s bits
HDT Bitmaps from Javi's talk !!!
42
Bit Sequences Reaching O(1) rank & o(n) bits of extra space
number of 1s up to positions 1+256k (k= 0,1,2,…)
0 1 0
...
1
1 2 3 256
35 bits set to 1
1
...
1
257 512
27 bits set to 1 35 1 2
Ds =
62 3 ...
1
513 768
45 bits set to 1
...
97 3
...
43
Bit Sequences Reaching O(1) rank & o(n) bits of extra space
1 1 0
...
1
1 2 3 25635 bits set to 1
1
...
257 51227 bits set to 1 35 1 2
Ds =
62 3 ...
1
513 76845 bits set to 1
...
97 3
... 1 1 0
...
1
1 2 3 324 bits set to 1 ...
1
33 646 bits set to 1
...
4 1 2
Db =
25 7
... 1
...
224 2568 bits set to 1
300 4444
Bit Sequences Reaching O(1) rank & o(n) bits of extra space
Finally solving: rank1( D , p ) = Ds[ p / 256 ] + Db[ p / 32 ] + rank1(blk, i) where i= p mod 32
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3245
Bit Sequences Reaching O(1) rank & o(n) bits of extra space
Only 256 entries storing values [0..8]
1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 blk =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 blks =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 321 0 0 1 0 0 0 0 1 1 0 0
Shift 32 – 12 = 20 posicións Rank1(blk,12)
Val binary OnesInByte 00000000 1 00000001 1 2 00000010 1 3 00000011 2 252 11111100 6 253 11111101 7 254 11111110 7 255 11111111 8 ... ... ...46
Bit Sequences Select1 in O(log n) with the same structures
select1(p)
48
Bit Sequences Compressed representations
PAGE 49
Agenda
images: zurb.com
50
Integer Sequences access/rank/select on general sequences
Rank2(9) = 3
S=
select4(3) =7 access (13) = 3
4 4 3 2 6 2 4 2 4 1 1 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12 13 14
see [Navarro 2016]
51
levels of the tree
000100101100 A B A C D A C 0 0 0 10 1 1
1
A B A A C D C 0 1 0 1 1
DATA SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A
Integer Sequences Wavelet tree (construction)
52
52 OF 74
DATA SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A
A B A C D A C 0 0 0 1 1
1
A B A A C D C 0 1 0 1 it is the 2nd bit in B1 Where is the 2nd ‘1’? at pos 5. 1 Where is the 1st ‘1’? at pos 2.
Broot B0 B1
Integer Sequences Wavelet tree (select)
53
A B A C D A C 0 0 0 1 1
1
A B A A C D C 0 1 0 1 Which bit occurs at position 4 in B0? How many ‘0’s are there up to pos 6? it is the 4th ‘0’ 1 It is set to 0 The codeword read is ’00’ A
DATA SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A
Broot B0 B1
Integer Sequences Wavelet tree (access)
54
A B A C D A C 0 0 0 1 1
1
A B A A C D C 0 1 0 1 Which bit occurs at position 3 in B1? How many ‘1’s are there up to pos 7? it is the 3rd ‘1’ 1 It is set to 0 The codeword read is ’10’ C
TEXT SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A
B1 Broot B0
Integer Sequences Wavelet tree (access)
55
A B A C D A C 0 0 0 1 1
1
A B A A C D C 0 1 0 1 How many 0s up to position 3 in B1? How many ‘1’s are there up to pos 7? it is the 3rd ‘1’ 1 2 !!
TEXT SYMBOL CODE WAVELET TREE A B A C D A C C D 00 01 10 11 B A
B1 Broot B0 Select (locate symbol) Access and Rank:
Integer Sequences Wavelet tree (rank)
56
A B A C D A C 0 0 0 1 1
1
A B A A C D C 0 1 0 1 1
WAVELET TREE
00010010110010
DATA SYMBOL CODE A B A C D A C C D 00 01 10 11 B A
n + o(n) bits n + o(n) bits n ⌈log s ⌉ (1 + o(1)) bits
Integer Sequences Wavelet tree (space and times)
57
A B A C D A C 1 0 1 1 0 0
1
B C D C A A A 0 1 0 0
WAVELET TREE
1 000 1 01 001 1 01
DATA SYMBOL CODE A B A C D A C C D 1 000 01 001 B A
nH0(S) + o(n) bits
1
B D C C 1 0
Integer Sequences Huffman-shaped (or others) Wavelet tree
PAGE 58
Agenda
images: zurb.com
Inverted Indexes are the most well-known index for text […] Suffix Arrays are powerful but huge full-text indexes. Self-indexes trade a more compact space by performance
A brief review about indexing
COMPACT DATA STRUCTURES: TO COMPRESS IS TO CONQUER PAGE 59
60
A brief Review about Indexing
Text indexing: well-known structures from the Web
implicit text auxiliar structure explicit text
61
A brief Review about Indexing Inverted indexes
Space-time trade-off
DCC communications compression image data information Cliff Logde 142 104 165 341 506 368 219 445 DCC is held at the Cliff Lodge convention center. It is an international forum for current work on data compression and related applications. DCC addresses not
compression methods for specific types of data (text, image, video, audio, space, graphics, web content, [...] ... also the use of techniques from information theory and data compression in networking, communications, and storage applications involving large datasets (including image and information mining, retrieval, archiving, backup, communications, and HCI). 99 207 336 128 395 19 25 Vocabulary Posting Lists Indexed text
Searches
Word posting of that word Phrase intersection of postings
Doc 1 Doc 2
Compression
1 1 2 2 1 2 1 2 1 2 1 1 DCC communications compression image data information Cliff Lodge Vocabulary Posting Lists Full-positional information Doc-addressing inverted index
62
A brief Review about Indexing Inverted indexes
4 10 15 25 29 40 46 54 57 70 79 82
Original posting list
1 2 3 4 5 6 7 8 9 10 11 12
4 6 5 10 4 11 6 8 3 13 9 3
Diferenc.
4 c6 c5 c10 29 c11 c6 c8 57 c13 c9 c3
Absolute sampling + var length coding
Direct access Partial decompression
c4 c6 c5 c10 c4 c11 c6 c8 c3 c13 c9 c3
Var-length coding
Complete decompression
63
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$
64
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
65
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
66
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
67
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
68
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
69
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
P = a b
70
A brief Review about Indexing Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations
Noccs = (4-3)+1 Occs = A[3] .. A[4] = {8, 1} Fast Space O(m lg n) O(4n) O(m lg n + noccs) + |T|
P = a b
71
A brief Review about Indexing BWT FM-index
in S of the chars that are lexicographically smaller than c.
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8
L: L [1 ..k]
For k in [1..12] Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
LF(i) = C[L[i]] + Occ(L[i],i)
74
A brief Review about Indexing BWT FM-index
s s i
C[$]=0 C[i]=1 C[m]=5 C[p]=6 C[s]=8 Occ[$] = 0,0,0,0,0,1,1,1,1,1,1,1 Occ[i] = 1,1,1,1,1,1,1,2,2,2,3,4 Occ[m] = 0,0,0,0,1,1,1,1,1,1,1,1 Occ[p] = 0,1,1,1,1,1,2,2,2,2,2,2 Occ[s] = 0,0,1,2,2,2,2,2,3,4,4,4
75
A brief Review about Indexing BWT FM-index
76
Bibliography
1.
Digital Systems Research Center, 1994. http://gatekeeper.dec.com/pub/DEC/SRC/researchreports/. 2.
LNCS 5280, pages 176–187, 2008. 3. Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index. In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington (USA), 2001. 4. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. 5. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, February 1994 6.
7.
pages 841–850, 2003.
77
Bibliography
8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098-1101, 1952 9.
88(11):1722–1732, 2000
22(5):935–948, 1993
n.1, p.2-es, 2007
pages, 2016
ALENEX, 2007.
78
Bibliography
trees and multisets. In Proc. 13th SODA, pages 233–242, 2002.
searching on compressed text. ACM Transactions on Information Systems, 18(2):113–139, 2000.
Documents and Images. Morgan Kaufmann, 1999.
Information Theory 23, 3, 337–343.
Transactions on Information Theory 24, 5, 530–536.
Compact Data Strutures
(To compress is to Conquer)
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
(Thanks: slides partially by: Susana Ladra, E. Rodríguez, & José R. Paramá)