Space and Time-Efficient Data Structures for Massive Datasets
Giulio Ermanno Pibiri
giulio.pibiri@di.unipi.it Supervisor
Rossano Venturini
Computer Science Department University of Pisa
10/10/2017
1
Space and Time-Efficient Data Structures for Massive Datasets - - PowerPoint PPT Presentation
Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor Rossano Venturini Computer Science Department University of Pisa 10/10/2017 1 High Level Thesis Data Structures +
Giulio Ermanno Pibiri
giulio.pibiri@di.unipi.it Supervisor
Rossano Venturini
Computer Science Department University of Pisa
10/10/2017
1
2
3
3
Journal paper
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017
3
Journal paper
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017
Conference paper
Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM), 2017
3
Journal paper
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017
Conference paper
Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM), 2017
Conference paper
Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017
3
http://pages.di.unipi.it/pibiri/ EVERYTHING that I do (papers, slides and code) is fully accessible at my page:
Journal paper
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2017
Conference paper
Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM), 2017
Conference paper
Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR), 2017
4
Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}
4
Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}
house is red red is always good the the is boy hungry is boy red house is the always hungry
T = {always, boy, good, house, hungry, is, red, the}
2 1 3 4 5
Lt1=[1, 3]
t1 t2 t3 t4 t5 t6 t7 t8
Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]
4
Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}
q = {boy, is, the}
house is red red is always good the the is boy hungry is boy red house is the always hungry
T = {always, boy, good, house, hungry, is, red, the}
2 1 3 4 5
Lt1=[1, 3]
t1 t2 t3 t4 t5 t6 t7 t8
Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]
4
Inverted Indexes owe their popularity to the efficient resolution of queries, such as: “return me all documents in which terms {t1,…,tk}
q = {boy, is, the}
house is red red is always good the the is boy hungry is boy red house is the always hungry
T = {always, boy, good, house, hungry, is, red, the}
2 1 3 4 5
Lt1=[1, 3]
t1 t2 t3 t4 t5 t6 t7 t8
Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]
5
Every encoder represents each sequence individually. No exploitation of redundancy.
5
Every encoder represents each sequence individually. No exploitation of redundancy. Idea: encode clusters of posting lists.
6
cluster of posting lists
6
cluster of posting lists reference list R
6
cluster of posting lists reference list R
6
cluster of posting lists reference list R
log u bits
R << u
log R bits
6
cluster of posting lists reference list R
log u bits
R << u
log R bits
Problems 1. Build the clusters. 2. Synthesise the reference list.
NP-hard problem already for a simplified formulation.
7
7
7
Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)
7
Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%)
7
Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average)
8
A dynamic ordered set S is a data structure representing n keys and supporting the following operations:
8
A dynamic ordered set S is a data structure representing n keys and supporting the following operations:
In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and O(n) space. More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.
8
A dynamic ordered set S is a data structure representing n keys and supporting the following operations:
How to optimally solve the integer dynamic
In the comparison model this is solved optimally by any self-balancing tree data structure in O(log n) time and O(n) space. More efficient solutions there exist if the considered keys are integers drawn from a bounded universe of size u.
9
Integer Data Structures
Elias-Fano Encoding
encode an ordered integer sequence S
space + time
+ space + static
9
Integer Data Structures
Elias-Fano Encoding
encode an ordered integer sequence S
space + time
+ space + static
9
Integer Data Structures
Elias-Fano Encoding
encode an ordered integer sequence S
space + time
+ space + static
Can we grab the best from both?
For u = nγ, γ = (1):
Result 1
10
Result 2
Result 3
For u = nγ, γ = (1):
Result 1
10
Result 2
Result 3
For u = nγ, γ = (1):
Result 1
10
Result 2
Result 3
For u = nγ, γ = (1):
Result 1
10
Result 2
Result 3
For u = nγ, γ = (1):
Result 1
10
Result 2
Result 3
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks
T
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks
T T T T
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
Y is an Y-fast trie
P is a dynamic prefix-sums DS
O(n / (b x log2 n)) x O(log u) = o(n) bits each
upper level
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
Y is an Y-fast trie
P is a dynamic prefix-sums DS
O(n / (b x log2 n)) x O(log u) = o(n) bits each
upper level
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
11
Y is an Y-fast trie
P is a dynamic prefix-sums DS
O(n / (b x log2 n)) x O(log u) = o(n) bits each
upper level
S
EF(S(n,u)) = n log(u/n) + 2n bits mini block of size b = log n / loglog n
block of log2 n mini blocks
T T T T
lower level
T is a k-ary tree of constant height:
+ some technicalities
12
Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach.
12
Strings of N words. N typically ranges from 1 to 5. Extracted from text using a sliding window approach.
12
Strings of N words. N typically ranges from 1 to 5.
Extracted from text using a sliding window approach.
12
Strings of N words. N typically ranges from 1 to 5.
N
number of grams
1
24,359,473
2
667,284,771
3
7,397,041,901
4
1,644,807,896
5
1,415,355,596
Extracted from text using a sliding window approach.
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
funnel-shaped distribution.
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
funnel-shaped distribution.
1 2 3 4
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
u/n by varying context-length k
funnel-shaped distribution.
1 2 3 4
k = 1
13
Compressed Tries with Context-based ID Remapping - SIGIR’17
High-level idea: map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k). Observation: the number of words following a given context is small.
u/n by varying context-length k
funnel-shaped distribution.
1 2 3 4
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Context-based ID Remapping
you will notice this!
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Context-based ID Remapping
you will notice this!
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
14
Compressed Tries with Context-based ID Remapping - SIGIR’17
Context-based ID Remapping
you will notice this!
Test machine Intel Xeon E5-2630 v3, 2.4 GHz 193 GB of RAM, Linux 64 bits C++ implementation gcc 5.4.1 with the highest
will you notice this?
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
2.3X 2.5X
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
2.3X 2.5X
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
2.3X 2.5X 2.5÷ 5.2X 3.1÷ 5.8X 5.5X 2X 2X 2X 2.5X 2.5X 3X 2.7X 2.8X 3.5X 2X
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
2.3X 2.5X 2.5÷ 5.2X 3.1÷ 5.8X 5.5X 2X 2X 2X 2.5X 2.5X 3X 2.7X 2.8X 3.5X 2X
15
Compressed Tries with Context-based ID Remapping - SIGIR’17
2.3X 2.5X
2.5÷ 5.2X 3.1÷ 5.8X 5.5X 2X 2X 2X 2.5X 2.5X 3X 2.7X 2.8X 3.5X 2X
16
Scalable Modified Kneser-Ney Language Model Estimation
seconds counting normalization interpolation
11 17 36 20 30 60 35 52 104
Tongrams - 1 Tongrams - 2 Tongrams - 4
1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words
seconds counting normalization interpolation
22 26 46 32 42 77 53 62 112 84 104 179 148 155 297
Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16
15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words
16
Scalable Modified Kneser-Ney Language Model Estimation
seconds counting normalization interpolation
11 17 36 20 30 60 35 52 104
Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation
11 17 36 51 60 138
KenLM Tongrams
1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words
seconds counting normalization interpolation
22 26 46 32 42 77 53 62 112 84 104 179 148 155 297
Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16
15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words
16
Scalable Modified Kneser-Ney Language Model Estimation
seconds counting normalization interpolation
11 17 36 20 30 60 35 52 104
Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation
11 17 36 51 60 138
KenLM Tongrams
1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words
seconds counting normalization interpolation
22 26 46 32 42 77 53 62 112 84 104 179 148 155 297
Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16
15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words
16
Scalable Modified Kneser-Ney Language Model Estimation
seconds counting normalization interpolation
11 17 36 20 30 60 35 52 104
Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation
11 17 36 51 60 138
KenLM Tongrams
1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words
seconds counting normalization interpolation
22 26 46 32 42 77 53 62 112 84 104 179 148 155 297
Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16 seconds counting normalization interpolation
22 26 46 153 171 261
KenLM Tongrams
15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words
16
Scalable Modified Kneser-Ney Language Model Estimation
seconds counting normalization interpolation
11 17 36 20 30 60 35 52 104
Tongrams - 1 Tongrams - 2 Tongrams - 4 seconds counting normalization interpolation
11 17 36 51 60 138
KenLM Tongrams
1,255,027 20,431,391 82,815,629 153,984,231 196,779,246 —————— 455,265,524 1.3 GB 233,035,325 total words
seconds counting normalization interpolation
22 26 46 32 42 77 53 62 112 84 104 179 148 155 297
Tongrams - 1 Tongrams - 2 Tongrams - 4 Tongrams - 8 Tongrams - 16 seconds counting normalization interpolation
22 26 46 153 171 261
KenLM Tongrams
15,039,323 44,033,774 142,894,817 280,714,113 381,284,741 —————— 863,966,768 3.2 GB 495,527,349 total words
17
TCS VLDBJ
17
TCS VLDBJ
17
TCS VLDBJ
Inverted Indexes with false positives allowed.
17
TCS VLDBJ
Inverted Indexes with false positives allowed. Data structures for features repository.
17
TCS VLDBJ
Inverted Indexes with false positives allowed. Compressed tries based on double-arrays. Data structures for features repository.
17
TCS VLDBJ
Inverted Indexes with false positives allowed. Compressed tries based on double-arrays. Data structures for features repository.
18