Space- and Time-Efficient Data Structures for Massive Datasets
Supervisor Rossano Venturini
Giulio Ermanno Pibiri
Referee Daniel Lemire Referee Simon Gog
08/03/2019 Department of Computer Science University of Pisa
Space- and Time-E ffi cient Data Structures for Massive Datasets - - PowerPoint PPT Presentation
Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee Supervisor Referee Daniel Lemire Rossano Venturini Simon Gog Department of Computer Science University of Pisa 08/03/2019 Evidence The
Supervisor Rossano Venturini
Giulio Ermanno Pibiri
Referee Daniel Lemire Referee Simon Gog
08/03/2019 Department of Computer Science University of Pisa
“Software is getting slower more rapidly than hardware becomes faster.”
Niklaus Wirth, A Plea for Lean Software
“Software is getting slower more rapidly than hardware becomes faster.”
Niklaus Wirth, A Plea for Lean Software
Journal paper
Clustered Elias-Fano Indexes
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.
Conference paper Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.
Dynamic Elias-Fano Representation
Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.
Efficient Data Structures for Massive N-Gram Datasets
Conference paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019.
On Optimally Partitioning Variable-Byte Codes
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.
Handling Massive N-Gram Datasets Efficiently
Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.
Fast Dictionary-based Compression for Inverted Indexes
Journal paper Journal paper
Journal paper
Clustered Elias-Fano Indexes
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.
Conference paper Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.
Dynamic Elias-Fano Representation
Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.
Efficient Data Structures for Massive N-Gram Datasets
Conference paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019.
On Optimally Partitioning Variable-Byte Codes
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.
Handling Massive N-Gram Datasets Efficiently
Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.
Fast Dictionary-based Compression for Inverted Indexes
Journal paper Journal paper
integer sequences
Journal paper
Clustered Elias-Fano Indexes
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017.
Conference paper Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017.
Dynamic Elias-Fano Representation
Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017.
Efficient Data Structures for Massive N-Gram Datasets
Conference paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019.
On Optimally Partitioning Variable-Byte Codes
Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.
Handling Massive N-Gram Datasets Efficiently
Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019.
Fast Dictionary-based Compression for Inverted Indexes
Journal paper Journal paper
integer sequences short strings
The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.
house is red red is always good the the is boy hungry is boy red house is the always hungry
2 1 3 4 5
{always, boy, good, house, hungry, is, red, the}
t1 t2 t3 t4 t5 t6 t7 t8
The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.
house is red red is always good the the is boy hungry is boy red house is the always hungry
2 1 3 4 5
{always, boy, good, house, hungry, is, red, the}
t1 t2 t3 t4 t5 t6 t7 t8
Lt1=[1, 3] Lt2=[4, 5] Lt3=[1] Lt4=[2, 3] Lt5=[3, 5] Lt6=[1, 2, 3, 4, 5] Lt7=[1, 2, 4] Lt8=[2, 3, 5]
The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.
Large research corpora describing different space/time trade-offs.
~1970 2014
Large research corpora describing different space/time trade-offs.
Space Time
Binary Interpolative Coding Variable-Byte Family ~1970 2014
Space Time
Binary Interpolative Coding (BIC) Variable-Byte (VByte) Family
Space Time
Binary Interpolative Coding (BIC) Variable-Byte (VByte) Family
Is it possible to design an encoding that is as small as BIC and much faster?
1
Space Time
Binary Interpolative Coding (BIC) Variable-Byte (VByte) Family
Is it possible to design an encoding that is as small as BIC and much faster?
1
Is it possible to design an encoding that is as fast as VByte and much smaller?
2
Space Time
Binary Interpolative Coding (BIC) Variable-Byte (VByte) Family
Is it possible to design an encoding that is as small as BIC and much faster?
1
Is it possible to design an encoding that is as fast as VByte and much smaller?
2
What about both objectives at the same time?!
3
Space Time
Binary Interpolative Coding (BIC) Variable-Byte (VByte) Family
Is it possible to design an encoding that is as small as BIC and much faster?
1
Is it possible to design an encoding that is as fast as VByte and much smaller?
2
What about both objectives at the same time?!
3
TOIS 2017 WSDM 2019 TKDE 2019
Every encoder represents each sequence individually.
Every encoder represents each sequence individually. Encode clusters of (similar) inverted lists.
reference list
Every encoder represents each sequence individually. Encode clusters of (similar) inverted lists.
reference list
Every encoder represents each sequence individually. Encode clusters of (similar) inverted lists.
reference list
Every encoder represents each sequence individually. Encode clusters of (similar) inverted lists.
Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) Slightly slower than PEF (~20%) Much faster than BIC (2X) Space Time
Spectrum
The majority of values are small (very small indeed).
The majority of values are small (very small indeed).
The majority of values are small (very small indeed). Encode dense regions with unary codes, sparse regions with VByte.
The majority of values are small (very small indeed). Encode dense regions with unary codes, sparse regions with VByte.
Compression ratio improves by 2X. Query processing speed and sequential decoding (almost) not affected. Optimal partitioning in linear time and constant space.
If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.
If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.
Put the most frequent patterns in a dictionary of size k. Then encode inverted lists as sequences of log2 k-bit codewords.
If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.
Put the most frequent patterns in a dictionary of size k. Then encode inverted lists as sequences of log2 k-bit codewords.
If we consider subsequences of d-gaps in inverted lists, these are repetitive across the whole inverted index.
Close to the most space-efficient representation (~7% away from BIC). Almost as fast as the fastest SIMD-ized decoders. Put the most frequent patterns in a dictionary of size k. Then encode inverted lists as sequences of log2 k-bit codewords.
N number of N-grams 1
24,359,473
2
667,284,771
3
7,397,041,901
4
1,644,807,896
5
1,415,355,596
The number of words following a given context is small.
k = 1
Map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).
The number of words following a given context is small.
k = 1
Map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).
The number of words following a given context is small.
k = 1
Map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).
The number of words following a given context is small.
k = 1
Map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).
The number of words following a given context is small.
k = 1
Map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).
The number of words following a given context is small.
The (Elias-Fano) context-based remapped trie is as fast as the fastest competitor, but up to 65% smaller.
k = 1
Map a word ID to the position it takes within its sibling IDs (the IDs following a context of fixed length k).
The number of words following a given context is small.
The (Elias-Fano) context-based remapped trie is even smaller than the most space-efficient competitors, that are lossy and with false-positives allowed, and up to 5X faster. The (Elias-Fano) context-based remapped trie is as fast as the fastest competitor, but up to 65% smaller.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
A 4 B 2 C 2 X 4
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
A 4 B 2 C 2 X 4 A 1 B 5 C 7 X 9
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
A 4 B 2 C 2 X 4 A 1 B 5 C 7 X 9
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
A 4 B 2 C 2 X 4 A 1 B 5 C 7 X 9
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
A 4 B 2 C 2 X 4 A 1 B 5 C 7 X 9
To compute the modified Kneser-Ney probabilities of the N-grams, the fastest algorithm in the literature uses 3 sorting steps in external memory.
Suffix order Context order Computing the distinct left extensions.
Using a scan of the block and O(|V|) space.
Rebuilding the last level of the trie.
A 4 B 2 C 2 X 4 A 1 B 5 C 7 X 9
Estimation runs 4.5X faster with billions of strings.
Impact is far reaching and implies substantial economic gains.
space and time-efficient ? context
algorithms foo data bar baz 1214 2 3647 3 1
frequency
space and time-efficient ? context
algorithms foo data bar baz 1214 2 3647 3 1
frequency
space and time-efficient ? context
f(“space and time-efficient data”) f(“space and time-efficient”) P(“data” | “space and time-efficient”) ≈
algorithms foo data bar baz 1214 2 3647 3 1
frequency
space and time-efficient ? context
f(“space and time-efficient data”) f(“space and time-efficient”) P(“data” | “space and time-efficient”) ≈
space + time
+ space + static
Integer data structures
to encode a sorted integer sequence S
Elias-Fano encoding
space + time
+ space + static
Can we grab the best from both? Integer data structures
to encode a sorted integer sequence S
Elias-Fano encoding
For u = nγ, γ = (1):
Result 1 Result 2 Result 3
For u = nγ, γ = (1):
Result 1 Result 2 Result 3
Optimal time bounds for all
using a sublinear redundancy.