Space- and Time-Efficient Data Structures for Massive Datasets - PowerPoint PPT Presentation

Space- and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor Rossano Venturini Department of Computer Science University of Pisa 15/11/2018 1

Evidence The increase of information does not scale with technology. 3

Evidence The increase of information does not scale with technology. “Software is getting slower more rapidly than hardware becomes faster. ” Niklaus Wirth, A Plea for Lean Software 3

Evidence The increase of information does not scale with technology. Even more relevant today! “Software is getting slower more rapidly than hardware becomes faster. ” Niklaus Wirth, A Plea for Lean Software 3

Scenario time Data structures Algorithms space PERFORMANCE EFFICIENCY how quickly a program how much work is required does its work - faster work by a program - less work 4

Scenario time Data structures Algorithms space PERFORMANCE EFFICIENCY how quickly a program how much work is required does its work - faster work by a program - less work ? Data compression space time 4

The dichotomy problem Small vs. fast? 5

The dichotomy problem Small vs. fast? Choose one. 5

The dichotomy problem Small vs. fast? Choose one. NO 5

High level thesis Data Structures + Data Compression Fast Algorithms Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction . Data Compression & Fast Retrieval together . 6

Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017. Variable-Byte Encoding is Now Space-Efficient Too Giulio Ermanno Pibiri and Rossano Venturini Journal paper arXiv (CoRR), April 2018. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Handling Massive N-Gram Datasets Efficiently Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2018. To appear. Full paper, 41 pages, 2018. 7

Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper Giulio Ermanno Pibiri and Rossano Venturini integer Annual Symposium on Combinatorial Pattern Matching (CPM) sequences Full paper, 14 pages, 2017. Variable-Byte Encoding is Now Space-Efficient Too Giulio Ermanno Pibiri and Rossano Venturini Journal paper arXiv (CoRR), April 2018. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Handling Massive N-Gram Datasets Efficiently Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS), 2018. To appear. Full paper, 41 pages, 2018. 7

Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper Giulio Ermanno Pibiri and Rossano Venturini integer Annual Symposium on Combinatorial Pattern Matching (CPM) sequences Full paper, 14 pages, 2017. Variable-Byte Encoding is Now Space-Efficient Too Giulio Ermanno Pibiri and Rossano Venturini Journal paper arXiv (CoRR), April 2018. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) Full paper, 12 pages, 2018. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Handling Massive N-Gram Datasets Efficiently Journal paper Giulio Ermanno Pibiri and Rossano Venturini short strings ACM Transactions on Information Systems (TOIS), 2018. To appear. Full paper, 41 pages, 2018. 7

Problem 1 Consider a sorted integer sequence. 8

Problem 1 Consider a sorted integer sequence. How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible bits? How to maintain fast decompression speed ? 8

Problem 1 Consider a sorted integer sequence. How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible bits? How to maintain fast decompression speed ? This is a difficult problem that has been studied since the the ’60. 8

Applications Inverted indexes Databases RDF indexing E-Commerce Geo-spatial data Graph-compression 9

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 10

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. red is the always house good is red the boy boy is is the hungry red house is always hungry 10

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy boy is is the hungry red house is always hungry 10

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy 3 boy is is the hungry red house is always 5 hungry 4 10

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 10

Inverted indexes Inverted indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 11

Inverted indexes Inverted indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 11

Inverted indexes Inverted indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red Q = {boy, is, the} the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 11

Space- and Time-Efficient Data Structures for Massive Datasets - PowerPoint PPT Presentation

Space- and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor Rossano Venturini Department of Computer Science University of Pisa 15/11/2018 1 Evidence The increase of

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri

Summary Structures for Massive Data Graham Cormode G.Cormode@warwick.ac.uk 7 6 4 1 Massive

The FIFA Universe Massive scale, massive influence, massive corruption First, Some History.

Massive Data Algorithmics Lecture 10: Connected Components and MST Massive Data Algorithmics

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

Space Efficient Data Structures and FM index Venkatesh Raman The Institute of Mathematical

External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School

Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee

A different look to massive MIMO Ana Garca Armada Communications Research Group (GCOM)

1 2 Compress a massive object to a small sketch 2 Compress a massive object to a small

Massive Data Algorithmics Lecture 11: BFS and DFS Massive Data Algorithmics Lecture 11: BFS and

Massive Data Algorithmics Lecture 3: External Search Trees Massive Data Algorithmics Lecture 3:

Massive Data Algorithmics Lecture 5: External Search Trees Massive Data Algorithmics Lecture 5:

Massive Data Algorithmics Lecture 6: Interval Trees Massive Data Algorithmics Lecture 6:

Massive Data Algorithmics Lecture 4: External Search Trees Massive Data Algorithmics Lecture 4:

Formal Modeling in Cognitive Science Source Codes Lecture 30: Codes; Kraft Inequality; Source

CSE 417 Algorithms Winter 2007 Huffman Codes: An Optimal Data Compression Method 1 a 45% b

CSE 421 Algorithms Huffman Codes: An Optimal Data Compression Method 1 a 45% b 13%

Arithmetic Coding Mathias Winther Madsen mathias.winther@gmail.com Institute for Logic,

Applications of Galois Geometries to Coding Theory and Cryptography Leo Storme Ghent University

Introduction to Information by Erol Seke For the course Communications OSMANGAZI

Abstract Geant4 photo-absorption ionisation (PAI) and the Moller-Bhahba standard models were

Some designs and binary codes preserved by the simple group Ru of Rudvalis Bernardo Rodrigues