Space- and Time-E ffi cient Data Structures for Massive Datasets - PowerPoint PPT Presentation

Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee Supervisor Referee Daniel Lemire Rossano Venturini Simon Gog Department of Computer Science University of Pisa 08/03/2019

Evidence The increase of data and, hence, information does not scale with technology.

Evidence The increase of data and, hence, information does not scale with technology. “Software is getting slower more rapidly than hardware becomes faster.” Niklaus Wirth, A Plea for Lean Software

Evidence The increase of data and, hence, information does not scale with technology. “Software is getting slower more rapidly than hardware becomes faster.” Niklaus Wirth, A Plea for Lean Software Even more relevant today!

Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) Full paper, 14 pages, 2017. On Optimally Partitioning Variable-Byte Codes Journal paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Journal paper Handling Massive N-Gram Datasets Efficiently Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.

Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper integer Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) sequences Full paper, 14 pages, 2017. On Optimally Partitioning Variable-Byte Codes Journal paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Journal paper Handling Massive N-Gram Datasets Efficiently Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.

Achieved results Clustered Elias-Fano Indexes Journal paper Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS) Full paper, 34 pages, 2017. Dynamic Elias-Fano Representation Conference paper integer Giulio Ermanno Pibiri and Rossano Venturini Annual Symposium on Combinatorial Pattern Matching (CPM) sequences Full paper, 14 pages, 2017. On Optimally Partitioning Variable-Byte Codes Journal paper Giulio Ermanno Pibiri and Rossano Venturini IEEE Transactions on Knowledge and Data Engineering (TKDE). To appear. Full paper, 12 pages, 2019. Fast Dictionary-based Compression for Inverted Indexes Giulio Ermanno Pibiri, Matthias Petri and Alistair Moffat Conference paper ACM Conference on Web Search and Data Mining (WSDM) Full paper, 9 pages, 2019. Efficient Data Structures for Massive N-Gram Datasets Conference paper Giulio Ermanno Pibiri and Rossano Venturini ACM Conference on Research and Development in Information Retrieval (SIGIR) Full paper, 10 pages, 2017. Journal paper Handling Massive N-Gram Datasets Efficiently short strings Giulio Ermanno Pibiri and Rossano Venturini ACM Transactions on Information Systems (TOIS). To appear. Full paper, 41 pages, 2019.

Problem 1 Consider a sorted integer sequence.

Problem 1 Consider a sorted integer sequence. How to represent it as a bit-vector where each original integer is uniquely-decodable, using as few as possible bits? How to maintain fast decompression speed ?

Ubiquity Inverted indexes Databases E-Commerce Graph compression Semantic data Geo-spatial data

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system.

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy 3 boy is is the hungry red house is always 5 hungry 4

Inverted indexes The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4

Many solutions Large research corpora describing different space/time trade-offs. ~1970 Elias’ Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • 2014 Partitioned Elias-Fano •

Many solutions Large research corpora describing different space/time trade-offs. ~1970 Elias’ Gamma and Delta • Variable-Byte Family • Binary Interpolative Coding • Simple Family • PForDelta • QMX • Elias-Fano • 2014 Partitioned Elias-Fano • Time Space Binary Variable-Byte Spectrum Interpolative Family Coding ~ 3X smaller ~ 4.5X faster

Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster

Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an encoding that is as small as BIC and much faster ? 1

Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an Is it possible to design an encoding that is as small as encoding that is as fast as BIC and much faster ? VByte and much smaller ? 1 2

Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an Is it possible to design an encoding that is as small as encoding that is as fast as BIC and much faster ? VByte and much smaller ? 1 2 What about both objectives at the same time?! 3

Key research questions Space Time Binary Variable-Byte Spectrum Interpolative (VByte) Coding (BIC) Family ~ 3X smaller ~ 4.5X faster Is it possible to design an Is it possible to design an encoding that is as small as encoding that is as fast as BIC and much faster ? VByte and much smaller ? 1 2 TOIS 2017 TKDE 2019 What about both objectives at the same time?! 3 WSDM 2019

1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually .

1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists.

1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists. reference list

1 - Clustered inverted indexes (TOIS 2017) Every encoder represents each sequence individually . Encode clusters of (similar) inverted lists. reference list Space Time Slightly slower Always better than than PEF (~20%) PEF (by up to 11% ) Spectrum Much faster than and better than BIC BIC (2X) (by up to 6.25% )

2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed).

2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed). Encode dense regions with unary codes, sparse regions with VByte.

2 - Optimally-partitioned Variable-Byte codes (TKDE 2019) The majority of values are small ( very small indeed). Encode dense regions with unary codes, sparse regions with VByte. Optimal partitioning in Query processing speed Compression ratio linear time and constant and sequential decoding improves by 2X . space. (almost) not affected .

Space- and Time-E ffi cient Data Structures for Massive Datasets - PowerPoint PPT Presentation

Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee Supervisor Referee Daniel Lemire Rossano Venturini Simon Gog Department of Computer Science University of Pisa 08/03/2019 Evidence The

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

15 E ffi cient mesh models Steve Marschner CS5625 Spring 2020 Follows chapter 16 in RTR 4e Basics

E ffi cient 2D Viewpoint Combination for Human Action Recognition Multi-view Action

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Tips on Writing Papers with Mathematical Content John N. Tsitsiklis May 2019

Parquet in Practice & Detail What is Parquet? How is it so e ffi cient? Why should I actually

Correlation Course Title Correlation Correlation coe ffi cient between -1 and 1 Sign

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and

Ti e E ffi cient Server Audit Problem, Deduplicated Re-execution, and the Web Cheng Tan, Lingfan

CAF - The C++ Actor Framework for Scalable and Resource-e ffi cient Applications Dominik

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and

THE PSALMS P salm 88 The Prayer Of A Troubled Soul A Desperate Plea His Constant Prayers

1 Kings 8:22, Then Solomon stood before the altar of the LORD in the presence of all the

2 August 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

M ILLING Making it Relevant genetic improvement of softwoods Dr. Charles Sorensson

S + M B 3. 1 1 Steve French Principal Software Engineer Azure Storage - Microsoft Legal

referencing SERVER 2 web page Images Web repository Server WEB PAGE Server instructions

QUARTERLY STATISTICS April, May & June 2020 1 Suspect: 6 Month Comparison Cumulative Number

Welcome to the St Edmundsbury Public Meeting Let us go forward together. (Sir Winston

Space- and Time-E ffi cient Data Structures for Massive Datasets - PowerPoint PPT Presentation

Space- and Time-E ffi cient Data Structures for Massive Datasets Giulio Ermanno Pibiri Referee Supervisor Referee Daniel Lemire Rossano Venturini Simon Gog Department of Computer Science University of Pisa 08/03/2019 Evidence The

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

15 E ffi cient mesh models Steve Marschner CS5625 Spring 2020 Follows chapter 16 in RTR 4e Basics

E ffi cient 2D Viewpoint Combination for Human Action Recognition Multi-view Action

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Tips on Writing Papers with Mathematical Content John N. Tsitsiklis May 2019

Parquet in Practice &amp; Detail What is Parquet? How is it so e ffi cient? Why should I actually

Correlation Course Title Correlation Correlation coe ffi cient between -1 and 1 Sign

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and

Ti e E ffi cient Server Audit Problem, Deduplicated Re-execution, and the Web Cheng Tan, Lingfan

CAF - The C++ Actor Framework for Scalable and Resource-e ffi cient Applications Dominik

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and

THE PSALMS P salm 88 The Prayer Of A Troubled Soul A Desperate Plea His Constant Prayers

1 Kings 8:22, Then Solomon stood before the altar of the LORD in the presence of all the

2 August 2020 Access service sheets at thecrossing.com.sg/im-new/sunday-services/ Family

M ILLING Making it Relevant genetic improvement of softwoods Dr. Charles Sorensson

S + M B 3. 1 1 Steve French Principal Software Engineer Azure Storage - Microsoft Legal

referencing SERVER 2 web page Images Web repository Server WEB PAGE Server instructions

QUARTERLY STATISTICS April, May &amp; June 2020 1 Suspect: 6 Month Comparison Cumulative Number

Welcome to the St Edmundsbury Public Meeting Let us go forward together. (Sir Winston

Parquet in Practice & Detail What is Parquet? How is it so e ffi cient? Why should I actually

QUARTERLY STATISTICS April, May & June 2020 1 Suspect: 6 Month Comparison Cumulative Number