On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno - PowerPoint PPT Presentation

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno Pibiri Rossano Venturini University of Pisa and ISTI-CNR University of Pisa and ISTI-CNR Pisa, Italy Pisa, Italy giulio.pibiri@di.unipi.it rossano.venturini@unipi.it Melbourne, 17/05/2018 1

Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 2

Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. red is the always house good is red the boy boy is is the hungry red house is always hungry 2

Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red T = {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy boy is is the hungry red house is always hungry 2

Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good is red the boy 3 boy is is the hungry red house is always 5 hungry 4 2

Context - Inverted Indexes We focus on compression effectiveness and retrieval speed in inverted indexes . The inverted index is the de-facto data structure at the basis of every large-scale retrieval system. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the L t 5 =[3, 5] hungry red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 2

Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 3

Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

Context - Inverted Indexes Inverted Indexes owe their popularity to the efficient resolution of queries , such as: “return all documents in which terms {t 1 ,…,t k } occur”. 1 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 red 2 T = {always, boy, good, house, hungry, is, red, the} is the always house good L t 1 =[1, 3] is L t 2 =[4, 5] red q = {boy, is, the} the L t 3 =[1] boy 3 q = {good, hungry} boy L t 4 =[2, 3] is is the hungry L t 5 =[3, 5] red house L t 6 =[1, 2, 3, 4, 5] is always L t 7 =[1, 2, 4] 5 hungry L t 8 =[2, 3, 5] 4 3

Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple-9/16 Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • 4

Many solutions Huge research corpora describing different space/time trade-offs. Elias gamma/delta Optimized PForDelta • • Variable-Byte Elias-Fano • • Binary Interpolative Coding Partitioned Elias-Fano • • Simple-9/16 Clustered Elias-Fano • • PForDelta Asymmetric Numeral Systems • • Space Time Binary Variable-Byte   Spectrum Interpolative (VByte) Coding Family ~ 3X smaller ~ 4.5X faster 4

Our research question Can we improve the space of a VByte-encoded sequence and preserve its query processing speed? 5

Variable-Byte Encoding Simple idea: encode each number using as few bytes as possible. 6 1 0000110 127 1 1111111 128 1 0000001 0 0000000 65790 1 0000100 1 0000001 0 1111110 6

Variable-Byte Encoding Simple idea: encode each number using as few bytes as possible. 6 1 0000110 127 1 1111111 128 1 0000001 0 0000000 65790 1 0000100 1 0000001 0 1111110 Decoding is fast : keep reading bytes until you hit a value smaller than 128. SIMD (Single Instruction Multiple Data) 6

So…what’s “wrong” with VByte? The majority of values are small ( very small indeed). VByte needs at least 8 bits per integer (bpi). 7

So…what’s “wrong” with VByte? The majority of values are small ( very small indeed). VByte needs at least 8 bits per integer (bpi). Sensibly far away from bit-level effectiveness.   BIC: 3.8 bpi on Gov2   PEF: 4.1 bpi on Gov2 7

High-level idea 1. Partition each inverted list into variable-length partitions. 2. Encode dense partitions with their characteristic bitvector . 3. Encode sparse partitions with VByte . 8

High-level idea 1. Partition each inverted list into variable-length partitions. 2. Encode dense partitions with their characteristic bitvector . 3. Encode sparse partitions with VByte . [13, 15, 16, 17, 20, 21, 23, 24] 1 0 1 1 1 0 0 1 1 0 1 1 13 14 15 16 17 18 19 20 21 22 23 24 24 - 13 - 1 = 12 bits VS 64 bits ( 5.33X ) 8

Computing an optimal partition 1st level 2nd level n 9

Computing an optimal partition 1st level Stores a fixed amount to bits, say F , for each partition. 2nd level n 9

Computing an optimal partition 1st level Stores a fixed amount to bits, say F , for each partition. 2nd level n guarantee Optimal Dynamic Programming (DP) 9

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno - PowerPoint PPT Presentation

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno Pibiri Rossano Venturini University of Pisa and ISTI-CNR University of Pisa and ISTI-CNR Pisa, Italy Pisa, Italy giulio.pibiri@di.unipi.it rossano.venturini@unipi.it

Basic Data Types (cont.) Data Types in C Four Basic Data Types Char (1 Byte = 8 Bits) Int

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Optimally Propagating SAT Encodings Martin Brain, Liana Hadarean , Ruben Martins and Daniel

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

New speed records 640838 Pentium M cycles for point multiplication to compute a 32-byte secret

IP Network Layer Programming TCP/IP Wenyuan Xu Department of Computer Science and

IP Datagram ICMP Message Format 1 byte 1 byte 1 byte 1 byte VERS HL Service Total Length

Encoding Byte Values Byte = 8 bits Binary 00000000 2 to 11111111 2 0 0 0000 Decimal:

15-780: Grad AI Lecture 14: Planning Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik

t rsr s Pt rs

HUNGRY FOR MORE: DIAGNOSIS AND TREATMENT OF EATING DISORDERS Learning Objectives Describe

Scope-related cumulativity asymmetries and cumulative composition Nina Haslinger & Viola

Matthew Series Lesson #019 January 5, 2014 Dean Bible Ministries www.deanbible.org Dr. Robert

on ORK at How to Ignite Passion in Your People without Burning them Out! with Eric Chester

Phase II Hut Layout Shaun Alsum Main U-Tube Vessel HEPA Grid Storage Filters HV FT Lid

Faster Force-Directed Graph Drawing with the Well-Separated Pair Decomposition Fabian Lipp

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno - PowerPoint PPT Presentation

On Optimally Partitioning Variable-Byte Index Data Giulio Ermanno Pibiri Rossano Venturini University of Pisa and ISTI-CNR University of Pisa and ISTI-CNR Pisa, Italy Pisa, Italy giulio.pibiri@di.unipi.it rossano.venturini@unipi.it

Basic Data Types (cont.) Data Types in C Four Basic Data Types Char (1 Byte = 8 Bits) Int

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Optimally Propagating SAT Encodings Martin Brain, Liana Hadarean , Ruben Martins and Daniel

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

New speed records 640838 Pentium M cycles for point multiplication to compute a 32-byte secret

IP Network Layer Programming TCP/IP Wenyuan Xu Department of Computer Science and

IP Datagram ICMP Message Format 1 byte 1 byte 1 byte 1 byte VERS HL Service Total Length

Encoding Byte Values Byte = 8 bits Binary 00000000 2 to 11111111 2 0 0 0000 Decimal:

15-780: Grad AI Lecture 14: Planning Geoff Gordon (this lecture) Tuomas Sandholm TAs Erik

t rsr s Pt rs

HUNGRY FOR MORE: DIAGNOSIS AND TREATMENT OF EATING DISORDERS Learning Objectives Describe

Scope-related cumulativity asymmetries and cumulative composition Nina Haslinger &amp; Viola

Matthew Series Lesson #019 January 5, 2014 Dean Bible Ministries www.deanbible.org Dr. Robert

on ORK at How to Ignite Passion in Your People without Burning them Out! with Eric Chester

Phase II Hut Layout Shaun Alsum Main U-Tube Vessel HEPA Grid Storage Filters HV FT Lid

Faster Force-Directed Graph Drawing with the Well-Separated Pair Decomposition Fabian Lipp

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Scope-related cumulativity asymmetries and cumulative composition Nina Haslinger & Viola