The PGM-index: a fully-dynamic compressed learned index with - PowerPoint PPT Presentation

The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo Giorgio Ferragina Vinciguerra

pgm.di.unipi.it The predecessor search problem • Given 𝑜 sorted input keys (e.g. integers), implement 𝑞𝑠𝑓𝑒𝑓𝑑𝑓𝑡𝑡𝑝𝑠 𝑦 = “largest key ≤ 𝑦 ” • Range queries and joins in DBs, conjunctive queries in search engines, IP routing… • Lookups alone are much easier; just use Cuckoo hashing for lookups at most 2 memory accesses (without sorting data!) 𝑞𝑠𝑓𝑒𝑓𝑑𝑓𝑡𝑡𝑝𝑠 36 = 36 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 𝑞𝑠𝑓𝑒𝑓𝑑𝑓𝑡𝑡𝑝𝑠 50 = 48 2

pgm.di.unipi.it Indexes 𝑙𝑓𝑧 = 36 B-tree 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 = 11 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 3 (values associated to keys are not shown)

pgm.di.unipi.it Input data as pairs (𝑙𝑓𝑧, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜) positions keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 4 Ao et al. [VLDB 2011]

pgm.di.unipi.it Input data as pairs (𝑙𝑓𝑧, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜) 4 3 positions 2 1 11 13 15 2 keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 2 3 4 𝑜 5 Ao et al. [VLDB 2011]

pgm.di.unipi.it Learned indexes 𝑙𝑓𝑧 Black-box trained on a dataset of pairs (key, pos) 𝒠 = { 2,1 , 11,2 , … , (95, 𝑜)} positions keys (approximate) 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 Binary search in [𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝑓𝑠𝑠𝑝𝑠, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝑓𝑠𝑠𝑝𝑠] 6 Ao et al. [VLDB 2011], Kraska et al. [SIGMOD 2018]

pgm.di.unipi.it The problem with learned indexes Too much I/O when Unpredictable data is on disk latency Very slow Fast query time and excellent to train space usage in practice, but no worst-case guarantees Unscalable to big data Must be tuned for each new dataset Vulnerable to adversarial inputs Blind to the and queries query distribution 7

pgm.di.unipi.it Introducing the PGM-index Constant I/O when Predictable data is on disk latency Very fast Fast query time and excellent to build space usage in practice, and guaranteed worst-case bounds Scalable to big data No additional tuning needed Resistant to adversarial inputs Query distribution and queries aware 8

pgm.di.unipi.it Ingredients of the PGM-index Opt. piecewise linear model Fixed model “error” ε Recursive design Fast to construct, best space usage Control the size of the search range Adapt to the memory hierarchy for linear learned indexes (like the page size in a B-tree) and enable query-time guarantees 9

pgm.di.unipi.it PGM-index construction Step 1 . Compute the optimal piecewise linear 𝜁 -approximation in Ο(𝑜) time 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 10 1 𝑜

pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑡 ! = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢 in Ο(𝑜) time 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 11 1 𝑜

pgm.di.unipi.it Partial memory layout of the PGM-index Each segment indexes a variable and potentially large sequence of keys while guaranteeing a search range size of 2𝜁 + 1 Segments (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 1 𝑜 Binary search in [𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁] 12

pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑡 ! = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢 in Ο(𝑜) time Step 3 . Keep only 𝑡 ! . 𝑙𝑓𝑧 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 13 1 𝑜

pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑡 ! = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢 in Ο(𝑜) time Step 3 . Keep only 𝑡 ! . 𝑙𝑓𝑧 2 23 31 48 71 88 122 145 14

pgm.di.unipi.it PGM-index construction Step 1 . Compute the Step 2 . Store the optimal piecewise linear segments as triples 𝜁 -approximation 𝑡 ! = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢 in Ο(𝑜) time Step 3 . Keep only 𝑡 ! . 𝑙𝑓𝑧 Step 4 . Repeat recursively 2 23 31 48 71 88 122 145 15

pgm.di.unipi.it Memory layout of the PGM-index (2, sl, ic) It can also be constructed Very fast construction, a couple in a single pass of seconds for 1 billion keys (2, sl, ic) (31, sl, ic) (88, sl, ic) (145, sl, ic) (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 1 𝑜 16

pgm.di.unipi.it Predecessor search with 𝜁 = 1 The PGM-index is never 𝑞𝑠𝑓𝑒𝑓𝑑𝑓𝑡𝑡𝑝𝑠 57 ? worse in time and space 𝐶 = disk page-size (2, sl, ic) than a B-tree Set 𝜁 = Θ 𝐶 for queries in 𝑃(log ! 𝑜) I/Os (2, sl, ic) (31, sl, ic) (88, sl, ic) (145, sl, ic) 𝑃(𝑜/𝜁) space 2𝜁 + 1 (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) 2𝜁 + 1 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146 1 𝑜 2𝜁 + 1 17

Experiments

pgm.di.unipi.it Experiments Avg search range Fastest CSS-tree 128-byte pages ≈ 350 MB Page size Matched by PGM with 2 ε set to 256 ≈ 4 4 MB ( − 83 × ) 2ε 19 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory

pgm.di.unipi.it Experiments on updates 20 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory

pgm.di.unipi.it Experiments on updates B + -tree page size Index size 128-byte 5.65 GB 3891× 256-byte 2.98 GB 2051× 512-byte 1.66 GB 1140× 1024-byte 0.89 GB 611× Dynamic PGM-index: 1.45 MB 21 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory

pgm.di.unipi.it Why the PGM is so effective? A B-tree node A PGM-index node Page size 𝐶 2𝜁 = 𝐶 𝑙 ! 𝑙 " … 𝑙 # (𝑙, 𝑡𝑚, 𝑗𝑑) … In one I/O and 𝑃 log " 𝐶 steps the Here the search range is reduced search range is reduced by 1/𝐶 by at least 1/𝐶 w.h.p. 1/𝐶 F Ferragina et al. [ICML 2020] 22

pgm.di.unipi.it New experiments with tuned Linear RMI 8-byte keys, 8-byte payload • PGM improved the empirical Tuned Linear RMI and PGM have the same size • performance of a tuned Linear RMI 10M predecessor searches, uniform query workload • Each PGM took about 2 seconds to construct They tested positive lookups. Here we test predecessor queries RMI took 30 × more! 23 New tuned Linear RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]

pgm.di.unipi.it New experiments with tuned Hybrid RMI 8-byte keys, 8-byte payload • RMI with non-linear models, tuned via grid search • 10M predecessor searches, uniform query workload • Avg search range 2 8 Max search range 2 8 Avg 2 15 Max 2 29 Each PGM took about 2 seconds to construct Hybrid RMI took 40 × (90 × with tuning) more! 24 New tuned Hybrid RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]

pgm.di.unipi.it New experiments Adversarial 8-byte keys, 8-byte payload • RMI with non-linear models, tuned via grid search • query workload 10M predecessor searches • About adversarial data inputs, see Kornaropoulos et al., 2020 [arXiv:2008.00297] 25 New tuned Hybrid RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]

pgm.di.unipi.it More results in the paper Query-distribution aware Index compression Multicriteria tuner Minimise average query time wrt Reduce the space of the index by a Minimise query time under a a given query workload further 52% via the compression of given space constraint and vice versa slopes and intercepts in a few dozens of seconds 26

The PGM-index: a fully-dynamic compressed learned index with - PowerPoint PPT Presentation

The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo Giorgio Ferragina Vinciguerra pgm.di.unipi.it The predecessor search problem Given sorted input keys (e.g. integers), implement

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

ANGLO AMERICAN PLATINUM PGM MARKETING STRATEGY BARCLAYS PGM THOUGHT LEADERSHIP FORUM

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je Wrocaw,

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

PGM: Reliable General Multicast Implementation for Linux? Christoph Lameter,

AIR CHALLENGE SUMMARY SUSTAINABILITY NORTH AMERICA WHY COMPRESSED AIR? Inappropriate

Introduction to Compressed Sensing Gitta Kutyniok (Institut f ur Mathematik, Technische

Aligning DNA sequences on compressed collections of genomes Part 2. Compressed indexing The

Fast Data Driven Compressed Sensing and application to compressed quantitative MRI Mike Davies

Foundations of Compressed Sensing Mike Davies Edinburgh Compressed Sensing research group (E-CoS)

Compressed Sensing: Challenges and Emerging Topics Mike Davies Edinburgh Compressed Sensing

Deep Compressed Sensing Yan Wu, Mihaela Rosca, Tim Lillicrap Compressed Sensing A Brief Review

Using Java EE ProtoCom for SAP HANA Cloud Chris:an Klaussner Sebas:an

Remote Procedure Calls (RPCs) and Remote Method Invocation (RMI) CS425/ECE428 SPRING 2019

Java Programming Unit 10 Stock Price Quotes with URL,

Server and Threads Shan Hung Wu & DataLab CS, NTHU Where are we? VanillaCore JDBC

Large Eddy Simulation of Strong-shock Richtmyer- Meshkov Instability R. Samtaney D. I. Pulin

Basics on Linear Programming Combinatorial Problem Solving (CPS) Javier Larrosa Albert Oliveras

Welcome! We will be starting soon. The Low-Income Forum on Energy Presents: Clean Energy for

The GENESIS platform, its Distribution, and Web Services Stephen Rank, David Nutter, Janet

The PGM-index: a fully-dynamic compressed learned index with - PowerPoint PPT Presentation

The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo Giorgio Ferragina Vinciguerra pgm.di.unipi.it The predecessor search problem Given sorted input keys (e.g. integers), implement

PGM: experminents with the IF toolset Marc Boyer LIAFA - Univ. Paris 7 - France PGM:

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je University of

Pattern Matching on Compressed T exts II Shunsuke Inenaga Kyushu University, Japan Agenda

ANGLO AMERICAN PLATINUM PGM MARKETING STRATEGY BARCLAYS PGM THOUGHT LEADERSHIP FORUM

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P) Artur Je Wrocaw,

Decoding in Compressed Sensing Ronald DeVore USC, 2008 p. 1/33 Discrete Compressed Sensing R

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

PGM: Reliable General Multicast Implementation for Linux? Christoph Lameter,

AIR CHALLENGE SUMMARY SUSTAINABILITY NORTH AMERICA WHY COMPRESSED AIR? Inappropriate

Introduction to Compressed Sensing Gitta Kutyniok (Institut f ur Mathematik, Technische

Aligning DNA sequences on compressed collections of genomes Part 2. Compressed indexing The

Fast Data Driven Compressed Sensing and application to compressed quantitative MRI Mike Davies

Foundations of Compressed Sensing Mike Davies Edinburgh Compressed Sensing research group (E-CoS)

Compressed Sensing: Challenges and Emerging Topics Mike Davies Edinburgh Compressed Sensing

Deep Compressed Sensing Yan Wu, Mihaela Rosca, Tim Lillicrap Compressed Sensing A Brief Review

Using Java EE ProtoCom for SAP HANA Cloud Chris:an Klaussner Sebas:an

Remote Procedure Calls (RPCs) and Remote Method Invocation (RMI) CS425/ECE428 SPRING 2019

Java Programming Unit 10 Stock Price Quotes with URL,

Server and Threads Shan Hung Wu &amp; DataLab CS, NTHU Where are we? VanillaCore JDBC

Large Eddy Simulation of Strong-shock Richtmyer- Meshkov Instability R. Samtaney D. I. Pulin

Basics on Linear Programming Combinatorial Problem Solving (CPS) Javier Larrosa Albert Oliveras

Welcome! We will be starting soon. The Low-Income Forum on Energy Presents: Clean Energy for

The GENESIS platform, its Distribution, and Web Services Stephen Rank, David Nutter, Janet

Server and Threads Shan Hung Wu & DataLab CS, NTHU Where are we? VanillaCore JDBC