Why are learned indexes so effective? Paolo Fabrizio Giorgio - PowerPoint PPT Presentation

Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna

A classical problem in computer science • Given a set of 𝑜 sorted input keys (e.g. integers) • Implement membership and predecessor queries • Range queries in databases, conjunctive queries in search engines, IP lookup in routers… 𝑛𝑓𝑛𝑐𝑓𝑠 36 = True 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 𝑞𝑠𝑓𝑒𝑓𝑑𝑓𝑡𝑡𝑝𝑠 50 = 48 2

Indexes 𝑙𝑓𝑧 B-tree 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 3

Input data as pairs (𝑙𝑓𝑧, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜) positions keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 4

Input data as pairs (𝑙𝑓𝑧, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜) 4 3 positions 2 1 11 13 15 2 keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 2 3 4 𝑜 5

Learned indexes 𝑙𝑓𝑧 Black-box trained on a dataset of pairs (key, pos) 𝒠 = { 2,1 , 11,2 , … , (95, 𝑜)} positions keys (approximate) 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 𝑜 Binary search in 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝜁, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝜁 e.g. 𝜁 is of the order of 100–1000 6

The knowledge gap in learned indexes Practice Theory Same query time of Same asymptotic query vs 👎 traditional tree-based time of traditional indexes tree-based indexes Space improvements of Same asymptotic space vs 👏 orders of magnitude, occupancy of traditional from GBs to few MBs tree-based indexes 7

[Ferragina and Vinciguerra, PVLDB 2020] PGM-index: An optimal learned index 1. Fix a max error 𝜁 , e.g. so that keys in [𝑞𝑝𝑡 − 𝜁,𝑞𝑝𝑡 + 𝜁] fit a cache-line 2. Find the smallest Piecewise Linear 𝜁 -Approximation (PLA) 3. Store triples (𝑔𝑗𝑠𝑡𝑢𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢) for each segment positions 8 24 keys 1 3 8 11 12 19 22 23 24 28 29 33 38 47 48 53 55 56 57 8 https://pgm.di.unipi.it 8 𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁

What is the space of learned indexes? • Space occupancy ∝ Number segments • The number of segments depends on • The size of the input dataset • How the points (𝑙𝑓𝑧, 𝑞𝑝𝑡) map to the plane • The value 𝜁 , i.e. how much the approximation is precise 𝜁 ! 𝜁 " ≪ 𝜁 ! positions positions positions keys keys keys 9

Model and assumptions • Consider gaps 𝑕 ! = 𝑙 !"# − 𝑙 ! between consecutive input keys • Model the gaps as positive iid rvs that follow a distribution with finite mean 𝜈 and variance 𝜏 $ 5 𝑕 ( 4 positions 𝑕 ) 3 𝑕 * 2 𝑕 + 1 𝑙 " 𝑙 # 𝑙 $ 𝑙 % 𝑙 & keys 10

The main result Theorem . If 𝜁 is sufficiently larger than 𝜏/𝜈 , the expected number of keys covered by a segment with maximum error 𝜁 is 𝐿 = 𝜈 $ 𝜏 $ 𝜁 $ and the number of segments on a dataset of size 𝑜 is 𝑜 𝐿 with high probability . 11

The main consequence The PGM-index achieves the same asymptotic query performance of a traditional 𝜁 -way tree-based index while improving its space from 𝜤(𝒐/𝜻) to 𝑷(𝒐/𝜻 𝟑 ) Learned indexes are pr provably better than traditional indexes (note that 𝜁 is of the order of 100-1000) 12

Sketch of the proof 1. Consider a segment on the stream of random gaps and the two parallel lines at distance 𝜁 2. How many steps before a new segment is needed? 𝜁 𝜁 positions Start a new segment from here keys 13

Sketch of the proof (2) 3. A discrete-time random walk, iid increments with mean 𝜈 4. Compute the expectation of 𝑗 ∗ = min 𝑗 ∈ ℕ 𝑙 I , 𝑗 is outside the red strip i.e. the Mean Exit Time (MET) of the random walk Show that the slope 𝑛 = 1/𝜈 maximises 𝐹[𝑗 ∗ ] , giving 𝐹[𝑗 ∗ ] = 𝜈 J /𝜏 J 𝜁 J 5. Start a new 𝜁 random walker location segment from here 𝜁 (𝑙𝑓𝑧 ! ∗ , 𝑗 ∗ ) 𝜁 positions 𝑛 time Start a new 𝑗 ∗ segment 𝜁 from here 𝑛 14 keys

Simulations 1. Generate 10 7 random streams of gaps according to several probability distributions 2. Compute and average I. The length of a segment found by the algorithm that computes the smallest PLA, adopted in the PGM-index II. The exit time of the random walk 15

Simulations of (𝜈 * /𝜏 * )𝜁 * OPT = Average segment length in a PGM-index MET = Mean exit time of the random walk Pareto k = 3 , α = 3 Lognormal µ = 1 , σ = 0 . 5 Mean segment length · 10 6 OPT OPT 1 . 5 MET MET Thm 1 (3 . 521 ε 2 ) Thm 1 (3 . 0 ε 2 ) 1 0 . 5 0 250 0 50 100 150 200 250 250 0 50 100 150 200 250 ε ε Both OPT and MET agree on the slope 1/ µ , but OPT is more robust More distributions in the paper 16

Stress test of “ 𝜁 sufficiently larger than 𝜏/𝜈 ” σ /µ = 0 . 15 σ /µ = 1 . 5 σ /µ = 15 1 Pareto k = 10 , α = 7 . 741 Pareto k = 10 , α = 2 . 202 0 . 2 Gamma θ = 5 , k = 44 . 444 Gamma θ = 5 , k = 0 . 444 0 . 8 Lognormal µ = 2 , σ = 0 . 149 Lognormal µ = 2 , σ = 1 . 086 Relative error 44 . 444 ε 2 0 . 444 ε 2 0 . 6 0 . 5 0 . 1 0 . 4 Pareto k = 10 , α = 2 . 002 Gamma θ = 5 , k = 0 . 004 0 . 2 Lognormal µ = 2 , σ = 2 . 328 0 . 004 ε 2 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 ε ε ε 17

Conclusions • No theoretical grounds for the efficiency of learned indexes was known • We have shown that on data with iid gaps, the mean segment length is Θ(𝜁 J ) • The PGM-index takes O(𝑜/𝜁 J ) space w.h.p., a quadratic improvement in 𝜁 over traditional indexes ( 𝜁 is usually of the order of 100–1000) • Open problems : 1. Do the results still hold without the iid assumption on the gaps? 2. Is the segment found by the optimal algorithm adopted in the PGM-index a constant factor longer than the one found by the random walker? 18

Why are learned indexes so effective? Paolo Fabrizio Giorgio - PowerPoint PPT Presentation

Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna A classical problem in computer science Given a set of sorted input keys (e.g. integers)

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Dow Jones Sustainability Indexes A cooperation of Dow Jones Indexes and SAM Content Key

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree

Indexes 1 Demo 2 Indexes Index = data structure

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Virtual Address Translation via Learned Page Table Indexes Artemiy Margaritovy Dmitrii Ustiugovz

EFFECTIVE EFFECTIVE EFFECTIVE EFFECTIVE COMMUNICATIONS COMMUNICATIONS People First Language

Responsibility Lvia apukov Ida Wikstrm Leonard Guik Victoria Knabl Structure

PERSONALITY INDEXES For Hiring, Team Building, and the Bottom Line Presentation by Deb Harris /

Scalable Low-Latency Indexes for a Key-Value Store Ankita Kejriwal With Arjun Gopalan, Ashish

PhUSE 2016 Paper CC08 Perish the Sort: Using Indexes and Hash Objects for Efficient Programming

Constructing High Frequency Price Indexes Data Daniel Melser Using Scanner Data Daniel Melser

indexing 1 many slides courtesy James Allan@umass File organizations or indexes are used to

INDEXING - 1 Tree-Structured Indices Tree-structured indexing techniques support both

Indexing Large, Mixed- Language Codebases Luke Zarko <zarko@google.com> The Kythe project

An Index of (Absolute) Correlation Aversion Theory and Some Implications Olivier Le Courtois (EM

Index-based Trading in Cloud Spot Markets Supreeth Shastri and David Irwin Idle Cloud is

15-721 DATABASE SYSTEMS Lecture #08 Latch-free OLTP Indexes (Part II) Andy Pavlo / /

Tools for Schools School Health Councils http://www.fns.usda.gov/tn/healthy/

NoSQL CS226 Big-data Management 1 Based on a presentation by Traversy Media 2 What is

A comparison of country performance in realizing universal WaSH The water, sanitation, and

Why are learned indexes so effective? Paolo Fabrizio Giorgio - PowerPoint PPT Presentation

Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna A classical problem in computer science Given a set of sorted input keys (e.g. integers)

Module 7: Creating and Maintaining Indexes Overview Creating Indexes Creating Index

Modern OLTP Indexes (Part 2) 1 / 43 Modern OLTP Indexes (Part 2) Recap Recap 2 / 43 Modern OLTP

An Example of Index An Example of Index pattern of structure in indicators pattern of structure

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Dow Jones Sustainability Indexes A cooperation of Dow Jones Indexes and SAM Content Key

RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes Se Kwon Lee, Jayashree

Indexes 1 Demo 2 Indexes Index = data structure

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Virtual Address Translation via Learned Page Table Indexes Artemiy Margaritovy Dmitrii Ustiugovz

EFFECTIVE EFFECTIVE EFFECTIVE EFFECTIVE COMMUNICATIONS COMMUNICATIONS People First Language

Responsibility Lvia apukov Ida Wikstrm Leonard Guik Victoria Knabl Structure

PERSONALITY INDEXES For Hiring, Team Building, and the Bottom Line Presentation by Deb Harris /

Scalable Low-Latency Indexes for a Key-Value Store Ankita Kejriwal With Arjun Gopalan, Ashish

PhUSE 2016 Paper CC08 Perish the Sort: Using Indexes and Hash Objects for Efficient Programming

Constructing High Frequency Price Indexes Data Daniel Melser Using Scanner Data Daniel Melser

indexing 1 many slides courtesy James Allan@umass File organizations or indexes are used to

INDEXING - 1 Tree-Structured Indices Tree-structured indexing techniques support both

Indexing Large, Mixed- Language Codebases Luke Zarko &lt;zarko@google.com&gt; The Kythe project

An Index of (Absolute) Correlation Aversion Theory and Some Implications Olivier Le Courtois (EM

Index-based Trading in Cloud Spot Markets Supreeth Shastri and David Irwin Idle Cloud is

15-721 DATABASE SYSTEMS Lecture #08 Latch-free OLTP Indexes (Part II) Andy Pavlo / /

Tools for Schools School Health Councils http://www.fns.usda.gov/tn/healthy/

NoSQL CS226 Big-data Management 1 Based on a presentation by Traversy Media 2 What is

A comparison of country performance in realizing universal WaSH The water, sanitation, and

Indexing Large, Mixed- Language Codebases Luke Zarko <zarko@google.com> The Kythe project