Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. - PowerPoint PPT Presentation

Function Interpolation for Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. Rubinstein, Renata Borovica-Gajic University of Melbourne Acknowledgement: CORE Student Travel Scholarship

Querying data with an index • Indexes are external structures used to make lookups faster. • B-Tree indexes are created on databases where the keys have an ordering. Query on Key (key, pos) 𝑙𝑓𝑧 → 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜

On Learned Indexes • An experiment by Kraska, et al. [*] to replace range index structure (i.e. B- Tree) with neural networks to “predict” position of an entry in a database. • Reduce 𝑃 log 𝑜 traversal time to 𝑃 1 evaluation time. • Indexing is a problem on learning how data is distributed. • Aim: To explore the feasibility of an alternative statistical tool: polynomial interpolation in indexing . Kraska , Tim, et al. "The case for learned index structures.“ Proceedings of the 2018 International Conference on Management of Data . 2018.

Mathematical View on Indexing F(key) = Indexing Function on Product Price (Key) Price 6 Product A 100 Position in Table 5 Product X 161 4 3 Product L 299 2 Product D 310 1 0 Product G 590 0 200 400 600 800 Price of product An index is a function 𝒈: 𝑽 ↦ 𝑶 that takes a query and return the position .

So... we can build a model to predict them! F(x) = Indexing Function 6 5 Position in Table 4 3 2 1 0 0 200 400 600 800 Price of product 𝑔 𝑦 ≈ ∑𝑏 𝑗 𝑦 𝑗 Neural Networks Polynomial Models Trees!

Polynomial Models - Preface For a chosen degree 𝑜 F(x) = Indexing Function 6 5 Position in Table 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 ≈ 𝑏 0 + 𝑏 1 𝑦 + 𝑏 2 𝑦 2 + ⋯ + 𝑏 𝑜 𝑦 𝑜 4 3 2 1 0 Use two different interpolation methods 0 200 400 600 800 Price of product to obtain 𝑏 𝑗 : • Bernstein Polynomial Interpolation • Chebyshev Polynomial Interpolation

Meet our Models Bernstein Interpolation Method Where 𝑗 𝛽 𝑗 = 𝑔 𝑂 𝑂 𝑂 𝑦 𝑗 1 − 𝑦 𝑂−𝑗 ෍ 𝛽 𝑗 𝑗 And 𝑔 is the function we want to 𝑗=0 approximate, scaled to [0,1] . In memory: only need to store the coefficients Model parameters ⟨𝛽 1 , 𝛽 2 , 𝛽 3 , ⋯ , 𝛽 𝑂 ⟩ 𝛽 𝑗 ⋅ 𝑂 𝑗

Meet our Models Chebyshev Interpolation Method Coefficients given by Discrete Chebyshev Transform 𝑂 ෍ 𝛽 𝑗 𝑈 𝑗 (𝑦) 𝑂−1 𝛽 𝑗 = 𝑞 𝑗 𝑂 𝑙 + 1 𝜌 ⋅ cos 𝑗𝜌 𝑂 + 𝑙 + 1 𝑗=0 𝑂 ෍ 𝑔 − cos 2 𝑂 2 𝑙=0 𝑈 0 𝑦 = 1 𝑈 1 𝑦 = 𝑦 𝑈 𝑜 𝑦 = 2𝑦𝑈 𝑜−1 𝑦 − 𝑈 𝑜−2 (𝑦) 𝑞 0 = 1, 𝑞 𝑙 = 2 (if 𝑙 > 0) Domain is [−1, 1]

Indexing as CDF Approximation If we: F(x) = Indexing Function • Pre-sort the values in the table, we 6 get the following equation: 5 Position in Table 4 3 2 𝐺 𝑙𝑓𝑧 = 𝑄 𝑦 ≤ 𝑙𝑓𝑧 × 𝑂 1 0 0 200 400 600 800 Price of product Our polynomial models need to simply predict the CDF, with key rescaled to the interpolation domain.

A Query System Data is not necessarily sorted in DB Query Model Step 1: Creation of Data Array

A Query System Data is not necessarily sorted in DB Sorted Data Dupe (A) ⟨𝑙𝑓𝑧 1 , 𝑞𝑝𝑡 1 ⟩ ⟨𝑙𝑓𝑧 2 , 𝑞𝑝𝑡 2 ⟩ ⟨𝑙𝑓𝑧 3 , 𝑞𝑝𝑡 3 ⟩ ⟨𝑙𝑓𝑧 4 , 𝑞𝑝𝑡 4 ⟩

A Query System ⟨𝑙𝑓𝑧 1 , 𝑞𝑝𝑡 1 ⟩ ⟨𝑙𝑓𝑧 2 , 𝑞𝑝𝑡 2 ⟩ Key Model ⟨𝑙𝑓𝑧 3 , 𝑞𝑝𝑡 3 ⟩ ⟨𝑙𝑓𝑧 4 , 𝑞𝑝𝑡 4 ⟩ Query Model Step 1: Predict position

A Query System 𝑥𝑠𝑝𝑜𝑕 ⟨𝑙𝑓𝑧, 𝑞𝑝𝑡⟩ Key Model 𝑑𝑝𝑠𝑠𝑓𝑑𝑢 ⟨𝑙𝑓𝑧, 𝑞𝑝𝑡⟩ Query Model Step 2: Error correction

Experiment Setup • Created random datasets with multiple distributions as keys: • Normal, Log-Normal, and Uniform. • Each distribution: • 500k, 1M, 1.5M, 2M rows. • We test the performance of each index • NN, B-Tree, polynomial • Hardware setup: • Core i7, 16GB of RAM. • Python 3.7 on GCC running on Linux. • PyTorch for Neural Network purposes. • No form of GPU use.

Benchmark Neural Network • Neural Network: • 1hr benchmark training time. • 2 hidden layers x 32 neurons. • RelU activation.

Index Creation / “Training” Time • Polynomial models are created faster than B-Trees. Model Type Creation Time B-Tree 34.57 seconds Bernstein(25) Polynomial 3.366 seconds • Polynomial models do not require any hyperparameter tuning. Chebyshev(25) Polynomial 3.809 seconds Neural Network Model 1hr (benchmark) • NNs, however, can be incrementally trained. Factor of 10 creation time reduction over B-Trees

Model Prediction Time Model Type Prediction Time (nanoseconds) Normal LogNormal Uniform B-Tree 24.4 40.1 41.5 Bernstein(25) Polynomial 277 336 166 Chebyshev(25) Polynomial 25.9 31.7 16.4 Neural Network Model 406 806 148 Model prediction time for 2 million rows. Polynomial models are able to predict faster than NNs.

Model Accuracy Model Type Root Mean Squared Positional Error Normal LogNormal Uniform B-Tree N/A Bernstein(25) Polynomial 9973.67 39566.59 62.58 Chebyshev(25) Polynomial 57.14 474.91 26.39 Neural Network Model 105.84 711.12 22.67 Average error for 2 million rows. Chebyshev Models are ~50% more accurate

Total Query Speed Model Type Average Query Times (nanoseconds) Normal LogNormal Uniform B-Tree 31.5 46.0 56.3 Chebyshev(25) Polynomial 62.1 751 40.2 Bernstein(25) Polynomial 8080 11800 192 Neural Network Model 402 1100 516 Chebyshev Models are 30% - 90% faster at querying.

Memory Usage Model Type Size of Database (in Entries) 500k Entries 1M Entries 1.5M Entries 2M Entries B-Tree 33.034 MB 66.126 MB 99.123 MB 132.163 MB Neural Network 210.73 kB 210.73 kB 210.73 kB 210.73 kB Bernstein(25) 1.8kB 1.8kB 1.8kB 1.8kB Polynomial Chebyshev(25) 1.8kB 1.8kB 1.8kB 1.8kB Polynomial 99.4% Reduction 99.3% reduction from from B-Trees Neural Network Model

Main Key Insight • “Indexing” is better interpreted as less of a learning problem and more of a fitting problem. Where overfitting is advantageous. • Learning: separate training and test data. • Fitting: same training and test data.

Conclusion • We advocate for the use of function interpolation as a ‘learned index’ due to the following benefits: • No hyperparameter tuning. • Fast creation time on a CPU-only environment. • Provides a higher compression rate vs. Neural Networks and definitely vs. B- Trees.

Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. - PowerPoint PPT Presentation

Function Interpolation for Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. Rubinstein, Renata Borovica-Gajic University of Melbourne Acknowledgement: CORE Student Travel Scholarship Querying data with an index Indexes are

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E.

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Learned Index Structures paper by Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Contact manifolds and SU ( 2 ) -structures in 5-dimensions SU ( n ) -structures Sasaki-Einstein

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo

Period Index: A Learned 2D Hash Index for Range and Duration Queries Andreas Behrend 1 os 2 Johann

Targeting Text Structures to Improve Reading What are Text Structures? Text Structures are

Information Model for Impaired Optical Path Validation

Random Data and Key Generation Evaluation of Some

Defending your DNS in a post-Kaminsky world Paul Wouters <paul@xelerance.com> Vendor and

Software Engineering Writing Intensive Dr. Barry Wittman Not Dr. Barry Whitman

January 30, 2020 Mill 19, Hazelwood Green The development plans for Hazelwood Green to restore

Transaction Models and Concurrency Control 5DV120 Database System Principles Ume a

Security Evaluation of Physical RNGs Werner Schindler, Workshop on Randomness and Arithmetics for

Advanced Computer Graphics Advanced Computer Graphics CS 563: Curves and Curved Surfaces William

Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. - PowerPoint PPT Presentation

Function Interpolation for Learned Index Structures Naufal Fikri Setiawan, Benjamin I.P. Rubinstein, Renata Borovica-Gajic University of Melbourne Acknowledgement: CORE Student Travel Scholarship Querying data with an index Indexes are

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E.

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

Learned Index Structures paper by Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, Neoklis

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Hypo contact and Sasakian SU ( 2 ) -structures in 5-dimensions structures on Lie groups Sasakian

FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index S&amp;P 500 Index Weighting 20% MSCI EM Index

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Blocking Factors, Views Rose-Hulman Institute of Technology Curt Clifton Index Redux

Contact manifolds and SU ( 2 ) -structures in 5-dimensions SU ( n ) -structures Sasaki-Einstein

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo

Period Index: A Learned 2D Hash Index for Range and Duration Queries Andreas Behrend 1 os 2 Johann

Targeting Text Structures to Improve Reading What are Text Structures? Text Structures are

Information Model for Impaired Optical Path Validation

Random Data and Key Generation Evaluation of Some

Defending your DNS in a post-Kaminsky world Paul Wouters &lt;paul@xelerance.com&gt; Vendor and

Software Engineering Writing Intensive Dr. Barry Wittman Not Dr. Barry Whitman

January 30, 2020 Mill 19, Hazelwood Green The development plans for Hazelwood Green to restore

Transaction Models and Concurrency Control 5DV120 Database System Principles Ume a

Security Evaluation of Physical RNGs Werner Schindler, Workshop on Randomness and Arithmetics for

Advanced Computer Graphics Advanced Computer Graphics CS 563: Curves and Curved Surfaces William

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

Defending your DNS in a post-Kaminsky world Paul Wouters <paul@xelerance.com> Vendor and