The Case for R244 Learned Index Structures Michael Chi Ian Tang - PowerPoint PPT Presentation

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1

Background 2

Index Structures • Index structures are built for efficient data access • E.g. B-Trees 3

Index Structures as Models 4

Index Structures as Models 5

Range Index 6

Range Index Models = CDF Models True position 𝑞∗ = 𝑠𝑏𝑜𝑙 𝑙𝑓𝑧 = | 𝑙 𝑙 ≤ 𝑙𝑓𝑧 | = 𝑄 𝑌 ≤ 𝑙𝑓𝑧 ∗ 𝑂 𝑄 𝑌 ≤ 𝑙𝑓𝑧 is the CDF of keys 7

Range Index Models = CDF Models Model: 𝑞𝑝𝑡 = 𝐺 𝑙𝑓𝑧 ∗ 𝑂 ≈ 𝑞 ∗ 𝐺 𝑙𝑓𝑧 ≈ 𝑄 𝑌 ≤ 𝑙𝑓𝑧 8

The Recursive Model Index (RMI) • Prediction from previous stage chooses the next model • Progressively refine the prediction 9

The Recursive Model Index • Benefits • Decouple execution cost & model size • Notion of progressively learning the shape of CDF • Divide the space into smaller ranges, easier to refine the final prediction 10

The Recursive Model Index • Worst case performance • If last stage models do not meet error requirement, replace by B-Trees • Have same worst case guarantee as B-Trees 11

The Recursive Model Index - Training • Loss defined as: 𝑔 𝑦 − 𝑧 2 𝑀 = ෍ (𝑦,𝑧) • Simple model trained in seconds, Neural Nets in minutes 12

Experiments • Integer Datasets • Weblogs dataset contains 200M log entries • Maps dataset indexed the longitude of ≈ 200M user -maintained features • Log-normal dataset synthesized by sampling 190M unique values • Models • 2-stage RMI model having second-stage sizes (10k, 50k, 100k, and 200k) • Read-optimized B-Tree with different page sizes 13

Results 14

Point Index 15

Point Index • Example: hash-map • Deterministically map keys to positions inside an array 16

The Hash-Model Index • Build a hash function based on the CDF of the data ( 𝑁 is size of hash-map): ℎ(𝑙𝑓𝑧) = 𝐺 𝑙𝑓𝑧 ∗ 𝑁 𝐺 𝑙𝑓𝑧 ≈ 𝑄 𝑌 ≤ 𝑙𝑓𝑧 17

The Hash-Model Index • Main objective is to reduce number of conflicts • Conflicts could induce high cost depending on architecture (e.g. distributed) 18

Experiments • Learned models with same settings as in range index • Compared against MurmurHash3-like hash-function 19

Existence Index 20

Existence Index • Example: Bloom filters • Return whether a key exists in a dataset • No false negatives, but has potential false positives 21

Bloom filters as a Classification Problem • Binary probabilistic classification task: Whether key exists in dataset Exists key Model Does not exist 22

Bloom filters as a Classification Problem • Guarantee for no false negative • Overflow bloom filter: remember false negatives from models 23

Experiments • Data • 1.7M blacklisted phishing URLs • Negative set: random URLs + whitelisted URLs • Comparison • Learned filter: RNN with GRU • Normal Bloom filter 24

Critique 25

Major Contributions 1. Proposed the idea of applying machine learning in index structures 2. Solutions to offering guarantees on performance, determinism with ML models 3. Showed significant performance improvements (time and space) 4. Inspired new research direction (27 citations since June 2018) 26

Criticism 1. Detail of platform used for experiments not given 2. Little discussion on training time 3. Experiments on CPU only 27

Conclusion & Future Direction • Proposed a new direction in database research that • Makes effective use of machine learning methods • Shows promising preliminary results • Inspired new research work • Requires more details on performance evaluation • Potentials in learned algorithms, multi-dimensional indexes 28

The Case for R244 Learned Index Structures Michael Chi Ian Tang - PowerPoint PPT Presentation

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1 Background 2 Index Structures Index structures are built for efficient data access E.g. B-Trees 3

Case Comparisons Department of Government London School of Economics and Political Science Uses

CASE PRESENTATION CASE PRESENTATION Prepared by: Dr. Lina Raffa Case Report p 14 year old

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

The Montreal case The Montreal case 1 The Montreal case The Montreal case 2 Montreal 1992

CASE PROJECT CASE PROJECT IMPLEMENTATION 2017 AND PLANS FOR 2018 Antoine Zannotti CASE Project

How Expert Knowledge Can Three Case Studies Help Measurements: First Case Study Second Case

Case Study A Case fo r Use in Addic tio n Re se arc h De re k Quig le y Unive rsity o f Auc

Case Management Best P r actice Case Management regulations. Case Study. Community

1 Case Presentation Running head: CASE PRESENTATION Alicia: A Case Presentation Christine S.

ANALYSE A CASE LAW Acelegal (Education Series) 1/38 ACELEGAL AGENDA What is a Case Law?

Vibration Case Histories Vibration Case Histories Barry T. Cease MeadWestvaco 1 9/19/2006

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Mutation Testing of Memory- Related Operators Jay Nanavati, Fan Wu, Mark Harman, Yue Jia, Jens

Estimating Variance under Hierarchical Case: . . . Interval and Fuzzy Hierarchical Case: . . .

ul#mate goal MEMORANDUM constitution and/or statute and/or regulation case case case

DSCD CASE PRESENTATION TEMPLATE APPENDIX 2 DATE OF EXAMINATION CANDIDATE NUMBER CASE

Who we are CDF is a Serbian based not-for profit, non-governmental and non-partisan organization

NAHAC Servicer Webinar April 24, 2018 10:00am 11:30am Pacific Time Webinar Information

Towards an integrated approach to testing automated driving on public roads EUCAR Conference

Search for New Particles Decaying to Z 0 +jets The CDF Collaboration URL http://www-cdf.fnal.gov

Not All Samples Are Created Equal Deep Learning with Importance Sampling Angelos Katharopoulos

Nonlinear incentives and mortgage officers decisions Kostas Tzioumis Matthew Gee . Office

* Proposals for improvement * 1. Turnover / Output 1.1 Definition of service being collected

The Class of 2018 9/25/2017 Guidance Department Sherry Moon, Director of Guidance