The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1
Background 2
Index Structures β’ Index structures are built for efficient data access β’ E.g. B-Trees 3
Index Structures as Models 4
Index Structures as Models 5
Range Index 6
Range Index Models = CDF Models True position πβ = π πππ πππ§ = | π π β€ πππ§ | = π π β€ πππ§ β π π π β€ πππ§ is the CDF of keys 7
Range Index Models = CDF Models Model: πππ‘ = πΊ πππ§ β π β π β πΊ πππ§ β π π β€ πππ§ 8
The Recursive Model Index (RMI) β’ Prediction from previous stage chooses the next model β’ Progressively refine the prediction 9
The Recursive Model Index β’ Benefits β’ Decouple execution cost & model size β’ Notion of progressively learning the shape of CDF β’ Divide the space into smaller ranges, easier to refine the final prediction 10
The Recursive Model Index β’ Worst case performance β’ If last stage models do not meet error requirement, replace by B-Trees β’ Have same worst case guarantee as B-Trees 11
The Recursive Model Index - Training β’ Loss defined as: π π¦ β π§ 2 π = ΰ· (π¦,π§) β’ Simple model trained in seconds, Neural Nets in minutes 12
Experiments β’ Integer Datasets β’ Weblogs dataset contains 200M log entries β’ Maps dataset indexed the longitude of β 200M user -maintained features β’ Log-normal dataset synthesized by sampling 190M unique values β’ Models β’ 2-stage RMI model having second-stage sizes (10k, 50k, 100k, and 200k) β’ Read-optimized B-Tree with different page sizes 13
Results 14
Point Index 15
Point Index β’ Example: hash-map β’ Deterministically map keys to positions inside an array 16
The Hash-Model Index β’ Build a hash function based on the CDF of the data ( π is size of hash-map): β(πππ§) = πΊ πππ§ β π πΊ πππ§ β π π β€ πππ§ 17
The Hash-Model Index β’ Main objective is to reduce number of conflicts β’ Conflicts could induce high cost depending on architecture (e.g. distributed) 18
Experiments β’ Learned models with same settings as in range index β’ Compared against MurmurHash3-like hash-function 19
Existence Index 20
Existence Index β’ Example: Bloom filters β’ Return whether a key exists in a dataset β’ No false negatives, but has potential false positives 21
Bloom filters as a Classification Problem β’ Binary probabilistic classification task: Whether key exists in dataset Exists key Model Does not exist 22
Bloom filters as a Classification Problem β’ Guarantee for no false negative β’ Overflow bloom filter: remember false negatives from models 23
Experiments β’ Data β’ 1.7M blacklisted phishing URLs β’ Negative set: random URLs + whitelisted URLs β’ Comparison β’ Learned filter: RNN with GRU β’ Normal Bloom filter 24
Critique 25
Major Contributions 1. Proposed the idea of applying machine learning in index structures 2. Solutions to offering guarantees on performance, determinism with ML models 3. Showed significant performance improvements (time and space) 4. Inspired new research direction (27 citations since June 2018) 26
Criticism 1. Detail of platform used for experiments not given 2. Little discussion on training time 3. Experiments on CPU only 27
Conclusion & Future Direction β’ Proposed a new direction in database research that β’ Makes effective use of machine learning methods β’ Shows promising preliminary results β’ Inspired new research work β’ Requires more details on performance evaluation β’ Potentials in learned algorithms, multi-dimensional indexes 28
Recommend
More recommend