The Case for R244 Learned Index Structures Michael Chi Ian Tang - - PowerPoint PPT Presentation

โ–ถ
the case for
SMART_READER_LITE
LIVE PREVIEW

The Case for R244 Learned Index Structures Michael Chi Ian Tang - - PowerPoint PPT Presentation

The Case for R244 Learned Index Structures Michael Chi Ian Tang Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N. 1 Background 2 Index Structures Index structures are built for efficient data access E.g. B-Trees 3


slide-1
SLIDE 1

The Case for Learned Index Structures

Kraska, T., Beutel, A., Chi, E. H., Dean, J., & Polyzotis, N.

R244 Michael Chi Ian Tang

1

slide-2
SLIDE 2

Background

2

slide-3
SLIDE 3

Index Structures

  • Index structures are built for efficient data access
  • E.g. B-Trees

3

slide-4
SLIDE 4

Index Structures as Models

4

slide-5
SLIDE 5

Index Structures as Models

5

slide-6
SLIDE 6

Range Index

6

slide-7
SLIDE 7

Range Index Models = CDF Models

True position ๐‘žโˆ— = ๐‘ ๐‘๐‘œ๐‘™ ๐‘™๐‘“๐‘ง = | ๐‘™ ๐‘™ โ‰ค ๐‘™๐‘“๐‘ง | = ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง โˆ— ๐‘‚ ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง is the CDF of keys

7

slide-8
SLIDE 8

Range Index Models = CDF Models

Model: ๐‘ž๐‘๐‘ก = ๐บ ๐‘™๐‘“๐‘ง โˆ— ๐‘‚ โ‰ˆ ๐‘žโˆ— ๐บ ๐‘™๐‘“๐‘ง โ‰ˆ ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง

8

slide-9
SLIDE 9

The Recursive Model Index (RMI)

  • Prediction from previous stage chooses the next model
  • Progressively refine the prediction

9

slide-10
SLIDE 10

The Recursive Model Index

  • Benefits
  • Decouple execution cost & model size
  • Notion of progressively learning the

shape of CDF

  • Divide the space into smaller ranges,

easier to refine the final prediction

10

slide-11
SLIDE 11

The Recursive Model Index

  • Worst case performance
  • If last stage models do not meet error

requirement, replace by B-Trees

  • Have same worst case guarantee as

B-Trees

11

slide-12
SLIDE 12

The Recursive Model Index - Training

  • Loss defined as:

๐‘€ = เท

(๐‘ฆ,๐‘ง)

๐‘” ๐‘ฆ โˆ’ ๐‘ง 2

  • Simple model trained in seconds,

Neural Nets in minutes

12

slide-13
SLIDE 13

Experiments

  • Integer Datasets
  • Weblogs dataset contains 200M log entries
  • Maps dataset indexed the longitude of โ‰ˆ 200M user-maintained features
  • Log-normal dataset synthesized by sampling 190M unique values
  • Models
  • 2-stage RMI model having second-stage sizes (10k, 50k, 100k, and 200k)
  • Read-optimized B-Tree with different page sizes

13

slide-14
SLIDE 14

Results

14

slide-15
SLIDE 15

Point Index

15

slide-16
SLIDE 16

Point Index

  • Example: hash-map
  • Deterministically map keys

to positions inside an array

16

slide-17
SLIDE 17

The Hash-Model Index

  • Build a hash function based on the CDF
  • f the data (๐‘ is size of hash-map):

โ„Ž(๐‘™๐‘“๐‘ง) = ๐บ ๐‘™๐‘“๐‘ง โˆ— ๐‘ ๐บ ๐‘™๐‘“๐‘ง โ‰ˆ ๐‘„ ๐‘Œ โ‰ค ๐‘™๐‘“๐‘ง

17

slide-18
SLIDE 18

The Hash-Model Index

  • Main objective is to reduce number of conflicts
  • Conflicts could induce high cost depending on architecture (e.g. distributed)

18

slide-19
SLIDE 19

Experiments

  • Learned models with same settings as in range index
  • Compared against MurmurHash3-like hash-function

19

slide-20
SLIDE 20

Existence Index

20

slide-21
SLIDE 21

Existence Index

  • Example: Bloom filters
  • Return whether a key exists

in a dataset

  • No false negatives, but has

potential false positives

21

slide-22
SLIDE 22

Bloom filters as a Classification Problem

  • Binary probabilistic classification task: Whether key exists in dataset

22

key

Model

Exists Does not exist

slide-23
SLIDE 23

Bloom filters as a Classification Problem

  • Guarantee for no false negative
  • Overflow bloom filter: remember false negatives from models

23

slide-24
SLIDE 24

Experiments

  • Data
  • 1.7M blacklisted phishing URLs
  • Negative set: random URLs + whitelisted

URLs

  • Comparison
  • Learned filter: RNN with GRU
  • Normal Bloom filter

24

slide-25
SLIDE 25

Critique

25

slide-26
SLIDE 26

Major Contributions

  • 1. Proposed the idea of applying machine learning in index structures
  • 2. Solutions to offering guarantees on performance, determinism with

ML models

  • 3. Showed significant performance improvements (time and space)
  • 4. Inspired new research direction (27 citations since June 2018)

26

slide-27
SLIDE 27

Criticism

  • 1. Detail of platform used for experiments not given
  • 2. Little discussion on training time
  • 3. Experiments on CPU only

27

slide-28
SLIDE 28

Conclusion & Future Direction

  • Proposed a new direction in database research that
  • Makes effective use of machine learning methods
  • Shows promising preliminary results
  • Inspired new research work
  • Requires more details on performance evaluation
  • Potentials in learned algorithms, multi-dimensional indexes

28