Superseding Traditional Indexes with Multicriteria Data Structures - - PowerPoint PPT Presentation

superseding traditional indexes
SMART_READER_LITE
LIVE PREVIEW

Superseding Traditional Indexes with Multicriteria Data Structures - - PowerPoint PPT Presentation

Superseding Traditional Indexes with Multicriteria Data Structures GIORGIO VINCIGUERRA PhD student in Computer Science giorgio.vinciguerra@phd.unipi.it Outline 1. Multicriteria data structures 2. The dictionary problem External


slide-1
SLIDE 1

˜

Superseding Traditional Indexes

with Multicriteria Data Structures

GIORGIO VINCIGUERRA

PhD student in Computer Science

giorgio.vinciguerra@phd.unipi.it

slide-2
SLIDE 2

˜

Outline

  • 1. Multicriteria data structures
  • 2. The dictionary problem
  • External memory model
  • Multiway trees
  • Novel approaches
  • Our results
  • 3. Bonus slides

2

slide-3
SLIDE 3

˜

Motivation

  • 1. Algorithms and data structures often offer a

collection of different trade-offs (e.g. time, space occupancy, energy consumption, …)

  • 2. Software engineers have to choose the one

that best fits the needs of their application

  • 3. These needs change with time, data, devices,

and users

3

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

˜

Multicriteria Data Structures

A multicriteria data structure selects the best data structure within some performance and computational constraints

6

FAMILY

  • f data structures

CONSTRAINTS space, time, energy… OPTIMISATION find the best structure

slide-7
SLIDE 7

˜

The dictionary problem

We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval

7

Databases File Systems Search Engines Social Networks

slide-8
SLIDE 8
slide-9
SLIDE 9

˜

Memory hierarchy

9

slide-10
SLIDE 10

˜

Memory hierarchy

10

L1

L2 L3

slide-11
SLIDE 11

˜

Memory hierarchy

11

L1

L2 L3

slide-12
SLIDE 12

˜

Memory hierarchy

12

100 ns

16 µs (SSD) 3 ms (HDD)

150 ms L1 32 KB L2 256 KB L3 3 MB

8 GB 256 GB ∞ TB

slide-13
SLIDE 13
slide-14
SLIDE 14

˜

The External Memory (aka I/O) model

  • 1. Internal memory (RAM) of capacity 𝑁
  • 2. External memory (disk) of unlimited capacity
  • 3. RAM and disk exchange blocks of size 𝐶
  • 4. Count # transfers in Big O instead of # ops

14

𝐶 ≈ 4𝐿𝑗𝐶

𝑁

slide-15
SLIDE 15

˜

The External Memory (aka I/O) model

  • 1. Internal memory (RAM) of capacity 𝑁
  • 2. External memory (disk) of unlimited capacity
  • 3. RAM and disk exchange blocks of size 𝐶
  • 4. Count # transfers in Big O instead of # ops

15

𝐶 = 64𝐶

𝑁 LLC

slide-16
SLIDE 16

˜

Back to the dictionary problem

We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval

16

I n t e g e r s

  • r

r e a l s e . g . p

  • i

n t a n d r a n g e q u e r i e s

61 71 12 15 18 1 24 22 88 34 3 10 5 13 55 44 60 2 5 74 90 81

slide-17
SLIDE 17

˜

Predecessor search & range queries

17

𝑁

2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜 𝑞𝑠𝑓𝑒 36 = 36 𝑞𝑠𝑓𝑒 50 = 48 𝑠𝑏𝑜𝑕𝑓 67,110

slide-18
SLIDE 18

˜

Baseline solutions for predecessor search

18 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

𝑁

𝐶 = 4

1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1

slide-19
SLIDE 19

˜

Baseline solutions for predecessor search

19 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

𝑁

𝐶 = 4

1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜

slide-20
SLIDE 20

˜

Baseline solutions for predecessor search

20 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

𝑁

𝐶 = 4

1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶))

slide-21
SLIDE 21

˜

B+ trees

21 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞

slide-22
SLIDE 22

˜

B+ trees

22 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 31 76 ∞ 55 71 76

48?

slide-23
SLIDE 23

˜

B+ trees

23 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞

Solution Space RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 1 Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο 1 Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) B+ tree Ο 𝑜 Ο log 𝑜 Ο log> 𝑜 Ο log> 𝑜 𝐶 + 1 𝐶 = 3

slide-24
SLIDE 24

˜

B-trees are everywhere

  • 1. “B-trees have become, de facto, a standard for

file organization” Comer. Ubiquitous B-tree. ACM Computing Surveys. ’79

  • 2. This is still true today

24

slide-25
SLIDE 25

˜

B-trees are everywhere

25 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞

slide-26
SLIDE 26

˜

B-trees are machine learning models

26 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 𝑙𝑓𝑧 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝜁, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝜁 + “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.”

Trained on the dataset { 𝑙𝑓𝑧H, 𝑗 }HJK,…,M

slide-27
SLIDE 27

˜

B-trees are machine learning models

27 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2O 2K 2^2 2P + “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧

Trained on the dataset { 𝑙𝑓𝑧H, 𝑗 }HJK,…,M

slide-28
SLIDE 28

˜

The Recursive Model Index (RMI)

28

Model 2.1 Model 2.3 Model 3.1 Model 3.2 Model 3.3 Model 3.4

Stage 1 Stage 2 Stage 3

+

2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

𝑙𝑓𝑧 𝑞𝑝𝑡 𝑙𝑓𝑧 ∈ 𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁 ?

Model 1.1 Model 2.2

slide-29
SLIDE 29

˜

Construction of RMI

29

  • 1. Train the root model on the dataset
  • 2. Use it to distribute keys to the next stage
  • 3. Repeat for each model in the next stage (on

smaller datasets)

Model 1.1 Model 2.1 Model 2.2 Model 2.3

Stage 1 Stage 2

key pos

+

slide-30
SLIDE 30

˜

Performance of RMI

30

+

slide-31
SLIDE 31

˜

Limitations of RMI

  • 1. Fixed structure with many hyperparameters

# stages, # models in each stage, kinds of regression models

  • 2. No a priori error guarantees

Difficult to predict latencies

  • 3. Models are agnostic to the power of models below

Can result in underused models (waste of space)

32

2.1 2.3 3.1 3.2 3.3 3.4

Stage1 Stage2 Stage3

1.1 2.2

slide-32
SLIDE 32

˜

Our idea (submitted)

33

Compute the optimal piecewise linear approx with guaranteed error 𝜁 in Ο(𝑜)

slide-33
SLIDE 33

˜

Our idea (submitted)

34

Save the 𝑛 segments in a vector as triples 𝑡H = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢

slide-34
SLIDE 34

˜

Our idea (submitted)

35

Drop all the points except 𝑡H. 𝑙𝑓𝑧

slide-35
SLIDE 35

˜

Our idea (submitted)

36

… and repeat!

slide-36
SLIDE 36

˜

Memory layout of the PGM-index

37

slide-37
SLIDE 37

˜

Some asymptotic bounds

38

Data Structure Space of index RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Plain sorted array Ο(1) Ο log 𝑜 Ο log 𝑜 𝐶 Ο log 𝑜 𝐶 Multiway tree Θ(𝑜) Ο log 𝑜 Ο logX 𝑜 Ο logX 𝑜 RMI Fixed Ο(?) Ο(?) Ο 1 PGM-index Θ(𝑛) Ο log 𝑛 Ο logY 𝑛

𝑑 ≥ 2𝜁 = Ω(𝐶)

Ο 1

𝐶

𝑜 keys 𝑛 segments, 𝜁 error

slide-38
SLIDE 38

˜

PGM-index in practice

39

Whole datasets First 25M entries

3 seconds to compute

Web logs Longitude IoT = 715M points = 166M points = 26M points

Error of the position estimate Number of segments

slide-39
SLIDE 39

˜

Space-time performance

40

slide-40
SLIDE 40

˜

How to explore this space of trade-offs?

Given a space bound 𝑇, find efficiently the index that minimizes the query time within space 𝑇 and vice versa

41

slide-41
SLIDE 41

˜

Back to Multicriteria Data Structures

A multicriteria data structure is defined by a family

  • f data structures and an optimisation algorithm

that selects the best data structure in the family within some computational constraints

42

FAMILY

PGM-indexes ∀ε

CONSTRAINTS

Space & Time

OPTIMISATION

???

slide-42
SLIDE 42

˜

The Multicriteria PGM-index

  • 1. We designed a cost model for the space 𝑡 𝜁 and the

time 𝑢(𝜁)

  • 2. … but we don’t have a closed formula for 𝑡 𝜁 , it

depends on the input array

  • 3. We fit 𝑡 𝜁 with a power law of the form 𝑏𝜁^_

43 space ε

slide-43
SLIDE 43

˜

Under the hood

  • 1. A sort of interpolation search over 𝜁 values
  • 2. Each iteration improves the fitting of 𝑏𝜁^_ updating 𝑏, 𝑐
  • 3. Bias the 𝜁-iterate towards the midpoint of a bin. search
  • 4. In practice, given a space (time) bound, it finds the

fastest (most compact) index for 715M keys in < 1 min

44

𝜁K 𝜁P 𝜁a space 𝜁∗

slide-44
SLIDE 44

˜

Future work

  • 1. Insertion and deletions
  • 2. Non-linear models
  • 3. Compression

45

slide-45
SLIDE 45

˜ Bonus slides

Tools that you may find useful

slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49

3× faster than py_distance 117× faster than scipy.spatial.distance.euclidean

slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52

˜

GIORGIO VINCIGUERRA

PhD student in Computer Science

http://pages.di.unipi.it/vinciguerra/ giorgio.vinciguerra@phd.unipi.it