[PPT] - Superseding Traditional Indexes with Multicriteria Data Structures PowerPoint Presentation

SLIDE 1

˜

Superseding Traditional Indexes

with Multicriteria Data Structures

GIORGIO VINCIGUERRA

PhD student in Computer Science

giorgio.vinciguerra@phd.unipi.it

SLIDE 2

˜

Outline

1. Multicriteria data structures
2. The dictionary problem
External memory model
Multiway trees
Novel approaches
Our results
3. Bonus slides

2

SLIDE 3

˜

Motivation

1. Algorithms and data structures often offer a

collection of different trade-offs (e.g. time, space occupancy, energy consumption, …)

2. Software engineers have to choose the one

that best fits the needs of their application

3. These needs change with time, data, devices,

and users

3

SLIDE 4

SLIDE 5

SLIDE 6

˜

Multicriteria Data Structures

A multicriteria data structure selects the best data structure within some performance and computational constraints

6

FAMILY

f data structures

CONSTRAINTS space, time, energy… OPTIMISATION find the best structure

SLIDE 7

˜

The dictionary problem

We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval

7

Databases File Systems Search Engines Social Networks

SLIDE 8

SLIDE 9

˜

Memory hierarchy

9

SLIDE 10

˜

Memory hierarchy

10

L1

L2 L3

SLIDE 11

˜

Memory hierarchy

11

L1

L2 L3

SLIDE 12

˜

Memory hierarchy

12

100 ns

16 µs (SSD) 3 ms (HDD)

150 ms L1 32 KB L2 256 KB L3 3 MB

8 GB 256 GB ∞ TB

SLIDE 13

SLIDE 14

˜

The External Memory (aka I/O) model

1. Internal memory (RAM) of capacity 𝑁
2. External memory (disk) of unlimited capacity
3. RAM and disk exchange blocks of size 𝐶
4. Count # transfers in Big O instead of # ops

14

𝐶 ≈ 4𝐿𝑗𝐶

𝑁

SLIDE 15

˜

The External Memory (aka I/O) model

1. Internal memory (RAM) of capacity 𝑁
2. External memory (disk) of unlimited capacity
3. RAM and disk exchange blocks of size 𝐶
4. Count # transfers in Big O instead of # ops

15

𝐶 = 64𝐶

𝑁 LLC

SLIDE 16

˜

Back to the dictionary problem

We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval

16

I n t e g e r s

r

r e a l s e . g . p

i

n t a n d r a n g e q u e r i e s

✓

61 71 12 15 18 1 24 22 88 34 3 10 5 13 55 44 60 2 5 74 90 81

SLIDE 17

˜

Predecessor search & range queries

17

𝑁

2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜 𝑞𝑠𝑓𝑒 36 = 36 𝑞𝑠𝑓𝑒 50 = 48 𝑠𝑏𝑜𝑕𝑓 67,110

SLIDE 18

˜

Baseline solutions for predecessor search

18 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

𝑁

𝐶 = 4

1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1

SLIDE 19

˜

Baseline solutions for predecessor search

19 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

𝑁

𝐶 = 4

1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜

SLIDE 20

˜

Baseline solutions for predecessor search

20 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

𝑁

𝐶 = 4

1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶))

SLIDE 21

˜

B+ trees

21 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞

SLIDE 22

˜

B+ trees

22 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 31 76 ∞ 55 71 76

48?

SLIDE 23

˜

B+ trees

23 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞

Solution Space RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 1 Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο 1 Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) B+ tree Ο 𝑜 Ο log 𝑜 Ο log> 𝑜 Ο log> 𝑜 𝐶 + 1 𝐶 = 3

SLIDE 24

˜

B-trees are everywhere

1. “B-trees have become, de facto, a standard for

file organization” Comer. Ubiquitous B-tree. ACM Computing Surveys. ’79

2. This is still true today

24

SLIDE 25

˜

B-trees are everywhere

25 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞

SLIDE 26

˜

B-trees are machine learning models

26 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 𝑙𝑓𝑧 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝜁, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝜁 + “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.”

Trained on the dataset { 𝑙𝑓𝑧H, 𝑗 }HJK,…,M

SLIDE 27

˜

B-trees are machine learning models

27 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2O 2K 2^2 2P + “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧

Trained on the dataset { 𝑙𝑓𝑧H, 𝑗 }HJK,…,M

SLIDE 28

˜

The Recursive Model Index (RMI)

28

Model 2.1 Model 2.3 Model 3.1 Model 3.2 Model 3.3 Model 3.4

Stage 1 Stage 2 Stage 3

+

2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123

1 𝑜

𝑙𝑓𝑧 𝑞𝑝𝑡 𝑙𝑓𝑧 ∈ 𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁 ?

Model 1.1 Model 2.2

SLIDE 29

˜

Construction of RMI

29

1. Train the root model on the dataset
2. Use it to distribute keys to the next stage
3. Repeat for each model in the next stage (on

smaller datasets)

Model 1.1 Model 2.1 Model 2.2 Model 2.3

Stage 1 Stage 2

key pos

+

SLIDE 30

˜

Performance of RMI

30

+

SLIDE 31

˜

Limitations of RMI

1. Fixed structure with many hyperparameters

# stages, # models in each stage, kinds of regression models

2. No a priori error guarantees

Difficult to predict latencies

3. Models are agnostic to the power of models below

Can result in underused models (waste of space)

32

2.1 2.3 3.1 3.2 3.3 3.4

Stage1 Stage2 Stage3

1.1 2.2

SLIDE 32

˜

Our idea (submitted)

33

Compute the optimal piecewise linear approx with guaranteed error 𝜁 in Ο(𝑜)

SLIDE 33

˜

Our idea (submitted)

34

Save the 𝑛 segments in a vector as triples 𝑡H = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢

SLIDE 34

˜

Our idea (submitted)

35

Drop all the points except 𝑡H. 𝑙𝑓𝑧

SLIDE 35

˜

Our idea (submitted)

36

… and repeat!

SLIDE 36

˜

Memory layout of the PGM-index

37

SLIDE 37

˜

Some asymptotic bounds

38

Data Structure Space of index RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Plain sorted array Ο(1) Ο log 𝑜 Ο log 𝑜 𝐶 Ο log 𝑜 𝐶 Multiway tree Θ(𝑜) Ο log 𝑜 Ο logX 𝑜 Ο logX 𝑜 RMI Fixed Ο(?) Ο(?) Ο 1 PGM-index Θ(𝑛) Ο log 𝑛 Ο logY 𝑛

𝑑 ≥ 2𝜁 = Ω(𝐶)

Ο 1

𝐶

𝑜 keys 𝑛 segments, 𝜁 error

SLIDE 38

˜

PGM-index in practice

39

Whole datasets First 25M entries

3 seconds to compute

Web logs Longitude IoT = 715M points = 166M points = 26M points

Error of the position estimate Number of segments

SLIDE 39

˜

Space-time performance

40

SLIDE 40

˜

How to explore this space of trade-offs?

Given a space bound 𝑇, find efficiently the index that minimizes the query time within space 𝑇 and vice versa

41

SLIDE 41

˜

Back to Multicriteria Data Structures

A multicriteria data structure is defined by a family

f data structures and an optimisation algorithm

that selects the best data structure in the family within some computational constraints

42

FAMILY

PGM-indexes ∀ε

CONSTRAINTS

Space & Time

OPTIMISATION

???

SLIDE 42

˜

The Multicriteria PGM-index

1. We designed a cost model for the space 𝑡 𝜁 and the

time 𝑢(𝜁)

2. … but we don’t have a closed formula for 𝑡 𝜁 , it

depends on the input array

3. We fit 𝑡 𝜁 with a power law of the form 𝑏𝜁^_

43 space ε

SLIDE 43

˜

Under the hood

1. A sort of interpolation search over 𝜁 values
2. Each iteration improves the fitting of 𝑏𝜁^_ updating 𝑏, 𝑐
3. Bias the 𝜁-iterate towards the midpoint of a bin. search
4. In practice, given a space (time) bound, it finds the

fastest (most compact) index for 715M keys in < 1 min

44

𝜁K 𝜁P 𝜁a space 𝜁∗

SLIDE 44

˜

Future work

1. Insertion and deletions
2. Non-linear models
3. Compression

45

SLIDE 45

˜ Bonus slides

Tools that you may find useful

SLIDE 46

SLIDE 47

SLIDE 48

SLIDE 49

3× faster than py_distance 117× faster than scipy.spatial.distance.euclidean

SLIDE 50

SLIDE 51

SLIDE 52

˜

GIORGIO VINCIGUERRA

PhD student in Computer Science

http://pages.di.unipi.it/vinciguerra/ giorgio.vinciguerra@phd.unipi.it