˜
Superseding Traditional Indexes
with Multicriteria Data Structures
GIORGIO VINCIGUERRA
PhD student in Computer Science
giorgio.vinciguerra@phd.unipi.it
Superseding Traditional Indexes with Multicriteria Data Structures - - PowerPoint PPT Presentation
Superseding Traditional Indexes with Multicriteria Data Structures GIORGIO VINCIGUERRA PhD student in Computer Science giorgio.vinciguerra@phd.unipi.it Outline 1. Multicriteria data structures 2. The dictionary problem External
GIORGIO VINCIGUERRA
PhD student in Computer Science
giorgio.vinciguerra@phd.unipi.it
2
3
6
FAMILY
CONSTRAINTS space, time, energy… OPTIMISATION find the best structure
We are given a set of “objects”, and we are asked to store them succinctly and to support efficient retrieval
7
Databases File Systems Search Engines Social Networks
9
10
L1
L2 L3
11
L1
L2 L3
12
100 ns
16 µs (SSD) 3 ms (HDD)
150 ms L1 32 KB L2 256 KB L3 3 MB
8 GB 256 GB ∞ TB
14
𝐶 ≈ 4𝐿𝑗𝐶
𝑁
15
𝐶 = 64𝐶
𝑁 LLC
16
I n t e g e r s
r e a l s e . g . p
n t a n d r a n g e q u e r i e s
61 71 12 15 18 1 24 22 88 34 3 10 5 13 55 44 60 2 5 74 90 81
17
𝑁
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜 𝑞𝑠𝑓𝑒 36 = 36 𝑞𝑠𝑓𝑒 50 = 48 𝑠𝑏𝑜𝑓 67,110
18 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
𝑁
𝐶 = 4
1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1
19 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
𝑁
𝐶 = 4
1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜
20 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
𝑁
𝐶 = 4
1 𝑜 Solution RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶))
21 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜
12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞
22 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜
12 23 31 122 ∞ ∞ 31 76 ∞ 55 71 76
48?
23 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜
12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞
Solution Space RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Scan Ο 1 Ο 𝑜 Ο(𝑜/𝐶) Ο 1 Binary search Ο 1 Ο log 𝑜 Ο(log(𝑜/𝐶)) Ο(log(𝑜/𝐶)) B+ tree Ο 𝑜 Ο log 𝑜 Ο log> 𝑜 Ο log> 𝑜 𝐶 + 1 𝐶 = 3
24
25 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜
12 23 31 122 ∞ ∞ 55 71 76 31 76 ∞
26 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 𝑙𝑓𝑧 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 − 𝜁, 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 + 𝜁 + “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.”
Trained on the dataset { 𝑙𝑓𝑧H, 𝑗 }HJK,…,M
27 2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜 2O 2K 2^2 2P + “All existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes.” 𝑙𝑓𝑧
Trained on the dataset { 𝑙𝑓𝑧H, 𝑗 }HJK,…,M
28
Model 2.1 Model 2.3 Model 3.1 Model 3.2 Model 3.3 Model 3.4
Stage 1 Stage 2 Stage 3
+
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123
1 𝑜
𝑙𝑓𝑧 𝑞𝑝𝑡 𝑙𝑓𝑧 ∈ 𝑞𝑝𝑡 − 𝜁, 𝑞𝑝𝑡 + 𝜁 ?
Model 1.1 Model 2.2
29
Model 1.1 Model 2.1 Model 2.2 Model 2.3
Stage 1 Stage 2
key pos
+
30
+
# stages, # models in each stage, kinds of regression models
Difficult to predict latencies
Can result in underused models (waste of space)
32
2.1 2.3 3.1 3.2 3.3 3.4
Stage1 Stage2 Stage3
1.1 2.2
33
Compute the optimal piecewise linear approx with guaranteed error 𝜁 in Ο(𝑜)
34
Save the 𝑛 segments in a vector as triples 𝑡H = 𝑙𝑓𝑧, 𝑡𝑚𝑝𝑞𝑓, 𝑗𝑜𝑢𝑓𝑠𝑑𝑓𝑞𝑢
35
Drop all the points except 𝑡H. 𝑙𝑓𝑧
36
… and repeat!
37
38
Data Structure Space of index RAM model Worst case time EM model Worst case I/Os EM model Best case I/Os Plain sorted array Ο(1) Ο log 𝑜 Ο log 𝑜 𝐶 Ο log 𝑜 𝐶 Multiway tree Θ(𝑜) Ο log 𝑜 Ο logX 𝑜 Ο logX 𝑜 RMI Fixed Ο(?) Ο(?) Ο 1 PGM-index Θ(𝑛) Ο log 𝑛 Ο logY 𝑛
𝑑 ≥ 2𝜁 = Ω(𝐶)
Ο 1
𝐶
𝑜 keys 𝑛 segments, 𝜁 error
39
Whole datasets First 25M entries
3 seconds to compute
Web logs Longitude IoT = 715M points = 166M points = 26M points
Error of the position estimate Number of segments
40
Given a space bound 𝑇, find efficiently the index that minimizes the query time within space 𝑇 and vice versa
41
42
FAMILY
PGM-indexes ∀ε
CONSTRAINTS
Space & Time
OPTIMISATION
???
time 𝑢(𝜁)
depends on the input array
43 space ε
fastest (most compact) index for 715M keys in < 1 min
44
𝜁K 𝜁P 𝜁a space 𝜁∗
45
Tools that you may find useful
3× faster than py_distance 117× faster than scipy.spatial.distance.euclidean
GIORGIO VINCIGUERRA
PhD student in Computer Science
http://pages.di.unipi.it/vinciguerra/ giorgio.vinciguerra@phd.unipi.it