The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds
Paolo Ferragina Giorgio Vinciguerra
The PGM-index: a fully-dynamic compressed learned index with - - PowerPoint PPT Presentation
The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds Paolo Giorgio Ferragina Vinciguerra pgm.di.unipi.it The predecessor search problem Given sorted input keys (e.g. integers), implement
Paolo Ferragina Giorgio Vinciguerra
pgm.di.unipi.it
2
ππ ππππππ‘π‘ππ 36 = 36 ππ ππππππ‘π‘ππ 50 = 48
2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95
1 π
pgm.di.unipi.it
3
2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95
1 π πππ‘ππ’πππ = 11 πππ§ = 36
B-tree
(values associated to keys are not shown)
pgm.di.unipi.it
4 positions keys
2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95
1 π
Ao et al. [VLDB 2011]
pgm.di.unipi.it
5 positions keys
2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95
1 π 2 3 4
2 11 13 15 1 2 3 4
Ao et al. [VLDB 2011]
pgm.di.unipi.it
6
πππ‘ππ’πππ πππ§ Black-box trained on a dataset of pairs (key, pos) π = { 2,1 , 11,2 , β¦ , (95, π)} Binary search in [πππ‘ππ’πππ β ππ π ππ , πππ‘ππ’πππ + ππ π ππ ] (approximate)
positions keys
2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95
1 π
Ao et al. [VLDB 2011], Kraska et al. [SIGMOD 2018]
pgm.di.unipi.it
7
Very slow to train
Vulnerable to adversarial inputs and queries
Must be tuned for each new dataset
Too much I/O when data is on disk
Unscalable to big data
Unpredictable latency
Blind to the query distribution
pgm.di.unipi.it
8
Predictable latency
Resistant to adversarial inputs and queries
Scalable to big data
Very fast to build
Constant I/O when data is on disk
No additional tuning needed
Query distribution aware
pgm.di.unipi.it
Fixed model βerrorβ Ξ΅
Control the size of the search range (like the page size in a B-tree)
Fast to construct, best space usage for linear learned indexes
Recursive design
Adapt to the memory hierarchy and enable query-time guarantees
9
pgm.di.unipi.it
Step 1. Compute the
π-approximation in Ξ(π) time
10
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146
1 π
pgm.di.unipi.it
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146
Step 1. Compute the
π-approximation in Ξ(π) time
11
Step 2. Store the segments as triples
π‘! = πππ§, π‘ππππ, πππ’ππ ππππ’ 1 π
pgm.di.unipi.it
Segments
(2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic)
12
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146
1 π Each segment indexes a variable and potentially large sequence of keys while guaranteeing a search range size of 2π + 1 Binary search in [πππ‘ β π, πππ‘ + π]
pgm.di.unipi.it
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146
Step 1. Compute the
π-approximation in Ξ(π) time
13
Step 2. Store the segments as triples
π‘! = πππ§, π‘ππππ, πππ’ππ ππππ’
Step 3. Keep only π‘!. πππ§
1 π
pgm.di.unipi.it
Step 1. Compute the
π-approximation in Ξ(π) time
14
Step 2. Store the segments as triples
π‘! = πππ§, π‘ππππ, πππ’ππ ππππ’
Step 3. Keep only π‘!. πππ§
2 23 31 48 71 88 122 145
pgm.di.unipi.it
Step 1. Compute the
π-approximation in Ξ(π) time
15
Step 2. Store the segments as triples
π‘! = πππ§, π‘ππππ, πππ’ππ ππππ’
Step 3. Keep only π‘!. πππ§ Step 4. Repeat recursively
2 23 31 48 71 88 122 145
pgm.di.unipi.it
(2, sl, ic) (31, sl, ic) (88, sl, ic) (145, sl, ic) (2, sl, ic)
16
(2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic)
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146
1 π It can also be constructed in a single pass Very fast construction, a couple
pgm.di.unipi.it (2, sl, ic) (23, sl, ic) (31, sl, ic) (48, sl, ic) (71, sl, ic) (88, sl, ic) (122, sl, ic) (145, sl, ic) (2, sl, ic) (31, sl, ic) (88, sl, ic) (145, sl, ic) (2, sl, ic)
ππ ππππππ‘π‘ππ 57 ?
17
2 11 12 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 99 102 115 122 123 128 140 145 146
1 π
πΆ = disk page-size Set π = Ξ πΆ for queries in π(log! π) I/Os π(π/π) space
The PGM-index is never worse in time and space than a B-tree
2π + 1 2π + 1 2π + 1
pgm.di.unipi.it
Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory
Fastest CSS-tree 128-byte pages β350 MB Matched by PGM with 2Ξ΅ set to 256 β4 4 MB (β83Γ)
19
Page size 2Ξ΅ Avg search range
pgm.di.unipi.it
20 Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory
pgm.di.unipi.it
21
3891Γ 2051Γ 1140Γ 611Γ
B+-tree page size Index size 128-byte 5.65 GB 256-byte 2.98 GB 512-byte 1.66 GB 1024-byte 0.89 GB Dynamic PGM-index: 1.45 MB
Intel Xeon Gold 5118 CPU @ 2.30GHz, data held in main memory
pgm.di.unipi.it
(π, π‘π, ππ) π! π" β¦ π#
In one I/O and π log" πΆ steps the search range is reduced by 1/πΆ
w.h.p. 1/πΆF
Page size πΆ 2π = πΆ
Here the search range is reduced by at least 1/πΆ
Ferragina et al. [ICML 2020]
22
pgm.di.unipi.it
New tuned Linear RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]
PGM improved the empirical performance of a tuned Linear RMI Each PGM took about 2 seconds to construct RMI took 30Γ more!
23
They tested positive lookups. Here we test predecessor queries
pgm.di.unipi.it
24
New tuned Hybrid RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]
Each PGM took about 2 seconds to construct Hybrid RMI took 40Γ (90Γ with tuning) more! Avg search range 28 Max search range 28
Avg 215 Max 229
pgm.di.unipi.it
25
New tuned Hybrid RMI implementation and datasets from Marcus et al., 2020 [arXiv:2006.12804]
Adversarial query workload
About adversarial data inputs, see Kornaropoulos et al., 2020 [arXiv:2008.00297]
pgm.di.unipi.it
26
Index compression
Reduce the space of the index by a further 52% via the compression of slopes and intercepts
Query-distribution aware
Minimise average query time wrt a given query workload
Multicriteria tuner
Minimise query time under a given space constraint and vice versa in a few dozens of seconds