Why are learned indexes so effective? Paolo Fabrizio Giorgio - - PowerPoint PPT Presentation

β–Ά
why are learned indexes so effective
SMART_READER_LITE
LIVE PREVIEW

Why are learned indexes so effective? Paolo Fabrizio Giorgio - - PowerPoint PPT Presentation

Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna A classical problem in computer science Given a set of sorted input keys (e.g. integers)


slide-1
SLIDE 1

Why are learned indexes so effective?

Paolo Ferragina1 Fabrizio Lillo2 Giorgio Vinciguerra1

1University of Pisa 2University of Bologna

slide-2
SLIDE 2

A classical problem in computer science

  • Given a set of π‘œ sorted input keys (e.g. integers)
  • Implement membership and predecessor queries
  • Range queries in databases, conjunctive queries in search

engines, IP lookup in routers…

2

𝑛𝑓𝑛𝑐𝑓𝑠 36 = True π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  50 = 48

2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95

1 π‘œ

slide-3
SLIDE 3

Indexes

3

2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95

1 π‘œ π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 𝑙𝑓𝑧

B-tree

slide-4
SLIDE 4

Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ)

4 positions keys

2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95

1 π‘œ

slide-5
SLIDE 5

Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ)

5 positions keys

2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95

1 π‘œ 2 3 4

2 11 13 15 1 2 3 4

slide-6
SLIDE 6

Learned indexes

6

π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 𝑙𝑓𝑧 Black-box trained on a dataset of pairs (key, pos) 𝒠 = { 2,1 , 11,2 , … , (95, π‘œ)} Binary search in π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ βˆ’ 𝜁, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ + 𝜁 (approximate)

positions keys

2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95

1 π‘œ e.g. 𝜁 is of the order of 100–1000

slide-7
SLIDE 7

The knowledge gap in learned indexes

Practice Same query time of traditional tree-based indexes Space improvements of

  • rders of magnitude,

from GBs to few MBs Theory Same asymptotic query time of traditional tree-based indexes Same asymptotic space

  • ccupancy of traditional

tree-based indexes

7

πŸ‘

vs

πŸ‘Ž

vs

slide-8
SLIDE 8

PGM-index: An optimal learned index

  • 1. Fix a max error 𝜁, e.g. so that keys in [π‘žπ‘π‘‘ βˆ’ 𝜁,π‘žπ‘π‘‘ + 𝜁]fit a cache-line
  • 2. Find the smallest Piecewise Linear 𝜁-Approximation (PLA)
  • 3. Store triples (𝑔𝑗𝑠𝑑𝑒𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’) for each segment

8

https://pgm.di.unipi.it

positions keys

1 3 8 11 12 19 22 23 24 28 29 33 38 47 48 53 55 56 57

π‘žπ‘π‘‘ βˆ’ 𝜁, π‘žπ‘π‘‘ + 𝜁

24 8

8

[Ferragina and Vinciguerra, PVLDB 2020]

slide-9
SLIDE 9

What is the space of learned indexes?

  • Space occupancy ∝ Number segments
  • The number of segments depends on
  • The size of the input dataset
  • How the points (𝑙𝑓𝑧, π‘žπ‘π‘‘) map to the plane
  • The value 𝜁, i.e. how much the approximation is precise

9

positions keys positions keys positions keys

𝜁! 𝜁" β‰ͺ 𝜁!

slide-10
SLIDE 10

Model and assumptions

  • Consider gaps 𝑕! = 𝑙!"# βˆ’ 𝑙! between consecutive input keys
  • Model the gaps as positive iid rvs that follow a distribution with

finite mean 𝜈 and variance 𝜏$

10

positions keys

𝑕( 𝑕) 𝑕* 𝑕+

1 2 3 4 5 𝑙" 𝑙# 𝑙$ 𝑙% 𝑙&

slide-11
SLIDE 11

The main result

  • Theorem. If 𝜁 is sufficiently larger than 𝜏/𝜈, the expected number
  • f keys covered by a segment with maximum error 𝜁 is

𝐿 = 𝜈$ 𝜏$ 𝜁$ and the number of segments on a dataset of size π‘œ is π‘œ 𝐿 with high probability.

11

slide-12
SLIDE 12

The main consequence

The PGM-index achieves the same asymptotic query performance of a traditional 𝜁-way tree-based index while improving its space from 𝜀(𝒐/𝜻) to 𝑷(𝒐/πœ»πŸ‘)

12

Learned indexes are pr provably better than traditional indexes

(note that 𝜁 is of the order of 100-1000)

slide-13
SLIDE 13

Sketch of the proof

  • 1. Consider a segment on the stream of random gaps and the two

parallel lines at distance 𝜁

  • 2. How many steps before a new segment is needed?

13

Start a new segment from here

positions keys

𝜁 𝜁

slide-14
SLIDE 14

Sketch of the proof (2)

3. A discrete-time random walk, iid increments with mean 𝜈 4. Compute the expectation of π‘—βˆ— = min 𝑗 ∈ β„• 𝑙I, 𝑗 is outside the red strip i.e. the Mean Exit Time (MET) of the random walk 5. Show that the slope 𝑛 = 1/𝜈 maximises 𝐹[π‘—βˆ—], giving 𝐹[π‘—βˆ—] = 𝜈J/𝜏J 𝜁J

14

positions keys

𝜁 𝜁

Start a new segment from here

(𝑙𝑓𝑧!βˆ—, π‘—βˆ—)

random walker location time

Start a new segment from here

π‘—βˆ—

𝜁 𝑛 𝜁 𝑛

slide-15
SLIDE 15

Simulations

  • 1. Generate 107 random streams of gaps according to

several probability distributions

  • 2. Compute and average

I. The length of a segment found by the algorithm that computes the smallest PLA, adopted in the PGM-index II. The exit time of the random walk

15

slide-16
SLIDE 16

Simulations of (𝜈*/𝜏*)𝜁*

16

250 50 100 150 200 250 Ξ΅ Pareto k = 3, Ξ± = 3

OPT MET Thm 1 (3.0Ξ΅2)

0.5 1 1.5 Mean segment length Β·106 Lognormal Β΅ = 1, Οƒ = 0.5

OPT MET Thm 1 (3.521Ξ΅2)

250 50 100 150 200 250 Ξ΅

More distributions in the paper

OPT = Average segment length in a PGM-index MET = Mean exit time of the random walk

Both OPT and MET agree on the slope 1/Β΅, but OPT is more robust

slide-17
SLIDE 17

Stress test of β€œπœ sufficiently larger than 𝜏/πœˆβ€

50 100 150 200 250 0.1 0.2 Ξ΅ Relative error Οƒ/Β΅ = 0.15

Pareto k = 10, Ξ± = 7.741 Gamma ΞΈ = 5, k = 44.444 Lognormal Β΅ = 2, Οƒ = 0.149 44.444Ξ΅2

50 100 150 200 250 0.2 0.4 0.6 0.8 Ξ΅ Οƒ/Β΅ = 1.5

Pareto k = 10, Ξ± = 2.202 Gamma ΞΈ = 5, k = 0.444 Lognormal Β΅ = 2, Οƒ = 1.086 0.444Ξ΅2

50 100 150 200 250 0.5 1 Ξ΅ Οƒ/Β΅ = 15

Pareto k = 10, Ξ± = 2.002 Gamma ΞΈ = 5, k = 0.004 Lognormal Β΅ = 2, Οƒ = 2.328 0.004Ξ΅2

17

slide-18
SLIDE 18

Conclusions

  • No theoretical grounds for the efficiency of learned indexes was known
  • We have shown that on data with iid gaps, the mean segment length is Θ(𝜁J)
  • The PGM-index takes O(π‘œ/𝜁J) space w.h.p., a quadratic improvement in 𝜁
  • ver traditional indexes (𝜁 is usually of the order of 100–1000)
  • Open problems:

1. Do the results still hold without the iid assumption on the gaps? 2. Is the segment found by the optimal algorithm adopted in the PGM-index a constant factor longer than the one found by the random walker?

18