why are learned indexes so effective
play

Why are learned indexes so effective? Paolo Fabrizio Giorgio - PowerPoint PPT Presentation

Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna A classical problem in computer science Given a set of sorted input keys (e.g. integers)


  1. Why are learned indexes so effective? Paolo Fabrizio Giorgio Ferragina 1 Lillo 2 Vinciguerra 1 1 University of Pisa 2 University of Bologna

  2. A classical problem in computer science β€’ Given a set of π‘œ sorted input keys (e.g. integers) β€’ Implement membership and predecessor queries β€’ Range queries in databases, conjunctive queries in search engines, IP lookup in routers… 𝑛𝑓𝑛𝑐𝑓𝑠 36 = True 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ π‘žπ‘ π‘“π‘’π‘“π‘‘π‘“π‘‘π‘‘π‘π‘  50 = 48 2

  3. Indexes 𝑙𝑓𝑧 B-tree π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ 3

  4. Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ) positions keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ 4

  5. Input data as pairs (𝑙𝑓𝑧, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ) 4 3 positions 2 1 11 13 15 2 keys 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 2 3 4 π‘œ 5

  6. Learned indexes 𝑙𝑓𝑧 Black-box trained on a dataset of pairs (key, pos) 𝒠 = { 2,1 , 11,2 , … , (95, π‘œ)} positions keys (approximate) π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ 2 11 13 15 18 23 24 29 31 34 36 44 47 48 55 59 60 71 73 74 76 88 95 1 π‘œ Binary search in π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ βˆ’ 𝜁, π‘žπ‘π‘‘π‘—π‘’π‘—π‘π‘œ + 𝜁 e.g. 𝜁 is of the order of 100–1000 6

  7. The knowledge gap in learned indexes Practice Theory Same query time of Same asymptotic query vs πŸ‘Ž traditional tree-based time of traditional indexes tree-based indexes Space improvements of Same asymptotic space vs πŸ‘ orders of magnitude, occupancy of traditional from GBs to few MBs tree-based indexes 7

  8. [Ferragina and Vinciguerra, PVLDB 2020] PGM-index: An optimal learned index 1. Fix a max error 𝜁 , e.g. so that keys in [π‘žπ‘π‘‘ βˆ’ 𝜁,π‘žπ‘π‘‘ + 𝜁] fit a cache-line 2. Find the smallest Piecewise Linear 𝜁 -Approximation (PLA) 3. Store triples (𝑔𝑗𝑠𝑑𝑒𝑙𝑓𝑧, π‘‘π‘šπ‘π‘žπ‘“, π‘—π‘œπ‘’π‘“π‘ π‘‘π‘“π‘žπ‘’) for each segment positions 8 24 keys 1 3 8 11 12 19 22 23 24 28 29 33 38 47 48 53 55 56 57 8 https://pgm.di.unipi.it 8 π‘žπ‘π‘‘ βˆ’ 𝜁, π‘žπ‘π‘‘ + 𝜁

  9. What is the space of learned indexes? β€’ Space occupancy ∝ Number segments β€’ The number of segments depends on β€’ The size of the input dataset β€’ How the points (𝑙𝑓𝑧, π‘žπ‘π‘‘) map to the plane β€’ The value 𝜁 , i.e. how much the approximation is precise 𝜁 ! 𝜁 " β‰ͺ 𝜁 ! positions positions positions keys keys keys 9

  10. Model and assumptions β€’ Consider gaps 𝑕 ! = 𝑙 !"# βˆ’ 𝑙 ! between consecutive input keys β€’ Model the gaps as positive iid rvs that follow a distribution with finite mean 𝜈 and variance 𝜏 $ 5 𝑕 ( 4 positions 𝑕 ) 3 𝑕 * 2 𝑕 + 1 𝑙 " 𝑙 # 𝑙 $ 𝑙 % 𝑙 & keys 10

  11. The main result Theorem . If 𝜁 is sufficiently larger than 𝜏/𝜈 , the expected number of keys covered by a segment with maximum error 𝜁 is 𝐿 = 𝜈 $ 𝜏 $ 𝜁 $ and the number of segments on a dataset of size π‘œ is π‘œ 𝐿 with high probability . 11

  12. The main consequence The PGM-index achieves the same asymptotic query performance of a traditional 𝜁 -way tree-based index while improving its space from 𝜀(𝒐/𝜻) to 𝑷(𝒐/𝜻 πŸ‘ ) Learned indexes are pr provably better than traditional indexes (note that 𝜁 is of the order of 100-1000) 12

  13. Sketch of the proof 1. Consider a segment on the stream of random gaps and the two parallel lines at distance 𝜁 2. How many steps before a new segment is needed? 𝜁 𝜁 positions Start a new segment from here keys 13

  14. Sketch of the proof (2) 3. A discrete-time random walk, iid increments with mean 𝜈 4. Compute the expectation of 𝑗 βˆ— = min 𝑗 ∈ β„• 𝑙 I , 𝑗 is outside the red strip i.e. the Mean Exit Time (MET) of the random walk Show that the slope 𝑛 = 1/𝜈 maximises 𝐹[𝑗 βˆ— ] , giving 𝐹[𝑗 βˆ— ] = 𝜈 J /𝜏 J 𝜁 J 5. Start a new 𝜁 random walker location segment from here 𝜁 (𝑙𝑓𝑧 ! βˆ— , 𝑗 βˆ— ) 𝜁 positions 𝑛 time Start a new 𝑗 βˆ— segment 𝜁 from here 𝑛 14 keys

  15. Simulations 1. Generate 10 7 random streams of gaps according to several probability distributions 2. Compute and average I. The length of a segment found by the algorithm that computes the smallest PLA, adopted in the PGM-index II. The exit time of the random walk 15

  16. Simulations of (𝜈 * /𝜏 * )𝜁 * OPT = Average segment length in a PGM-index MET = Mean exit time of the random walk Pareto k = 3 , Ξ± = 3 Lognormal Β΅ = 1 , Οƒ = 0 . 5 Mean segment length Β· 10 6 OPT OPT 1 . 5 MET MET Thm 1 (3 . 521 Ξ΅ 2 ) Thm 1 (3 . 0 Ξ΅ 2 ) 1 0 . 5 0 250 0 50 100 150 200 250 250 0 50 100 150 200 250 Ξ΅ Ξ΅ Both OPT and MET agree on the slope 1/ Β΅ , but OPT is more robust More distributions in the paper 16

  17. Stress test of β€œ 𝜁 sufficiently larger than 𝜏/𝜈 ” Οƒ /Β΅ = 0 . 15 Οƒ /Β΅ = 1 . 5 Οƒ /Β΅ = 15 1 Pareto k = 10 , Ξ± = 7 . 741 Pareto k = 10 , Ξ± = 2 . 202 0 . 2 Gamma ΞΈ = 5 , k = 44 . 444 Gamma ΞΈ = 5 , k = 0 . 444 0 . 8 Lognormal Β΅ = 2 , Οƒ = 0 . 149 Lognormal Β΅ = 2 , Οƒ = 1 . 086 Relative error 44 . 444 Ξ΅ 2 0 . 444 Ξ΅ 2 0 . 6 0 . 5 0 . 1 0 . 4 Pareto k = 10 , Ξ± = 2 . 002 Gamma ΞΈ = 5 , k = 0 . 004 0 . 2 Lognormal Β΅ = 2 , Οƒ = 2 . 328 0 . 004 Ξ΅ 2 0 0 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Ξ΅ Ξ΅ Ξ΅ 17

  18. Conclusions β€’ No theoretical grounds for the efficiency of learned indexes was known β€’ We have shown that on data with iid gaps, the mean segment length is Θ(𝜁 J ) β€’ The PGM-index takes O(π‘œ/𝜁 J ) space w.h.p., a quadratic improvement in 𝜁 over traditional indexes ( 𝜁 is usually of the order of 100–1000) β€’ Open problems : 1. Do the results still hold without the iid assumption on the gaps? 2. Is the segment found by the optimal algorithm adopted in the PGM-index a constant factor longer than the one found by the random walker? 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend