Period Index: A Learned 2D Hash Index for Range and Duration Queries - - PowerPoint PPT Presentation

period index a learned 2d hash index for range and
SMART_READER_LITE
LIVE PREVIEW

Period Index: A Learned 2D Hash Index for Range and Duration Queries - - PowerPoint PPT Presentation

Period Index: A Learned 2D Hash Index for Range and Duration Queries Andreas Behrend 1 os 2 Johann Gamper 2 Anton Dign Philip Schmiegelt 3 Hannes Voigt 4 Matthias Rottmann 5 Karsten Kahl 5 1 University of Bonn, Germany 2 Free University of


slide-1
SLIDE 1

Period Index: A Learned 2D Hash Index for Range and Duration Queries

Andreas Behrend1 Anton Dign¨

  • s2

Johann Gamper2 Philip Schmiegelt3 Hannes Voigt4 Matthias Rottmann5 Karsten Kahl5

1University of Bonn, Germany 2Free University of Bozen-Bolzano, Italy 3Fraunhofer FKIE Bonn, Germany 4TU Dresden, Germany 5University of Wuppertal, Germany

SSTD’ 19, Vienna, Austria

slide-2
SLIDE 2

Background and Motivation Contributions Period Index Experiments Conclusion and Future Work

SSTD ’19 2/30

  • A. Dign¨
  • s
slide-3
SLIDE 3

Background and Motivation /1

Temporal period data is produced in many application domains:

◮ Personnel data (work contract periods, project assignment periods) ◮ Financial data (insurance policy periods, rental contract periods) ◮ Medical data (hospital stay periods)

. . . or derived via state analysis:

◮ Medical data (treatment periods, high fever periods) ◮ Air traffic data (maneuvering periods, landing periods)

SSTD ’19 3/30

  • A. Dign¨
  • s
slide-4
SLIDE 4

Background and Motivation /2

◮ Example: Fever periods of patients

Patient Period Ann [01-Aug-2019, 03-Aug-2019) Sam [02-Aug-2019, 05-Aug-2019) Tom [06-Aug-2019, 13-Aug-2019) Joe [10-Aug-2019, 12-Aug-2019)

◮ Graphical illustration

Joe Tom Sam Ann

1-Aug 2-Aug 3-Aug 4-Aug 5-Aug 6-Aug 7-Aug 8-Aug 9-Aug 10-Aug 11-Aug 12-Aug

t

SSTD ’19 4/30

  • A. Dign¨
  • s
slide-5
SLIDE 5

Background and Motivation /3

◮ Heavy research efforts since decades

◮ Temporal data storage ◮ Temporal keys ◮ Temporal indexing ◮ Time travel queries ◮ Temporal aggregation queries ◮ Temporal join queries

◮ Some works found their way into commercial DBMSs

SSTD ’19 5/30

  • A. Dign¨
  • s
slide-6
SLIDE 6

Background and Motivation /4

◮ Temporal features in the SQL:2011 standard [7]

◮ Period specification, application time and system time ◮ Predicates like overlap, before, during ◮ Implementation in IBM D2, Oracle, Teradata, SQLServer [4]

◮ Range types in PostgreSQL (2012) [9]

◮ Range datatype, predicates, and functions ◮ Advanced temporal query processing prototype [5] SSTD ’19 6/30

  • A. Dign¨
  • s
slide-7
SLIDE 7

Background and Motivation /5

◮ An important primitive is time travel or range queries

◮ Find all active insurance policies as of yesterday ◮ Find all air planes maneuvering yesterday between 08:00 and 10:00

◮ An often neglected attribute is the duration of periods

◮ Find all active insurance policies longer than a year ◮ Find all air planes maneuvering for 5 to 6 hours

Position and duration are key properties for querying data with intervals!

SSTD ’19 7/30

  • A. Dign¨
  • s
slide-8
SLIDE 8

Three Types of Queries

Range Query

Find all patients that had fever last week

Duration Query

Find all patients that had fever for more than two days

Range-Duration Query

Find all patients that had fever last week for more than two days

SSTD ’19 8/30

  • A. Dign¨
  • s
slide-9
SLIDE 9

Challenges

Range Predicate

◮ Corresponds to the overlap of two intervals ◮ Inequality among both endpoints (difficult to index and optimize) ◮ overlap(I, [start, end)) = Ie > start AND Is < end

Duration Predicate

◮ Rarely interested in a precise duration, i.e., 2h 5min 32 sec ◮ Needs to be expressible as a range, i.e., duration between 2h and 4h ◮ duration(I) > min AND duration(I) < max

We want to use both at the same time!

SSTD ’19 9/30

  • A. Dign¨
  • s
slide-10
SLIDE 10

Background and Motivation Contributions Period Index Experiments Conclusion and Future Work

SSTD ’19 10/30

  • A. Dign¨
  • s
slide-11
SLIDE 11

Contributions

◮ Period Index: novel index for intervals according to duration and

position on the timeline

◮ Grid-based data structure with constant time lookup – well suited

for parallelization

◮ Space-efficient implementation using arrays and linked lists to avoid

storage of empty cells

◮ Adaptive bucket length for different data distribution (learn)

SSTD ’19 11/30

  • A. Dign¨
  • s
slide-12
SLIDE 12

Related Work

Range Queries

◮ Interval tree [3, 6] ◮ Segment tree [2]

Duration Queries

◮ Any sorted index, most notably B-trees

Range-Duration Queries

◮ Multidimensional indices, most notably quad-trees, R-trees [1] and

Grid file [8]

SSTD ’19 12/30

  • A. Dign¨
  • s
slide-13
SLIDE 13

Background and Motivation Contributions Period Index Experiments Conclusion and Future Work

SSTD ’19 13/30

  • A. Dign¨
  • s
slide-14
SLIDE 14

Intuition of the Period Index

Idea

◮ Index position of intervals using buckets and cells ◮ Index duration of intervals using levels

Observations / Assumptions

◮ Position of intervals may be arbitrarily skewed ◮ Duration of intervals follows ZIPF distribution

(many short / few long)

Desiderata

◮ Access to relevant cells should be fast ◮ Cells should be equally filled

SSTD ’19 14/30

  • A. Dign¨
  • s
slide-15
SLIDE 15

Data Structure

◮ Buckets of fixed length l to index position (horizontal) ◮ Levels with smaller cells to index duration (vertical)

ti ti + l bucket length l indexation of es ti +l Level 0 |e| ∈

  • l

2 , ∞

  • es ∈ [ti, ti + l)

+ l

2

Level 1 |e| ∈

  • l

4 , l 2

  • es ∈
  • ti, ti + l

2

  • es ∈
  • ti + l

2 , ti + l

  • + l

4

+ 3l

4

Level 2 |e| ∈

  • l

8 , l 4

  • + l

8

+ 3l

8

+ 5l

8

+ 7l

8

. . . indexation of |e|

◮ Intervals are stored in all overlapping cells on the corresponding level ◮ Relevant cells are calculated arithmetically

SSTD ’19 15/30

  • A. Dign¨
  • s
slide-16
SLIDE 16

Construction / Interval Assignment

Given an interval e = [es, ee) with duration |e|.

  • 1. Find corresponding level:

x = min

  • max
  • ⌊log2( l

|e|)⌋, 0

  • , nl
  • 2. Place e into cells ci s.t.:

es − o l · 2x

  • ≤ i ≤

ee − o l · 2x

  • Relevant cells for interval e are calculated arithmetically

f (e, d) → {ci, cj, . . .}

SSTD ’19 16/30

  • A. Dign¨
  • s
slide-17
SLIDE 17

Example / Interval Assignment

◮ Buckets length l = 24h and number of levels nl = 3 ◮ e5 = [02:45, 08:30) with |e5| = 5h 45m

4-Aug-2016 00:00 5-Aug-2016 00:00 l = 24h +24h Level 0 |e| ∈ (12h, ∞] +12h Level 1 |e| ∈ (6h, 12h] +6h +18h Level 2 |e| ∈ (3h, 6h] +3h +9h +15h +21h Level 3 |e| ∈ (0h, 3h] e1 e2 e3 e4 e5 e6

◮ e5 → level 2, cells {c1 and c2}

SSTD ’19 17/30

  • A. Dign¨
  • s
slide-18
SLIDE 18

Query Evaluation

Given a Range-Duration query Q with range e = [es, ee) and duration restriction d = [dmin, dmax).

  • 1. For each level x s.t.:

min

  • max
  • ⌊log2(

l dmin )⌋, 0

  • , nl
  • ≤ x ≤ min
  • max
  • ⌊log2(

l dmax )⌋, 0

  • , nl
  • 2. Scan all cells ci s.t.:

es − o l · 2x

  • ≤ i ≤

ee − o l · 2x

  • Relevant cells for query Q are calculated arithmetically

f (e, d) → {. . . , ci, . . .}

SSTD ’19 18/30

  • A. Dign¨
  • s
slide-19
SLIDE 19

Example / Query Evaluation

Q(e, d): e = [08:00, 10:00) and d = [5h, 10h)

4-Aug-2016 00:00 5-Aug-2016 00:00 l = 24h +24h Level 0 |e| ∈ (12h, ∞] +12h Level 1 |e| ∈ (6h, 12h] +6h +18h Level 2 |e| ∈ (3h, 6h] +3h +9h +15h +21h Level 3 |e| ∈ (0h, 3h] e1 e2 e3 e4 e5 e6

◮ Q → {(level 1, cells c1), (level 2, cells c2)}

SSTD ’19 19/30

  • A. Dign¨
  • s
slide-20
SLIDE 20

Analysis

◮ The number of cells proportionally determines the number of

collisions in the index; Large dataset → many cells

◮ The maximum duration of intervals limits the maximum bucket

length l

◮ The minimum duration of intervals limits the maximum number of

levels nl

◮ Given l and the number of levels nl we can control the total number

  • f cells.

#cells = (2 · 2nl − 1) · |D|

l

adjustable in the range of 1 to 2 · |D|1

1|D| is the domain size SSTD ’19 20/30

  • A. Dign¨
  • s
slide-21
SLIDE 21

Adaptive Bucket Length

◮ What if start times are not uniformly distributed? ◮ Use Histogram of starting points to model the distribution ◮ Replace regular time division with weight from cumulative histogram

time 20 40 60 80 100 number of elements ×10 4 1 2 3 time 20 40 60 80 100 number of elements ×10 5 2 4 6 8 10

◮ Relevant Cells are calculated arithmetically

f (e, d) + H → {ci, cj, . . .}

SSTD ’19 21/30

  • A. Dign¨
  • s
slide-22
SLIDE 22

Background and Motivation Contributions Period Index Experiments Conclusion and Future Work

SSTD ’19 22/30

  • A. Dign¨
  • s
slide-23
SLIDE 23

Experiments - Setup

◮ Adaptive bucket lengths ◮ Competitors

◮ Grid file ◮ Interval tree (Range Queries only) ◮ B+-tree (Duration Queries only)

◮ Runtime in Query/sec

◮ Range Queries only ◮ Duration Queries only ◮ Range-Duration Queries ◮ Different duration distributions SSTD ’19 23/30

  • A. Dign¨
  • s
slide-24
SLIDE 24

Experiments - Adaptive bucket lengths

◮ Impact of start time point distribution on cell fill factor ◮ Adaptive bucket length is able to counteract (learn) data skew and

distribute load among cells

SSTD ’19 24/30

  • A. Dign¨
  • s
slide-25
SLIDE 25

Experiments - Range Query

Interval Tree B+-tree Grid Period Index 5 10 15 20 100 101 102

Number of tuples [M] Queries/sec

Uniform distribution

5 10 15 20 100 101 102

Number of tuples [M] Queries/sec

Zipf distribution

◮ Period index applies efficient calculation of relevant cells

(compared to trees)

◮ Period index has smaller cells for small intervals (compared to Grid)

SSTD ’19 25/30

  • A. Dign¨
  • s
slide-26
SLIDE 26

Experiments - Duration Query

Interval Tree B+-tree Grid Period Index 5 10 15 20 100 101 102 103

Number of tuples [M] Queries/sec

Uniform distribution

5 10 15 20 100 102 104

Number of tuples [M] Queries/sec

Zipf distribution

◮ Period index applies efficient calculation of relevant cells

(compared to trees)

◮ Regular grid may in some cases have less relevant cell accesses

SSTD ’19 26/30

  • A. Dign¨
  • s
slide-27
SLIDE 27

Experiments - Range-Duration Query

Interval Tree B+-tree Grid Period Index 5 10 15 20 100 101 102 103

Number of tuples [M] Queries/sec

Uniform distribution

5 10 15 20 100 102 104

Number of tuples [M] Queries/sec

Zipf distribution

◮ Period index applies both range and duration restriction

(compared to trees)

◮ Period index has smaller cells for small intervals (compared to Grid)

SSTD ’19 27/30

  • A. Dign¨
  • s
slide-28
SLIDE 28

References

Norbert Beckmann et al. “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles”. In: SIGMOD Conference. ACM Press, 1990, pp. 322–331. Jon Louis Bentley. “Solutions to Klee’s rectangle problems”. In: Technical report (1977). Mark de Berg et al. “Interval Trees”. In: Computational Geometry. Springer-Verlag, 2000. Chap. 10.1, pp. 212–217. Michael H. B¨

  • hlen et al. “Temporal Data Management - An Overview”. In: eBISS. Vol. 324. Lecture Notes in

Business Information Processing. Springer, 2018, pp. 51–83. Anton Dign¨

  • s et al. “Extending the Kernel of a Relational DBMS with Comprehensive Support for Sequenced

Temporal Queries”. In: ACM Trans. Database Syst. 41.4 (2016), 26:1–26:46. Hans-Peter Kriegel, Marco P¨

  • tke, and Thomas Seidl. “Managing Intervals Efficiently in Object-Relational

Databases”. In: VLDB. Morgan Kaufmann, 2000, pp. 407–418. Krishna G. Kulkarni and Jan-Eike Michels. “Temporal features in SQL: 2011”. In: SIGMOD Record 41.3 (2012),

  • pp. 34–43.

J¨ urg Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. “The Grid File: An Adaptable, Symmetric Multikey File Structure”. In: ACM Trans. Database Syst. 9.1 (1984), pp. 38–71.

  • PostgreSQL. Documentation manual PostgreSQL - range types.

https://www.postgresql.org/docs/9.2/rangetypes.html. 2012. SSTD ’19 28/30

  • A. Dign¨
  • s
slide-29
SLIDE 29

Background and Motivation Contributions Period Index Experiments Conclusion and Future Work

SSTD ’19 29/30

  • A. Dign¨
  • s
slide-30
SLIDE 30

Conclusion and Future Work

◮ Period index: efficient access method for period-duration queries ◮ Constant time access to relevant cells ◮ Adaptive to different start time distributions ◮ Well suited for parallelization

Future Work

◮ Extend to more fine grained duration distribution among cells ◮ Extensive experiments on parallelization and NUMA ◮ Adaptability to dynamic scenarios similar to extendible hashing

Thank you for your attention!

SSTD ’19 30/30

  • A. Dign¨
  • s