overlap interval partition join
play

Overlap Interval Partition Join os 1 ohlen 1 Johann Gamper 2 Anton - PowerPoint PPT Presentation

Overlap Interval Partition Join os 1 ohlen 1 Johann Gamper 2 Anton Dign Michael H. B 1 University of Z urich, Switzerland 2 Free University of Bozen-Bolzano, Italy SIGMOD 2014 June 22-27, 2014 - Snowbird, Utah, USA Introduction


  1. Overlap Interval Partition Join os 1 ohlen 1 Johann Gamper 2 Anton Dign¨ Michael H. B¨ 1 University of Z¨ urich, Switzerland 2 Free University of Bozen-Bolzano, Italy SIGMOD 2014 June 22-27, 2014 - Snowbird, Utah, USA

  2. Introduction ◮ Temporal relations : tuples have a time interval. ◮ Overlap join : join tuples with overlapping time intervals. r 1 r 2 r 3 r s 1 s 3 s s 2 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t r 1 ◦ s 2 r 2 ◦ s 3 r 3 ◦ s 3 r 2 ◦ s 2 r ⊲ ⊳ s SIGMOD 2014 2/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  3. Introduction ◮ Temporal relations : tuples have a time interval. ◮ Overlap join : join tuples with overlapping time intervals. r 1 r 2 r 3 r s 1 s 3 s s 2 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t r 1 ◦ s 2 r 2 ◦ s 3 r 3 ◦ s 3 r 2 ◦ s 2 r ⊲ ⊳ s ◮ Goal : Efficient and robust overlap join ◮ Alternative for query optimizer when other predicates are absent, have poor selectivity (long histories), or need to be evaluated after the join (on overlapping interval) SIGMOD 2014 2/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  4. Outline ◮ OIP : an efficient partitioning for interval data ◮ OIPJoin : a partition join based on OIP ◮ Determine the optimal OIP parameter k for OIPJoin ◮ Empirical evaluation SIGMOD 2014 3/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  5. Idea of Overlap Interval Partitioning OIP ◮ Given input data with intervals t SIGMOD 2014 4/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  6. Idea of Overlap Interval Partitioning OIP ◮ Given input data with intervals t ◮ Partition intervals according to position and duration t ◮ Constant clustering guarantee: Difference in duration of tuple and partition is upper-bounded by a constant. SIGMOD 2014 4/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  7. Overlap Interval Partitioning ( OIP ) ◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap SIGMOD 2014 5/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  8. Overlap Interval Partitioning ( OIP ) ◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap k = 3: Q 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t granule 0 granule 1 granule 2 SIGMOD 2014 5/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  9. Overlap Interval Partitioning ( OIP ) ◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap k = 3: k = 4: Q Q 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t t granule 0 granule 1 granule 2 granule 0 granule 1 granule 2 granule 3 SIGMOD 2014 5/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  10. Overlap Interval Partitioning ( OIP ) ◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap k = 3: k = 4: Q Q 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t t granule 0 granule 1 granule 2 granule 0 granule 1 granule 2 granule 3 Low k ⇒ fewer partition accesses High k ⇒ more precise partitions (less overlapping boxes) (better fitting boxes) SIGMOD 2014 5/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  11. The OIPJoin 1. Determine number of granules k r 1 r r 2 r 3 s 3 s s 4 s 6 s 1 s 2 s 5 s 7 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t SIGMOD 2014 6/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  12. The OIPJoin 1. Determine number of granules k 2. Partition both input relations using OIP r 1 r r 2 r 3 s 3 s s 4 s 6 s 1 s 2 s 5 s 7 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t SIGMOD 2014 6/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  13. The OIPJoin 1. Determine number of granules k 2. Partition both input relations using OIP 3. Join tuples within overlapping partitions r 1 r r 2 r 3 s 3 s s 4 s 6 s 1 s 2 s 5 s 7 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t SIGMOD 2014 6/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  14. The OIPJoin 1. Determine number of granules k 2. Partition both input relations using OIP 3. Join tuples within overlapping partitions r 1 r r 2 r 3 s 3 s s 4 s 6 s 1 s 2 s 5 s 7 2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12 t Properties: ◮ Only 11 tuple comparisons ◮ 9 result tuples ◮ 2 false hits ( r 1 ◦ s 6 and r 2 ◦ s 5 ) ◮ Only 5 inner partitions scanned (5 partition accesses) SIGMOD 2014 6/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  15. Properties of OIP ◮ Constant clustering guarantee : The difference in duration between a tuple and its partition is less than two granules. ◮ All tuples in a partition behave similarly ◮ Very few false hits ◮ Scans of partitions instead of random tuple access : ◮ High cache locality ◮ Much faster than index look-ups SIGMOD 2014 7/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  16. How to Determine k ? Intuition: Find optimal k s.t. the number of false hits of OIP justifies the number of partition accesses and vice versa. SIGMOD 2014 8/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  17. Cost Dimensions We consider CPU and IO costs Cost CPU IO False Hits Increase the number of CPU Increase the number of operations (identifying and block transfers (more discarding false hits). data is fetched). Partition Increase the number of CPU Increase the number of Accesses operations (search in the ac- block transfers (more cess structure). partially filled blocks) SIGMOD 2014 9/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  18. Cost Dimensions We consider CPU and IO costs Cost CPU IO False Hits Increase the number of CPU Increase the number of operations (identifying and block transfers (more discarding false hits). data is fetched). Partition Increase the number of CPU Increase the number of Accesses operations (search in the ac- block transfers (more cess structure). partially filled blocks) What does that mean for k ? ◮ High k ⇒ few false hits, many partition accesses ◮ Low k ⇒ many false hits, few partition accesses SIGMOD 2014 9/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  19. Determining k for the OIPJoin 1. Quantify false hits on average: AFR ≤ 1 k (Probability that a tuple is a false hit) 2. Quantify partition accesses on average: APA = k 2 + k +1 3 (Number of partitions accessed by a query interval) 3. Define the cost function for the overhead due to AFR and APA using CPU and IO cost 4. Minimize the cost function w.r.t. k SIGMOD 2014 10/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  20. Overhead Cost for Partition Accesses ◮ For each of the | p r | outer partitions ◮ APA inner partition accesses (scans) partially filled blocks search in access structure (1 trailing block per partition) (2 comparisons in access list) | p r | · APA · ( c io + 2 · c cpu ) part. accesses ◮ Average number of Partition Accesses APA = k 2 + k +1 3 SIGMOD 2014 11/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  21. Overhead Cost for False Hits ◮ For each of the | p r | outer partitions ◮ AFR · n s false hits (inner) fetched ◮ Each outer tuple ◮ Is compared with AFR · n s false hits (inner) ◮ Is AFR · n s times a false hits | p r | · n s · AFR · c io + 2 · n s · n r · AFR · 2 · c cpu ) false hits b more data is fetched identifying and discarding (1 false hit within a block) (2 comparisons per false hit) ◮ Average False hit Ratio AFR ≤ 1 k SIGMOD 2014 12/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  22. The Overhead Cost Function partially filled blocks search in access structure (1 trailing block per partition) (2 comparisons in access list) cost ( k ) = | p r | · APA · ( c io + 2 · c cpu ) + part. accesses | p r | · n s · AFR · ( c io + 2 · n r | p r | · 2 · c cpu ) false hits b more data is fetched identifying and discarding (1 false hit within a block) (2 comparisons per false hit) SIGMOD 2014 13/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

  23. Determining k for the OIPJoin ◮ By minimizing cost ( k ) we get: k = f ( n r , n s , c cpu , c io , b ) Example: ◮ n r = 10M tuples ◮ n r = 100M tuples ◮ c cpu = 0 . 5 ◮ c io = 10 ◮ b = 15 tuples on average in storage block k = f (10M , 100M , 0 . 5 , 10 , 15) = 16 , 521 SIGMOD 2014 14/23 A. Dign¨ os, M. H. B¨ ohlen, J. Gamper

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend