Overlap Interval Partition Join os 1 ohlen 1 Johann Gamper 2 Anton - - PowerPoint PPT Presentation

overlap interval partition join
SMART_READER_LITE
LIVE PREVIEW

Overlap Interval Partition Join os 1 ohlen 1 Johann Gamper 2 Anton - - PowerPoint PPT Presentation

Overlap Interval Partition Join os 1 ohlen 1 Johann Gamper 2 Anton Dign Michael H. B 1 University of Z urich, Switzerland 2 Free University of Bozen-Bolzano, Italy SIGMOD 2014 June 22-27, 2014 - Snowbird, Utah, USA Introduction


slide-1
SLIDE 1

Overlap Interval Partition Join

Anton Dign¨

  • s1

Michael H. B¨

  • hlen1

Johann Gamper2

1University of Z¨

urich, Switzerland

2Free University of Bozen-Bolzano, Italy

SIGMOD 2014 June 22-27, 2014 - Snowbird, Utah, USA

slide-2
SLIDE 2

Introduction

◮ Temporal relations: tuples have a time interval. ◮ Overlap join: join tuples with overlapping time intervals. r r1 r2 r3 s s1 s3 s2

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t r ⊲ ⊳ s r1◦s2 r2◦s3 r2◦s2 r3◦s3

SIGMOD 2014 2/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-3
SLIDE 3

Introduction

◮ Temporal relations: tuples have a time interval. ◮ Overlap join: join tuples with overlapping time intervals. r r1 r2 r3 s s1 s3 s2

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t r ⊲ ⊳ s r1◦s2 r2◦s3 r2◦s2 r3◦s3 ◮ Goal: Efficient and robust overlap join

◮ Alternative for query optimizer when other predicates are absent,

have poor selectivity (long histories), or need to be evaluated after the join (on overlapping interval)

SIGMOD 2014 2/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-4
SLIDE 4

Outline

◮ OIP: an efficient partitioning for interval data ◮ OIPJoin: a partition join based on OIP ◮ Determine the optimal OIP parameter k for OIPJoin ◮ Empirical evaluation

SIGMOD 2014 3/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-5
SLIDE 5

Idea of Overlap Interval Partitioning OIP

◮ Given input data with intervals

t

SIGMOD 2014 4/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-6
SLIDE 6

Idea of Overlap Interval Partitioning OIP

◮ Given input data with intervals

t

◮ Partition intervals according to position and duration

t

◮ Constant clustering guarantee: Difference in duration of tuple and

partition is upper-bounded by a constant.

SIGMOD 2014 4/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-7
SLIDE 7

Overlap Interval Partitioning (OIP)

◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap

SIGMOD 2014 5/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-8
SLIDE 8

Overlap Interval Partitioning (OIP)

◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap

k = 3:

Q

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

granule 0 granule 1 granule 2

SIGMOD 2014 5/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-9
SLIDE 9

Overlap Interval Partitioning (OIP)

◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap

k = 3:

Q

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

granule 0 granule 1 granule 2

k = 4:

Q

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

granule 0 granule 1 granule 2 granule 3

SIGMOD 2014 5/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-10
SLIDE 10

Overlap Interval Partitioning (OIP)

◮ Divide time range into k granules of equal duration ◮ Partitions are sequences of contiguous granules ◮ Partitions can overlap

k = 3:

Q

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

granule 0 granule 1 granule 2

Low k ⇒ fewer partition accesses (less overlapping boxes) k = 4:

Q

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

granule 0 granule 1 granule 2 granule 3

High k ⇒ more precise partitions (better fitting boxes)

SIGMOD 2014 5/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-11
SLIDE 11

The OIPJoin

  • 1. Determine number of granules k

r r1 r2 r3 s s1 s2 s5 s4 s6 s7 s3

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

SIGMOD 2014 6/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-12
SLIDE 12

The OIPJoin

  • 1. Determine number of granules k
  • 2. Partition both input relations using OIP

r r1 r2 r3 s s1 s2 s5 s4 s6 s7 s3

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

SIGMOD 2014 6/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-13
SLIDE 13

The OIPJoin

  • 1. Determine number of granules k
  • 2. Partition both input relations using OIP
  • 3. Join tuples within overlapping partitions

r r1 r2 r3 s s1 s2 s5 s4 s6 s7 s3

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

SIGMOD 2014 6/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-14
SLIDE 14

The OIPJoin

  • 1. Determine number of granules k
  • 2. Partition both input relations using OIP
  • 3. Join tuples within overlapping partitions

r r1 r2 r3 s s1 s2 s5 s4 s6 s7 s3

2012-1 2012-2 2012-3 2012-4 2012-5 2012-6 2012-7 2012-8 2012-9 2012-10 2012-11 2012-12

t

Properties:

◮ Only 11 tuple comparisons

◮ 9 result tuples ◮ 2 false hits (r1 ◦ s6 and r2 ◦ s5)

◮ Only 5 inner partitions scanned (5 partition accesses)

SIGMOD 2014 6/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-15
SLIDE 15

Properties of OIP

◮ Constant clustering guarantee: The difference in duration

between a tuple and its partition is less than two granules.

◮ All tuples in a partition behave similarly ◮ Very few false hits

◮ Scans of partitions instead of random tuple access:

◮ High cache locality ◮ Much faster than index look-ups SIGMOD 2014 7/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-16
SLIDE 16

How to Determine k?

Intuition: Find optimal k s.t. the number of false hits of OIP justifies the number of partition accesses and vice versa.

SIGMOD 2014 8/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-17
SLIDE 17

Cost Dimensions

We consider CPU and IO costs Cost CPU IO False Hits Increase the number of CPU

  • perations (identifying and

discarding false hits). Increase the number of block transfers (more data is fetched). Partition Accesses Increase the number of CPU

  • perations (search in the ac-

cess structure). Increase the number of block transfers (more partially filled blocks)

SIGMOD 2014 9/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-18
SLIDE 18

Cost Dimensions

We consider CPU and IO costs Cost CPU IO False Hits Increase the number of CPU

  • perations (identifying and

discarding false hits). Increase the number of block transfers (more data is fetched). Partition Accesses Increase the number of CPU

  • perations (search in the ac-

cess structure). Increase the number of block transfers (more partially filled blocks) What does that mean for k?

◮ High k ⇒ few false hits, many partition accesses ◮ Low k ⇒ many false hits, few partition accesses

SIGMOD 2014 9/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-19
SLIDE 19

Determining k for the OIPJoin

  • 1. Quantify false hits on average: AFR ≤ 1

k

(Probability that a tuple is a false hit)

  • 2. Quantify partition accesses on average: APA = k2+k+1

3

(Number of partitions accessed by a query interval)

  • 3. Define the cost function for the overhead due to AFR and APA

using CPU and IO cost

  • 4. Minimize the cost function w.r.t. k

SIGMOD 2014 10/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-20
SLIDE 20

Overhead Cost for Partition Accesses

◮ For each of the |pr| outer partitions

◮ APA inner partition accesses (scans)

|pr| · APA · (c io + 2 · c cpu)

  • part. accesses

partially filled blocks (1 trailing block per partition) search in access structure (2 comparisons in access list)

◮ Average number of Partition Accesses APA = k2+k+1 3

SIGMOD 2014 11/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-21
SLIDE 21

Overhead Cost for False Hits

◮ For each of the |pr| outer partitions

◮ AFR · ns false hits (inner) fetched

◮ Each outer tuple

◮ Is compared with AFR · ns false hits (inner) ◮ Is AFR · ns times a false hits

|pr| · ns · AFR · c io b + 2 · ns · nr · AFR · 2 · c cpu) false hits

more data is fetched (1 false hit within a block) identifying and discarding (2 comparisons per false hit)

◮ Average False hit Ratio AFR ≤ 1 k

SIGMOD 2014 12/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-22
SLIDE 22

The Overhead Cost Function

cost(k) = |pr| · APA · (c io + 2 · c cpu) +

  • part. accesses

|pr| · ns · AFR · (c io b + 2 · nr |pr| · 2 · c cpu) false hits

partially filled blocks (1 trailing block per partition) search in access structure (2 comparisons in access list) more data is fetched (1 false hit within a block) identifying and discarding (2 comparisons per false hit)

SIGMOD 2014 13/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-23
SLIDE 23

Determining k for the OIPJoin

◮ By minimizing cost(k) we get:

k = f (nr, ns, c cpu, c io, b) Example:

◮ nr = 10M tuples ◮ nr = 100M tuples ◮ c cpu = 0.5 ◮ c io = 10 ◮ b = 15 tuples on average in storage block

k = f (10M, 100M, 0.5, 10, 15) = 16, 521

SIGMOD 2014 14/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-24
SLIDE 24

Related Work /1

◮ Overlap join based on space partitioning approaches, such as

quadtree1 and loose quadtree2

◮ Divide time range recursively into two sub-ranges ◮ Join cells of outer relation with all relevant of inner relation

◮ Properties

◮ Long-lived tuples reside high up in hierarchy (many FH) ◮ Cells grow with a factor of two (too much, many FH) ◮ Parent cells are required for children (many possibly empty partitions)

◮ OIPJoin does not deteriorate in performance with long-lived

tuples, partitions grow by a constant factor.

  • 1R. A. Finkel and J. L. Bentley. Quad trees: A data structure for retrieval on composite
  • keys. Acta Inf., 4:1-9, 1974.
  • 2T. Ulrich. Loose octrees. In Game Programming Gems, pages 444-453. Charles River

Media, 2000.

SIGMOD 2014 15/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-25
SLIDE 25

Related Work /2

◮ Overlap join based on indexing approaches, such as interval tree,

relational interval tree3, segment tree

◮ Associate intervals with index node(s) ◮ Join index nodes or tuples of outer relation with all relevant of inner

◮ Properties

◮ Long-Lived tuples reside high up in hierarchy (∼ many partitions) ◮ Requires many node joins (∼ many partitions) ◮ No physical clustering possible (2 indices) (∼ FH in storage)

◮ OIPJoin carefully balances the cost due to the access structure

and groups tuple into partitions (cache locality)

3H.-P. Kriegel, M. Ptke, and T. Seidl. Managing intervals efficiently in object-relational

  • databases. In VLDB, pages 407418, 2000.
  • J. Enderle, M. Hampel, and T. Seidl. Joining interval data in relational databases. In

SIGMOD, pages 683694, 2004.

SIGMOD 2014 16/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-26
SLIDE 26

Empirical Evaluation

  • 1. Cost function compared with runtime
  • 2. k adapts to CPU and IO cost
  • 3. Comparison with state-of-the-art approaches

◮ Clustering guarantee is highly relevant for long-lived tuples ◮ CPU cost is also relevant for disk resident data SIGMOD 2014 17/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-27
SLIDE 27

Cost function Compared with Runtime

◮ OIPJoin between 10M and 100M tuples ◮ Data in main memory

0.1 1 10 100 1000 5 10 15 20 25 30 Cost [sec x 1000] k [# x 1000] 1 10 100 1000 5 10 15 20 25 30 Runtime [sec x 1000] k [# x 1000]

Cost Function Runtime

◮ Minimum of the cost function matches minimum of the runtime.

SIGMOD 2014 18/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-28
SLIDE 28

k Adapts to CPU and IO Cost

4 8 12 16 20 0.001 0.01 0.1 1 10 100 k [# x 1000] CPU cost / IO cost 0.004 0.008 0.012 0.016 0.02 0.001 0.01 0.1 1 10 100 AFR [%] CPU cost / IO cost 40 80 120 0.001 0.01 0.1 1 10 100 APA [# x 1000] CPU cost / IO cost 500 1000 1500 2000 2500 0.001 0.01 0.1 1 10 100 IO [M] CPU cost / IO cost

◮ Cost for access structure and false hits depends on CPU and IO cost.

SIGMOD 2014 19/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-29
SLIDE 29

Varying Duration of Tuples

◮ Outer and inner relation 10M tuples ◮ Data in main memory

50 100 150 2 4 6 8 10 Runtime [sec x 1000]

  • Max. Tuple Duration [%]

sgt rit lqt smj

  • ip

1 2 3 4 2 4 6 8 10 AFR [%]

  • Max. Tuple Duration [%]

smj lqt

  • ip

◮ Clustering guarantee is important for long-lived tuples ◮ Partition scans more efficient than random memory access

SIGMOD 2014 20/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-30
SLIDE 30

Real World Datasets

◮ Personnel data ◮ File changes

5 10 15 20 25 30 35 25 50 75 100 Runtime [sec] # of Outer Tuples [%] sgt smj rit lqt

  • ip

1 2 3 4 5 6 25 50 75 100 Runtime [sec x 1000] # of Outer Tuples [%] sgt rit smj lqt

  • ip

◮ Real world data contain a mix of short and long tuples

SIGMOD 2014 21/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-31
SLIDE 31

Varying Number of Tuples on Disk

◮ Outer relation 1% of inner relation ◮ Tuple durations up to 0.1%

500 1000 1500 2000 2500 3000 3500 500 1000 1500 IO [M] # of Inner Tuples [M] sgt rit smj

  • ip

lqt 100 200 300 400 500 1000 1500 Runtime [sec x 1000] # of Inner Tuples [M] sgt rit lqt smj

  • ip

◮ Minimizing IOs is not enough ◮ Also on disk the CPU cost of access structure and false hits is

important.

SIGMOD 2014 22/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-32
SLIDE 32

Conclusion

Summary

◮ OIP offers a constant clustering guarantee ◮ OIPJoin is self-adjusting ◮ OIPJoin outperforms state-of-the-art approaches

Future Work

◮ Advanced statistics to calculate the number of empty partitions for

APA, e.g., using histograms.

◮ Study the maintenance of OIP. ◮ Refinement of cost function for different buffer replacement

strategies.

SIGMOD 2014 23/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper
slide-33
SLIDE 33

Conclusion

Summary

◮ OIP offers a constant clustering guarantee ◮ OIPJoin is self-adjusting ◮ OIPJoin outperforms state-of-the-art approaches

Future Work

◮ Advanced statistics to calculate the number of empty partitions for

APA, e.g., using histograms.

◮ Study the maintenance of OIP. ◮ Refinement of cost function for different buffer replacement

strategies. Thank you for your attention!

SIGMOD 2014 23/23

  • A. Dign¨
  • s, M. H. B¨
  • hlen, J. Gamper