Multi-dimensional Indexing GIS applications (maps): GIS - - PDF document

multi dimensional indexing
SMART_READER_LITE
LIVE PREVIEW

Multi-dimensional Indexing GIS applications (maps): GIS - - PDF document

Advanced Data Structures NTUA 2007 NTUA 2007 R-trees and Grid File Multi-dimensional Indexing GIS applications (maps): GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility networks, etc.


slide-1
SLIDE 1

1

Advanced Data Structures NTUA 2007 NTUA 2007 R-trees and Grid File

Multi-dimensional Indexing

GIS applications (maps): GIS applications (maps):

Urban planning, route optimization, fire or

pollution monitoring, utility networks, etc.

  • ESRI (ArcInfo), Oracle Spatial, etc.

Other applications:

VLSI design CAD/CAM model of human VLSI design, CAD/CAM, model of human

brain, etc.

Traditional applications:

Multidimensional records

slide-2
SLIDE 2

2

Spatial data types

Point : 2 real numbers Line : sequence of points

point line region

Region : area included inside n-points

Spatial Relationships

Topological relationships: Topological relationships:

adjacent, inside, disjoint, etc

Direction relationships:

Above, below, north_of, etc

Metric relationships:

“distance < 100”

And operations to express the

relationships

slide-3
SLIDE 3

3

Spatial Queries

Selection queries: “Find all objects inside Selection queries: Find all objects inside

query q”, inside-> intersects, north

Nearest Neighbor-queries: “Find the

closets object to a query point q”, k- closest objects

Spatial join queries: Two spatial relations S1 and

S2, find all pairs: { x in S1, y in S2, and x rel y= true} ,

rel= intersect, inside, etc

Access Methods

Point Access Methods (PAMs):

Point Access Methods (PAMs):

Index methods for 2 or 3-dimensional

points (k-d trees, Z-ordering, grid-file)

Spatial Access Methods (SAMs):

Index methods for 2 or 3-dimensional

regions and points (R-trees)

slide-4
SLIDE 4

4

Indexing using SAMs

Approximate each region with a simple

Approximate each region with a simple

shape: usually Minimum Bounding Rectangle (MBR) = [(x1, x2), (y1, y2)]

y2 x1 x2 y1

Indexing using SAMs (cont.)

Two steps: Two steps:

Filtering step: Find all the MBRs (using

the SAM) that satisfy the query

Refinement step:For each qualified

MBR, check the original object against MBR, check the original object against the query

slide-5
SLIDE 5

5

Spatial Indexing

Point Access Methods (PAMs) vs Spatial Point Access Methods (PAMs) vs Spatial

Access Methods (SAMs)

PAM: index only point data

Hierarchical (tree-based) structures Multidimensional Hashing Space filling curve

SAM: index both points and regions

Transformations Overlapping regions Clipping methods

Spatial Indexing

Point Access Methods

slide-6
SLIDE 6

6

The problem

Given a point set and a rectangular query find the Given a point set and a rectangular query, find the

points enclosed in the query

We allow insertions/deletions on line

Q

Grid File

Hashing methods for multidimensional points

Hashing methods for multidimensional points

(extension of Extensible hashing)

Idea: Use a grid to partition the space each

cell is associated with one page

Two disk access principle (exact match)

The Grid File: An Adaptable, Symmetric Multikey File Structure

  • J. NIEVERGELT, H. HINTERBERGER lnstitut ftir Informatik, ETH AND
  • K. C. SEVCIK University of Toronto. ACM TODS 1984.
slide-7
SLIDE 7

7

Grid File

Start with one bucket Start with one bucket

for the whole space.

Select dividers along

each dimension. Partition space into cells

Dividers cut all the

way.

Grid File

E h ll d

Each cell corresponds

to 1 disk page.

Many cells can point

to the same page.

Cell directory

potentially exponential in the number of in the number of dimensions

slide-8
SLIDE 8

8

Grid File Implementation

Dynamic structure using a grid directory Dynamic structure using a grid directory

Grid array: a 2 dimensional array with

pointers to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1)

Linear scales: Two 1 dimensional arrays that

used to access the grid array (main memory) used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)

Example

Grid Directory Buckets/Disk Blocks Linear scale Y Linear scale X

slide-9
SLIDE 9

9

Grid File Search

  • Exact Match Search: at most 2 I/Os assuming linear scales fit in

/ g memory.

First use liner scales to determine the index into the cell

directory

access the cell directory to retrieve the bucket address (may

cause 1 I/O if cell directory does not fit in memory)

access the appropriate bucket (1 I/O)

  • Range Queries:
  • Range Queries:

use linear scales to determine the index into the cell directory. Access the cell directory to retrieve the bucket addresses of

buckets to visit.

Access the buckets.

Grid File Insertions

Determine the bucket into which insertion must

  • ccur.

If space in bucket, insert. Else, split bucket

how to choose a good dimension to split? ans: create convex regions for buckets.

If bucket split causes a cell directory to split do so

and adjust linear scales and adjust linear scales.

insertion of these new entries potentially requires a

complete reorganization of the cell directory--- expensive!!!

slide-10
SLIDE 10

10

Grid File Deletions

Deletions may decrease the space utilization Deletions may decrease the space utilization.

Merge buckets

We need to decide which cells to merge and

a merging threshold

Buddy system and neighbor system

A bucket can merge with only one buddy in each

A bucket can merge with only one buddy in each

dimension

Merge adjacent regions if the result is a rectangle

Z-ordering

Basic assumption: Finite precision in the Basic assumption: Finite precision in the

representation of each co-ordinate, K bits (2K values)

The address space is a square (image) and

represented as a 2K x 2K array Each element is called a pixel

Each element is called a pixel

slide-11
SLIDE 11

11

Z-ordering

Impose a linear ordering on the pixels

Impose a linear ordering on the pixels

  • f the image 1 dimensional problem

10 11

A

ZA = shuffle(xA, yA) = shuffle(“01”, “11”) = 0111 = (7)10

00 01 10 11 00 01

B

( )10 ZB = shuffle(“01”, “01”) = 0011

Z-ordering

Given a point (x y) and the precision K

Given a point (x, y) and the precision K

find the pixel for the point and then compute the z-value

Given a set of points, use a B+ -tree to

index the z-values

A range (rectangular) query in 2-d is

mapped to a set of ranges in 1-d

slide-12
SLIDE 12

12

Queries

Find the z values that contained in the

Find the z-values that contained in the

query and then the ranges

10 11

QA range [4, 7] QA QB ranges [2,3] and [8,9]

00 01 10 11 00 01

QB

Hilbert Curve

We want points that are close in 2d to

be close in the 1d

Note that in 2d there are 4 neighbors

for each point where in 1d only 2.

Z-curve has some “jumps” that we

ld lik t id would like to avoid

Hilbert curve avoids the jumps :

recursive definition

slide-13
SLIDE 13

13

Hilbert Curve- example

It has been shown that in general Hilbert is better It has been shown that in general Hilbert is better

than the other space filling curves for retrieval [Jag90]

Hi (order-i) Hilbert curve for 2ix2i array

H1 H2 ... H(n+1)

Reference

  • H. V. Jagadish: Linear Clustering of Objects with Multiple
  • Atributes. ACM SIGMOD Conference 1990: 332-342
slide-14
SLIDE 14

14

Problem

Given a collection of geometric objects Given a collection of geometric objects

(points, lines, polygons, ...)

  • rganize them on disk, to answer

spatial queries (range, nn, etc)

R-trees

[Guttman 84] Main idea: extend B+ -tree to [Guttman 84] Main idea: extend B+ tree to

multi-dimensional spaces!

(only deal with Minimum Bounding Rectangles

  • MBRs)
slide-15
SLIDE 15

15

R-trees

A multi-way external memory tree A multi-way external memory tree Index nodes and data (leaf) nodes All leaf nodes appear on the same level Every node contains between t and M

entries entries

The root node has at least 2 entries

(children)

Example

eg w/ fanout 4: group nearby rectangles

eg., w/ fanout 4: group nearby rectangles

to parent MBRs; each group -> disk page

A B C F G H

I

D E J

slide-16
SLIDE 16

16

Example

F= 4

F= 4

A B C F G H I P1 P3 D E J P2 P4

F G D E H I J A B C

Example

F= 4

F= 4

A B C F G H I P1 P3

P1 P2 P3 P4

D E J P2 P4

F G D E H I J A B C

slide-17
SLIDE 17

17

R-trees - format of nodes

{ (MBR; obj ptr)} for leaf nodes

{ (MBR; obj_ptr)} for leaf nodes P1 P2 P3 P4 A B C

x-low; x-high l hi h obj

A B C

y-low; y-high ...

  • bj

ptr ...

R-trees - format of nodes

{ (MBR; node ptr)} for non leaf nodes

{ (MBR; node_ptr)} for non-leaf nodes P1 P2 P3 P4

x-low; x-high y-low; y-high ... node ptr ...

A B C

slide-18
SLIDE 18

18

i

E

4 6 8 10

y axis

b

E

f

  • mitted

1

E

2 e d a h g

E

5

E

6

E

4

E7

8 contents

E

9 i

E1 E2 E3 E4 E5 E6 E7 E8 Root E9 E1 E2

2 4 6 8 10 2

x axis

c

E

3

a b c d e f h g i E4 E5 E8

R-trees:Search

A B C F G H I P1 P3

P1 P2 P3 P4

D E J P2 P4

F G D E H I J A B C

slide-19
SLIDE 19

19

R-trees:Search

A B C F G H I J P1 P3

P1 P2 P3 P4 A C

D E J P2 P4

F G D E H I J A B C

R-trees:Search

Main points: Main points:

every parent node completely covers its ‘children’ a child MBR may be covered by more than one

parent - it is stored under ONLY ONE of them. (ie., no need for dup. elim.)

a point query may follow multiple branches. everything works for any(?) dimensionality

slide-20
SLIDE 20

20

R-trees:Insertion

Insert X

A B C F G H I P1 P3

P1 P2 P3 P4

Insert X

D E J P2 P4

F G D E H I J A B C

X

X

R-trees:Insertion

Insert Y

A B C F G H I P1 P3

P1 P2 P3 P4

Insert Y

D E J P2 P4

F G D E H I J A B C

Y

slide-21
SLIDE 21

21

R-trees:Insertion

Extend the parent MBR

Extend the parent MBR

A B C F G H I P1 P3

P1 P2 P3 P4

B D E F H J P2 P4

F G D E H I J A B C

Y

Y

R-trees:Insertion

How to find the next node to insert the

How to find the next node to insert the

new object?

Using ChooseLeaf: Find the entry that

needs the least enlargement to include Y. Resolve ties using the area (smallest)

Other methods (later)

slide-22
SLIDE 22

22

R-trees:Insertion

If node is full then Split : ex. Insert w

A B C F G H I P1 P3

P1 P2 P3 P4

W K D E J P2 P4

F G D E H I J A B C K

R-trees:Insertion

If node is full then Split : ex. Insert w

A B C F G H I P1 P3

Q1 Q2 A B

W K P5

P1 P5 P2 P3 P4

D E J P2 P4

F G D E H I J C K W

Q1 Q2

slide-23
SLIDE 23

23

R-trees:Split

Split node P1: partition the MBRs into two groups.

A B C W K P1

  • (A1: plane sweep,

until 50% of rectangles)

  • A2: ‘linear’ split

A3 d i li

  • A3: quadratic split
  • A4: exponential split:

2M-1 choices

R-trees:Split

pick two rectangles as ‘seeds’; pick two rectangles as seeds ; assign each rectangle ‘R’ to the ‘closest’ ‘seed’

seed2 seed1 R

slide-24
SLIDE 24

24

R-trees:Split

pick two rectangles as ‘seeds’; pick two rectangles as seeds ; assign each rectangle ‘R’ to the ‘closest’ ‘seed’: ‘closest’: the smallest increase in area

seed2 seed1 R

R-trees:Split

How to pick Seeds: How to pick Seeds:

Linear:Find the highest and lowest side in each

dimension, normalize the separations, choose the pair with the greatest normalized separation

Quadratic: For each pair E1 and E2, calculate the

rectangle J= MBR(E1, E2) and d= J-E1-E2. Choose the pair with the largest d the pair with the largest d

slide-25
SLIDE 25

25

R-trees:Insertion

Use the ChooseLeaf to find the leaf

Use the ChooseLeaf to find the leaf

node to insert an entry E

If leaf node is full, then Split, otherwise

insert there

Propagate the split upwards, if necessary

p g p p , y

Adjust parent nodes

R-Trees:Deletion

Find the leaf node that contains the entry E Find the leaf node that contains the entry E Remove E from this node If underflow:

Eliminate the node by removing the node entries

and the parent entry

Reinsert the orphaned (other entries) into the tree

using I nsert

Other method (later)

slide-26
SLIDE 26

26

R-trees: Variations

R+ -tree: DO not allow overlapping so split R+ tree: DO not allow overlapping, so split

the objects (similar to z-values) Greek R-tree (Faloutsos, Roussopoulos, Sellis)

R* -tree: change the insertion, deletion

algorithms (minimize not only area but also perimeter, forced re-insertion ) German R-tree: Kriegel’s group

Hilbert R-tree: use the Hilbert values to insert

  • bjects into the tree

R-tree

The original R tree tries to minimize the

The original R-tree tries to minimize the

area of each enclosing rectangle in the index nodes.

Is there any other property that can be

  • ptimized?

p R* -tree Yes!

slide-27
SLIDE 27

27

R* -tree

Optimization Criteria:

Optimization Criteria:

(O1) Area covered by an index MBR (O2) Overlap between index MBRs (O3) Margin of an index rectangle (O4) Storage utilization

( ) g

Sometimes it is impossible to optimize

all the above criteria at the same time!

R* -tree

ChooseSubtree:

If next node is a leaf node, choose the node

using the following criteria:

Least overlap enlargement Least area enlargement Smaller area

Else

Least area enlargement Smaller area

slide-28
SLIDE 28

28

R* -tree

SplitNode

Choose the axis to split Choose the two groups along the chosen axis

ChooseSplitAxis

Along each axis, sort rectangles and break them

into two groups (M-2m+ 2 possible ways where

  • ne group contains at least m rectangles).

Compute the sum S of all margin-values p g (perimeters) of each pair of groups. Choose the

  • ne that minimizes S

ChooseSplitIndex

Along the chosen axis, choose the grouping that

gives the minimum overlap-value

R* -tree

Forced Reinsert:

Forced Reinsert:

defer splits, by forced-reinsert, i.e.: instead

  • f splitting, temporarily delete some

entries, shrink overflowing MBR, and re- insert those entries

Which ones to re-insert? Which ones to re insert? How many? A: 30%

slide-29
SLIDE 29

29

Spatial Queries

Given a collection of geometric objects (points, lines,

polygons, ...)

  • rganize them on disk, to answer efficiently

point queries range queries k-nn queries

q

spatial joins (‘all pairs’ queries)

Spatial Queries

Given a collection of geometric objects (points, lines,

polygons, ...)

  • rganize them on disk, to answer

point queries range queries k-nn queries

q

spatial joins (‘all pairs’ queries)

slide-30
SLIDE 30

30

Spatial Queries

Given a collection of geometric objects (points, lines,

polygons, ...)

  • rganize them on disk, to answer

point queries range queries k-nn queries

q

spatial joins (‘all pairs’ queries)

Spatial Queries

Given a collection of geometric objects (points, lines,

polygons, ...)

  • rganize them on disk, to answer

point queries range queries k-nn queries

q

spatial joins (‘all pairs’ queries)

slide-31
SLIDE 31

31

Spatial Queries

Given a collection of geometric objects (points, lines,

polygons, ...)

  • rganize them on disk, to answer

point queries range queries k-nn queries

q

spatial joins (‘all pairs’ queries)

R-tree

4 5 6 7 8 9 10 11 2 3

1 2 3 12 13 1

slide-32
SLIDE 32

32

R-trees - Range search

d d pseudocode: check the root for each branch, if its MBR intersects the query rectangle apply range-search (or print out if this apply range-search (or print out, if this is a leaf)

R-trees - NN search

A B C F G H I P1 P3 D E J P2 P4

q

slide-33
SLIDE 33

33

R-trees - NN search

Q: How? (find near neighbor; refine

)

Q: How? (find near neighbor; refine...)

A B C F G H I P1 P3 D E J P2 P4

q

R-trees - NN search

A1: depth-first search; then range query A1: depth first search; then range query

A B C F G H I P1 P3 B D E J P2 P4

q

slide-34
SLIDE 34

34

R-trees - NN search

A1: depth-first search; then range query A1: depth first search; then range query

A B C F G H I P1 P3 D E J P2 P4

q

R-trees - NN search

A1: depth-first search; then range query A1: depth first search; then range query

A B C F G H I P1 P3 D E J P2 P4

q

slide-35
SLIDE 35

35

R-trees - NN search: Branch and Bound

A2: [Roussopoulos+

sigmod95]:

A2: [Roussopoulos+ , sigmod95]:

At each node, priority queue, with promising

MBRs, and their best and worst-case distance

main idea: Every face of any MBR contains at least

  • ne point of an actual spatial object!

MBR face property

MBR is a d-dimensional rectangle which is the MBR is a d dimensional rectangle, which is the

minimal rectangle that fully encloses (bounds) an

  • bject (or a set of objects)

MBR f.p.: Every face of the MBR contains at least one

i t f bj t i th d t b point of some object in the database

slide-36
SLIDE 36

36

Search improvement

Visit an MBR (node) only when necessary How to do pruning? Using MINDIST and MINMAXDIST

MINDIST

MINDIST(P R) is the minimum distance between a MINDIST(P, R) is the minimum distance between a

point P and a rectangle R

If the point is inside R, then MINDIST= 0 If P is outside of R, MINDIST is the distance of P to

the closest point of R (one point of the perimeter)

slide-37
SLIDE 37

37

MINDIST computation

  • MINDIST(p,R) is the minimum distance between p and R with

(p, ) p corner points l and u

the closest point in R is at least this distance away

ri = li if pi < li p R u

=

− =

d i i i

r p R P MINDIST

1 2

) ( ) , (

u=(u1, u2, …, ud)

i i

pi

i

= ui if pi > ui = pi otherwise p p p l

MINDIST = 0

l=(l1, l2, …, ld)

) , ( ) , ( ,

  • P

R P MINDIST R

∈ ∀

MINMAXDIST

MINMAXDIST(P R) f h di i fi d th

MINMAXDIST(P,R): for each dimension, find the

closest face, compute the distance to the furthest point on this face and take the minimum of all these (d) distances

MINMAXDIST(P,R) is the smallest possible upper

bound of distances from P to R

MINMAXDIST guarantees that there is at least one

  • bject in R with a distance to P smaller or equal to it.

) , ( ) , ( , R P MINMAXDIST

  • P

R

∈ ∃

slide-38
SLIDE 38

38

MINDIST and MINMAXDIST

MINDIST(P R) < = NN(P) < = MINMAXDIST(P R)

MINDIST(P, R) < = NN(P) < = MINMAXDIST(P,R)

R1 R3 R4

MINDIST MINMAXDIST MINDIST

R2

MINDIST MINMAXDIST MINMAXDIST

Pruning in NN search

Downward pruning: An MBR R is discarded if there exists Downward pruning: An MBR R is discarded if there exists

another R’ s.t. MINDIST(P,R)> MINMAXDIST(P,R’)

Downward pruning: An object O is discarded if there

exists an R s.t. the Actual-Dist(P,O) > MINMAXDIST(P,R)

Upward pruning: An MBR R is discarded if an object O is

found s.t. the MINDIST(P,R) > Actual-Dist(P,O)

slide-39
SLIDE 39

39

Pruning 1 example

Downward pruning: An MBR R is discarded if there exists Downward pruning: An MBR R is discarded if there exists

another R’ s.t. MINDIST(P,R)> MINMAXDIST(P,R’)

MINDIST

R R’

MINMAXDIST

Pruning 2 example

Downward pruning: An object O is discarded if there Downward pruning: An object O is discarded if there

exists an R s.t. the Actual-Dist(P,O) > MINMAXDIST(P,R)

Actual-Dist

O R

MINMAXDIST

slide-40
SLIDE 40

40

Pruning 3 example

Upward pruning: An MBR R is discarded if an object O is Upward pruning: An MBR R is discarded if an object O is

found s.t. the MINDIST(P,R) > Actual-Dist(P,O)

MINDIST

R

Actual-Dist

O

Ordering Distance

MINDIST is an optimistic distance where MINMAXDIST is MINDIST is an optimistic distance where MINMAXDIST is

a pessimistic one. P

MINDIST MINMAXDIST

slide-41
SLIDE 41

41

NN-search Algorithm

1

Initialize the nearest distance as infinite distance

1.

Initialize the nearest distance as infinite distance

2.

Traverse the tree depth-first starting from the root. At each Index node, sort all MBRs using an ordering metric and put them in an Active Branch List (ABL).

3.

Apply pruning rules 1 and 2 to ABL

4.

Visit the MBRs from the ABL following the order until it is empty

5.

If Leaf node, compute actual distances, compare with the best NN f d t if NN so far, update if necessary.

6.

At the return from the recursion, use pruning rule 3

7.

When the ABL is empty, the NN search returns.

K-NN search

Keep the sorted buffer of at most k current nearest

Keep the sorted buffer of at most k current nearest

neighbors

Pruning is done using the k-th distance

slide-42
SLIDE 42

42

Another NN search: Best-First

Gl b l d [ S99]

Global order [HS99]

Maintain distance to all entries in a common Priority

Queue

Use only MINDIST Repeat

Inspect the next MBR in the list

Inspect the next MBR in the list Add the children to the list and reorder

Until all remaining MBRs can be pruned

Nearest Neighbor Search (NN) with R-Trees

Best-first (BF) algorihm:

i

E

2 4 6 8 10 2 4 6 8 10

x axis y axis

b

E

f query point

  • mitted

1

E

2 e d c a h g

E

3

E

5

E

6

E

4

E7

8 search region contents

E

9 i

a 5 b 13 c 18 d 13 e 13 f 10 h 2 g 13 E1 1 E2 2 E3 8 E4 5 E5 5 E6 9 E7 13 E8 2 Root E9 17 i 10 E1 E2 E4 E5 E8

2 4 6 8 10 E 1 1 E 2 2

Visit Root

E 13 7

follow E

1 E 2 2 E 5 4 E 5 5 E 8 3 E 9 6 E 8 3

Action Heap follow E

2 E 2 8 E 5 4 E 5 5 E 8 3 E 9 6

follow E

8

Report h and terminate

E 17 9 E 13 7 E 5 4 E 5 5 E 8 3 E 9 6 E 17 9

Result {empty} {empty} {empty} {(h,

2 )}

E4 E5 8

i 10 E 5 4 E 5 5 E 8 3 E 9 6 E 13 7 g 13

slide-43
SLIDE 43

43

HS algorithm

Initialize PQ (priority queue) Initialize PQ (priority queue) InesrtQueue(PQ, Root) While not IsEmpty(PQ)

R= Dequeue(PQ) If R is an object

Report R and exit (done!)

If R is a leaf page node

For each O in R, compute the Actual-Dists, InsertQueue(PQ, O)

If R is an index node

For each MBR C, compute MINDIST, insert into PQ

Best-First vs Branch and Bound

Best-First is the “optimal” algorithm in the sense that Best First is the optimal algorithm in the sense that

it visits all the necessary nodes and nothing more!

But needs to store a large Priority Queue in main

  • memory. If PQ becomes large, we have thrashing…

BB uses small Lists for each node. Also uses

MINMAXDIST to prune some entries

slide-44
SLIDE 44

44

Spatial Join

Find all parks in each city in MA Find all parks in each city in MA Find all trails that go through a forest in MA Basic operation

find all pairs of objects that overlap

Single-scan queries Single-scan queries

nearest neighbor queries, range queries

Multiple-scan queries

spatial join

Algorithms

No existing index structures

g

Transform data into 1-d space [O89]

z-transform; sensitive to size of pixel

Partition-based spatial-merge join [PW96]

partition into tiles that can fit into memory plane sweep algorithm on tiles

Spatial hash joins [LR96, KS97] Sort data using recursive partitioning [BBKK01]

With index structures [BKS93, HJR97]

k-d trees and grid files R-trees

slide-45
SLIDE 45

45

R-tree based Join [BKS93]

S R

Join1(R,S)

  • Tree synchronized traversal algorithm
  • Tree synchronized traversal algorithm

Join1(R,S) Repeat Find a pair of intersecting entries E in R and F in S If R and S are leaf pages then add (E,F) to result-set Else Join1(E,F)

  • Until all pairs are examined
  • CPU and I/O bottleneck

R S

slide-46
SLIDE 46

46

CPU – Time Tuning

Two ways to improve CPU time

Two ways to improve CPU – time

Restricting the search space Spatial sorting and plane sweep Spatial sorting and plane sweep

Reducing CPU bottleneck

S R

slide-47
SLIDE 47

47

Join2(R,S,IntersectedVol)

Join2(R,S,IV) Repeat Find a pair of intersecting entries E in R and F in S that overlap with IV If R and S are leaf pages then add (E,F) to result-set Else Join2(E,F,CommonEF)

Until all pairs are examined

I l b f i l

In general, number of comparisons equals

size(R) + size(S) + relevant(R)* relevant(S)

Reduce the product term

Restricting the search space

1 5

Join1: 7 of R * 7 of S Now: 3 of R * 2 of S

1 3 5 1 1

= 49 comparisons Now: 3 of R 2 of S Plus Scanning: 7 of R + 7 of S =6 comp = 14 comp

slide-48
SLIDE 48

48

Using Plane Sweep

S R r1 r2 s1 s2 Consider the extents along x-axis Start with the first entry r1 sweep a vertical line r3

Using Plane Sweep

S R r1 r2 s1 s2 r3 Check if (r1,s1) intersect along y-dimension Add (r1,s1) to result set

slide-49
SLIDE 49

49

Using Plane Sweep

S R r1 r2 s1 s2 r3 Check if (r1,s2) intersect along y-dimension Add (r1,s2) to result set

Using Plane Sweep

S R r1 r2 s1 s2 r3 Reached the end of r1 Start with next entry r2

slide-50
SLIDE 50

50

Using Plane Sweep

S R r1 r2 s1 s2 r3 Reposition sweep line

Using Plane Sweep

S R r1 r2 s1 s2 r3 Check if r2 and s1 intersect along y Do not add (r2,s1) to result

slide-51
SLIDE 51

51

Using Plane Sweep

S R r1 r2 s1 s2 r3 Reached the end of r2 Start with next entry s1

Using Plane Sweep

S R r1 r2 s1 s2 r3 Total of 2(r1) + 1(r2) + 0 (s1)+ 1(s2)+ 0(r3) = 4 comparisons

slide-52
SLIDE 52

52

I/O Tunning

Compute a read schedule of the pages to minimize Compute a read schedule of the pages to minimize

the number of disk accesses

Local optimization policy based on spatial locality

Three methods

Local plane sweep Local plane sweep with pinning

L l d

Local z-order

Reducing I/O

Plane sweep again:

Plane sweep again:

Read schedule r1, s1, s2, r3 Every subtree examined only once Consider a slightly different layout

slide-53
SLIDE 53

53

Reducing I/O

S R r1 r2 s1 s2 r3 Read schedule is r1, s2, r2, s1, s2, r3 Subtree s2 is examined twice

Pinning of nodes

After examining a pair (E F) compute the degree After examining a pair (E,F), compute the degree

  • f intersection of each entry

degree(E) is the number of intersections between E and

unprocessed rectangles of the other dataset

If the degrees are non-zero, pin the pages of the

entry with maximum degree y g

Perform spatial joins for this page Continue with plane sweep

slide-54
SLIDE 54

54

Reducing I/O

R S r2 s1 r1 r2 r3 s2 After computing join(r1,s2), degree(r1) = 0 degree(s2) = 1 So, examine s2 next Read schedule = r1, s2, r3, r2, s1 Subtree s2 examined only once

Local Z-Order

  • Idea:

1.

Compute the intersections between each rectangle of the

  • ne node and all rectangles of the other node

2.

Sort the rectangles according to the Z-ordering of their centers

3.

Use this ordering to fetch pages

slide-55
SLIDE 55

55

Local Z-ordering

r3 III III s1 r1 r2 s2 r4 IV II I IV II I III

Read schedule: <s1,r2,r1,s2,r4,r3>

R-trees - performance analysis

How many disk (= node) accesses we’ll need for

range nn spatial joins

Worst Case vs. Average Case

  • st Case

s e age Case

slide-56
SLIDE 56

56

Worst Case Perofrmance

In the worst case we need to perform

In the worst case, we need to perform

O(N/B) I/O’s for an empty query (pretty bad!)

We need to show a family of datasets We need to show a family of datasets

and queries were any R-tree will perform like that

Example:

2 4 6 8 10

y axis

2 4 6 8 10

x axis

12 14 16 18 20

slide-57
SLIDE 57

57

Average Case analysis

How many disk accesses (expected value) for range

How many disk accesses (expected value) for range

queries?

query distribution wrt location?

  • “ “ wrt size?

R-trees - performance analysis

How many disk accesses for range queries?

How many disk accesses for range queries?

query distribution wrt location? uniform; (biased)

  • “ “ wrt size? uniform
slide-58
SLIDE 58

58

R-trees - performance analysis

easier case: we know the positions of data nodes and

their MBRs, eg:

R-trees - performance analysis

How many times will P1 be retrieved (unif. queries)?

P1 x1 x2

slide-59
SLIDE 59

59

R-trees - performance analysis

How many times will P1 be retrieved (unif. POINT

i )? queries)?

P1 x1 x2 1 1

R-trees - performance analysis

How many times will P1 be retrieved (unif. POINT

i )? A 1* 2 queries)? A: x1* x2

P1 x1 x2 1 1

slide-60
SLIDE 60

60

R-trees - performance analysis

How many times will P1 be retrieved (unif. queries of

i 1 2)? size q1xq2)?

P1 x1 x2 1 q2 1 q1

R-trees - performance analysis

Minkowski sum

Minkowski sum

q1 q2 q1 q1/2 q2/2

slide-61
SLIDE 61

61

R-trees - performance analysis

How many times will P1 be retrieved (unif. queries of

i 1 2)? A ( 1 1)* ( 2 2) size q1xq2)? A: (x1+ q1)* (x2+ q2)

P1 x1 x2 1 q2 1 q1

R-trees - performance analysis

Thus given a tree with n nodes (i= 1

n) we expect

Thus, given a tree with n nodes (i 1, ... n) we expect

) )( ( ) , (

2 2 , 1 1 , 2 1

q x q x q q DA

i n i i

+ + = ∑ + ∗ = ∑

2 , 1 , i n i i

x x

i 1 , 2 2 , 1 i n i i n i

x q x q

∑ ∑

+ n q q ∗ ∗ +

2 1

slide-62
SLIDE 62

62

R-trees - performance analysis

Thus, given a tree with n nodes (i= 1, ... n) we expect

‘volume’

) )( ( ) , (

2 2 , 1 1 , 2 1

q x q x q q DA

i n i i

+ + = ∑ + ∗ = ∑

2 , 1 , i n i i

x x

‘surface area’ count

i 1 , 2 2 , 1 i n i i n i

x q x q

∑ ∑

+ n q q ∗ ∗ +

2 1

R-trees - performance analysis

Observations:

for point queries: only volume matters for horizontal-line queries: (q2= 0): vertical length

matters

for large queries (q1, q2 > > 0): the count N matters

  • verlap: does not seem to matter (but it is related to

) area)

formula: easily extendible to n dimensions

slide-63
SLIDE 63

63

R-trees - performance analysis

Conclusions: Conclusions:

splits should try to minimize area and perimeter ie., we want few, small, square-like parent MBRs rule of thumb: shoot for queries with q1= q2 = 0.1 (or

= 0.05 or so).

More general Model

What if we have only the dataset D and the set of

What if we have only the dataset D and the set of

queries S?

We should “predict” the structures of a “good” R-tree

for this dataset. Then use the previous model to estimate the average query performance for S

For point dataset, we can use the Fractal Dimension

to find the “average” structure of the tree

(More in the [FK94] paper)

slide-64
SLIDE 64

64

Unifrom dataset

  • Assume that the dataset (that contains only rectangles) is
  • Assume that the dataset (that contains only rectangles) is

uniformly distributed in space.

  • Density of a set of N MBRs is the average number of

MBRs that contain a given point in space. OR the total area covered by the MBRs over the area of the work space.

  • N boxes with average size s= (s1,s2), D(N,s) = N s1 s2
  • If s = s2= s then:
  • If s1= s2= s, then:

N D s s N D = ⇒ =

2

Density of Leaf nodes

  • Assume a dataset of N rectangles. If the average page

g g p g capacity is f, then we have Nln = N/f leaf nodes.

  • If D1 is the density of the leaf MBRs, and the average

area of each leaf MBR is s2, then: S ti t f N f D

N f D s s f N D

1 1 2 1 1

= ⇒ =

  • So, we can estimate s1, from N, f, D1

We need to estimate D1 from the dataset’s

density…

slide-65
SLIDE 65

65

Estimating D1

Consider a leaf node that

f

contains f MBRs. Then for each side of the leaf node MBR we have: MBRs Also, Nln leaf nodes contain N MBRs, uniformly distributed.

f

The average distance between the centers of two consecutive MBRs is t= (assuming [0,1] 2 space)

N 1

t

Estimating D1

  • Combining the previous observations we can estimate
  • Combining the previous observations we can estimate

the density at the leaf level, from the density of the dataset: W l h id i l h h

2 1

} 1 1 { f D D − + =

  • We can apply the same ideas recursively to the other

levels of the tree.

slide-66
SLIDE 66

66

R-trees–performance analysis

Assuming Uniform distribution:

Assuming Uniform distribution:

h

} ) {( 1 ) (

2 1 1 j h j j

f N q D q DA + + =

+ =

D D and D D

j

= − + =

− 2 1

} 1 1 {

where And D is the density of the dataset, f the fanout [TS96], N the number of objects

D D and f D j = + = } 1 {

References

  • Christos Faloutsos and Ibrahim Kamel. “Beyond Uniformity and

Christos Faloutsos and Ibrahim Kamel. Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension”. Proc. ACM PODS, 1994.

  • Yannis Theodoridis and Timos Sellis. “A Model for the Prediction of R-

tree Performance”. Proc. ACM PODS, 1996.