CMPUT 391 Database Management Systems Spatial Data Management 1 - - PowerPoint PPT Presentation

cmput 391 database management systems spatial data
SMART_READER_LITE
LIVE PREVIEW

CMPUT 391 Database Management Systems Spatial Data Management 1 - - PowerPoint PPT Presentation

CMPUT 391 Database Management Systems Spatial Data Management 1 Dr. Jrg Sander, 2006 University of Alberta CMPUT 391 Database Management Systems Spatial Data Management Shortcomings of Relational Databases and ORDBMS Modeling


slide-1
SLIDE 1

CMPUT 391 Database Management Systems Spatial Data Management

University of Alberta

  • Dr. Jörg Sander, 2006

1

CMPUT 391 – Database Management Systems

slide-2
SLIDE 2

Spatial Data Management

  • Shortcomings of Relational Databases and ORDBMS
  • Modeling Spatial Data
  • Spatial Queries
  • Space-Filling Curves + B-Trees
  • R-trees

University of Alberta

  • Dr. Jörg Sander, 2006

2

CMPUT 391 – Database Management Systems

slide-3
SLIDE 3

The Need for a DBMS

  • On one hand we have a tremendous increase in the

amount of data applications have to handle, on the

  • ther hand we want a reduced application

development time.

– Object-Oriented programming – DBMS features: query capability with optimization, concurrency control, recovery, indexing, etc.

  • Can we merge these two to get an object database

management system since data is getting more complex?

University of Alberta

  • Dr. Jörg Sander, 2006

3

CMPUT 391 – Database Management Systems

slide-4
SLIDE 4

Manipulating New Kinds of Data

  • A television channel needs to store video

sequences, radio interviews, multimedia documents, geographical information, etc., and retrieve them efficiently.

  • A movie producing company needs to store

movies, frame sequences, data about actors and theaters, etc.

  • A biological lab needs to store complex data

about molecules, chromosomes, etc, and retrieve parts of data as well as complete data.

University of Alberta

  • Dr. Jörg Sander, 2006

4

CMPUT 391 – Database Management Systems

slide-5
SLIDE 5

What are the Needs?

  • Images
  • Video
  • Multimedia in general
  • Spatial data (GIS)
  • Biological data
  • CAD data
  • Virtual Worlds
  • Games
  • List of lists
  • User defined data types

University of Alberta

  • Dr. Jörg Sander, 2006

5

CMPUT 391 – Database Management Systems

slide-6
SLIDE 6

Shortcomings with RDBMS

  • Supports only a small fixed collection of relatively

simple data types (integers, floating point numbers, date, strings)

  • No set-valued attributes (sets, lists,…)
  • No inheritance in the Is-a relationship
  • No complex objects, apart from BLOB (binary

large object) and CLOB (character large object)

  • Impedance mismatch between data access

language (declarative SQL) and host language (procedural C or Java): programmer must explicitly tell how things to be done. Is there a different solution?

University of Alberta

  • Dr. Jörg Sander, 2006

6

CMPUT 391 – Database Management Systems

slide-7
SLIDE 7

Existing Object Databases

  • Object database is a persistent storage manager for
  • bjects:

– Persistent storage for object-oriented programming languages (C++, SmallTalk,etc.) – Object-Database Systems:

  • Object-Oriented Database Systems: alternative to relational

systems

  • Object-Relational Database Systems: Extension to relational

systems

  • Market: RDBMS ( $8 billion), OODMS ($30 million) world-wide
  • OODB Commercial Products: ObjectStore, GemStone, Orion, etc.

University of Alberta

  • Dr. Jörg Sander, 2006

7

CMPUT 391 – Database Management Systems

slide-8
SLIDE 8

Query No Query File System Relational DBMS Object-Relational DBMS Object-Oriented DBMS Simple Data Complex Data

DBMS Classification Matrix

University of Alberta

  • Dr. Jörg Sander, 2006

8

CMPUT 391 – Database Management Systems

slide-9
SLIDE 9

Object-Relational Features of Oracle

Methods

CREATE TYPE Rectangle_typ AS OBJECT ( len NUMBER, wid NUMBER, MEMBER FUNCTION area RETURN NUMBER, ); CREATE TYPE BODY Rectangle_typ AS MEMBER FUNCTION area RETURN NUMBER IS BEGIN RETURN len * wid; END area; END;

University of Alberta

  • Dr. Jörg Sander, 2006

9

CMPUT 391 – Database Management Systems

slide-10
SLIDE 10

Object-Relational Features of Oracle

Collection types / nested tables

CREATE TYPE PointType AS OBJECT ( x NUMBER, y NUMBER); CREATE TYPE PolygonType AS TABLE OF PointType; CREATE TABLE Polygons ( name VARCHAR2(20), points PolygonType) NESTED TABLE points STORE AS PointsTable; The relations representing individual polygons are not stored directly as values of the points attribute; they are stored in a single table, PointsTable

University of Alberta

  • Dr. Jörg Sander, 2006

10

CMPUT 391 – Database Management Systems

slide-11
SLIDE 11

Spatial Data Management

  • Shortcomings of Relational Databases
  • Modeling Spatial Data
  • Spatial Queries
  • Space-Filling Curves + B-Trees
  • R-trees

University of Alberta

  • Dr. Jörg Sander, 2006

11

CMPUT 391 – Database Management Systems

slide-12
SLIDE 12

Relational Representation of Spatial Data

  • Example: Representation of geometric objects (here: parcels/fields of land)

in normalized relations

Parcels

FNr BNr F1 F1 F1 F1 F4 F4 F4 F4 F4 F4 F7 F7 F7 F7 B1 B2 B3 B4 B2 B5 B6 B7 B8 B9 B7 B10 B11 B12

… …

Borders

BNr PNr1 PNr2 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 P1 P2 P3 P4 P2 P5 P6 P7 P8 P6 P9 P10 P2 P3 P4 P1 P5 P6 P7 P8 P3 P9 P10 P7

Points

PNr

P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 X P1 X P2 X P3 X P4 X P5 X P6 X P7 X P8 X P9 X P10 Y P1 Y P2 Y P3 Y P4 Y P5 Y P6 Y P7 Y P8 Y P9 Y P10

X-Coord Y-Coord

F7 F4 F5 F2 F6 F3 F1

Redundancy free representation requires distribution of the information

  • ver 3 tables: Parcels, Borders, Points

University of Alberta

  • Dr. Jörg Sander, 2006

12

CMPUT 391 – Database Management Systems

slide-13
SLIDE 13

Relational Representation of Spatial Data

  • For (spatial) queries involving parcels it is necessary to reconstruct

the spatial information from the different tables

– E.g.: if we want to determine if a given point P is inside parcel F2, we have to find all corner-points of parcel F2 first

SELECT Points.PNr, X-Coord, Y-Coord FROM Parcels, Border, Points WHERE FNr = ‘F2’ AND Parcel.BNr = Borders.BNr AND (Borders.PNr1 = Points.PNr OR Borders.PNr2 = Points.PNr)

  • Even this simple query requires expensive joins of three tables
  • Querying the geometry (e.g., P in F2?) is not directly supported.

University of Alberta

  • Dr. Jörg Sander, 2006

13

CMPUT 391 – Database Management Systems

slide-14
SLIDE 14

Extension of the Relational Model to Support Spatial Data

  • Integration of spatial data types and operations into the core of

a DBMS ( object-oriented and object-relational databases)

– Data types such as Point, Line, Polygon – Operations such as ObjectIntersect, RangeQuery, etc.

  • Advantages

– Natural extension of the relational model and query languages – Facilitates design and querying of spatial databases – Spatial data types and operations can be supported by spatial index structures and efficient algorithms, implemented in the core of a DBMS

  • All major database vendors today implement support for spatial data and
  • perations in their database systems via object-relational extensions

University of Alberta

  • Dr. Jörg Sander, 2006

14

CMPUT 391 – Database Management Systems

slide-15
SLIDE 15

Extension of the Relational Model to Support Spatial Data – Example

Relation: ForestZones(Zone: Polygon, ForestOfficial: String, Area: Cardinal)

  • The province decides that a reforestation is necessary in an area described

by a polygon S. Find all forest officials affected by this decision.

SELECT ForestOfficial FROM ForestZones WHERE ObjectIntersects (S, Zone)

R2 R4 R6 R3 R1 R5 ForestZones Zone ForestOfficial Area (m2) R1 R2 R3 R4 R5 R6 Stevens Behrens Lee Goebel Jones Kent 3900 4250 6700 5400 1900 4600

University of Alberta

  • Dr. Jörg Sander, 2006

15

CMPUT 391 – Database Management Systems

slide-16
SLIDE 16

Data Types for Spatial Objects

  • Spatial objects are described by

– Spatial Extent

  • location and/or boundary with respect to a reference point in a coordinate

system, which is at least 2-dimensional.

  • Basic object types: Point, Lines, Polygon

– Other Non-Spatial Attributes

  • Thematic attributes such as height, area, name, land-use, etc.

University of Alberta

  • Dr. Jörg Sander, 2006

16

CMPUT 391 – Database Management Systems

2-dim. polygons Crop Forest Water 2-dim. lines 2-dim. points X Y

slide-17
SLIDE 17

Spatial Data Management

  • Shortcomings of Relational Databases
  • Modeling Spatial Data
  • Spatial Queries
  • Space-Filling Curves + B-Trees
  • R-trees

University of Alberta

  • Dr. Jörg Sander, 2006

17

CMPUT 391 – Database Management Systems

slide-18
SLIDE 18

Spatial Query Processing

  • DBMS has to support two types of operations

– Operations to retrieve certain subsets of spatial object from the database

  • “Spatial Queries/Selections”, e.g., window query, point query, etc.

– Operations that perform basic geometric computations and tests

  • E.g., point in polygon test, intersection of two polygons etc.
  • Spatial selections, e.g. in geographic information systems, are
  • ften supported by an interactive graphical user interface

University of Alberta

  • Dr. Jörg Sander, 2006

18

CMPUT 391 – Database Management Systems

W

Window Query

P

Point Query

slide-19
SLIDE 19

Basic Spatial Queries

Containment Query R Point Query P

  • Containment Query: Given a spatial
  • bject R, find all objects that completely

contain R. If R is a Point: Point Query

  • Region Query: Given a region R

(polygon or circle), find all spatial

  • bjects that intersect with R. If R is a

rectangle: Window Query

  • Enclosure Query: Given a polygon

region R, find all objects that are completely contained in R

  • K-Nearest Neighbor Query: Given an
  • bject P, find the k objects that are

closest to P (typically for points)

University of Alberta

  • Dr. Jörg Sander, 2006

19

CMPUT 391 – Database Management Systems

Region Query R Window Query R Enclosure Query R 2-nn Query P

slide-20
SLIDE 20

Basic Spatial Operation – Spatial Join

  • Given two sets of spatial objects (typically minimum bounding rectangles)

– S1 = {R1, R2, …, Rm} and S2 = {R’1, R’2, …, R’n}

  • Spatial Join: Compute all pairs of objects (R, R’) such that

– R ∈ S1, R’ ∈ S2, – and R intersects R’ (R ∩ R’ ≠ ∅) – Spatial predicates other than intersection are also possible, e.g. all pairs of

  • bjects that are within a certain distance from each other

B1 A2 A3 A4 A5 A6 A1 B2 B3

Answer Set (A5, B1) (A4, B1) (A1, B2) (A6, B2) (A2, B3) Spatial-Join

{A1, …, A6} {B1, …, B3}

University of Alberta

  • Dr. Jörg Sander, 2006

20

CMPUT 391 – Database Management Systems

slide-21
SLIDE 21

Index Support for Spatial Queries

  • Conventional index structures such as B-trees are not designed

to support spatial queries

– Group objects only along one dimension – Do not preserve spatial proximity

  • E.g. nearest neighbor query:

Nearest neighbor of Q is typically not the nearest neighbor in any single dimension

X Y

Q NN(Q) A B C D A and B closer in the X dimension; C and D closer in the Y dimension.

University of Alberta

  • Dr. Jörg Sander, 2006

21

CMPUT 391 – Database Management Systems

slide-22
SLIDE 22

Index Support for Spatial Queries

  • Spatial index structures try to preserve spatial proximity

– Group objects that are close to each other on the same data page – Problem: the number of bytes to store extended spatial objects (lines, polygons) varies – Solution:

  • Store Approximations of spatial objects in the index structure,

typically axis-parallel minimum bounding rectangles (MBR)

  • Exact object representation (ER) stored separately; pointers to ER in the index

ER

MBR

Spatial Index (MBR, , ...) (ER) (MBR, , ...)... ... University of Alberta

  • Dr. Jörg Sander, 2006

22

CMPUT 391 – Database Management Systems

slide-23
SLIDE 23

Query Processing Using Approximations

Two-Step Procedure

1. Filter Step:

– Use the index to find all approximations that satisfy the query – Some objects already satisfy the query based on the approximation,

  • thers have to be checked in the refinement step Candidate Set

2. Refinement Step:

– Load the exact object representations for candidates left after the filter step and test whether they satisfies the query

query-window

b a c d

e

f g

e a und b sind sicherAntworten f, d und g sind sicher keine Antworten c und e sindKandidaten c ist einFehltreffer(false hit, d. h. ein Kandidat, der keine Antwort ist)

Filter candidates Refinement (exact evaluation) final results false hits Not an answer Query Query Window

  • a and b are certainly answers
  • f, d, and g are certainly

not answers

  • c and e are candidates
  • c is a false hit

Why?

University of Alberta

  • Dr. Jörg Sander, 2006

23

CMPUT 391 – Database Management Systems

slide-24
SLIDE 24

Spatial Data Management

  • Shortcomings of Relational Databases
  • Modeling Spatial Data
  • Spatial Queries
  • Space-Filling Curves + B-Trees
  • R-trees

University of Alberta

  • Dr. Jörg Sander, 2006

24

CMPUT 391 – Database Management Systems

slide-25
SLIDE 25

Embedding of the 2-dimensional space into a 1 dimensional space

  • Basic Idea:

– The data space is partitioned into rectangular cells. – Use a space filling curve to assign cell numbers to the cells (define a linear order on the cells)

  • The curve should preserve spatial proximity as

good as possible

  • Cell numbers should be easy to compute

– Objects are approximated by cells. – Store the cell numbers for objects in a conventional index structure with respect to the linear order

43 63 62 59 58 47 46 42 1 21 20 17 16 5 4 3 23 22 19 18 7 6 2 9 29 28 25 24 13 12 8 11 31 30 27 26 15 14 10 33 53 52 49 48 37 36 32 35 55 54 51 50 39 38 34 41 61 60 57 56 45 44 40

University of Alberta

  • Dr. Jörg Sander, 2006

25

CMPUT 391 – Database Management Systems

slide-26
SLIDE 26

Space Filling Curves

Lexicographic Order

1 7 6 5 4 3 2 9 15 14 13 12 11 10 8 17 23 22 21 20 19 18 16 25 31 30 29 28 27 26 24 33 39 38 37 36 35 34 32 41 47 46 45 44 43 42 40 49 55 54 53 52 51 50 48 57 63 62 61 60 59 58 56

Hilbert-Curve

University of Alberta

  • Dr. Jörg Sander, 2006

26

CMPUT 391 – Database Management Systems

Z-Order

2 42 40 34 32 10 8 3 43 41 35 33 11 9 1 6 46 44 38 36 14 12 4 7 47 45 39 37 15 13 5 18 58 56 50 48 26 24 16 19 59 57 51 49 27 25 17 22 62 60 54 52 30 28 20 23 63 61 55 53 31 29 21 1 21 20 19 16 15 14 2 22 23 18 17 12 13 3 7 25 24 29 30 11 8 4 6 26 27 28 31 10 9 5 57 37 36 35 32 53 54 58 56 38 39 34 33 52 55 59 61 41 40 45 46 51 50 60 62 42 43 44 47 48 49 63

  • Z-Order preserves spatial proximity relatively good
  • Z-Order is easy to compute
slide-27
SLIDE 27

Z-Order – Z-Values

  • Coding of Cells

– Partition the data space recursively into two halves – Alternate X and Y dimension – Left/bottom 0 – Right/top 1

–Z-Value: (c, l)

c = decimal value of the bit string l = level (number of bits) if all cells are on the same level, then l can be omitted

1 1

1

1 1

1

10 2 010000 16 0111 7 X Y

University of Alberta

  • Dr. Jörg Sander, 2006

27

CMPUT 391 – Database Management Systems

slide-28
SLIDE 28

Z-Order – Representation of Spatial Objects

  • For Points

– Use a fixed a resolution of the space in both dimensions, i.e., each cell has the same size – Each point is then approximated by one cell

  • For extended spatial object

– minimum enclosing cell

  • Problems with cells that intersect

the first partitions already

– improvement: use several cells

  • Better approximation of the objects
  • Redundant storage
  • Redundant retrieval in spatial queries

Query returns the same answer several times Query Window

R R

Coding of R by one cell

C C1 C2 C3 C4

2 42 40 34 32 10 8 3 43 41 35 33 11 9 1 6 46 44 38 36 14 12 4 7 47 45 39 37 15 13 5 18 58 56 50 48 26 24 16 19 59 57 51 49 27 25 17 22 62 60 54 52 30 28 20 23 63 61 55 53 31 29 21

Coding of R by several cells

University of Alberta

  • Dr. Jörg Sander, 2006

28

CMPUT 391 – Database Management Systems

slide-29
SLIDE 29

Z-Order – Mapping to a B+-Tree

  • Linear Order for Z-values to store them in a B+-tree:

Let (c1, l1) and (c2, l2) be two Z-Values and let l = min{l1, l2}. The order relation ≤Z (that defines a linear order on Z-values) is then defined by (c1, l1) ≤Z (c2, l2) iff (c1 div 2 ) ≤ (c2 div 2 ) Examples: (1,2) ≤Z (3,2), (3,4) ≤Z (3,2), (1,2) ≤Z (10,4)

(l1- l) (l2- l)

University of Alberta

  • Dr. Jörg Sander, 2006

29

CMPUT 391 – Database Management Systems

slide-30
SLIDE 30

Mapping to a B+-Tree - Example

(0,2) (2,3) (7,4) (6,3) (7,3) (7,4) (4,3) (21,5) (11,4) (6,3) (6,3) (20,5) (6,4) (0,2) ≤ (7,4) ≤ (7,4) ≤ (6,3) (2,3) ≤ (7,4) ≤ (4,3) ≤ (6,3)

. . .

Exact representations stored in a different location

(6,4) ≤ (7,4) ≤ (20,5) ≤ (6,3)

University of Alberta

  • Dr. Jörg Sander, 2006

30

CMPUT 391 – Database Management Systems

slide-31
SLIDE 31

Mapping to a B+-Tree – Window Query

  • Window Query Range Query in the B+-tree

– find all entries (Z-Values) in the range [l, u] where

  • l = smallest Z-Value of the window (bottom left corner)
  • u = largest Z-Value of the window (top right corner)
  • l and u are computed with respect to the maximum

resolution/length of the Z-values in the tree (here: 6)

Window: Min = (0,6), Max = (10,6) (0,2) (2,3) (7,4) (6,3) (7,4) (4,3) (21,5) (11,4) (6,3) (6,3) (20,5) (6,4) (7,3) Result: (0,2) (10,6) ≤ (2,3)

University of Alberta

  • Dr. Jörg Sander, 2006

31

CMPUT 391 – Database Management Systems

slide-32
SLIDE 32

Spatial Data Management

  • Shortcomings of Relational Databases
  • Modeling Spatial Data
  • Spatial Queries
  • Space-Filling Curves + B-Trees
  • R-trees

University of Alberta

  • Dr. Jörg Sander, 2006

32

CMPUT 391 – Database Management Systems

slide-33
SLIDE 33

The R-Tree – Properties

  • Balanced Tree designed to organize rectangles [Gut 84].
  • Each page contains between m and M entries.
  • Data page entries are of the form (MBR, PointerToExactRepr).

– MBR is a minimum bounding rectangle of a spatial object, which PointerToExactRepr is pointing to

  • Directory page entries are of the form (MBR, PointerToSubtree).

– MBR is the minimum bounding rectangle of all entries in the subtree, which PointerToSubtree is pointing to.

  • Rectangles can overlap
  • The height h of an R-Tree

for N spatial objects:

Directory Data Level 1 Directory Level 2 Pages

. . .

Exact Representations

⎡ ⎤ 1

log + ≤ N h

m

University of Alberta

  • Dr. Jörg Sander, 2006

33

CMPUT 391 – Database Management Systems

slide-34
SLIDE 34

The R-Tree – Queries

University of Alberta

  • Dr. Jörg Sander, 2006

34

CMPUT 391 – Database Management Systems

A5 A1 A4 A3 A6 A2 R S T Point Query X Y

A2 A3 A4 A5 A6 A1

R S T Answer Set: Paths that the query has to follow []

.

A5 A1 A4 A3 A6 A2 R S T Window Query X Y

A2 A3 A4 A5 A6 A1

R S T Answer Set: [A2, A3]

slide-35
SLIDE 35

The R-Tree – Queries

PointQuery (Page, Point); FOR ALL Entry ∈ Page DO IF Point IN Entry.MBR THEN IF Page = DataPage THEN PointInPolygonTest (load(Entry.ExactRepr), Point) ELSE PointQuery (Entry.Subtree, Point); Window Query (Page, Window); FOR ALL Entry ∈ Page DO IF Window INTERSECTS Entry.MBR THEN IF Page = DataPage THEN Intersection (load(Entry.ExactRepr), Window) ELSE WindowQuery (Entry.Subtree, Window);

First call: Page = Root of the R-tree

University of Alberta

  • Dr. Jörg Sander, 2006

35

CMPUT 391 – Database Management Systems

slide-36
SLIDE 36

R-Tree Construction – Optimization Goals

  • Overlap between the MBRs

⇒ spatial queries have to follow several paths ⇒ try to minimize overlap

  • Empty space in MBR

⇒ spatial queries may have to follow irrelevant paths ⇒ try to minimize area and empty space in MBRs

Start: empty data page (= root)

X Y

A3 A4 A5 A1

M = 3, m = 2 Insert: A5, A1, A3, A4 ⇒

A5, A1, A3, A4 * (overflow)

University of Alberta

  • Dr. Jörg Sander, 2006

36

CMPUT 391 – Database Management Systems

slide-37
SLIDE 37

R-Tree Construction – Important Issues

  • Split Strategy
  • Insertion Strategy

A5 A1 A4 A3 R S

? Split into 2 pages

X Y

A3 A4 A5 A1

R S

How to divide a set of rectangles into 2 sets?

University of Alberta

  • Dr. Jörg Sander, 2006

37

CMPUT 391 – Database Management Systems

A5 A1 A4 A3 R S

? Insert A2

X Y

A3 A4 A5 A1

R S

A2

?

A2

Where to insert a new rectangle?

slide-38
SLIDE 38

R-Tree Construction – Insertion Strategies

  • Dynamic construction by insertion of rectangles R

– Searching for the data page into which R will be inserted, traverses the tree from the root to a data page. – When considering entries of a directory page P, 3 cases can occur:

  • 1. R falls into exactly one Entry.MBR

follow Entry.Subtree

  • 2. R falls into the MBR of more than one entry e1 , ... , en

follow Ei.Subtree for entry ei with the smallest area of ei.MBR.

  • 3. R does not fall into an Entry.MBR of the current page

check the increase in area of the MBR for each entry when enlarging the MBR to enclose R. Choose Entry with the minimum increase in area (if this entry is not unique, choose the one with the smallest area); enlarge Entry.MBR and follow Entry.Subtree

  • Construction by “bulk-loading” the rectangles

– Sort the rectangles, e.g., using Z-Order – Create the R-tree “bottom-up”

University of Alberta

  • Dr. Jörg Sander, 2006

38

CMPUT 391 – Database Management Systems

slide-39
SLIDE 39

R-Tree Construction – Split

  • Insertion will eventually lead to an overflow of a data page

– The parent entry for that page is deleted. – The page is split into 2 new pages - according to a split strategy – 2 new entries pointing to the newly created pages are inserted into the parent page. – A now possible overflow in the parent page is handled recursively in a similar way; if the root has to be split, a new root is created to contain the entries pointing to the newly created pages.

X Y

A3 A4 A5 A1 A2 A1 A4 A3 R S

R S

A2 A5 A6 A6 *

M = 3, m = 2

X Y

A3 A4 A5 A1 A5 A1 A4 A3 U S

U S

A2 A2 A6 A6

V

V Overlow split node

University of Alberta

  • Dr. Jörg Sander, 2006

39

CMPUT 391 – Database Management Systems

slide-40
SLIDE 40

R-Tree Construction – Splitting Strategies

  • Overflow of node K with |K| = M+1 entries Distribution of the

entries into two new nodes K1 and K2 such that |K1| ≥ m and |K2| ≥ m

  • Exhaustive algorithm:

– Searching for the “best” split in the set of all possible splits is too expensive (O(2M) possibilities!)

  • Quadratic algorithm:

– Choose the pair of rectangles R1 and R2 that have the largest value d(R1, R2) for empty space in an MBR, which covers both R1 und R2.

d (R1, R2) := Area(MBR(R1∪R2)) – (Area(R1) + Area(R2))

– Set K1 := {R1} and K2 := {R2} – Repeat until STOP

  • if all Ri are assigned: STOP
  • if all remaining Ri are needed to fill the smaller node to guarantee minimal
  • ccupancy m: assign them to the smaller node and STOP
  • else: choose the next Ri and assign it to the node that will have the smallest

increase in area of the MBR by the assignment. If not unique: choose the Ki that covers the smaller area (if still not unique: the one with less entries).

University of Alberta

  • Dr. Jörg Sander, 2006

40

CMPUT 391 – Database Management Systems

slide-41
SLIDE 41

R-Tree Construction – Splitting Strategies

  • Linear algorithm:

– Same as the quadratic algorithm, except for the choice of the initial pair: Choose the pair with the largest normalized distance.

  • For each dimension determine the rectangle with the largest minimal value

and the rectangle with the smallest maximal value (the difference is the maximal distance/separation).

  • Normalize the maximal distance of each dimension by dividing by the sum of

the extensions of the rectangles in this dimension

  • Choose the pair of rectangles that has the greatest normalized distance.

Set K1 := {R1} and K2 := {R2}.

Smallest maximal value in X dimension Largest minimal value in X dimension X Y Smallest maximal value in Y dimension Largest minimal value in Y dimension

  • max. distance for X
  • max. distance for Y

University of Alberta

  • Dr. Jörg Sander, 2006

41

CMPUT 391 – Database Management Systems

slide-42
SLIDE 42

R-Trees – Variants

  • Many variants of R-trees exist,

– e.g., the R*-tree, X-tree for higher dimensional point data, … – For further information see http://www.cs.umd.edu/~hjs/rtrees/index.html (includes an interactive demo)

  • R-trees are also efficient index structures

for point data since points can be modeled as “degenerated” rectangles

– Multi-dimensional points, where a distance function between the points is defined play an important role for similarity search in so-called “feature” or “multi-media” databases.

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13

University of Alberta

  • Dr. Jörg Sander, 2006

42

CMPUT 391 – Database Management Systems

slide-43
SLIDE 43

Examples of Feature Databases

  • Measurements for celestial objects

(e.g., intensity of emission in different wavelengths)

  • Colour histograms of images
  • Documents, shape descriptors, …

(o11, o12, …, o1d) (o21, o22, …, o2d) . . . n d-dimensional feature vectors (on1, on2, …, ond)

University of Alberta

  • Dr. Jörg Sander, 2006

43

CMPUT 391 – Database Management Systems

slide-44
SLIDE 44

Feature Databases and Similarity Queries

  • Objects + Metric Distance Function

– The distance function measures (dis)similarity between objects

  • Basic types of similarity queries

– range queries with range ε

  • Retrieves all objects which

are similar to the query object up to a certain degree ε

– k-nearest neighbor queries

  • Retrieves k most similar
  • bjects to the query

query range ε query object 3-nn distance

University of Alberta

  • Dr. Jörg Sander, 2006

44

CMPUT 391 – Database Management Systems