Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & - - PowerPoint PPT Presentation

spatial indexing
SMART_READER_LITE
LIVE PREVIEW

Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & - - PowerPoint PPT Presentation

Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & Cloud Services (P. Baumann) 1 Applications of Multidimensional Data Geographic Information Systems (GIS) Geospatial information; service standards by Open Geospatial


slide-1
SLIDE 1

1 340151 Big Data & Cloud Services (P. Baumann)

Spatial Indexing

Ramakrishnan/Gehrke Ch. 28

slide-2
SLIDE 2

2 340151 Big Data & Cloud Services (P. Baumann)

  • Geographic Information Systems (GIS)
  • Geospatial information; service standards by Open Geospatial Consortium (OGC)
  • Vendors: ESRI, Intergraph, SmallWorld, …, Oracle, …; open-source: Grass, PostGIS, …
  • All classes of spatial queries and data are common
  • Computer-Aided Design / Manufacturing
  • spatial objects, ex: surface of airplane fuselage
  • Range queries and spatial join queries are common
  • Multimedia Databases
  • Images, video, text, etc. stored and retrieved by content
  • First converted to feature vector form; high dimensionality
  • Nearest-neighbor queries are the most common

Applications of Multidimensional Data

slide-3
SLIDE 3

3 340151 Big Data & Cloud Services (P. Baumann)

Multidimensional Data

  • Point Data
  • = points in a multidimensional space
  • Ex: geographic locations; feature vectors extracted from text
  • Region Data
  • = objects having spatial extent with location and boundary
  • DB typically uses geometric approximations constructed using line segments,

polygons, etc., called vector data

  • What about raster data such as satellite imagery?
  • = each pixel stores a measured value
  • For this Chapter we assume vector data; raster data and their operations have

specific, rather distinct characteristics

slide-4
SLIDE 4

4 340151 Big Data & Cloud Services (P. Baumann)

Multidimensional Queries

  • Point queries
  • "show Bremen"
  • Spatial Range queries
  • "Find all hotels within a radius of 5 miles

from the conference venue"

  • "Find all cities that lie on the Nile in Egypt"
  • 50 < age < 55 AND 80K < sal < 90K
  • 50 < Lat < 55 AND 80 < Long < 90
  • Nearest-Neighbor queries
  • "Find the 10 cities nearest to Bremen"
  • "Find the city with population 500,000 or

more that is nearest to Kalamazoo, MI"

  • Spatial Join queries
  • "Find all cities near a lake“
  • "Find all parts that touch the fuselage"

(in airplane design)

  • Expensive; join condition involves regions

and proximity!

  • Similarity queries
  • "Given a face, find the five most similar

faces"

  • …plus aggregation, and more
slide-5
SLIDE 5

5 340151 Big Data & Cloud Services (P. Baumann)

  • With Emp relation:
  • sort entries first by age and then by sal
  • Ex: index on <age, sal>
  • Observation: Composite search key B+ tree linearizes 2-D space
  • Problems:
  • spatial proximity lost

"Close in nature" should imply "close on disk"

  • Not symmetric in dimensions

First Try: Composite B+ Tree

B+ tree

  • rder

Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75>

11 12 13 70 60 50 40 30 20 10 80

Spatial clusters

slide-6
SLIDE 6

6 340151 Big Data & Cloud Services (P. Baumann)

Second Try: Multiple B+ Trees

  • Query example:

select * from R where a0 < A < a1 and b0 < B < b1 A B a0 a1 b0 b1

  • read tuple with a0<A<a1
  • read tuple with b0<B<b1
  • intersect

Several conventional indexes: A B a0 a1 b0 b1 read only tuples with a0<A<a1 and b0<B<b1 wanted:

  • Problems:
  • Selects way too much data
  • Index space grows with dimensionality
slide-7
SLIDE 7

7 340151 Big Data & Cloud Services (P. Baumann)

  • Requirements:
  • any number of dimensions
  • Symmetric behavior for all dimensions
  • supports inserts and deletes gracefully
  • Ideally, want to support non-point data as well (e.g., lines, shapes)
  • Zillions of approaches and variants in literature
  • Grid file, Quad/Oct-tree, kdb-tree, space-filling curves, …
  • Core idea always: spatial clustering of entries on disk
  • we look into R-tree
  • widely used, in many variants

Wanted: a Multi-Dimensional Index

slide-8
SLIDE 8

8 340151 Big Data & Cloud Services (P. Baumann)

  • R-tree = tree-structured n-D index [Guttman 1984]
  • Discriminating value of B+-Tree substituted by bounding intervals (bbox)
  • Index search by bbox, not by exact (polygon) shape
  • Leaf entry = < n-dimensional box, rid >
  • tightest bounding box for object
  • Non-leaf entry = < n-dim box, ptr to child node >
  • Box covers all boxes in child node (in fact, subtree)
  • 2-D sketch:

The R-Tree

Leaf level

X Y

Root of R-Tree

slide-9
SLIDE 9

9 340151 Big Data & Cloud Services (P. Baumann)

Sample R-Tree

R8 R9 R10 R11 R12 R17 R18 R19 R13 R14 R15 R16 R1 R2 R3 R4 R5 R6 R7 Leaf entry Index entry Spatial object

slide-10
SLIDE 10

10 340151 Big Data & Cloud Services (P. Baumann)

Sample R-Tree (contd.)

R1 R2 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R3 R4 R5 R6 R7 „contains“

slide-11
SLIDE 11

11 340151 Big Data & Cloud Services (P. Baumann)

Sample 3D R+-Tree [Wikipedia]

slide-12
SLIDE 12

12 340151 Big Data & Cloud Services (P. Baumann)

Search for Objects Overlapping Box Q

Current node := root;

  • 1. If current node is non-leaf:

for each entry <E, ptr>: if box E overlaps Q then search subtree identified by ptr;

  • 2. If current node is leaf:

for each entry <E, rid>: if box E overlaps Q then rid identifies an object that might overlap Q. May have to search several subtrees at each node! (B-tree equality search goes to just one leaf)

slide-13
SLIDE 13

13 340151 Big Data & Cloud Services (P. Baumann)

  • Balanced: All leaves at same distance from root
  • remains balanced on inserts and deletes
  • Nodes can be kept 50% full (except root)
  • Can choose parameter m <= 50%, and ensure that every node is at least m% full
  • Inexact match: search by object bounding box, not object
  • Needs (usually inexpensive) exact match step afterwards
  • Common for all multidimensional access methods
  • Generally good behavior in practice,

however not necessarily good worst-case performance

  • Priority R-Tree: as efficient as R-Tree + worst-case optimal [Arge,de Berg,Haverkort,Yi 2004]

R-Tree Properties

slide-14
SLIDE 14

14 340151 Big Data & Cloud Services (P. Baumann)

R-Tree Variants

  • R+ tree: [Sellis,Roussopoulos,Faloutsos 1987]

avoid overlap by inserting object into multiple leaves if necessary

  • single path to leaf
  • …at cost of redundancy
  • R* tree: [Beckmann,Kriegel,Schneider,Seeger 1990]

forced re-inserts to reduce overlap in tree nodes

  • When node overflows, instead of splitting:
  • Remove some (say, 30% of the) entries and reinsert them into the tree
  • Could result in all reinserted entries fitting on some existing pages, avoiding a split
slide-15
SLIDE 15

15 340151 Big Data & Cloud Services (P. Baumann)

Summary

  • Index support for multidimensional queries has many applications
  • GIS, CAD/CAM, …: spatio-temporal, 2..4-D
  • multimedia indexing, statistical databases:

non-spatial dimensions, 3-D..12-D..10,000-D…

  • Fundamental difference between space/time and feature spaces
  • <4D vs 1000s of dimensions
  • R-tree worse than sequential scan for 12+ D
  • Main multidimensional query types:
  • Point and region data
  • Overlap/containment and nearest-neighbor queries
slide-16
SLIDE 16

16 340151 Big Data & Cloud Services (P. Baumann)

Summary (contd.)

  • Many approaches to indexing; R-tree approach widely used in GIS
  • Overall, works quite well for 2..4-D datasets
  • Several variants (notably, R+ and R* trees) proposed; widely used
  • Can improve search performance by using a convex polygon to

approximate query shape (instead of a bounding box) and testing for polygon-box intersection

  • Issues
  • For high-dimensional datasets, unless data has good “contrast”, nearest-neighbor may

not be well-separated