spatial indexing
play

Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & - PowerPoint PPT Presentation

Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & Cloud Services (P. Baumann) 1 Applications of Multidimensional Data Geographic Information Systems (GIS) Geospatial information; service standards by Open Geospatial


  1. Spatial Indexing Ramakrishnan/Gehrke Ch. 28 340151 Big Data & Cloud Services (P. Baumann) 1

  2. Applications of Multidimensional Data  Geographic Information Systems (GIS) • Geospatial information; service standards by Open Geospatial Consortium (OGC) • Vendors: ESRI, Intergraph, SmallWorld, …, Oracle, …; open - source: Grass, PostGIS, … • All classes of spatial queries and data are common  Computer-Aided Design / Manufacturing • spatial objects, ex: surface of airplane fuselage • Range queries and spatial join queries are common  Multimedia Databases • Images, video, text, etc. stored and retrieved by content • First converted to feature vector form; high dimensionality • Nearest-neighbor queries are the most common 340151 Big Data & Cloud Services (P. Baumann) 2

  3. Multidimensional Data  Point Data • = points in a multidimensional space • Ex: geographic locations; feature vectors extracted from text  Region Data • = objects having spatial extent with location and boundary • DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data  What about raster data such as satellite imagery? • = each pixel stores a measured value • For this Chapter we assume vector data; raster data and their operations have specific, rather distinct characteristics 340151 Big Data & Cloud Services (P. Baumann) 3

  4. Multidimensional Queries  Point queries  Spatial Join queries • "show Bremen" • "Find all cities near a lake“ • "Find all parts that touch the fuselage"  Spatial Range queries (in airplane design) • • "Find all hotels within a radius of 5 miles Expensive; join condition involves regions from the conference venue" and proximity! • "Find all cities that lie on the Nile in Egypt"  Similarity queries • 50 < age < 55 AND 80K < sal < 90K • "Given a face, find the five most similar • 50 < Lat < 55 AND 80 < Long < 90 faces"  Nearest-Neighbor queries  …plus aggregation, and more • "Find the 10 cities nearest to Bremen" • "Find the city with population 500,000 or more that is nearest to Kalamazoo, MI" 340151 Big Data & Cloud Services (P. Baumann) 4

  5. First Try: Composite B+ Tree  With Emp relation: • sort entries first by age and then by sal • Ex: index on <age, sal>  Observation: Composite search key B+ tree linearizes 2-D space 80  Problems: 70 Spatial • spatial proximity lost 60 clusters "Close in nature" should imply "close on disk" 50 40 • Not symmetric in dimensions 30 Consider entries: B+ tree <11, 80>, <12, 10> 20 order <12, 20>, <13, 75> 10 11 12 13 340151 Big Data & Cloud Services (P. Baumann) 5

  6. Second Try: Multiple B+ Trees  Query example: select * from R where a 0 < A < a 1 and b 0 < B < b 1 Several conventional indexes: wanted: B B b 1 b 1 read only tuples - read tuple with a 0 <A<a 1 with a 0 <A<a 1 - read tuple with b 0 <B<b 1 b 0 b 0 and b 0 <B<b 1 - intersect a 0 a 1 A a 0 a 1 A  Problems: • Selects way too much data • Index space grows with dimensionality 340151 Big Data & Cloud Services (P. Baumann) 6

  7. Wanted: a Multi-Dimensional Index  Requirements: • any number of dimensions • Symmetric behavior for all dimensions • supports inserts and deletes gracefully • Ideally, want to support non-point data as well (e.g., lines, shapes)  Zillions of approaches and variants in literature • Grid file, Quad/Oct-tree, kdb-tree, space- filling curves, … • Core idea always: spatial clustering of entries on disk  we look into R-tree • widely used, in many variants 340151 Big Data & Cloud Services (P. Baumann) 7

  8. The R-Tree  R-tree = tree-structured n-D index [Guttman 1984] • Discriminating value of B+-Tree substituted by bounding intervals (bbox) • Index search by bbox, not by exact (polygon) shape  Leaf entry = < n-dimensional box, rid > • tightest bounding box for object  Non-leaf entry = < n-dim box, ptr to child node > Root of R-Tree • Box covers all boxes in child node (in fact, subtree)  2-D sketch: Y Leaf level X 340151 Big Data & Cloud Services (P. Baumann) 8

  9. Sample R-Tree Leaf entry R1 R4 Index entry R11 R3 R5 R13 R9 R8 Spatial object R14 R10 R12 R7 R18 R17 R6 R16 R19 R15 R2 340151 Big Data & Cloud Services (P. Baumann) 9

  10. Sample R-Tree (contd.) R1 R2 „contains“ R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 340151 Big Data & Cloud Services (P. Baumann) 10

  11. Sample 3D R+-Tree [Wikipedia] 340151 Big Data & Cloud Services (P. Baumann) 11

  12. Search for Objects Overlapping Box Q Current node := root; 1. If current node is non-leaf: for each entry <E, ptr>: if box E overlaps Q then search subtree identified by ptr; 2. If current node is leaf: for each entry <E, rid>: if box E overlaps Q then rid identifies an object that might overlap Q. May have to search several subtrees at each node! (B-tree equality search goes to just one leaf) 340151 Big Data & Cloud Services (P. Baumann) 12

  13. R-Tree Properties  Balanced: All leaves at same distance from root • remains balanced on inserts and deletes  Nodes can be kept 50% full (except root) • Can choose parameter m <= 50%, and ensure that every node is at least m% full  Inexact match: search by object bounding box, not object • Needs (usually inexpensive) exact match step afterwards • Common for all multidimensional access methods  Generally good behavior in practice, however not necessarily good worst-case performance • Priority R-Tree: as efficient as R-Tree + worst-case optimal [Arge,de Berg,Haverkort,Yi 2004] 340151 Big Data & Cloud Services (P. Baumann) 13

  14. R-Tree Variants  R+ tree: [Sellis,Roussopoulos,Faloutsos 1987] avoid overlap by inserting object into multiple leaves if necessary • single path to leaf • …at cost of redundancy  R* tree: [Beckmann,Kriegel,Schneider,Seeger 1990] forced re-inserts to reduce overlap in tree nodes • When node overflows, instead of splitting: - Remove some (say, 30% of the) entries and reinsert them into the tree - Could result in all reinserted entries fitting on some existing pages, avoiding a split 340151 Big Data & Cloud Services (P. Baumann) 14

  15. Summary  Index support for multidimensional queries has many applications • GIS, CAD/CAM, …: spatio -temporal, 2..4-D • multimedia indexing, statistical databases: non-spatial dimensions, 3-D..12-D..10,000- D…  Fundamental difference between space/time and feature spaces • <4D vs 1000s of dimensions • R-tree worse than sequential scan for 12+ D  Main multidimensional query types: • Point and region data • Overlap/containment and nearest-neighbor queries 340151 Big Data & Cloud Services (P. Baumann) 15

  16. Summary (contd.)  Many approaches to indexing; R-tree approach widely used in GIS • Overall, works quite well for 2..4-D datasets • Several variants (notably, R+ and R* trees) proposed; widely used  Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection  Issues • For high- dimensional datasets, unless data has good “contrast”, nearest -neighbor may not be well-separated 340151 Big Data & Cloud Services (P. Baumann) 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend