Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases
Stefan Kufer and Andreas Henrich
stefan.kufer@uni-bamberg.de University of Bamberg Media Informatics Group
Stuttgart, 09.03.2017
Distributed Databases Stefan Kufer and Andreas Henrich - - PowerPoint PPT Presentation
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases Stefan Kufer and Andreas Henrich stefan.kufer@uni-bamberg.de University of Bamberg Media Informatics Group Stuttgart, 09.03.2017 Motivation age of
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases
Stefan Kufer and Andreas Henrich
stefan.kufer@uni-bamberg.de University of Bamberg Media Informatics Group
Stuttgart, 09.03.2017
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 2) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
age of social media: creation and distribution of media items
→ maintained in (personal) media archives
large, heterogeneous distributed database of various
resources (= nodes in the network) → adequate indexing techniques are needed
Motivation
heterogeneous resources in the distributed database
…
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 3) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
search criteria to be adressed:
text timestamps content features geographic information
retrieval tasks in a distributed environment
resource description problem resource selection problem (result merging)
Problem Description
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 4) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
general preliminaries:
set of resources each resource maintains a set of
geotagged media items
plate-carrée projection
lat/lon coordinates = y/x coordinates in
a 2-dimensional plane
more general spatial data scenario summaries of the spatial content of a
resource
query routing based on summaries
Search Scenario
resource A
[lat/y=48.22, lon/x=11.62] [lat/y=-33.86, lon/x=151.22]
resource description summarize
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 5) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Search Scenario
C A B
resource description resource selection
1. C 2. A 3. B
summarize summarize summarize
= resource data point (database object) similarity query criterion: d(q,o)
d = Euclidean distance q = query object
= query object
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 6) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
objective: encoding sets of two-dimensional data points
effectiveness → accurate delineation (selectivity) efficiency → compact storage (space efficiency)
categories of resource descriptions techniques (previous
work):
Geometric Approaches Space Partitioning Approaches Hybrid Approaches
Resource Descriptions
[KBH12], [KBH13], [KH14]
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 7) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
approaches that organize the data one | several bounding volumes (bv) to delimit the set of data
points → extents of bv described in summaries
evaluated approaches:
MBR (as a comparative baseline) RecMAR
Geometric Approaches
MBR
k
RecMAR
2 k = maximum number of Minimum Area Rectangles
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 8) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
approaches that organize the data one | several bounding volumes (bv) to delimit the set of data
points → extents of bv described in summaries
evaluated approaches:
MBR (as a comparative baseline) RecMAR
Geometric Approaches
MBR RecMAR
k 3 k = maximum number of Minimum Area Rectangles
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 9) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
approaches that organize the data one | several bounding volumes (bv) to delimit the set of data
points → extents of bv described in summaries
evaluated approaches:
MBR (as a comparative baseline) RecMAR
Geometric Approaches
MBR RecMAR
k 6 k = maximum number of Minimum Area Rectangles
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 10) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
approaches that organize the embedding space decompose the space into disjoint subspaces
identify regions (not) containing data points → information about cell
evaluated approach:
UFS
Space Partitioning Approaches
n n = number of sites/subspaces
uniform grid kd space partitioning
UFS 32
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 11) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
global space partitioning → the same for all resources!
(summaries only need to contain information about cell occupancy)
space partitioning must be adapted to the data distribution of
the whole data collection!
additional tasks:
collect information about the data distribution in the network partition space, distribute information in the network (update information as data collection changes)
Space Partitioning Approaches
A B C D
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 12) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
combine properties of two arbitrary resource description
techniques
method A: builds foundation, method B: refines foundation evaluated approach:
KDMBR
→ summary: binary information about cell occupancy (foundation), quantized MBR information for occupied cells (refinement)
Hybrid Approaches
n b n = number of subspaces, b = number of bits per bound (4*b for an MBR)
KDMBR
32 3
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 13) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
quadtree: recursive division of space into four quadrants regular decomposition (equal sized cells) → linear storage of
quadtrees possible (memory efficient representation)
linear quadtree encoding types:
only black nodes encoding whole quadtree structure (all internal nodes + leaves)
Novel Quadtree-based Resource Description Techniques
[MRJ02]
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 14) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
linear quadtrees: allow for local space partitioning
adapted to the data distribution of the single resource
area-driven decomposition of the space, parameters:
c → maximum number of subspaces of the quadtree structure (storage
space oriented stopping criterion)
a → threshold area, if undercut by all black cells: end of construction
(selectivity oriented stopping criterion)
Novel Quadtree-based Resource Description Techniques
A B C D
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 15) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
QT
space partitioning (sp) technique resource-individual sp (local sp)
GridQT
hybrid technique uniform grid (global sp) + qt-structure (local sp)
KDQT
hybrid technique kd-structure (global sp) + qt-structure (local sp)
Novel Quadtree-based Resource Description Techniques
c,a r c,a n c,a r = number of rows (columns = 2*r)
QT
32,0.1
GridQT
4 32,0.1
KDQT
32 32,0.1
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 16) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
QTMBR
hybrid technique qt structure (local sp) + quantized MBRs (bv)
MBRQT
hybrid technique external MBR (bv) + qt-structure (local sp)
Novel Quadtree-based Resource Description Techniques
c,a c,a b
QTMBR
32,0.1 3
MBRQT
32,0.1
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 17) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
all techniques describe areas containing data points
→ ranking is based on minimum distance between the areas of a resource and the query point q
Resource Selection - Ranking
for details!
A B
= query point q = resource data point example: mindist of the areas described by the summary of resource B < mindist of the areas described by the summary of resource A ⇒ B ranked higher than A
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 18) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
406,450 geo-referenced images from Flickr 5,951 different users → 5,951 resources long-tail distribution of data to resources data space: densely populated and unpopulated areas vary
Evaluation – Data Collection
log-scaled! n=4.0 → 10 – 1 = 9.999
4
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 19) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
50 kNN queries; k = 50 performance measures:
avg. resource fraction contacted (rfc) → selectivity avg. resource description size (rds) → space efficiency
numerous parameterizations for each technique Skyline operator for comparing the different techniques
two dimensions: rfc and rds find dominant parameterizations
for each technique → ‘as good or better in all dimensions and better in at least one dimension‘
Evaluation - Experimental Setting
for details!
[BKS01]
Skyline of RecMAR
k
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 20) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
all resource descriptions: bit vectors → Java gzip compression
(if beneficial)
if summary bigger than data points themself → direct transfer
for quadtree-based techniques: LQ scheme | CBLQ scheme 27 byte serialization overhead + 1 extra byte (resource
description type + resource size)
Evaluation - Optimizations
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 21) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Evaluation – Experimental Results
k
baseline: 0.0013 rfc (7.74 res.) 265.6 byte per res.
exactly depicted areas described
improvable
→ bounding volumes not really suited
QTMBR : 0.0087 rfc (~52) 44.04 rds
64,0.1 3
MBR: 0.0471 rfc (~280) 42.35 rds
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 22) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Evaluation – Experimental Results
baseline: 0.0013 rfc (7.74 res.) 265.6 byte per res.
→ dead space is reduced much more efficiently
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 23) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Evaluation – Experimental Results
QTMBR (for low rds) and KDMBR (for bigger rds)
(local||global) is a must → GridQT → MBRQT
c,a b b n r c,a c,a
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 24) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
bounding volumes are not suited for the given data collection,
subpar scaling (trade-off storage space vs. gained selectivity)
space partitioning approaches offer better performance and
scale better
adaptive techniques required (both for global and local space
partitioning)
hybrid approaches can improve the performance (if designed
thoughtfully → adapt. space partitioning for the foundation!)
local space partitioning: similar performance to global space
partitioning → especially suited for ‘small‘ resources
Evaluation – Summary
Thank you for your attention!
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 25) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 26) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
evaluate techniques for a bigger data collection/different
distribution of data points to resources → robustness
slight optimizations of existing techniques (e.g. employ
quantized MARs as a refinement for hybrid techniques, …)
utilization of summaries in different application fields (e.g.
binary image compression for QTMBR , …)
Future Work
c,a b
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 27) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
for good results: global space partitioning must be adjusted to
the data distribution of the whole data collection → how?
UFS : random selection of n data points out of the data
collection
kd-based approaches (KDMBR and KDQT ):
space partitioning is learned from training data random selection of training data points out of data collection bucket-based method: one bucket at the beginning; data points
are inserted into bucket → bucket overflow → bucket is split in halves (cyclacilly altering dimensions)
repeat until n buckets are reached
Adaption of the Global Space Partitioning Approaches
n n b n c,a
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 28) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Cell-interior, quantized MBRs
4 bits for encoding each axis Trade-off: accuracy vs. storage space
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 29) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
LQ code:
df-order, only black nodes → complete path for every node described
5 literals → 0|1|2|3|X (NW|NE|SW|SE|stop at non-maximum depth)
code: 13X-210-211-212-213-22X-23X, condensed: 13-21-22-23
CBLQ code:
bf-order, all leafs and internal nodes encoding → complete quadtree structure
4 literals → 0|1|2|3 (white leaf|black leaf|internal node, 1+ descendents are
code: 0320-0001-0311-1111, condensed: 0330-0001-0111
Linear Quadtree Encoding Schemes
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 30) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Ranking Algorithm
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 31) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
query points are chosen right out of the query collection
→ at least nearest neigbor has distance 0!
two-step approach:
1. random resource is selected 2. random data of this resource is selected as query point
big and small resources have the same probability of issuing a
query
ranking algorithm biased for big resources harder search task
Selection of Query Points
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 32) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
kNN algorithm (precise search)
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 33) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Non-quadtree utilizing
zipped summary || non-zipped summary zipped data points || non-zipped data points
→ 4 possibilities
Quadtree utlizing
zipped LQ-coded || non-zipped LQ-coded zipped CBLQ-coded || non-zipped CBLQ-coded zipped data points || non-zipped data points
→ 6 possibilities
1 extra byte → 3 bit for resource description type + 5 bit for
resource size (32 different sizes quantized)
Resource Description Types
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 34) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Resource Descriptions & Summaries
Resource Description
hypernym; contains information about the spatial features of a resource
Summary
high-level aggregation of the spatial features of a resource (geometric approach, space partitioning approach, hybrid approach)
Direct Transfer
coordinates of the data points are transferred instead of a summary
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 35) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Parameter Variation
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 36) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Key Figures (transfer type, resource sizes)
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 37) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Key Figures (detailled view for QTMBR )
c,a b
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 38) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
Dead Space
MBR (44 Byte) QTMBR (46 Byte)
16,1 2
bounds are exactly delineated in all dimensions; but: a lot of dead space inside bounds are not exactly delineated in the single dimensions; but: significantly less dead space
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 39) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
[Be92] Becker, B.; Franciosa, P. G.; Gschwind, S.; Ohler, T.; Thiemt, G.; Widmayer, P.: Enclosing many boxes by an optimal pair of boxes. In: Proc. of STACS 92: 9th
Heidelberg, pp. 475–486, 1992. [BH12] Blank, D.; Henrich, A.: Describing and Selecting Collections of Georeferenced Media Items in Peer-to-Peer Information Retrieval Systems. In: Discovery of Geospatial Resources: Methodologies, Technologies, and Emergent Applications. Information Science Reference, pp. 1–20, 2012. [BHK16] Blank, D.; Henrich, A.; Kufer, S.: Using Summaries to Search and Visualize Distributed Resources Addressing Spatial and Multimedia Features. Datenbank- Spektrum 16/1, pp. 67–76, 2016. [BKS01] Börzsönyi, S.; Kossmann, D.; Stocker, K.: The Skyline Operator. In: Proc. of the 17th Int. Conf. on Data Engineering. IEEE Computer Society, Washington, DC, USA, pp. 421–430, 2001. [Ca00] Callan, J.: Distributed Information Retrieval. In: Advances in Information
Literature 1/4
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 40) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
[Ca05] Caldwell, D. R.: Unlocking the Mysteries of the Bounding Box. A/2, pp. 1–20,
[Cu03] Cuenca-Acuna, F. M.; Peery, C.; Martin, R. P.; Nguyen, T. D.: PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing
Distributed Computing (HPDC-12 ’03). IEEE Press, Seattle,Washington, pp. 1–11, 2003. [Ga82] Gargantini, I.: An EffectiveWay to Represent Quadtrees. Commun. ACM 25/12,
[GG98] Gaede, V.; Günther, O.: Multidimensional Access Methods. ACM Comput. Surv. 30/2, pp. 170–231, 1998. [HB10] Henrich, A.; Blank, D.: Description and Selection of Media Archives for Geographic Nearest Neighbor Queries in P2P Networks, 2010. [He09] Hetland, M. L.: The Basic Principles of Metric Indexing. In: Swarm Intelligence for Multi-objective Problems in Data Mining. Springer Berlin Heidelberg,
Literature 2/4
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 41) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
[KBH12] Kufer, S.; Blank, D.; Henrich, A.: Techniken der Ressourcenbeschreibung und -auswahl für das geographische Information Retrieval. In: Proc. of the IR Workshop at LWA 2012. Dortmund, Germany, pp. 1–8, 2012. [KBH13] Kufer, S.; Blank, D.; Henrich, A.: Using Hybrid Techniques for Resource Description and Selection in the Context of Distributed Geographic Information
SSTD 2013, Munich, Germany. Springer Berlin Heidelberg, pp. 330–347, 2013. [KH14] Kufer, S.; Henrich, A.: Hybrid Quantized Resource Descriptions for Geospatial Source Selection. In: Proc. of the 4th Int. Workshop on Location and the Web. LocWeb ’14, ACM, Shanghai, China, pp. 17–24, 2014. [Li97] Lin, T.-W.: Set Operations on Constant Bit-length Linear Quadtrees. Pattern
[MRJ02] Manouvrier, M.; Rukoz, M.; Jomier, G.: Quadtree representations for storage and manipulation of clusters of images. Im. Vis. Comp. 20/7, pp. 513–527, 2002. [Oo99] Oosterom, P.V.: Spatial Access Methods. In. Vol. 1, Geographical Information Systems, chap. 27, pp. 385–400, 1999.
Literature 3/4
Quadtree-based Resource Description Techniques for Spatial Data in Distributed Databases (p. 42) Stefan Kufer and Andreas Henrich − BTW 2017 in Stuttgart, March 09, 2017
[Sa05] Samet, H.: Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. [Sa84] Samet, H.: The Quadtree and Related Hierarchical Data Structures. ACM
[SK90] Seeger, B.; Kriegel, H.-P.: The Buddy Tree: An Efficient and Robust Access Method for Spatial Data Base. In: Proc. of the Sixteenth Intl. Conf. on VLDB. Morgan Kaufmann Publishers Inc., Brisbane, Australia, pp. 590–601, 1990.
Literature 4/4