Big Spatial Data Management on Hadoop and Beyond
Ahmed Eldawy Computer Science and Engineering
1
Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy - - PowerPoint PPT Presentation
Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 1 Claudius Ptolemy (AD 90 AD 168) Al Idrisi (1099 1165) Cholera cases in the London epidemic of 1854 Cool computer technology..!! Can I
Ahmed Eldawy Computer Science and Engineering
1
Claudius Ptolemy (AD 90 – AD 168)
Al Idrisi (1099–1165)
Cholera cases in the London epidemic of 1854
Cool computer technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is My pleasure. Here it is. I have BIG data. I need HELP..!!
Kindly let me get the technology you have Kindly let me understand your needs
1969
HELP..!! I have
BIG data. Your
technology is not helping me mmm…Let me check with my good friends there. My pleasure. Here it is. Cool Database technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is
Kindly let me understand your needs Kindly let me get the technology you have
HELP..!! Again, I have BIG data. Your technology is not helping me Sorry, seems like the DBMS technology cannot scale more Let me check with my other good friends there. My pleasure. Here it is. Cool Big Data technology..!! Can I use it in my application? Oh..!! But, it’s not made for me. Can’t make use of it as is
Kindly let me understand your needs Kindly let me get the technology you have
Big Spatial Data Management
Tons of Spatial data out there…
Smart Phones Satellite Images Medical Data Traffic Data Geotagged Microblogs VGI Sensor Networks Geotagged Pictures
26
Spatial Data & Hadoop
Takes 193 seconds Hadoop Spatial Data
points = LOAD ’points’ AS (id:int, x:int, y:int); result = FILTER points BY x < xmax AND x >= xmin AND y < ymax AND y >= ymin;
SpatialHadoop
points = LOAD ’points’ AS (id:int, location:point); result = FILTER points BY Overlap(location, rectangle (xmin, ymin, xmax, ymax));
Finishes in 2 seconds
➔ SpatialHadoop
27
80,000 downloads in one year
Industry Academia
Conducted more than seven
keynotes, tutorials, and invited talks Incubated by Eclipse Foundation and renamed to GeoJinni >500GB public datasets for benchmarking and testing Spatial Language Spatial Indexes Spatial Operations Visualization Students Projects Collaboration
University
Genova
28
The Built-in Approach of SpatialHadoop The On-top Approach
Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling
Pig Latin Hadoop
Java APIS
User Programs Spatial Modules (Spatial) User Program + MapReduce APIs + Job Monitoring and Scheduling + MapReduce Runtime + Storage +
…
Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling
Pig Latin Hadoop Java APIS
User Programs
Spatial Indexing Early Pruning Spatial Operators Spatial Language
From Scratch Approach The Built-in Approach (SpatialHadoop)
29
Agenda
The ecosystem of SpatialHadoop
Motivation Internal system design Applications Related work Experiments
Open Research Problems
30
VLDB’13 ICDE’15
SpatialHadoop Architecture
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Language
Pigeon [ICDE’14]
Visualization
[VLDB’15, ICDE’16]
ST-Hadoop
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13]
MapReduce
Spatial File Splitter Spatial Record Reader
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
31
Indexing
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Visualization
[VLDB’15, ICDE’16]
ST-Hadoop
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13]
MapReduce
Spatial File Splitter Spatial Record Reader
Language
Pigeon [ICDE’14]
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
VLDB’13 ICDE’15
32
Data Loading in Hadoop
Blindly chops down a big file into 128MB chunks Values of records are not considered Relevant records are typically assigned to two different blocks HDFS is too restrictive where files cannot be modified
Input File Data Nodes
128MB 128MB 128MB 128MB
33
Two-layer Index Layout
Global Indexing
Locally Indexed HDFS Bocks Data Nodes Global Index
34
Uniform Grid
Works only for uniformly distributed data
35
R-tree
Read a sample Bulk load the sample into an R-tree
Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead
Use MBR of leaf nodes as partition boundaries
36
R-tree
Read a sample Bulk load the sample into an R-tree
Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead
Use MBR of leaf nodes as partition boundaries Partition the data
37
R-tree
Read a sample Bulk load the sample into an R-tree
Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead
Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes
38
R-tree-based Index of a 400 GB road network
39
Non-indexed Heap File
40
MapReduce Layer
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Visualization
[VLDB’15,ICDE’16]
ST-Hadoop [TODS ]
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13, TSAS]
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Language
Pigeon [ICDE’14]
Under review Demo paper
MapReduce
Spatial File Splitter Spatial Record Reader
VLDB’13 ICDE’15
41
Map plan – Hadoop
File Splitter Split Number
Split Record Reader Record Reader k,v k,v k,v k,v k,v k,v Map Map Map task Map task Input Heap File Spatial File Splitter Filter Function Spatial Record Reader Spatial Record Reader
Map plan – SpatialHadoop
Indexed Input File(s)
…
42
Operations
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Visualization
[VLDB’15, ICDE’16]
ST-Hadoop
MapReduce
Spatial File Splitter Spatial Record Reader
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Language
Pigeon [ICDE’14]
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13]
VLDB’13 ICDE’15
43
Operations Layer
Basic Operations: e.g, Range query and
KNN
Spatial Join Operations Computational geometry operations:
e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull
User-defined operations: e.g., kNN join
44
Range Query
Use the global index to prune disjoint partitions Use local indexes to find matching records
45
KNN over Indexed Data
k=3
First iteration runs as before and result is tested for correctness Answer is incorrect Second iteration processes other blocks that might contain an answer ✓ Answer is correct
46
Spatial Join
Join Directly Partition – Join
47
Spatial Join
Join Directly Partition – Join
Total of 36 overlapping pairs Only 16 overlapping pairs
48
CG_Hadoop
Polygon Union Voronoi Diagram Delaunay Triangulation Skyline Convex Hull Farthest/closest pair
Single Machine Hadoop Spatial Hadoop
29x 260x 1x
49
Convex Hull
Find the minimal convex polygon that contains all points
Input Output
50
Convex Hull in CG_Hadoop
Hadoop SpatialHadoop Partition Pruning Local hull Global hull
51
Visualization
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
ST-Hadoop
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13]
MapReduce
Spatial File Splitter Spatial Record Reader
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Language
Pigeon [ICDE’14]
Visualization
[VLDB’15, ICDE’16] VLDB’13 ICDE’15
52
Visualization in HadoopViz
Scatter Plot Road Network Heat Map Satellite Data Vector Map Admin Boundaries The goal of HadoopViz is not to propose new visualization techniques, instead its goal is to scale out existing techniques.
53
Heat Map From 2009 to 2014 Month-by-Month
72 Frames × 14 Billion points per frame Total = 1 Trillion points
Created in 3 hours on 10 nodes instead of 60 hours
54
Abstract Visualization
Input Partition
canvas
canvas
Output Image
55
Example: Satellite Data Visualization
2D Matrix with zeros
𝟑𝟑 𝟖 𝟐𝟔
Update the matrix
+
Matrix addition
Generate the image
56
Example: Road Network Visualization
Create a blank image
roads as polygons
PNG and write to file
image on the other
57
Multilevel Images in HadoopViz
Map of California – 2GB Generated in 2 minutes on 10-node cluster instead of one hour
58
Multi-level Visualization
Abstract multi-level visualization algorithm The choice of partitioning technique changes for each zoom level Zoom Level Threshold level 𝑨𝜄 Default Hadoop partitioning Spatial Partitioning 𝒜𝜾 = 𝟐 𝟑 log 𝐶 𝑙 𝑢
B: Block capacity k: Partitioning parameter |t|: Tile size
59
Applications
Visualization
[VLDB’15, ICDE’16]
ST-Hadoop
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13,]
MapReduce
Spatial File Splitter Spatial Record Reader
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Language
Pigeon [ICDE’14]
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
VLDB’13 ICDE’15
60
SHAHED – A system for querying and
visualizing spatio-temporal satellite data
http://shahed.cs.umn.edu/
(Best poster runner-up)
ICDE’15
Visualize animated heat maps
Run spatio-temporal selection and aggregate queries
61
TAREEG – Web-based extractor for
OpenStreetMap data using MapReduce
http://tareeg.net/
OpenStreetMap”, ACM SIGSPATIAL’14 ___ “TAREEG: A MapReduce-Based Web Service for Extracting Spatial Data from OpenStreetMap”, SIGMOD’14
62
Agenda
The ecosystem of SpatialHadoop
Motivation Internal system design Applications Related work Experiments
Other research projects Future work
63
Other Big Spatial Data Systems
Parallel
ESRI Tools for Hadoop SpatialHadoop is the only extensible system that can be easily expanded by researchers and developers
64
Quad K-d Hilbert 500 1000 1500 2000 2500 RUNNING TIME (SEC)
Experimental Results
Spatial Join Running time with different indexes
0.01 0.1 1 10 100 1 2 4 8 16 64 128
Throughput of Range Query
Hadoop SpatialHadoop
100 200 300
Union Voronoi Skyline Convex Hull Closest Pair Farthest Pair
Speedup of CG_Hadoop
Baseline Hadoop SpatialHadoop
500X 260X
20 40 60
Scatter Plot Roads Heatmap Satellite Vector Map Border Lines
Visualization Speedup
Baseline HadoopViz
48X
65
Agenda
The ecosystem of SpatialHadoop
Motivation System design Applications Related work Experiments
Future directions
66
Future Research Directions
Extend HadoopViz
Interactive visualization Support more case studies
Migrate spatial indexes to Spark to support spatial data mining
Improve HDFS spatial indexes to support dynamic data
67
Adaptive Multilevel Visualization
Empty tiles Covered by parent data tiles
Data tiles Small data sizes in those areas Not worth of preprocessing Image tiles Big enough to consider preprocessing
Empty tiles No data in there
68
Dynamic Indexes
69
Advanced Analytic Queries
Partitioning Local VD Pruning Vertical Merge Pruning Horizontal Merge Final output
70
Sphinx: Distributed SQL Engine for Big Spatial Data
HDFS
Storage
Spatial Indexes
Query Executor
R-tree Scanner Spatial Join
Query Planner
Range Query Plans Spatial Join Plans
Query Parser
Spatial Data types/ Functions Spatial Indexing Command
Cloudera Impala Sphinx
Queries on Big Spatial Data”, poster in ACM SIGSPATIAL’15
71
Summary
Indexes Operations Visualization (HadoopViz) Apps
72
Questions?
73
A Unified Big Data Interface
Impala Sphinx MLLib
HDFS – File System YARN – Resource Manager
Unified Big Data Abstraction
Cost Model Query Optimizer Query Executor
SparkSQL
74
Language
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Visualization
[VLDB’15, ICDE’16]
ST-Hadoop [TODS ]
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13, TSAS]
MapReduce
Spatial File Splitter Spatial Record Reader
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Under review Demo paper
Language
Pigeon [ICDE’14] VLDB’13 ICDE’15
75
Language (Pigeon)
Hides the complexity of the system with a high level language OGC standard used by Oracle Spatial and PostGIS Extends Pig Latin with OGC-compliant primitives
Spatial data types (e.g., Polygon) Basic operations (e.g., Area) Spatial predicates (e.g., Touches) Spatial analysis (e.g., Union) Spatial aggregate functions (e.g., Convex Hull)
76
Spatial Data Types
lakes = LOAD ’lakes’ AS (id:int, area:polygon);
Data Loading Range Query
houses_in_range = Filter houses BY Overlap(house_loc, range); nearest_houses = KNN houses WITH_K=100 USING DistanceTo(house_loc, query_loc); lakes_states = Join lakes BY lakes_boundary states BY states_boundary Predicate = Overlap
KNN Spatial Join
77
Spatio-temporal Indexing
Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Visualization
HadoopViz[VLDB’15]
Operations
Basic operations – CG_Hadoop [SIGSPATIAL’13, TSAS]
MapReduce
Spatial File Splitter Spatial Record Reader
Indexing
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Language
Pigeon [ICDE’14]
Under review Demo paper
ST-Hadoop [TODS ]
VLDB’13 ICDE’15
Spatio-temporal Data” Submitted to ACM TODS
78
Monthly Indexes
Multiresolution Spatio-temporal Index
2012 2013
jan feb dec jan feb dec jan 366 1 2 365 1 2 31
… … … … …
1
Daily Indexes Yearly Indexes
Spatio-temporal Data” Submitted to ACM TODS
2
79
Performance of SHAHED
80
Reference Point
Intersection rectangle Reference point
81
Index building
82
Index Building for NASA Data
83
Related Work
Most techniques for spatial data processing in Hadoop use Hadoop as a blackbox
RQ, KNN and SJMR [Zhang et al’09] R-tree construction [Cary et al’09] KNN Join [Lu et al’12, Zhang et al’12] RNN [Akdogan et al’10] ANN [Wang et al’10]
MD-HBase [Nishimura et al’11]
Framework for multi-dimensional data processing Based on HBase, a key-value store on HDFS Does not support MapReduce programming
84
Map plan – Hadoop
File Splitter Split Number
Split Record Reader Record Reader k,v k,v k,v k,v k,v k,v Map Map Map task Map task Input Heap File Spatial File Splitter Filter Function Spatial Record Reader Spatial Record Reader
Map plan – SpatialHadoop
Indexed Input File(s)
…
85
KNN
k=3
SpatialFileSplitter selects the block that contains the query point Map function performs kNN in the selected block Answer is tested for correctness ✓ Answer is correct
86
KNN
k=3
First iteration runs as before and result is tested for correctness Answer is incorrect Second iteration processes other blocks that might contain an answer
87
Range query
88
K Nearest Neighbor
89
Preliminary Results
90