Big Spatial Data Management on Hadoop and Beyond
Ahmed Eldawy Computer Science and Engineering
11/30/18 1
Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy - - PowerPoint PPT Presentation
Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 11/30/18 1 Claudius Ptolemy (AD 90 AD 168) Al Idrisi (1099 1165) Cholera cases in the London epidemic of 1854 Cool computer
11/30/18 1
Claudius Ptolemy (AD 90 – AD 168)
Cool computer technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is My pleasure. Here it is. I have BIG data. I need HELP..!!
Kindly let me get the technology you have Kindly let me understand your needs
HELP..!! I have
BIG data. Your
technology is not helping me mmm…Let me check with my good friends there. My pleasure. Here it is. Cool Database technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is
Kindly let me understand your needs Kindly let me get the technology you have
HELP..!! Again, I have BIG data. Your technology is not helping me Sorry, seems like the DBMS technology cannot scale more Let me check with my other good friends there. My pleasure. Here it is. Cool Big Data technology..!! Can I use it in my application? Oh..!! But, it’s not made for me. Can’t make use of it as is
Kindly let me understand your needs Kindly let me get the technology you have
11/30/18 26
Takes 193 seconds Hadoop Spatial Data
points = LOAD ’points’ AS (id:int, x:int, y:int); result = FILTER points BY x < xmax AND x >= xmin AND y < ymax AND y >= ymin;
SpatialHadoop
points = LOAD ’points’ AS (id:int, location:point); result = FILTER points BY Overlap(location, rectangle (xmin, ymin, xmax, ymax));
Finishes in 2 seconds
11/30/18 27
80,000 downloads in one year
Conducted more than seven
keynotes, tutorials, and invited talks Incubated by Eclipse Foundation and renamed to GeoJinni >500GB public datasets for benchmarking and testing Spatial Language Spatial Indexes Spatial Operations Visualization Students Projects Collaboration
University
Genova
11/30/18 28
Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling
Pig Latin Hadoop
Java APIS
User Programs Spatial Modules (Spatial) User Program + MapReduce APIs + Job Monitoring and Scheduling + MapReduce Runtime + Storage +
Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling
Pig Latin Hadoop Java APIS
User Programs
Spatial Indexing Early Pruning Spatial Operators Spatial Language
From Scratch Approach The Built-in Approach (SpatialHadoop)
11/30/18 29
11/30/18 30
VLDB’13 ICDE’15
TAREEG[SIGMOD’14, SIGSPATIAL’14]
Pigeon [ICDE’14]
[VLDB’15, ICDE’16]
ST-Hadoop
Basic operations – CG_Hadoop [SIGSPATIAL’13]
Spatial File Splitter Spatial Record Reader
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
11/30/18 31
TAREEG[SIGMOD’14, SIGSPATIAL’14]
[VLDB’15, ICDE’16]
ST-Hadoop
Basic operations – CG_Hadoop [SIGSPATIAL’13]
Spatial File Splitter Spatial Record Reader
Pigeon [ICDE’14]
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
VLDB’13 ICDE’15
11/30/18 32
128MB 128MB 128MB 128MB
11/30/18 33
Global Indexing
Locally Indexed HDFS Bocks Data Nodes Global Index
11/30/18 34
11/30/18 35
Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead
Use MBR of leaf nodes as partition boundaries
11/30/18 36
Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead
Use MBR of leaf nodes as partition boundaries Partition the data
11/30/18 37
Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead
Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes
11/30/18 38
11/30/18 39
11/30/18 40
TAREEG[SIGMOD’14, SIGSPATIAL’14]
[VLDB’15,ICDE’16]
ST-Hadoop [TODS ]
Basic operations – CG_Hadoop [SIGSPATIAL’13, TSAS]
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Pigeon [ICDE’14]
Under review Demo paper
Spatial File Splitter Spatial Record Reader
VLDB’13 ICDE’15
11/30/18 41
File Splitter Split Number
Split Record Reader Record Reader k,v k,v k,v k,v k,v k,v Map Map Map task Map task Input Heap File Spatial File Splitter Filter Function Spatial Record Reader Spatial Record Reader
Indexed Input File(s)
11/30/18 42
TAREEG[SIGMOD’14, SIGSPATIAL’14]
[VLDB’15, ICDE’16]
ST-Hadoop
Spatial File Splitter Spatial Record Reader
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Pigeon [ICDE’14]
Basic operations – CG_Hadoop [SIGSPATIAL’13]
VLDB’13 ICDE’15
11/30/18 43
11/30/18 44
Use the global index to prune disjoint partitions Use local indexes to find matching records
11/30/18 45
First iteration runs as before and result is tested for correctness Answer is incorrect Second iteration processes other blocks that might contain an answer ✓ Answer is correct
11/30/18 46
11/30/18 47
11/30/18 48
Polygon Union Voronoi Diagram Delaunay Triangulation Skyline Convex Hull Farthest/closest pair
Single Machine Hadoop Spatial Hadoop
29x 260x 1x
11/30/18 49
Find the minimal convex polygon that contains all points
Input Output
11/30/18 50
Hadoop SpatialHadoop Partition Pruning Local hull Global hull
11/30/18 51
TAREEG[SIGMOD’14, SIGSPATIAL’14]
ST-Hadoop
Basic operations – CG_Hadoop [SIGSPATIAL’13]
Spatial File Splitter Spatial Record Reader
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Pigeon [ICDE’14]
[VLDB’15, ICDE’16] VLDB’13 ICDE’15
11/30/18 52
11/30/18 53
11/30/18 54
Input Partition
canvas
canvas
Output Image
11/30/18 55
2D Matrix with zeros
𝟑𝟑 𝟖 𝟐𝟔
Update the matrix
Matrix addition
Generate the image
11/30/18 56
Create a blank image
roads as polygons
PNG and write to file
image on the other
11/30/18 57
Map of California – 2GB Generated in 2 minutes on 10-node cluster instead of one hour
11/30/18 58
B: Block capacity k: Partitioning parameter |t|: Tile size
11/30/18 59
[VLDB’15, ICDE’16]
ST-Hadoop
Basic operations – CG_Hadoop [SIGSPATIAL’13,]
Spatial File Splitter Spatial Record Reader
Grid – R-tree – R+-tree – Quad tree [VLDB’15]
Pigeon [ICDE’14]
TAREEG[SIGMOD’14, SIGSPATIAL’14]
VLDB’13 ICDE’15
11/30/18 60
visualizing spatio-temporal satellite data
(Best poster runner-up)
ICDE’15
Visualize animated heat maps
Run spatio-temporal selection and aggregate queries
11/30/18 61
OpenStreetMap”, ACM SIGSPATIAL’14 ___ “TAREEG: A MapReduce-Based Web Service for Extracting Spatial Data from OpenStreetMap”, SIGMOD’14
11/30/18 62
11/30/18 63
11/30/18 64
Quad K-d Hilbert 500 1000 1500 2000 2500 RUNNING TIME (SEC)
Spatial Join Running time with different indexes
0.01 0.1 1 10 100 1 2 4 8 16 64 128
Throughput of Range Query
Hadoop SpatialHadoop
100 200 300
Union Voronoi Skyline Convex Hull Closest Pair Farthest Pair
Speedup of CG_Hadoop
Baseline Hadoop SpatialHadoop
500X 260X
20 40 60
Scatter Plot Roads Heatmap Satellite Vector Map Border Lines
Visualization Speedup
Baseline HadoopViz
48X
11/30/18 65
11/30/18 66
Interactive visualization Support more case studies
11/30/18 67
Empty tiles Covered by parent data tiles
Data tiles Small data sizes in those areas Not worth of preprocessing Image tiles Big enough to consider preprocessing
Empty tiles No data in there
11/30/18 68
11/30/18 69
Partitioning Local VD Pruning Vertical Merge Pruning Horizontal Merge Final output
11/30/18 70
Spatial Indexes
R-tree Scanner Spatial Join
Range Query Plans Spatial Join Plans
Spatial Data types/ Functions Spatial Indexing Command
Queries on Big Spatial Data”, poster in ACM SIGSPATIAL’15
11/30/18 71
11/30/18 72
11/30/18 73