Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy - - PowerPoint PPT Presentation

big spatial data
SMART_READER_LITE
LIVE PREVIEW

Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy - - PowerPoint PPT Presentation

Big Spatial Data Management on Hadoop and Beyond Ahmed Eldawy Computer Science and Engineering 11/30/18 1 Claudius Ptolemy (AD 90 AD 168) Al Idrisi (1099 1165) Cholera cases in the London epidemic of 1854 Cool computer


slide-1
SLIDE 1

Big Spatial Data Management on Hadoop and Beyond

Ahmed Eldawy Computer Science and Engineering

11/30/18 1

slide-2
SLIDE 2
slide-3
SLIDE 3

Claudius Ptolemy (AD 90 – AD 168)

slide-4
SLIDE 4

Al Idrisi (1099–1165)

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Cholera cases in the London epidemic of 1854

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Cool computer technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is My pleasure. Here it is. I have BIG data. I need HELP..!!

slide-13
SLIDE 13
slide-14
SLIDE 14

Kindly let me get the technology you have Kindly let me understand your needs

1969

slide-15
SLIDE 15
slide-16
SLIDE 16

HELP..!! I have

BIG data. Your

technology is not helping me mmm…Let me check with my good friends there. My pleasure. Here it is. Cool Database technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is

slide-17
SLIDE 17
slide-18
SLIDE 18

Kindly let me understand your needs Kindly let me get the technology you have

slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

HELP..!! Again, I have BIG data. Your technology is not helping me Sorry, seems like the DBMS technology cannot scale more Let me check with my other good friends there. My pleasure. Here it is. Cool Big Data technology..!! Can I use it in my application? Oh..!! But, it’s not made for me. Can’t make use of it as is

slide-23
SLIDE 23
slide-24
SLIDE 24

Kindly let me understand your needs Kindly let me get the technology you have

slide-25
SLIDE 25

Big Spatial Data Management

slide-26
SLIDE 26

Tons of Spatial data out there…

Smart Phones Satellite Images Medical Data Traffic Data Geotagged Microblogs VGI Sensor Networks Geotagged Pictures

11/30/18 26

slide-27
SLIDE 27

Spatial Data & Hadoop

Takes 193 seconds Hadoop Spatial Data

points = LOAD ’points’ AS (id:int, x:int, y:int); result = FILTER points BY x < xmax AND x >= xmin AND y < ymax AND y >= ymin;

SpatialHadoop

points = LOAD ’points’ AS (id:int, location:point); result = FILTER points BY Overlap(location, rectangle (xmin, ymin, xmax, ymax));

Finishes in 2 seconds

➔ SpatialHadoop

11/30/18 27

slide-28
SLIDE 28

80,000 downloads in one year

Industry Academia

Conducted more than seven

keynotes, tutorials, and invited talks Incubated by Eclipse Foundation and renamed to GeoJinni >500GB public datasets for benchmarking and testing Spatial Language Spatial Indexes Spatial Operations Visualization Students Projects Collaboration

University

  • f

Genova

11/30/18 28

slide-29
SLIDE 29

The Built-in Approach of SpatialHadoop The On-top Approach

Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling

Pig Latin Hadoop

Java APIS

User Programs Spatial Modules (Spatial) User Program + MapReduce APIs + Job Monitoring and Scheduling + MapReduce Runtime + Storage +

Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling

Pig Latin Hadoop Java APIS

User Programs

Spatial Indexing Early Pruning Spatial Operators Spatial Language

From Scratch Approach The Built-in Approach (SpatialHadoop)

11/30/18 29

slide-30
SLIDE 30

Agenda

The ecosystem of SpatialHadoop

Motivation Internal system design Applications Related work Experiments

Open Research Problems

11/30/18 30

slide-31
SLIDE 31

VLDB’13 ICDE’15

SpatialHadoop Architecture

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Language

Pigeon [ICDE’14]

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

11/30/18 31

slide-32
SLIDE 32

Indexing

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

MapReduce

Spatial File Splitter Spatial Record Reader

Language

Pigeon [ICDE’14]

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

VLDB’13 ICDE’15

11/30/18 32

slide-33
SLIDE 33

Data Loading in Hadoop

Blindly chops down a big file into 128MB chunks Values of records are not considered Relevant records are typically assigned to two different blocks HDFS is too restrictive where files cannot be modified

Input File Data Nodes

128MB 128MB 128MB 128MB

11/30/18 33

slide-34
SLIDE 34

Two-layer Index Layout

Global Indexing

Locally Indexed HDFS Bocks Data Nodes Global Index

11/30/18 34

slide-35
SLIDE 35

Uniform Grid

Works only for uniformly distributed data

11/30/18 35

slide-36
SLIDE 36

R-tree

Read a sample Bulk load the sample into an R-tree

Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead

Use MBR of leaf nodes as partition boundaries

11/30/18 36

slide-37
SLIDE 37

R-tree

Read a sample Bulk load the sample into an R-tree

Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead

Use MBR of leaf nodes as partition boundaries Partition the data

11/30/18 37

slide-38
SLIDE 38

R-tree

Read a sample Bulk load the sample into an R-tree

Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead

Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes

11/30/18 38

slide-39
SLIDE 39

R-tree-based Index of a 400 GB road network

11/30/18 39

slide-40
SLIDE 40

Non-indexed Heap File

11/30/18 40

slide-41
SLIDE 41

MapReduce Layer

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Visualization

[VLDB’15,ICDE’16]

ST-Hadoop [TODS ]

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13, TSAS]

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

Language

Pigeon [ICDE’14]

 Under review Demo paper

MapReduce

Spatial File Splitter Spatial Record Reader

VLDB’13 ICDE’15

11/30/18 41

slide-42
SLIDE 42

Map plan – Hadoop

File Splitter Split Number

  • f splits

Split Record Reader Record Reader k,v k,v k,v k,v k,v k,v Map Map Map task Map task Input Heap File Spatial File Splitter Filter Function Spatial Record Reader Spatial Record Reader

Map plan – SpatialHadoop

Indexed Input File(s)

11/30/18 42

slide-43
SLIDE 43

Operations

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

Language

Pigeon [ICDE’14]

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

VLDB’13 ICDE’15

11/30/18 43

slide-44
SLIDE 44

Operations Layer

Basic Operations: e.g, Range query and

KNN

Spatial Join Operations Computational geometry operations:

e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull

User-defined operations: e.g., kNN join

11/30/18 44

slide-45
SLIDE 45

Range Query

Use the global index to prune disjoint partitions Use local indexes to find matching records

11/30/18 45

slide-46
SLIDE 46

KNN over Indexed Data

k=3

First iteration runs as before and result is tested for correctness  Answer is incorrect Second iteration processes other blocks that might contain an answer ✓ Answer is correct

11/30/18 46

slide-47
SLIDE 47

Spatial Join

Join Directly Partition – Join

11/30/18 47

slide-48
SLIDE 48

Spatial Join

Join Directly Partition – Join

Total of 36 overlapping pairs Only 16 overlapping pairs

11/30/18 48

slide-49
SLIDE 49

CG_Hadoop

Polygon Union Voronoi Diagram Delaunay Triangulation Skyline Convex Hull Farthest/closest pair

Single Machine Hadoop Spatial Hadoop

29x 260x 1x

  • A. Eldawy, Y. Li, M. F. Mokbel, R. Janardan. “CG_Hadoop: Computational Geometry in MapReduce”, ACM SIGSPATIAL’13

11/30/18 49

slide-50
SLIDE 50

Convex Hull

Find the minimal convex polygon that contains all points

Input Output

11/30/18 50

slide-51
SLIDE 51

Convex Hull in CG_Hadoop

Hadoop SpatialHadoop Partition Pruning Local hull Global hull

11/30/18 51

slide-52
SLIDE 52

Visualization

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

Language

Pigeon [ICDE’14]

Visualization

[VLDB’15, ICDE’16] VLDB’13 ICDE’15

11/30/18 52

slide-53
SLIDE 53

Visualization in HadoopViz

Scatter Plot Road Network Heat Map Satellite Data Vector Map Admin Boundaries The goal of HadoopViz is not to propose new visualization techniques, instead its goal is to scale out existing techniques.

11/30/18 53

slide-54
SLIDE 54

Heat Map From 2009 to 2014 Month-by-Month

72 Frames × 14 Billion points per frame Total = 1 Trillion points

Created in 3 hours on 10 nodes instead of 60 hours

11/30/18 54

slide-55
SLIDE 55

Abstract Visualization

Input Partition

  • 3. plot
  • 3. plot
  • 3. plot
  • 2. Create

canvas

  • 2. Create

canvas

  • 4. merge
  • 4. merge
  • 4. merge
  • 1. smooth
  • 1. smooth
  • 1. smooth
  • 5. write

Output Image

11/30/18 55

slide-56
SLIDE 56

Example: Satellite Data Visualization

  • 1. Smooth: Recover holes
  • 2. Create Canvas: Initialize a

2D Matrix with zeros

𝟑𝟑 𝟖 𝟐𝟔

  • 3. Plot:

Update the matrix

+

  • 4. Merge:

Matrix addition

  • 5. Write:

Generate the image

11/30/18 56

slide-57
SLIDE 57

Example: Road Network Visualization

  • 1. Smooth: Merge intersections
  • 2. Create Canvas:

Create a blank image

  • 3. Plot: Draw

roads as polygons

  • 5. Write: Encode as

PNG and write to file

  • 4. Merge: Plot an

image on the other

11/30/18 57

slide-58
SLIDE 58

Multilevel Images in HadoopViz

Map of California – 2GB Generated in 2 minutes on 10-node cluster instead of one hour

11/30/18 58

slide-59
SLIDE 59

Multi-level Visualization

Abstract multi-level visualization algorithm The choice of partitioning technique changes for each zoom level Zoom Level Threshold level 𝑨𝜄 Default Hadoop partitioning Spatial Partitioning 𝒜𝜾 = 𝟐 𝟑 log 𝐶 𝑙 𝑢

B: Block capacity k: Partitioning parameter |t|: Tile size

11/30/18 59

slide-60
SLIDE 60

Applications

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13,]

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

Language

Pigeon [ICDE’14]

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

VLDB’13 ICDE’15

11/30/18 60

slide-61
SLIDE 61

SHAHED – A system for querying and

visualizing spatio-temporal satellite data

http://shahed.cs.umn.edu/

  • A. Eldawy et al. “SHAHED: A MapReduce-based System for Querying and Visualizing Spatio-temporal Satellite Data”, IEEE ICDE’15

(Best poster runner-up)

  • A. Eldawy et al. “A Demonstration of SHAHED: A MapReduce-based System for Querying and Visualizing Satellite Data”, IEEE

ICDE’15

Visualize animated heat maps

  • r still images

Run spatio-temporal selection and aggregate queries

11/30/18 61

slide-62
SLIDE 62

TAREEG – Web-based extractor for

OpenStreetMap data using MapReduce

http://tareeg.net/

  • L. Alarabi, A. Eldawy, R. Alghamdi, M. F. Mokbel. “TAREEG: A MapReduce-Based System for Extracting Spatial Data from

OpenStreetMap”, ACM SIGSPATIAL’14 ___ “TAREEG: A MapReduce-Based Web Service for Extracting Spatial Data from OpenStreetMap”, SIGMOD’14

11/30/18 62

slide-63
SLIDE 63

Agenda

The ecosystem of SpatialHadoop

Motivation Internal system design Applications Related work Experiments

Other research projects Future work

11/30/18 63

slide-64
SLIDE 64

Other Big Spatial Data Systems

  • A. Eldawy and M. Mokbel. “The Era of Big Spatial Data: A Survey”, Foundations and Trends in Databases 2016

Parallel

ESRI Tools for Hadoop SpatialHadoop is the only extensible system that can be easily expanded by researchers and developers

11/30/18 64

slide-65
SLIDE 65

Quad K-d Hilbert 500 1000 1500 2000 2500 RUNNING TIME (SEC)

Experimental Results

Spatial Join Running time with different indexes

0.01 0.1 1 10 100 1 2 4 8 16 64 128

Throughput of Range Query

Hadoop SpatialHadoop

100 200 300

Union Voronoi Skyline Convex Hull Closest Pair Farthest Pair

Speedup of CG_Hadoop

Baseline Hadoop SpatialHadoop

500X 260X

20 40 60

Scatter Plot Roads Heatmap Satellite Vector Map Border Lines

Visualization Speedup

Baseline HadoopViz

48X

11/30/18 65

slide-66
SLIDE 66

Agenda

The ecosystem of SpatialHadoop

Motivation System design Applications Related work Experiments

Future directions

11/30/18 66

slide-67
SLIDE 67

Future Research Directions

Extend HadoopViz

Interactive visualization Support more case studies

Migrate spatial indexes to Spark to support spatial data mining

  • perations

Improve HDFS spatial indexes to support dynamic data

11/30/18 67

slide-68
SLIDE 68

Adaptive Multilevel Visualization

Empty tiles Covered by parent data tiles

Data tiles Small data sizes in those areas Not worth of preprocessing Image tiles Big enough to consider preprocessing

Empty tiles No data in there

11/30/18 68

slide-69
SLIDE 69

Dynamic Indexes

11/30/18 69

slide-70
SLIDE 70

Advanced Analytic Queries

Partitioning Local VD Pruning Vertical Merge Pruning Horizontal Merge Final output

11/30/18 70

slide-71
SLIDE 71

Sphinx: Distributed SQL Engine for Big Spatial Data

HDFS

Storage

Spatial Indexes

Query Executor

R-tree Scanner Spatial Join

Query Planner

Range Query Plans Spatial Join Plans

Query Parser

Spatial Data types/ Functions Spatial Indexing Command

Cloudera Impala Sphinx

  • A. Eldawy. M. Elganainy, I. Sabek, A. Bakeer, A. Abdelmotaleb, M. F. Mokbel “Sphinx: Distributed Execution of Interactive SQL

Queries on Big Spatial Data”, poster in ACM SIGSPATIAL’15

11/30/18 71

slide-72
SLIDE 72

Summary

Indexes Operations Visualization (HadoopViz) Apps

11/30/18 72

slide-73
SLIDE 73

Thank You

Questions?

11/30/18 73