Spatial Data Ahmed Eldawy Computer Science and Engineering Claudius - - PowerPoint PPT Presentation

spatial data
SMART_READER_LITE
LIVE PREVIEW

Spatial Data Ahmed Eldawy Computer Science and Engineering Claudius - - PowerPoint PPT Presentation

The Era of Big Spatial Data Ahmed Eldawy Computer Science and Engineering Claudius Ptolemy (AD 90 AD 168) Al Idrisi (1099 1165) Cholera cases in the London epidemic of 1854 Cool computer technology..!! Can I use it in my My pleasure.


slide-1
SLIDE 1

The Era of Big Spatial Data

Ahmed Eldawy Computer Science and Engineering

slide-2
SLIDE 2
slide-3
SLIDE 3

Claudius Ptolemy (AD 90 – AD 168)

slide-4
SLIDE 4

Al Idrisi (1099–1165)

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Cholera cases in the London epidemic of 1854

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Cool computer technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is My pleasure. Here it is. I have BIG data. I need HELP..!!

slide-13
SLIDE 13
slide-14
SLIDE 14

Kindly let me get the technology you have Kindly let me understand your needs

1969

slide-15
SLIDE 15
slide-16
SLIDE 16

HELP..!! I have

BIG data. Your

technology is not helping me mmm…Let me check with my good friends there. My pleasure. Here it is. Cool Database technology..!! Can I use it in my application? Oh..!! But, it is not made for me. Can’t make use of it as is

slide-17
SLIDE 17
slide-18
SLIDE 18

Kindly let me understand your needs Kindly let me get the technology you have

slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

HELP..!! Again, I have BIG data. Your technology is not helping me Sorry, seems like the DBMS technology cannot scale more Let me check with my other good friends there. My pleasure. Here it is. Cool Big Data technology..!! Can I use it in my application? Oh..!! But, it’s not made for me. Can’t make use of it as is

slide-23
SLIDE 23
slide-24
SLIDE 24

Kindly let me understand your needs Kindly let me get the technology you have

slide-25
SLIDE 25

Big Spatial Data

slide-26
SLIDE 26

Tons of Spatial data out there…

Smart Phones Satellite Images Medical Data Traffic Data Geotagged Microblogs VGI Sensor Networks Geotagged Pictures

slide-27
SLIDE 27

Spatial Data & Hadoop

Takes 193 seconds Hadoop Spatial Data

points = LOAD ’points’ AS (id:int, x:int, y:int); result = FILTER points BY x < xmax AND x >= xmin AND y < ymax AND y >= ymin;

SpatialHadoop

points = LOAD ’points’ AS (id:int, location:point); result = FILTER points BY Overlap(location, rectangle (xmin, ymin, xmax, ymax));

Finishes in 2 seconds

 SpatialHadoop

slide-28
SLIDE 28

80,000 downloads in one year

Industry Academia

Conducted more than seven

keynotes, tutorials, and invited talks >500GB public datasets for benchmarking and testing Spatial Language Spatial Indexes Spatial Operations Visualization Students Projects Collaboration

University

  • f

Genova

slide-29
SLIDE 29

The Built-in Approach of SpatialHadoop The On-top Approach

Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling

Pig Latin Hadoop

Java APIS

User Programs Spatial Modules (Spatial) User Program + MapReduce APIs + Job Monitoring and Scheduling + MapReduce Runtime + Storage +

Storage (HDFS) MapReduce Runtime Job Monitoring and Scheduling

Pig Latin Hadoop Java APIS

User Programs

Spatial Indexing Early Pruning Spatial Operators Spatial Language

From Scratch Approach The Built-in Approach (SpatialHadoop)

slide-30
SLIDE 30

Agenda

The ecosystem of SpatialHadoop

Motivation Internal system design Applications Related work Performance results

Open Research Problems

slide-31
SLIDE 31

VLDB’13 ICDE’15

SpatialHadoop Architecture

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Language

Pigeon [ICDE’14]

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

slide-32
SLIDE 32

Indexing

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

MapReduce

Spatial File Splitter Spatial Record Reader

Language

Pigeon [ICDE’14]

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

VLDB’13 ICDE’15

slide-33
SLIDE 33

Data Loading in Hadoop

Blindly chops down a big file into 128MB chunks Values of records are not considered Relevant records are typically assigned to two different blocks HDFS is too restrictive where files cannot be modified

Input File Data Nodes

128MB 128MB 128MB 128MB

slide-34
SLIDE 34

Spatial Distributed File System

Default Partitioning Spatial Partitioning

slide-35
SLIDE 35

Uniform Grid

Works only for uniformly distributed data

slide-36
SLIDE 36

R-tree

Read a sample Bulk load the sample into an R-tree

Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead

Use MBR of leaf nodes as partition boundaries

slide-37
SLIDE 37

R-tree

Read a sample Bulk load the sample into an R-tree

Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead

Use MBR of leaf nodes as partition boundaries Partition the data

slide-38
SLIDE 38

R-tree

Read a sample Bulk load the sample into an R-tree

Leaf node capacity C 𝐷 = 𝑙. 𝐶 𝑆 (1 + 𝛽) k: Sample size B: HDFS Block capacity |R|: Input size α: Index overhead

Use MBR of leaf nodes as partition boundaries Partition the data Optional: Build R-tree Local indexes

slide-39
SLIDE 39

R-tree-based Index of a 400 GB road network

slide-40
SLIDE 40

Non-indexed Heap File

slide-41
SLIDE 41

Operations

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

Language

Pigeon [ICDE’14]

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13]

VLDB’13 ICDE’15

slide-42
SLIDE 42

Operations Layer

Basic Operations: e.g, Range query and

KNN

Spatial Join Operations Computational geometry operations:

e.g., Polygon Union, Voronoi diagram, Delaunay Triangulation, and Convex Hull

User-defined operations: e.g., kNN join

slide-43
SLIDE 43

Range Query

Use the global index to prune disjoint partitions Use local indexes to find matching records

slide-44
SLIDE 44

KNN over Indexed Data

k=3

First iteration runs as before and result is tested for correctness  Answer is incorrect Second iteration processes other blocks that might contain an answer  Answer is correct

slide-45
SLIDE 45

Spatial Join

Join Directly Partition – Join

slide-46
SLIDE 46

Spatial Join

Join Directly Partition – Join

Total of 36 overlapping pairs Only 16 overlapping pairs

slide-47
SLIDE 47

CG_Hadoop

Polygon Union Voronoi Diagram Delaunay Triangulation Skyline Convex Hull Farthest/closest pair

Single Machine Hadoop Spatial Hadoop

29x 260x 1x

  • A. Eldawy, Y. Li, M. F. Mokbel, R. Janardan. “CG_Hadoop: Computational Geometry in MapReduce”, ACM SIGSPATIAL’13
slide-48
SLIDE 48

Convex Hull

Find the minimal convex polygon that contains all points

Input Output

slide-49
SLIDE 49

Convex Hull in CG_Hadoop

Hadoop SpatialHadoop Partition Pruning Local hull Global hull

slide-50
SLIDE 50

Advanced Analytics

(Ongoing work) Partitioning Local VD Pruning Vertical Merge Pruning Horizontal Merge Final output

slide-51
SLIDE 51

Applications

Visualization

[VLDB’15, ICDE’16]

ST-Hadoop

Operations

Basic operations – CG_Hadoop [SIGSPATIAL’13,]

MapReduce

Spatial File Splitter Spatial Record Reader

Indexing

Grid – R-tree – R+-tree – Quad tree [VLDB’15]

Language

Pigeon [ICDE’14]

Applications: SHAHED [ICDE’15] – MNTG [SSTD’13, ICDE’14]

TAREEG[SIGMOD’14, SIGSPATIAL’14]

VLDB’13 ICDE’15

slide-52
SLIDE 52

SHAHED – A system for querying and

visualizing spatio-temporal satellite data

http://shahed.cs.umn.edu/

  • A. Eldawy et al. “SHAHED: A MapReduce-based System for Querying and Visualizing Spatio-temporal Satellite Data”, IEEE ICDE’15

(Best poster runner-up)

  • A. Eldawy et al. “A Demonstration of SHAHED: A MapReduce-based System for Querying and Visualizing Satellite Data”, IEEE

ICDE’15

Visualize animated heat maps

  • r still images

Run spatio-temporal selection and aggregate queries

slide-53
SLIDE 53

TAREEG – Web-based extractor for

OpenStreetMap data using MapReduce

http://tareeg.net/

  • L. Alarabi, A. Eldawy, R. Alghamdi, M. F. Mokbel. “TAREEG: A MapReduce-Based System for Extracting Spatial Data from

OpenStreetMap”, ACM SIGSPATIAL’14 ___ “TAREEG: A MapReduce-Based Web Service for Extracting Spatial Data from OpenStreetMap”, SIGMOD’14

slide-54
SLIDE 54

Agenda

The ecosystem of SpatialHadoop

Motivation Internal system design Applications Related work Performance Results

Other research projects Future work

slide-55
SLIDE 55

Other Big Spatial Data Systems

  • A. Eldawy and M. Mokbel. “The Era of Big Spatial Data: A Survey”, Foundations and Trends in Databases 2016

Parallel

ESRI Tools for Hadoop SpatialHadoop is the only extensible system that can be easily expanded by researchers and developers

slide-56
SLIDE 56

Quad K-d Hilbert 500 1000 1500 2000 2500 RUNNING TIME (SEC)

Performance Results

Spatial Join Running time with different indexes

0.01 0.1 1 10 100 1 2 4 8 16 64 128

Throughput of Range Query

Hadoop SpatialHadoop

100 200 300

Union Voronoi Skyline Convex Hull Closest Pair Farthest Pair

Speedup of CG_Hadoop

Baseline Hadoop SpatialHadoop

500X 260X

20 40 60

Scatter Plot Roads Heatmap Satellite Vector Map Border Lines

Visualization Speedup

Baseline HadoopViz

48X

slide-57
SLIDE 57

Agenda

The ecosystem of SpatialHadoop

Motivation System design Applications Related work Performance results

Future directions

slide-58
SLIDE 58

Adaptive Multilevel Visualization

Empty tiles Covered by parent data tiles

Data tiles Small data sizes in those areas Not worth of preprocessing Image tiles Big enough to consider preprocessing

Empty tiles No data in there

slide-59
SLIDE 59

Adaptive Multilevel Images

9/27/2017 Ahmed Eldawy 59

Full Image (3,160 tiles) Adaptive Image (231 files)

slide-60
SLIDE 60

Dynamic Indexes

slide-61
SLIDE 61

Analysis of Satellite Data

9/27/2017 Ahmed Eldawy 61

slide-62
SLIDE 62

Existing Methods

9/27/2017 Ahmed Eldawy 62

Input Rasterize Vectorize

Huge intermediate data for high-resolution images

slide-63
SLIDE 63

Scanline Method

9/27/2017 Ahmed Eldawy 63

slide-64
SLIDE 64

Performance

1 10 100 1000 10000 100000

Counties States Boundaries Counties States Boundaries Counties States Boundaries Counties States Boundaries GLC2000 MERIS ASTER Treecover

Axis Title Axis Title

Chart Title Vectorize Rasterize (SciDB) Rasterize (PostGIS) Scanline

slide-65
SLIDE 65

Summary

Indexes Operations Visualization (HadoopViz) Apps

slide-66
SLIDE 66

Thank You

Questions?