Spatial Data Management in Apache Spark The GeoSpark Perspective and - - PowerPoint PPT Presentation

spatial data management in apache spark
SMART_READER_LITE
LIVE PREVIEW

Spatial Data Management in Apache Spark The GeoSpark Perspective and - - PowerPoint PPT Presentation

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments THIS TALK


slide-1
SLIDE 1

Spatial Data Management in Apache Spark

The GeoSpark Perspective and Beyond

Jia Yu

slide-2
SLIDE 2

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-3
SLIDE 3

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-4
SLIDE 4

WHAT IS GEOSPARK

  • A spatial data management system on top of Apache Spark since 2015
  • Some statistics
  • Monthly: downloads > 4k, visits > 8K; Overall: downloads > 40K, visits > 100K
  • Was on listed as Infrastructure Project on Apache Spark official 3rd party project page
  • Users and contributors from Apple, Facebook, Uber, numerous startup companies
  • Evaluation from a recent Very Large Data Bases (VLDB) 2018 research paper

“GeoSpark comes close to a complete spatial analytics system because of data types and queries supported and the control user has while writing applications. It also exhibits the best performance in most cases.”

  • How Good Are Modern Spatial Analytics Systems? PVLDB Vol11
slide-5
SLIDE 5

GEOSPARK OVERVIEW

Spatial SQL API Spatial RDD Layer

Geometrical Operations Library

Spatial Query Processing Layer

Range KNN Range Join Global Spatial RDD Partitioner Spatial RDD Spatial Index

Point, Polygon, Line string ...

Query Optimizer

Distance Distance Join

Scala/Java RDD API

SELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) AND city.name = 'Gotham';

Spatial RDD / DataFrame

Spatial partitioning, Index Query optimization

Query result

slide-6
SLIDE 6

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-7
SLIDE 7

HETEROGENEOUS GEOMETRIES

How about this? It can be even more complex i.e., country boundaries

SELECT ST_GeomFromWKT ( TaxiTripRawTable.pickuppointString ) FROM TaxiTripRawTable

slide-8
SLIDE 8

CUSTOM SERIALIZER

Byte array

?

Spark Byte array … … … GeoSpark … … … …

Byte array

Serializing De-serializing

Point, Polygon, LineString, …, Spatial index….

slide-9
SLIDE 9

SPATIAL PARTITIONING

Not scalable Scalable and fast Range query, Join query

slide-10
SLIDE 10

SPATIAL PARTITIONING

Uniform grids KDB-Tree Quad-Tree R-Tree

slide-11
SLIDE 11

SPATIAL INDEXING

Global spatial partitioning grid file

Master Worker Worker R-Tree, Quad-Tree R-Tree, Quad-Tree

slide-12
SLIDE 12

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-13
SLIDE 13

SPATIAL RANGE QUERY

SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint)

slide-14
SLIDE 14

SPATIAL KNN QUERY

Narrow dependency Take first K Wide dependency Query local index

Index Index Index

Indexed SRDD (cached) Stage 1 Intermediate SRDD

KNN Query

Intermediate SRDD Sort Stage 2 Output file

Data Flow

Recycle Small shuffle

SELECT ST_Neighbors(MyLocation Restaurants.Locations, 20) FROM Restaurants

slide-15
SLIDE 15

SPATIAL JOIN QUERY

SELECT * FROM TaxiZones, TaxiTripTable WHERE ST_Contains(TaxiZones.bound, TaxiTripTable.pickuppoint)

slide-16
SLIDE 16

SPATIAL JOIN QUERY

Narrow dependency

Index Index Index

SRDD A - Indexed (repartitioned, cached) SRDD B (repartitioned)

Index Index Index

Intermediate SRDD Zip partitions by ID Stage 1 Recycle

Join Query Data Flow

Stage 1

Data Flow

Wide dependency

Broadcast Join algorithm Without spatial partitioning GSJoin algorithm With spatial partitioning on two input SRDDs

Result SRDD Query local index Narrow dependency Shuffle Small shuffle

slide-17
SLIDE 17

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-18
SLIDE 18

QUERY OPTIMIZER (V1.2.0)

  • Heuristics-Based Optimization
  • Cost-Base Optimization

Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans Selected Physical Plan Cost Models Analysis Logical Optimization Physical Planning Code Generation

DataFrames DataFrame SQL

System Catalog GeoSpark Heuristic Rules GeoSpark Cost- based Strategies GeoSpark Statistics GeoSpark Expressions

slide-19
SLIDE 19

HEURISTICS BASED OPTIMIZATION

  • Predicate pushdown

SELECT * FROM TaxiStopStations, TaxiTripTable WHERE ST_Contains(TaxiStopStations.bound, TaxiTripTable.pickuppoint) AND ST_Contains(Manhattan, TaxiStopStations.bound)

Range Join Broadcast or GSJoin Range filter: Manhattan Taxi stops Pickup points Range Join Broadcast or GSJoin Taxi stops Pickup points Range filter: Manhattan

(a) No predicate pushdown (b) With predicate pushdown

Result Result Range filter: Manhattan

slide-20
SLIDE 20

HEURISTICS BASED OPTIMIZATION

  • Predicate merging

SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint) AND ST_Contains(Queens, TaxiTripTable.pickuppoint) SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint) OR ST_Contains(Queens, TaxiTripTable.pickuppoint)

(a) AND, take the intersection (b) OR, take the union

slide-21
SLIDE 21

HEURISTICS BASED OPTIMIZATION

  • Intersection query rewrite

SELECT ST_Intersection(Lions.habitat, Zebras.habitat) FROM Lions, Zebras SELECT ST_Intersection(Lions.habitat, Zebras.habitat) FROM Lions, Zebras WHERE ST_Intersects(Lions.habitat, Zebras.habitat);

Cross join, slow Optimized GeoSpark inner join, fast

slide-22
SLIDE 22

COST BASED OPTIMIZATION

  • Cost: based on GeoSpark statistics, MBR, count
  • Index scan selection: Index scan VS DataFrame

scan, based on query selectivity

  • Spatial join algorithm selection: partition-wise

GeoSpark join VS broadcast join

slide-23
SLIDE 23

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-24
SLIDE 24

MapD Kinetica GeoSpark Distributed Yes, late 2017 Yes Yes SpatialSQL Yes Yes, limited Yes Compact in-mem geometry, index No No Yes Distributed spatial index No, nested loop No Yes, dist. Quad-Tree, R-Tree Distributed spatial data partitioning No, still hash or round-robin No Yes, 4 spatial partition methods

  • Opt. distributed

spatial join No No Yes Spatial query

  • ptimizer

No No Yes, HBO, CBO Fault tolerance No, fail right away Yes Yes, RDD lineage SQL CodeGen Yes No Yes Streaming Yes Yes Yes Storage system Yes Yes No, but + MapD, +Kinetica

slide-25
SLIDE 25

REFERENCE

  • MapD RoadMap: https://github.com/mapd/mapd-

core/blob/master/ROADMAP .md

  • Kinetica: https://www.kinetica.com/product/faq/
slide-26
SLIDE 26

THIS TALK

GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Experiments Query optimizer GPU-based spatial database

slide-27
SLIDE 27

JOIN QUERY

Point Polygon LineString Point PolygonLineString

1.3 billion points join 171 thousand polygons 72.7 million line strings join 171 thousand polygons 263 million polygons join 171 thousand polygons 4 machines

slide-28
SLIDE 28

CONCLUSION

  • GeoSpark is the fastest approach compared to
  • ther systems
  • For join query, GeoSpark has the least memory

because it can make Spark quickly serialize/ deserialize data without having too much intermediate data be sticking in memory

slide-29
SLIDE 29

QUESTIONS?

slide-30
SLIDE 30

THE IMPACT OF INDEX

Point data Polygon data

slide-31
SLIDE 31

CONCLUSION 2

  • Spatial index is only helpful when prune complex

shapes because of filter and refine model

slide-32
SLIDE 32

THE IMPACT OF SPATIAL PARTITIONING

Point Polygon LineString

slide-33
SLIDE 33

CONCLUSION 3

  • KDB-Tree partition is the most load-balanced
  • Quad-Tree is better
  • R-Tree is the worst but better than uniform grids
slide-34
SLIDE 34

POINT RANGE

slide-35
SLIDE 35

POLYGON RANGE

slide-36
SLIDE 36

LINE STRING RANGE