Spatial Data Management in Apache Spark The GeoSpark Perspective and - PowerPoint PPT Presentation

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu

THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments

WHAT IS GEOSPARK • A spatial data management system on top of Apache Spark since 2015 • Some statistics • Monthly: downloads > 4k, visits > 8K; Overall: downloads > 40K, visits > 100K • Was on listed as Infrastructure Project on Apache Spark official 3rd party project page • Users and contributors from Apple, Facebook, Uber, numerous startup companies • Evaluation from a recent Very Large Data Bases (VLDB) 2018 research paper “GeoSpark comes close to a complete spatial analytics system because of data types and queries supported and the control user has while writing applications. It also exhibits the best performance in most cases. ” - How Good Are Modern Spatial Analytics Systems? PVLDB Vol11

GEOSPARK OVERVIEW Spatial SQL API SELECT superhero.name FROM city, superhero Query Optimizer Scala/Java RDD API WHERE ST_Contains(city.geom, superhero.geom) AND city.name = 'Gotham'; Spatial Query Processing Layer Query result Range Join Distance Join Range Distance KNN Query optimization Spatial RDD Layer Global Spatial RDD Partitioner Spatial Index Spatial RDD / DataFrame Spatial RDD Point, Polygon, Line string ... Geometrical Operations Library Spatial partitioning, Index

HETEROGENEOUS GEOMETRIES How about this? It can be even more complex i.e., country boundaries SELECT ST_GeomFromWKT ( TaxiTripRawTable.pickuppointString ) FROM TaxiTripRawTable

CUSTOM SERIALIZER Spark … … … ? Serializing Byte array … … … De-serializing Byte array GeoSpark … Byte array Point, Polygon, LineString, …, Spatial index….

SPATIAL PARTITIONING Range query, Join query Not scalable Scalable and fast

SPATIAL PARTITIONING Uniform grids KDB-Tree Quad-Tree R-Tree

SPATIAL INDEXING R-Tree, Quad-Tree Worker R-Tree, Quad-Tree Global spatial partitioning grid file Worker Master

SPATIAL RANGE QUERY SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint)

SPATIAL KNN QUERY Recycle KNN Query local index Sort Take first K Query Narrow Wide dependency dependency Index Index Index Output file Stage 1 Stage 2 Data Small Intermediate Intermediate Indexed SRDD Flow shuffle SRDD SRDD (cached) SELECT ST_Neighbors(MyLocation Restaurants.Locations, 20) FROM Restaurants

SPATIAL JOIN QUERY SELECT * FROM TaxiZones, TaxiTripTable WHERE ST_Contains(TaxiZones.bound, TaxiTripTable.pickuppoint)

SPATIAL JOIN QUERY Broadcast Join algorithm GSJoin algorithm Without spatial partitioning With spatial partitioning on two input SRDDs Recycle Index Index Index SRDD A - Indexed SRDD B (repartitioned, cached) (repartitioned) Wide Small Narrow dependency Zip partitions by ID dependency shuffle Shuffle Index Index Index Intermediate SRDD Narrow Query local index dependency Stage 1 Stage 1 Result SRDD Join Query Data Flow Data Flow

QUERY OPTIMIZER (V1.2.0) • Heuristics-Based Optimization • Cost-Base Optimization GeoSpark System GeoSpark Heuristic Rules Catalog Expressions SQL Unresolved Logical Optimized Analysis Logical Plan DataFrames Logical Plan Optimization Logical Plan Physical Physical Cost Selected Code DataFrame Planning Plans Models Physical Plan Generation GeoSpark Cost- GeoSpark based Strategies Statistics

HEURISTICS BASED OPTIMIZATION • Predicate pushdown SELECT * FROM TaxiStopStations, TaxiTripTable WHERE ST_Contains(TaxiStopStations.bound, TaxiTripTable.pickuppoint) AND ST_Contains(Manhattan, TaxiStopStations.bound) Result Result Range Join Range filter: Broadcast or GSJoin Manhattan Range filter: Range filter: Range Join Manhattan Manhattan Broadcast or GSJoin Pickup points Taxi stops Pickup points Taxi stops (a) No predicate pushdown (b) With predicate pushdown

HEURISTICS BASED OPTIMIZATION • Predicate merging SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint) AND ST_Contains(Queens, TaxiTripTable.pickuppoint) (a) AND, take the intersection SELECT * FROM TaxiTripTable WHERE ST_Contains(Manhattan, TaxiTripTable.pickuppoint) OR ST_Contains(Queens, TaxiTripTable.pickuppoint) (b) OR, take the union

HEURISTICS BASED OPTIMIZATION • Intersection query rewrite SELECT ST_Intersection(Lions.habitat, Zebras.habitat) FROM Lions, Zebras Cross join, slow SELECT ST_Intersection(Lions.habitat, Zebras.habitat) FROM Lions, Zebras WHERE ST_Intersects(Lions.habitat, Zebras.habitat); Optimized GeoSpark inner join, fast

COST BASED OPTIMIZATION • Cost: based on GeoSpark statistics, MBR, count • Index scan selection: Index scan VS DataFrame scan, based on query selectivity • Spatial join algorithm selection: partition-wise GeoSpark join VS broadcast join

MapD Kinetica GeoSpark Distributed Yes, late 2017 Yes Yes SpatialSQL Yes Yes, limited Yes Compact in-mem No No Yes geometry, index Distributed spatial Yes, dist. Quad-Tree, No, nested loop No index R-Tree Distributed spatial No, still hash or Yes, 4 spatial No data partitioning round-robin partition methods Opt. distributed No No Yes spatial join Spatial query No No Yes, HBO, CBO optimizer Fault tolerance No, fail right away Yes Yes, RDD lineage SQL CodeGen Yes No Yes Streaming Yes Yes Yes No, but + MapD, Storage system Yes Yes +Kinetica

REFERENCE • MapD RoadMap: https://github.com/mapd/mapd- core/blob/master/ROADMAP .md • Kinetica: https://www.kinetica.com/product/faq/

JOIN QUERY 1.3 billion points join 171 thousand polygons 72.7 million line strings join 171 thousand polygons 4 machines 263 million polygons join 171 thousand polygons Point Polygon LineString Point PolygonLineString

CONCLUSION • GeoSpark is the fastest approach compared to other systems • For join query, GeoSpark has the least memory because it can make Spark quickly serialize/ deserialize data without having too much intermediate data be sticking in memory

QUESTIONS?

THE IMPACT OF INDEX Point data Polygon data

CONCLUSION 2 • Spatial index is only helpful when prune complex shapes because of filter and refine model

THE IMPACT OF SPATIAL PARTITIONING Point Polygon LineString

CONCLUSION 3 • KDB-Tree partition is the most load-balanced • Quad-Tree is better • R-Tree is the worst but better than uniform grids

POINT RANGE

POLYGON RANGE

LINE STRING RANGE

Spatial Data Management in Apache Spark The GeoSpark Perspective and - PowerPoint PPT Presentation

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments THIS TALK

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Big Spatial Data Management on Spark 1 Tons of Spatial data out there Geotagged Pictures

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Containers in Production with Docker, CoreOS, Kubernetes and Apache Stratos Last Updated: June.

FINANCIAL AND OPERATIONAL SUPPLEMENT NOTICE TO INVESTORS Certain statements in this earnings

Investor Presentation November 6, 2019 Disclaimer Forward-Looking Statements This presentation

Charter of Trust on Cybersecurity charter-of-trust.com | #Charter of Trust Digitalization

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1

The Nexus of Open Source Innovation Eric Baldeschwieler, CTO, Hortonworks Avik Dey, Director,

MIRAMONTE TOWN HOME COMMUNITY MIRAMONTE TOWNHOME COMMUNITY MIRAMONTE TOWNHOME COMMUNITY PART

Apache Beam Dan Debrunner Programming Model Architect IBM Streams STSM, IBM Background

Spatial Data Management in Apache Spark The GeoSpark Perspective and - PowerPoint PPT Presentation

Spatial Data Management in Apache Spark The GeoSpark Perspective and Beyond Jia Yu THIS TALK GeoSpark overview Spatial RDD / DataFrame layer Spatial query processing layer Query optimizer GPU-based spatial database Experiments THIS TALK

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Big Spatial Data Management on Spark 1 Tons of Spatial data out there Geotagged Pictures

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Containers in Production with Docker, CoreOS, Kubernetes and Apache Stratos Last Updated: June.

FINANCIAL AND OPERATIONAL SUPPLEMENT NOTICE TO INVESTORS Certain statements in this earnings

Investor Presentation November 6, 2019 Disclaimer Forward-Looking Statements This presentation

Charter of Trust on Cybersecurity charter-of-trust.com | #Charter of Trust Digitalization

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1

The Nexus of Open Source Innovation Eric Baldeschwieler, CTO, Hortonworks Avik Dey, Director,

MIRAMONTE TOWN HOME COMMUNITY MIRAMONTE TOWNHOME COMMUNITY MIRAMONTE TOWNHOME COMMUNITY PART

Apache Beam Dan Debrunner Programming Model Architect IBM Streams STSM, IBM Background

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark