Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL – Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University

A Little Story… • August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015 were released by New York City Taxi & Limousine Commission • Since then: New taxi trip records keep being published on the Internet • As of TODAY: Millions of new records have been added into the dataset PickupTime DropoffTime TripDistance PickupLocation DropoffLocation PaymentType FareAmount TipAmount 2009-01-01 2009-01-01 (40.7577,- (40.7497,- 2.2 Credit Card 15.5 3.5 08:01:01 08:20:37 73.9851) 73.9882) Photo credit: NYC TLC website

A Little Story… (cont.) • People really want to do Spatial Query on this 175 GB data in PostGIS • People really need a Spatial Index to speed up the queries. Which Spatial Index can handle these? 1 billion records, 175 GB, millions new records, keep being published

Compared Approaches: GiST • Generalized Search Tree (GiST-Spatial, Similar to R-Tree) • Index structure: Tree index Root … Non-leaf Non-leaf … … Tuple Tuple Tuple Tuple Tuple Tuple • Index entry (tree node): Minimum Bounding Rectangle, Tuple pointers • Index search: Top-down, fast prune by checking Query Window with MBR • Index maintenance: Search tree, then split (if full) and merge (if too empty)

Compared Approaches: GiST • Summary of GiST • Fast index search • Large storage overhead: 20% or more additional overhead • Slow maintenance: Split, merge tree nodes Index Name Data size Index size Initial. time Insertion (0.1%) GiST 175 GB NYC 84 GB 28 hours 6 hours

Compared Approaches: BRIN-Spatial SELECT * • Block Range Index (BRIN-Spatial): FROM NYCtaxi N WHERE ST_WITHIN • PostgreSQL 9.5, PostGIS 2.3 (QueryWindow, N.pickuppoint) • Index heap file pages • Index search: • 1. Serial search by checking Query Window with MBR Heap file • 2. Filter false positive pages Disk pages • Index maintenance: • Update MBR for Insertion Data table • No update for deletion Filter false positive pages Disk pages

Compared Approaches: BRIN-Spatial • Summary of BRIN-Spatial • Index heap file pages • Very small • Fast maintenance Heap file • Not good at queries Disk pages Data table

Compared Approaches: Hippo-Spatial • Hippo-Spatial: PVLDB 2016 • Index heap file pages • Index entry: dynamic page range, partial histogram • Index search: Page False positive 1 √ • 1. Serial search by finding overlapped buckets between 2 √ Query Window and partial histogram 3 Got results! • 2. Filter false positive pages 4 Got results! Page Range Histogram Bucket ID (X,Y) Histograms on X and Y 5 √ 1,1 1,2 … 4,3 4,4 Start End 6 Got results! 4 7 √ 3 1 10 1 0 … 1 0 8 Got results! Page 1 - 10 2 9 Got results! 26 30 0 1 … 0 1 10 √ 1 Y X 1 2 3 4 11 25 0 1 … 0 0

Compared Approaches: Hippo-Spatial • Hippo-Spatial: • Index maintenance • Data insertion: eager update on partial histogram • Data deletion: lazy update on partial histogram Traverse Page Range Histogram Bucket ID (X,Y) 1,1 1,2 … 4,3 4,4 Start End Each entry Histograms on X and Y Out of date? YES. 1 10 1 0 … 1 0 4 Page 26 3 26 30 0 1 … 0 1 2 Resummarize 11 25 0 1 … 0 0 Page Range Histogram Bucket ID (X,Y) 1 26 30 0 1 … 0 1 Y X 1 2 3 4

Compared Approaches: Hippo-Spatial • Summary of Hippo-Spatial Histograms on X and Y 4 Page Range Histogram Bucket ID (X,Y) • Index heap file pages 3 1,1 1,2 … 4,3 4,4 Start End • Still small 2 • Fast maintenance 1 1 10 1 0 … 1 0 • Good at common queries Y X 1 2 3 4 26 30 0 1 … 0 1 Sorted List (Start Page# ↓ ) Pointer 11 25 0 1 … 0 0 Pointer Pointer

Experimental Environment • Datasets • NYC Taxi Trips 175 GB • Parameter setting • Hippo: Histogram bucket (H) 400, Partial histogram density(D) 20% • BRIN: Page per range (P) 128

Indexing Overhead • Index size • Hippo: 100x < GiST • BRIN: 100x < Hippo • Reason Log. scale • Index pages not tuples • Partial histogram > MBR Log. scale

Indexing Overhead (cont.) • Index initialization time • Hippo, BRIN-Spatial 100x < GiST • Hippo takes 60% time of BRIN • Reason Log. scale • Hierarchy > flat index structure • GiST writes lots of temporary disk files • BRIN in-memory entry is updated frequently

Indexing Overhead (cont.) Non-leaf GiST … Node Node Node Node Page … 1 Node Node Node 2 3 DBMS 4 5 Hippo 6 7 8 StartPageID EndPageID Bit 1 Bit 2 … Bit b DBMS 9 10 11 StartPageID EndPageID Bit 1 Bit 2 … Bit b 12 13 BRIN 14 15 DBMS StartPageID EndPageID Xmin Ymin Xmax Ymax StartPageID EndPageID Xmin Ymin Xmax Ymax

Query Response Time: vary query selectivity factor • Hippo ≈ GiST at 0.1% and 1% selectivity • BRIN is always the worst

Index Probe Time: vary query selectivity factor • Hippo and BRIN have constant index probe time • Search all index entries for a given query • GiST index probe time increases along with selectivity factor Log. scale

Inspected Pages: vary query selectivity factor • Hippo inspects 5 times less disk pages than BRIN • BRIN searches too many pages with 32, 128, 512 pages per range • Higher density makes Hippo inspect more pages

Query Response Time: vary query areas • Setting • Area: percent of NYC region area • Dense locations, Time Square, JFK,…; Random locations, random within NYC • Hippo works better in dense locations, medium selectivity factors • GiST works better in random locations, highly selective queries

Maintenance time: vary update ratio • Insertion: • Hippo 100x < GiST, flat index structure • BRIN 50x < Hippo, Hippo updates partial histogram • Deletion: Hippo 100x < GiST; BRIN > Hippo, BRIN has to re-build Log. scale

Throughput: Hybrid workloads • Queries + Updates • Update-intensive workloads (10%-50%), Hippo is 100x > GiST • Query-intensive workloads (70%-90%), Hippo ≈ GiST

Summary of Results Metric GiST-Spatial Hippo-Spatial BRIN-Spatial Storage overhead 84 GB 2 GB 10 MB Initialization time 28 hours 30 minutes 45 minutes selectivity between Favored selectivity query 0.001% selectivity X 0.01% and 1% 10 -5 % range query range query area ≥ Favored dense area query X area 10 −4 % 6 minutes for 4 seconds for 1 second for Index insertion inserting 10 −4 % data inserting 10 −4 % data inserting 10 −4 % data 2 hours for deleting 2 min for deleting Index deletion Index rebuilt 10 −4 % data 10 −4 % data Balanced Workload Hybrid workload Query-intensive Update-intensive and Update-intensive

Take-home Lesson • Do not use GiST (spatial tree index) if limited storage • Do not use BRIN or Hippo for Yelp-like applications. • Use Hippo for spatial analytics applications over dynamic and dense spatial data. • query selectivity is 0.1% - 1%, update-intensive workloads Use GiST Use Hippo

Questions? 9.6.1 kernel https://github.com/DataSystemsLab/hippo-postgresql Build Hippo CREATE INDEX hippo_idx ON hippo_tbl USING hippo (randomNumber) WITH (density = 20);

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University A Little Story August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015

Understanding Employer Pickup 50-338c, 1/19/E Todays agenda Review five key items 1. Define

Civil Service 101 Presentation NYC Fleet June 18, 2020 NYC Fleet NYC Fleet NYC Fleet:

NYC Fleet Keith T. Kerman, NYC Chief Fleet Officer Montreal, Canada June 3, 2019 NYC Fleet: Who

Deferred Retirement Option Plan - DROP October 2016 1 Deferred Retirement Option Plan - DROP

DC TAXI APP Quick Start Guide Driver & Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

EXPAND EXPAND NYC NYC CARE CARE CITYWIDE CITYWIDE NYC Care will expand to Queens and

De-carbonizing Fleet NYC Fleet Keith T . Kerman, NYC Chief Fleet Officer NY Energy Summit,

NYC DOT Commissioner Polly Trottenberg NACTO Designing Cities 1 NEW YORK CITY IS GROWING

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016

Westport Middle High School School Bus and Van AM Drop-Off and PM Pick-up 5/16/18 Old County

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

Bioheat NYC Fleet Presentation to NYOHA December 16, 2019 American Management Association

NYC Hep B Patient Navigation Programs NYC Health Department Nirah Johnson, LCSW Director,

Taxis and Accessible Services Division: Status of Taxi Industry San Francisco Municipal

IN THE ISA Jason Lowe-Power , Venkatesh Akella, Matthew K. Farrens, Samuel T. King, Christopher

Authenticated Setup of Virtual Links with Quality-of-Service Guarantees Roland Bless, Martin

Access Grid with High-quality Video Service JongWon Kim and ACE Team Ace@netmedia.gist.ac.kr

Monitoring Coastline Water Quality in Cartagena, Colombia using GIS Technology and Spanish

A RIZONA S TATE N AME T ITLE R OLE IN E XECUTING R ACE TO THE T OP P LAN AZ Eileen Klein Chief of

Southern California Green Growth Initiative California Workforce Investment Board Regional

SATTVA: SpArsiTy inspired classificaTion of malware VAriants Lakshmanan Nataraj, S. Karthikeyan,

ANDROID APPLICATIONS PRESENTER: DON HART Don spent several years as an independent computer

Sambuz

Useful Links

Newsletter

Mail Us

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University A Little Story August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015

Understanding Employer Pickup 50-338c, 1/19/E Todays agenda Review five key items 1. Define

Civil Service 101 Presentation NYC Fleet June 18, 2020 NYC Fleet NYC Fleet NYC Fleet:

NYC Fleet Keith T. Kerman, NYC Chief Fleet Officer Montreal, Canada June 3, 2019 NYC Fleet: Who

Deferred Retirement Option Plan - DROP October 2016 1 Deferred Retirement Option Plan - DROP

DC TAXI APP Quick Start Guide Driver &amp; Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO

EXPAND EXPAND NYC NYC CARE CARE CITYWIDE CITYWIDE NYC Care will expand to Queens and

De-carbonizing Fleet NYC Fleet Keith T . Kerman, NYC Chief Fleet Officer NY Energy Summit,

NYC DOT Commissioner Polly Trottenberg NACTO Designing Cities 1 NEW YORK CITY IS GROWING

DIY Medium Data Analysis: NYC Taxi Trips and Beyond Todd Schneider January 20, 2016

Westport Middle High School School Bus and Van AM Drop-Off and PM Pick-up 5/16/18 Old County

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

Bioheat NYC Fleet Presentation to NYOHA December 16, 2019 American Management Association

NYC Hep B Patient Navigation Programs NYC Health Department Nirah Johnson, LCSW Director,

Taxis and Accessible Services Division: Status of Taxi Industry San Francisco Municipal

IN THE ISA Jason Lowe-Power , Venkatesh Akella, Matthew K. Farrens, Samuel T. King, Christopher

Authenticated Setup of Virtual Links with Quality-of-Service Guarantees Roland Bless, Martin

Access Grid with High-quality Video Service JongWon Kim and ACE Team Ace@netmedia.gist.ac.kr

Monitoring Coastline Water Quality in Cartagena, Colombia using GIS Technology and Spanish

A RIZONA S TATE N AME T ITLE R OLE IN E XECUTING R ACE TO THE T OP P LAN AZ Eileen Klein Chief of

Southern California Green Growth Initiative California Workforce Investment Board Regional

SATTVA: SpArsiTy inspired classificaTion of malware VAriants Lakshmanan Nataraj, S. Karthikeyan,

ANDROID APPLICATIONS PRESENTER: DON HART Don spent several years as an independent computer

Sambuz

Useful Links

Newsletter

Mail Us

DC TAXI APP Quick Start Guide Driver & Passenger Apps WATCH THE DC TAXI APP TRAINING VIDEO