indexing the pickup and drop off locations of nyc taxi
play

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in - PowerPoint PPT Presentation

Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University A Little Story August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015


  1. Indexing the Pickup and Drop-off Locations of NYC Taxi Trips in PostgreSQL – Lessons from the Road Jia Yu and Mohamed Sarwat Arizona State University

  2. A Little Story… • August 1, 2015 : Over 1 billion taxi trip records from 2009 to 2015 were released by New York City Taxi & Limousine Commission • Since then: New taxi trip records keep being published on the Internet • As of TODAY: Millions of new records have been added into the dataset PickupTime DropoffTime TripDistance PickupLocation DropoffLocation PaymentType FareAmount TipAmount 2009-01-01 2009-01-01 (40.7577,- (40.7497,- 2.2 Credit Card 15.5 3.5 08:01:01 08:20:37 73.9851) 73.9882) Photo credit: NYC TLC website

  3. A Little Story… (cont.) • People really want to do Spatial Query on this 175 GB data in PostGIS • People really need a Spatial Index to speed up the queries. Which Spatial Index can handle these? 1 billion records, 175 GB, millions new records, keep being published

  4. Compared Approaches: GiST • Generalized Search Tree (GiST-Spatial, Similar to R-Tree) • Index structure: Tree index Root … Non-leaf Non-leaf … … Tuple Tuple Tuple Tuple Tuple Tuple • Index entry (tree node): Minimum Bounding Rectangle, Tuple pointers • Index search: Top-down, fast prune by checking Query Window with MBR • Index maintenance: Search tree, then split (if full) and merge (if too empty)

  5. Compared Approaches: GiST • Summary of GiST • Fast index search • Large storage overhead: 20% or more additional overhead • Slow maintenance: Split, merge tree nodes Index Name Data size Index size Initial. time Insertion (0.1%) GiST 175 GB NYC 84 GB 28 hours 6 hours

  6. Compared Approaches: BRIN-Spatial SELECT * • Block Range Index (BRIN-Spatial): FROM NYCtaxi N WHERE ST_WITHIN • PostgreSQL 9.5, PostGIS 2.3 (QueryWindow, N.pickuppoint) • Index heap file pages • Index search: • 1. Serial search by checking Query Window with MBR Heap file • 2. Filter false positive pages Disk pages • Index maintenance: • Update MBR for Insertion Data table • No update for deletion Filter false positive pages Disk pages

  7. Compared Approaches: BRIN-Spatial • Summary of BRIN-Spatial • Index heap file pages • Very small • Fast maintenance Heap file • Not good at queries Disk pages Data table

  8. Compared Approaches: Hippo-Spatial • Hippo-Spatial: PVLDB 2016 • Index heap file pages • Index entry: dynamic page range, partial histogram • Index search: Page False positive 1 √ • 1. Serial search by finding overlapped buckets between 2 √ Query Window and partial histogram 3 Got results! • 2. Filter false positive pages 4 Got results! Page Range Histogram Bucket ID (X,Y) Histograms on X and Y 5 √ 1,1 1,2 … 4,3 4,4 Start End 6 Got results! 4 7 √ 3 1 10 1 0 … 1 0 8 Got results! Page 1 - 10 2 9 Got results! 26 30 0 1 … 0 1 10 √ 1 Y X 1 2 3 4 11 25 0 1 … 0 0

  9. Compared Approaches: Hippo-Spatial • Hippo-Spatial: • Index maintenance • Data insertion: eager update on partial histogram • Data deletion: lazy update on partial histogram Traverse Page Range Histogram Bucket ID (X,Y) 1,1 1,2 … 4,3 4,4 Start End Each entry Histograms on X and Y Out of date? YES. 1 10 1 0 … 1 0 4 Page 26 3 26 30 0 1 … 0 1 2 Resummarize 11 25 0 1 … 0 0 Page Range Histogram Bucket ID (X,Y) 1 26 30 0 1 … 0 1 Y X 1 2 3 4

  10. Compared Approaches: Hippo-Spatial • Summary of Hippo-Spatial Histograms on X and Y 4 Page Range Histogram Bucket ID (X,Y) • Index heap file pages 3 1,1 1,2 … 4,3 4,4 Start End • Still small 2 • Fast maintenance 1 1 10 1 0 … 1 0 • Good at common queries Y X 1 2 3 4 26 30 0 1 … 0 1 Sorted List (Start Page# ↓ ) Pointer 11 25 0 1 … 0 0 Pointer Pointer

  11. Experimental Environment • Datasets • NYC Taxi Trips 175 GB • Parameter setting • Hippo: Histogram bucket (H) 400, Partial histogram density(D) 20% • BRIN: Page per range (P) 128

  12. Indexing Overhead • Index size • Hippo: 100x < GiST • BRIN: 100x < Hippo • Reason Log. scale • Index pages not tuples • Partial histogram > MBR Log. scale

  13. Indexing Overhead (cont.) • Index initialization time • Hippo, BRIN-Spatial 100x < GiST • Hippo takes 60% time of BRIN • Reason Log. scale • Hierarchy > flat index structure • GiST writes lots of temporary disk files • BRIN in-memory entry is updated frequently

  14. Indexing Overhead (cont.) Non-leaf GiST … Node Node Node Node Page … 1 Node Node Node 2 3 DBMS 4 5 Hippo 6 7 8 StartPageID EndPageID Bit 1 Bit 2 … Bit b DBMS 9 10 11 StartPageID EndPageID Bit 1 Bit 2 … Bit b 12 13 BRIN 14 15 DBMS StartPageID EndPageID Xmin Ymin Xmax Ymax StartPageID EndPageID Xmin Ymin Xmax Ymax

  15. Query Response Time: vary query selectivity factor • Hippo ≈ GiST at 0.1% and 1% selectivity • BRIN is always the worst

  16. Index Probe Time: vary query selectivity factor • Hippo and BRIN have constant index probe time • Search all index entries for a given query • GiST index probe time increases along with selectivity factor Log. scale

  17. Inspected Pages: vary query selectivity factor • Hippo inspects 5 times less disk pages than BRIN • BRIN searches too many pages with 32, 128, 512 pages per range • Higher density makes Hippo inspect more pages

  18. Query Response Time: vary query areas • Setting • Area: percent of NYC region area • Dense locations, Time Square, JFK,…; Random locations, random within NYC • Hippo works better in dense locations, medium selectivity factors • GiST works better in random locations, highly selective queries

  19. Maintenance time: vary update ratio • Insertion: • Hippo 100x < GiST, flat index structure • BRIN 50x < Hippo, Hippo updates partial histogram • Deletion: Hippo 100x < GiST; BRIN > Hippo, BRIN has to re-build Log. scale

  20. Throughput: Hybrid workloads • Queries + Updates • Update-intensive workloads (10%-50%), Hippo is 100x > GiST • Query-intensive workloads (70%-90%), Hippo ≈ GiST

  21. Summary of Results Metric GiST-Spatial Hippo-Spatial BRIN-Spatial Storage overhead 84 GB 2 GB 10 MB Initialization time 28 hours 30 minutes 45 minutes selectivity between Favored selectivity query 0.001% selectivity X 0.01% and 1% 10 -5 % range query range query area ≥ Favored dense area query X area 10 −4 % 6 minutes for 4 seconds for 1 second for Index insertion inserting 10 −4 % data inserting 10 −4 % data inserting 10 −4 % data 2 hours for deleting 2 min for deleting Index deletion Index rebuilt 10 −4 % data 10 −4 % data Balanced Workload Hybrid workload Query-intensive Update-intensive and Update-intensive

  22. Take-home Lesson • Do not use GiST (spatial tree index) if limited storage • Do not use BRIN or Hippo for Yelp-like applications. • Use Hippo for spatial analytics applications over dynamic and dense spatial data. • query selectivity is 0.1% - 1%, update-intensive workloads Use GiST Use Hippo

  23. Questions? 9.6.1 kernel https://github.com/DataSystemsLab/hippo-postgresql Build Hippo CREATE INDEX hippo_idx ON hippo_tbl USING hippo (randomNumber) WITH (density = 20);

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend