two birds one stone a fast yet lightweight indexing
play

Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for - PowerPoint PPT Presentation

Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for Modern Database Systems Jia Yu and Mohamed Sarwat Arizona State University Motivation Tree index Fast index search Large storage overhead: 5% - 15% additional


  1. Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for Modern Database Systems Jia Yu and Mohamed Sarwat Arizona State University

  2. Motivation • Tree index • Fast index search • Large storage overhead: 5% - 15% additional overhead • Slow maintenance: Split, merge, redistribute nodes Tree index Root … Non-leaf Non-leaf … … Tuple Tuple Tuple Tuple Tuple Tuple B+ Tree on TPC-H Index size Initialization time Insertion time (0.1%) 2 GB 0.25 GB 30 sec 10 sec 20 GB 2.51 GB 500 sec 1180 sec 200 GB 25 GB 8000 sec 42000 sec

  3. Motivation (cont.) • Bitmap index : Fast, small, low cardinality, read-only • Compressed index : Small, slow search, slow maintenance • Approximate index : Inaccurate result • Sparse index (aggregated page info: min, max): Very small, not good at queries Data table Sparse index Disk page range Min Max SELECT * 1 - 10 5 100 Disk FROM Lineitem pages WHERE Price = 55 11 - 20 2 80 21 - 30 10 90 Disk Heap file pages

  4. Design goals • Low indexing overhead (size and initialization time) • Fast index maintenance (insertion and deletion) • Competitive query performance • No need to be fast on all selectivity scenarios • Works for “ not too selective ” but still “ selective ” query TPC-H Lineitem Table ID QUANTITY PRICE SHIPDATE Comment 1 4 $50 1995-08-01 N/A 2 22 $90 1994-11-27 Refurbished … … … … … Highly selective, <0.001% Still selective 0.01% - 1% Unselective > 1% SELECT average(*) SELECT * SELECT * FROM Lineitem FROM Lineitem FROM Lineitem WHERE Price > 55 AND Price < 65 WHERE Price > 0.00 AND Price < 0.01 WHERE Price > 10 Use B+ Tree Use our index Full table scan

  5. Hippo Index Structure Bucket Price • Sparse index 2 21 - 40 3 41 - 60 Complete load balanced histogram 4 61 - 90 Hippo logical structure Partial histogram 1 Bucket Price Partial Bucket Price 1 1 - 20 Disk Page # histogram Internal data 2 21 - 40 2 21 - 40 1 - 10 2,3,4 21,22,55,75,77 4 61 - 90 11 – 25 2,4,5 23,24,62,91,92 3 41 - 60 … 5 91 - 120 26 - 30 1,2,5 11,12,25,101,110 4 61 - 90 Partial histogram 2 5 91 - 120 Bucket Price Manage 1 1 - 20 Grouped by partial 2 21 - 40 DBMS histogram similarity 5 91 - 120 Partial histogram 3 query optimizer No initial. overhead ü Load-balanced, handle data ü skewness Global data distribution won’t ü change frequently

  6. Hippo Index Structure (cont.) • Hippo disk structure Index Entries Sorted List Why Hippo has small size StartPageID Index disk pages StartPageID EndPageID Bit 1 Bit 2 … Bit b 1-10 Use dynamic page grouping StartPageID EndPageID Bit 1 Bit 2 … Bit b 11-25 Partial histogram in compressed bitmap . . . . . . StartPageID EndPageID Bit 1 Bit 2 … Bit b n-n+3 After updates Why Hippo has fast maintenance Flat index structure, no internal Index Entries dependency Sorted List Sorted list enables binary search StartPageID StartPageID EndPageID Bit 1 Bit 2 … Bit b 1-10 StartPageID EndPageID Bit 1 Bit 2 … Bit b 11-25 . . . . . . StartPageID EndPageID Bit 1 Bit 2 … Bit b n-n+3

  7. Hippo Index Structure (cont.) • Hippo initialization Why Hippo has fast initialization Only one full table scan Page 1 2 Complete histogram Partial histogram density = 3/5 3 Bucket Price 4 1 1 - 20 5 Disk Page # Partial histogram 2 21 - 40 6 1 - 10 2,3,4 Refer 3 41 - 60 to 7 4 61 - 90 8 5 91 - 120 9 10 11 12 Disk Page # Partial histogram 13 11 - 15 2,4,5 14 15

  8. Hippo Index Search • Step 1: Scanning index entries SELECT * SELECT * SELECT * FROM Lineitem FROM Lineitem FROM Lineitem WHERE Price > 55 WHERE Price > 55 AND Price < 65 WHERE Price = 55 Bucket Price Bucket Price Bucket Price 1 1 - 20 1 1 - 20 1 1 - 20 2 21 - 40 2 21 - 40 2 21 - 40 3 41 - 60 3 41 - 60 3 41 - 60 4 61 - 90 4 61 - 90 4 61 - 90 5 91 - 120 5 91 - 120 5 91 - 120

  9. Hippo Index Search • Step 1: Scanning index entries (continued) Bucket Price 2 21 - 40 3 41 - 60 SELECT * 4 61 - 90 FROM Lineitem Hippo Partial histogram 1 WHERE Price = 55 Disk Page # Partial histogram Bucket Price 1 - 10 2,3,4 Bucket Price 2 21 - 40 11 – 25 2,4,5 1 1 - 20 4 61 - 90 26 - 30 1,2,5 … 5 91 - 120 2 21 - 40 Partial histogram 2 3 41 - 60 4 61 - 90 Bucket Price 5 91 - 120 1 1 - 20 2 21 - 40 Page 1 - 10 5 91 - 120 Why Hippo has fast query performance Partial histogram 3 Use partial histogram to prune useless disk pages

  10. Hippo Index Search • Step 2: Filtering false positive pages Page Content False positive SELECT * FROM Lineitem 1 21,22,77 √ WHERE Price = 55 2 22,75,77 √ Query 3 22, 55 ,75 Got results! result User 4 21, 55 ,75 Got results! 5 √ 21,22,75 Page 1 - 10 6 22, 55 ,77 Got results! 7 21,22,77 √ 8 21, 55 ,77 Got results! 9 55 ,75,77 Got results! 10 √ 21,75,77

  11. Hippo Maintenance: Insertion - Eager • Check index entries right after each data insertion • Check the complete histogram and each Hippo partial histogram • Update index entries right after each check • S1. No need to update the corresponding index entry • S2. Need to update the corresponding index entry at once

  12. Hippo Maintenance: Insertion - Eager (cont.) • Scenario 1: No need to update the entry Why Hippo has fast maintenance Sorted list enables binary search on INSERT INTO Lineitem (Quantity, Price,Shipdate) index entries VALUES (3, 23 ,1997-05-30); Hippo Complete histogram Bucket Price Disk Page # Partial histogram Binary search 1 1 - 20 1 - 10 2,3,4 Page 1 2 21 - 40 11 – 25 2,4,5 3 41 - 60 26 - 30 1,2,5 4 61 - 90 … 5 91 - 120 Disk Page # Partial histogram Bucket Price 2 2 ,3,4 1 - 10 21 - 40 No index updates

  13. Hippo Maintenance: Insertion - Eager (cont.) • Scenario 2: Need to update the entry INSERT INTO Lineitem (Quantity, Price,Shipdate) VALUES (10, 61 ,1995-01-01); Hippo Complete histogram Disk Page # Partial histogram 1 - 10 2,3,4 Bucket Price 11 – 25 2,4,5 Page 1 1 - 20 Binary search Why Hippo has fast maintenance 2 21 - 40 26 - 30 1,2,5 26 … 3 41 - 60 Flat index structure, no internal dependency 4 61 - 90 Only update one index entry 5 91 - 120 Update Disk Page # Partial histogram Disk Page # Partial histogram Bucket Price 4 26 - 30 1,2,5 1,2, 4 ,5 26 - 30 61 - 90

  14. Hippo Maintenance: Deletion - Lazy • Don’t update index entries after each data deletion • Deleted data tuples are marked as “deleted” and may be removed at any time • Still guarantee accurate query results because of “filtering false positive pages” • Update index periodically or invoked by the user • Slow but can run at DBMS idle time • S1. No need to update the corresponding index entry • S2. Need to update the corresponding index entry

  15. Hippo Maintenance: Deletion – Lazy (cont.) • Scenario 1: No need to update the entry Each page Page 1 Out of date? No 2 Hippo 3 Each entry . Disk Page # Partial histogram . Traverse 1 - 10 2,3,4 . 11 – 25 2,4,5 26 - 30 1,2,5 Page … 11 12 13 . . . . . .

  16. Hippo Maintenance: Deletion – Lazy (cont.) • Scenario 2: Need to update the entry Each page Page Out of date? YES. 1 2 Hippo 3 Each entry . Disk Page # Partial histogram . Traverse 1 - 10 2,3,4 . 11 – 25 2,4,5 Re-summarize this index entry. 26 - 30 1,2,5 Page Get a new partial histogram. … 11 12 13 . . . . . .

  17. Experiments • Datasets (all around 200 GB): 1. TPC-H: Lineitem table 1. PartKey: Uniform distribution. 40 million distinct values 2. SuppKey: Uniform distribution. 2 million distinct values 3. OrderKey: sorted 2. Exponential distribution data: Skewed distribution 3. Wikipedia page hit rates: Skewed distribution 4. NYC Taxi trip records: pickup point, dimension reduction using Hilbert Curve • Compared indexes: 1. B+ Tree 2. Hippo 3. Block Range Index - BRIN (A sparse index)

  18. Experiments (cont.) • Indexing overhead • Size: Hippo 40x < B+ Tree • Initial. time: Hippo 2.5x < B+ Tree Figure: Index size on different datasets Figure: Initial. time on different datasets (logarithmic scale)

  19. Experiments (cont.) • Query time • Hippo ≈ B+ Tree at 0.1% and 1% selectivity • BRIN always scans lots of disk pages Figure: Query time on TPC-H Figure: Query time on NYC Trips

  20. Experiments (cont.) • Hybrid workload • Update-intensive workloads (10%-50%), Hippo is 100x > B+ Tree • Query-intensive workloads (70%-90%), Hippo ≈ B+ Tree Figure: Throughput on TPC-H Figure: Throughput on NYC Trips

  21. Take-home Lesson • Use Hippo if you have limited storage budget • Use Hippo if your query selectivity is 0.1% - 1% • Use Hippo if you have update-intensive workloads

  22. Questions? 9.6.1 kernel https://github.com/DataSystemsLab/hippo-postgresql Build Hippo CREATE INDEX hippo_idx ON hippo_tbl USING hippo (randomNumber) WITH (density = 20);

  23. Experiments (cont.) • Maintenance time • Hippo 10x – 1000x < B+ Tree • Hippo 10x < BRIN. BRIN doesn’t support deletion. Figure: Update time on NYC Trips Figure: Update time on TPC-H (logarithmic scale) (logarithmic scale)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend