Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for - - PowerPoint PPT Presentation

two birds one stone a fast yet lightweight indexing
SMART_READER_LITE
LIVE PREVIEW

Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for - - PowerPoint PPT Presentation

Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for Modern Database Systems Jia Yu and Mohamed Sarwat Arizona State University Motivation Tree index Fast index search Large storage overhead: 5% - 15% additional


slide-1
SLIDE 1

Two Birds, One Stone: A Fast, yet Lightweight, Indexing Scheme for Modern Database Systems

Jia Yu and Mohamed Sarwat Arizona State University

slide-2
SLIDE 2
  • Tree index

Motivation

Tuple Non-leaf Non-leaf Root

Tree index

Tuple Tuple Tuple Tuple Tuple

… … …

  • Fast index search
  • Large storage overhead: 5% - 15% additional overhead
  • Slow maintenance: Split, merge, redistribute nodes

B+ Tree on TPC-H Index size Initialization time Insertion time (0.1%) 2 GB 0.25 GB 30 sec 10 sec 20 GB 2.51 GB 500 sec 1180 sec 200 GB 25 GB 8000 sec 42000 sec

slide-3
SLIDE 3

Motivation (cont.)

  • Bitmap index: Fast, small, low cardinality, read-only
  • Compressed index: Small, slow search, slow maintenance
  • Approximate index: Inaccurate result
  • Sparse index (aggregated page info: min, max): Very small, not good

at queries

Disk page range Min Max 1 - 10 5 100 11 - 20 2 80 21 - 30 10 90

SELECT * FROM Lineitem WHERE Price = 55

Disk pages

Sparse index Data table

Disk pages

Heap file

slide-4
SLIDE 4

Design goals

  • Low indexing overhead (size and initialization time)
  • Fast index maintenance (insertion and deletion)
  • Competitive query performance
  • No need to be fast on all selectivity scenarios
  • Works for “not too selective” but still “selective” query

SELECT * FROM Lineitem WHERE Price > 0.00 AND Price < 0.01 SELECT average(*) FROM Lineitem WHERE Price > 55 AND Price < 65 SELECT * FROM Lineitem WHERE Price > 10 Highly selective, <0.001% Still selective 0.01% - 1% Unselective > 1% Use B+ Tree Use our index Full table scan

ID QUANTITY PRICE SHIPDATE Comment 1 4 $50 1995-08-01 N/A 2 22 $90 1994-11-27 Refurbished … … … … …

TPC-H Lineitem Table

slide-5
SLIDE 5
  • Sparse index

Hippo Index Structure

Partial histogram 1

Hippo logical structure

Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120 Complete load balanced histogram

Partial histogram 2 Partial histogram 3 Disk Page # Partial histogram Internal data 1 - 10 2,3,4 21,22,55,75,77 11 – 25 2,4,5 23,24,62,91,92 26 - 30 1,2,5 11,12,25,101,110

Bucket Price 2 21 - 40 3 41 - 60 4 61 - 90 Bucket Price 2 21 - 40 4 61 - 90 5 91 - 120

DBMS

query optimizer

Manage

Bucket Price 1 1 - 20 2 21 - 40 5 91 - 120

Grouped by partial histogram similarity

ü No initial. overhead ü Load-balanced, handle data skewness ü Global data distribution won’t change frequently

slide-6
SLIDE 6

Hippo Index Structure (cont.)

  • Hippo disk structure

After updates

StartPageID EndPageID Bit 1 Bit 2 … Bit b StartPageID EndPageID Bit 1 Bit 2 … Bit b StartPageID EndPageID Bit 1 Bit 2 … Bit b

StartPageID 1-10 11-25 . . . n-n+3

Index Entries Sorted List

. . .

StartPageID EndPageID Bit 1 Bit 2 … Bit b StartPageID EndPageID Bit 1 Bit 2 … Bit b StartPageID EndPageID Bit 1 Bit 2 … Bit b

StartPageID 1-10 11-25 . . . n-n+3

. . .

Index Entries Sorted List

Why Hippo has small size

Index disk pages Use dynamic page grouping Partial histogram in compressed bitmap

Why Hippo has fast maintenance

Flat index structure, no internal dependency Sorted list enables binary search

slide-7
SLIDE 7

Hippo Index Structure (cont.)

  • Hippo initialization

Page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120

Complete histogram

Refer to Disk Page # Partial histogram 1 - 10 2,3,4 Disk Page # Partial histogram 11 - 15 2,4,5

Why Hippo has fast initialization Only one full table scan Partial histogram density = 3/5

slide-8
SLIDE 8

Hippo Index Search

  • Step 1: Scanning index entries

SELECT * FROM Lineitem WHERE Price = 55 SELECT * FROM Lineitem WHERE Price > 55 SELECT * FROM Lineitem WHERE Price > 55 AND Price < 65

Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120 Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120 Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120

slide-9
SLIDE 9

Hippo Index Search

  • Step 1: Scanning index entries (continued)

Partial histogram 1

Hippo

Partial histogram 2 Partial histogram 3 Disk Page # Partial histogram 1 - 10 2,3,4 11 – 25 2,4,5 26 - 30 1,2,5

Bucket Price 2 21 - 40 3 41 - 60 4 61 - 90 Bucket Price 2 21 - 40 4 61 - 90 5 91 - 120 Bucket Price 1 1 - 20 2 21 - 40 5 91 - 120

Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120

Page 1 - 10 SELECT * FROM Lineitem WHERE Price = 55

Why Hippo has fast query performance

Use partial histogram to prune useless disk pages

slide-10
SLIDE 10

Hippo Index Search

  • Step 2: Filtering false positive pages

Page 1 - 10

Page Content False positive 1

21,22,77

√ 2

22,75,77

√ 3

22,55,75

Got results! 4

21,55,75

Got results! 5

21,22,75

√ 6

22,55,77

Got results! 7

21,22,77

√ 8

21,55,77

Got results! 9

55,75,77

Got results! 10

21,75,77

Query result User

SELECT * FROM Lineitem WHERE Price = 55

slide-11
SLIDE 11

Hippo Maintenance: Insertion - Eager

  • Check index entries right after each data insertion
  • Check the complete histogram and each Hippo partial histogram
  • Update index entries right after each check
  • S1. No need to update the corresponding index entry
  • S2. Need to update the corresponding index entry at once
slide-12
SLIDE 12
  • Scenario 1: No need to update the entry

… Hippo

Disk Page # Partial histogram 1 - 10 2,3,4 11 – 25 2,4,5 26 - 30 1,2,5

Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120

Complete histogram

INSERT INTO Lineitem (Quantity, Price,Shipdate) VALUES (3, 23,1997-05-30);

Page 1

Disk Page # Partial histogram 1 - 10

2,3,4

Bucket Price

2

21 - 40

No index updates

Hippo Maintenance: Insertion - Eager (cont.)

Why Hippo has fast maintenance

Sorted list enables binary search on index entries

Binary search

slide-13
SLIDE 13

Hippo Maintenance: Insertion - Eager (cont.)

  • Scenario 2: Need to update the entry

… Hippo

Disk Page # Partial histogram 1 - 10 2,3,4 11 – 25 2,4,5 26 - 30 1,2,5

Bucket Price 1 1 - 20 2 21 - 40 3 41 - 60 4 61 - 90 5 91 - 120

Complete histogram

INSERT INTO Lineitem (Quantity, Price,Shipdate) VALUES (10, 61,1995-01-01);

Page 26

Disk Page # Partial histogram 26 - 30 1,2,5

Bucket Price

4

61 - 90

Disk Page # Partial histogram 26 - 30 1,2,4,5

Update Binary search Why Hippo has fast maintenance

Flat index structure, no internal dependency Only update one index entry

slide-14
SLIDE 14

Hippo Maintenance: Deletion - Lazy

  • Don’t update index entries after each data deletion
  • Deleted data tuples are marked as “deleted” and may be removed at any time
  • Still guarantee accurate query results because of “filtering false positive

pages”

  • Update index periodically or invoked by the user
  • Slow but can run at DBMS idle time
  • S1. No need to update the corresponding index entry
  • S2. Need to update the corresponding index entry
slide-15
SLIDE 15

Hippo Maintenance: Deletion – Lazy (cont.)

  • Scenario 1: No need to update the entry

… Hippo

Disk Page # Partial histogram 1 - 10 2,3,4 11 – 25 2,4,5 26 - 30 1,2,5

Traverse

Page 1 2 3 . . .

Out of date? No

Page 11 12 13 . . .

Each entry Each page . . .

slide-16
SLIDE 16

Hippo Maintenance: Deletion – Lazy (cont.)

  • Scenario 2: Need to update the entry

… Hippo

Disk Page # Partial histogram 1 - 10 2,3,4 11 – 25 2,4,5 26 - 30 1,2,5

Traverse

Page 1 2 3 . . .

Out of date? YES.

Page 11 12 13 . . .

Each entry Each page Re-summarize this index entry. Get a new partial histogram. . . .

slide-17
SLIDE 17

Experiments

  • Datasets (all around 200 GB):
  • 1. TPC-H: Lineitem table
  • 1. PartKey: Uniform distribution. 40 million distinct values
  • 2. SuppKey: Uniform distribution. 2 million distinct values
  • 3. OrderKey: sorted
  • 2. Exponential distribution data: Skewed distribution
  • 3. Wikipedia page hit rates: Skewed distribution
  • 4. NYC Taxi trip records: pickup point, dimension reduction using Hilbert Curve
  • Compared indexes:
  • 1. B+ Tree
  • 2. Hippo
  • 3. Block Range Index - BRIN (A sparse index)
slide-18
SLIDE 18

Experiments (cont.)

  • Indexing overhead
  • Size: Hippo 40x < B+ Tree
  • Initial. time: Hippo 2.5x < B+ Tree

Figure: Index size on different datasets (logarithmic scale) Figure: Initial. time on different datasets

slide-19
SLIDE 19

Experiments (cont.)

  • Query time
  • Hippo ≈ B+ Tree at 0.1% and 1% selectivity
  • BRIN always scans lots of disk pages

Figure: Query time on NYC Trips Figure: Query time on TPC-H

slide-20
SLIDE 20

Experiments (cont.)

  • Hybrid workload
  • Update-intensive workloads (10%-50%), Hippo is 100x > B+ Tree
  • Query-intensive workloads (70%-90%), Hippo ≈ B+ Tree

Figure: Throughput on NYC Trips Figure: Throughput on TPC-H

slide-21
SLIDE 21

Take-home Lesson

  • Use Hippo if you have limited storage budget
  • Use Hippo if your query selectivity is 0.1% - 1%
  • Use Hippo if you have update-intensive workloads
slide-22
SLIDE 22

Questions?

Build Hippo

CREATE INDEX hippo_idx ON hippo_tbl USING hippo (randomNumber) WITH (density = 20);

9.6.1 kernel https://github.com/DataSystemsLab/hippo-postgresql

slide-23
SLIDE 23

Experiments (cont.)

  • Maintenance time
  • Hippo 10x – 1000x < B+ Tree
  • Hippo 10x < BRIN. BRIN doesn’t support deletion.

Figure: Update time on NYC Trips (logarithmic scale) Figure: Update time on TPC-H (logarithmic scale)