Optimally L Leveraging D Densi sity a and L Locality for E Exploratory B Browsi sing a and S Sampling
Albert Kim1*, Liqi Xu2*, Tarique Siddiqui1, Silu Huang2, Samuel Madden1, Aditya Parameswaran2
1MIT 2University of Illinois (UIUC)
1
Optimally L Leveraging D Densi sity a and L Locality for E - - PowerPoint PPT Presentation
Optimally L Leveraging D Densi sity a and L Locality for E Exploratory B Browsi sing a and S Sampling Albert Kim 1* , Liqi Xu 2* , Tarique Siddiqui 1 , Silu Huang 2 , Samuel Madden 1 , Aditya Parameswaran 2 1 MIT 2 University of Illinois
Albert Kim1*, Liqi Xu2*, Tarique Siddiqui1, Silu Huang2, Samuel Madden1, Aditya Parameswaran2
1MIT 2University of Illinois (UIUC)
1
Subset of voters who reside in Paris and voted for a specific candidate Some of genes that get positively induced after a clinical trial Example sessions on a given website on an IPhone X Summarization Browsing
“Although big data demands aggregations, analysts wanted to see individual records to spotcheck their results, and to get a sense of what sat in a bucket.” [1]
[1] Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data. Moritz et al.
Any-k Problem: How to quickly return a small subset of records that satisfy arbitrary user- specified predicates?
… Origin Mon … ORD 1 … ORD 2 … CMI 2 … CMI 3 … ORD 1 … ORD 2 … CMI 1 … ORD 1
Airline Dataset
Mon = 1 Mon = 2 Mon = 3 1 1 1 1 1 1 1 1
Bitmap Indices
q Effective for traditional OLAP-style workloads q One bitmap per each attribute value q Index at the record level q Inefficient for any-k problem q High storage cost Bitmaps for ANYK probelm
Airline Dataset Density Maps
Mon = 1 Mon = 2 Mon = 3 1 1 1 1 1 1 1 1
Bitmap Indices
# of tuples per block: 2
Mon = 1 Mon = 2 Mon = 3 0.5 0.5 0.0 0.0 0.5 0.5 0.5 0.5 0.0 1.0 0.0 0.0
q Index at the block level q Read/Write in the unit of sector (e.g,. 4KB) q Consume less memory q Store the frequency of set bits per block
… Origin Mon … ORD 1 … ORD 2 … CMI 2 … CMI 3 … ORD 1 … ORD 2 … CMI 1 … ORD 1
SELECT ANY-K(*) FROM T WHERE Month = 1 AND Origin = “ORD” Month = 1 AND Origin = “ORD”: Month = 1: (Sorted) Orig = ”ORD”: (Sorted)
Month = 1 AND Origin = “ORD”: Month = 1: Orig = ”ORD”: SELECT ANY-K(*) FROM T WHERE Month = 1 AND Origin = “ORD” Density-Optimal vs Locality Optimal ?
I/O Cost Model on HDDs
Hybrid
§ 123 million rows and 11 attributes with a total size of 11 GB
§ Bitmap-Scan § Lossy-Bitmap § EWAH
CPU I/O
Query runtimes for airline workload on a HDD. q Hybrid: 4x faster q I/O: 90% of the runtime
Memory consumption of index structures q Uncompressed bitmaps: 47x more memory q EWAH: 3x more memory q Lossy: slower query performance due to high false positives
Needletail Architecture ü Density Maps ü ANY-K algorithms q Aggregation Estimation q Grouping + Join q More experimental results
Technical Report: http://data-people.cs.illinois.edu/needletail.pdf