Optimally L Leveraging D Densi sity a and L Locality for E - - PowerPoint PPT Presentation

optimally l leveraging d densi sity a and l locality for
SMART_READER_LITE
LIVE PREVIEW

Optimally L Leveraging D Densi sity a and L Locality for E - - PowerPoint PPT Presentation

Optimally L Leveraging D Densi sity a and L Locality for E Exploratory B Browsi sing a and S Sampling Albert Kim 1* , Liqi Xu 2* , Tarique Siddiqui 1 , Silu Huang 2 , Samuel Madden 1 , Aditya Parameswaran 2 1 MIT 2 University of Illinois


slide-1
SLIDE 1

Optimally L Leveraging D Densi sity a and L Locality for E Exploratory B Browsi sing a and S Sampling

Albert Kim1*, Liqi Xu2*, Tarique Siddiqui1, Silu Huang2, Samuel Madden1, Aditya Parameswaran2

1MIT 2University of Illinois (UIUC)

1

slide-2
SLIDE 2

Mo Motivation

Subset of voters who reside in Paris and voted for a specific candidate Some of genes that get positively induced after a clinical trial Example sessions on a given website on an IPhone X Summarization Browsing

slide-3
SLIDE 3

Mo Motivation

“Although big data demands aggregations, analysts wanted to see individual records to spotcheck their results, and to get a sense of what sat in a bucket.” [1]

[1] Trust, but Verify: Optimistic Visualizations of Approximate Queries for Exploring Big Data. Moritz et al.

Any-k Problem: How to quickly return a small subset of records that satisfy arbitrary user- specified predicates?

slide-4
SLIDE 4

… Origin Mon … ORD 1 … ORD 2 … CMI 2 … CMI 3 … ORD 1 … ORD 2 … CMI 1 … ORD 1

Exi Existing ng Appr Approach: h: Bi Bitma map Index

Airline Dataset

Mon = 1 Mon = 2 Mon = 3 1 1 1 1 1 1 1 1

Bitmap Indices

q Effective for traditional OLAP-style workloads q One bitmap per each attribute value q Index at the record level q Inefficient for any-k problem q High storage cost Bitmaps for ANYK probelm

slide-5
SLIDE 5

Ou Our Ap Approach: De Densit ity M Map ap I Index

Airline Dataset Density Maps

Mon = 1 Mon = 2 Mon = 3 1 1 1 1 1 1 1 1

Bitmap Indices

# of tuples per block: 2

Mon = 1 Mon = 2 Mon = 3 0.5 0.5 0.0 0.0 0.5 0.5 0.5 0.5 0.0 1.0 0.0 0.0

q Index at the block level q Read/Write in the unit of sector (e.g,. 4KB) q Consume less memory q Store the frequency of set bits per block

… Origin Mon … ORD 1 … ORD 2 … CMI 2 … CMI 3 … ORD 1 … ORD 2 … CMI 1 … ORD 1

slide-6
SLIDE 6

Ou Our Ap Approach: De Densit ity-Op Optimal

Observation #1 [Density: Denser is better]

SELECT ANY-K(*) FROM T WHERE Month = 1 AND Origin = “ORD” Month = 1 AND Origin = “ORD”: Month = 1: (Sorted) Orig = ”ORD”: (Sorted)

slide-7
SLIDE 7

Ou Our Ap Approach: Lo Locality-Op Optimal

Observation #2 [Locality: Closer is better]

Month = 1 AND Origin = “ORD”: Month = 1: Orig = ”ORD”: SELECT ANY-K(*) FROM T WHERE Month = 1 AND Origin = “ORD” Density-Optimal vs Locality Optimal ?

slide-8
SLIDE 8

Ou Our Ap Approach: I/ I/O Optim timal al

q Leverages both density and locality q Uses dynamic programming q High Computation Cost

I/O Cost Model on HDDs

Blocks # of samples q Run both Density-Optimal and Locality- Optimal qChoose the set of blocks with the smaller estimated I/O Cost

Hybrid

slide-9
SLIDE 9

Expe Experimental Setting ng

qAirline Dataset

§ 123 million rows and 11 attributes with a total size of 11 GB

qBaselines:

§ Bitmap-Scan § Lossy-Bitmap § EWAH

qQueries

slide-10
SLIDE 10

Expe Experimental Resul sults

CPU I/O

Query runtimes for airline workload on a HDD. q Hybrid: 4x faster q I/O: 90% of the runtime

slide-11
SLIDE 11

Expe Experimental Resul sults

Memory consumption of index structures q Uncompressed bitmaps: 47x more memory q EWAH: 3x more memory q Lossy: slower query performance due to high false positives

slide-12
SLIDE 12

Mo More in the paper! r!

Needletail Architecture ü Density Maps ü ANY-K algorithms q Aggregation Estimation q Grouping + Join q More experimental results

Technical Report: http://data-people.cs.illinois.edu/needletail.pdf