BlinkDB: Queries with Bounded Error and Bounded Response Times on - PowerPoint PPT Presentation

BlinkDB: Queries with Bounded Error and Bounded Response Times on Very Large Data Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica Presented by Liqi Xu

SELECT AVG(SessionTime) Problem: very large data FROM Sessions WHERE City = ‘New York’ ● 100 million tuples for ‘New York’ ● Problem: High cost in execution time and space ○ ● Idea: trade result accuracy for response time and space ● Sampling: ○ 10,000 tuples for ‘New York ’ ○ return an approximate result (with error bound) ■ E.g. appox. avg 234.23 ± 5.32

SELECT AVG(SessionTime) Problems: approx. techniques FROM Sessions efficiency v.s. flexibility of the queries WHERE City = ‘New York’ All future queries Frequencies of group Frequencies of set of No future queries are are known in and filter predicates do columns used for known in advance advance not change over time group and filter predicates do not change over time

SELECT AVG(SessionTime) Problems: approx. techniques FROM Sessions efficiency v.s. flexibility of the queries WHERE City = ‘Urbana’ All future queries Frequencies of group Frequencies of set of No future queries are are known in and filter predicates do columns used for known in advance advance not change over time group and filter predicates do not Online Aggregation ‘current’ sampling change over time

BlinkDB ● “a distributed sampling-based approximate query processing system” ● Efficient ○ ~TBs data in seconds ○ with meaningful error bounds SELECT COUNT(*) SELECT COUNT(*) FROM Sessions FROM Sessions WHERE Genre = ‘western’ WHERE Genere = ‘western’ GROUP BY OS GROUP BY OS ERROR WITHIN 10% AT CONFIDENCE 95% WITHIN 5 SECONDS

BlinkDB ● “a distributed sampling-based approximate query processing system” ● Efficient ○ ~TBs data in seconds ○ with meaningful error bounds ● More general queries ○ Only assumption: ■ “query column sets” (QCSs) are stable ■ QCSs: columns used for grouping and filtering (ie. in WHERE, GROUP BY, and HAVING)

BlinkDB Architecture run-time offline

Sample creation ● Construct stratified samples

1. higher possibility of missing Problem with Uniform Samples under-representing groups SELECT AVG(SessionTime) FROM Sessions ID City Age Session_Time WHERE City = ‘Urbana’’ 1 NYC 20 212 2 Urbana 40 532 Sampling_rate = ⅓ ID City Age Session_Time 3 NYC 30 243 3 NYC 30 243 4 Urbana 40 291 5 NYC 20 453 5 NYC 20 453 6 NYC 30 293

1. higher possibility of missing Problem with Uniform Samples under-representing groups 2. Error of each aggregate is NOT equal ID City Age Session_Time 1 NYC 20 212 2 Urbana 40 532 ID City Age Session_Time Sampling_rate = ⅔ 1 NYC 20 212 3 NYC 30 243 3 NYC 30 243 4 Urbana 40 291 4 Urbana 40 291 5 NYC 20 453 6 NYC 30 293 6 NYC 30 293

Stratified Samples (on City ) ID City Age Session_Time 1 NYC 20 212 Sampling_rate(NYC) = 1/4 2 Urbana 40 532 Sampling_rate(Urbana) = 1/2 3 NYC 30 243 ID City Age Session_Time 4 Urbana 40 291 3 NYC 30 243 4 Urbana 40 291 5 NYC 20 453 6 NYC 30 293 Assign equal sample size to each groups

Stratified Samples (on City ) ID City Age Session_Time 1 NYC 20 212 Sampling_rate(NYC) = 3/4 Sampling_rate(Urbana) = 2/2 2 Urbana 40 532 ID City Age Session_Time 3 NYC 30 243 1 NYC 20 212 4 Urbana 40 291 3 NYC 30 243 4 Urbana 40 291 5 NYC 20 453 5 NYC 20 453 6 NYC 30 293 6 NYC 30 293

Storage cost of stratified samples ● Build several multi-dimensional stratified samples ○ increase query accuracy and latency ● n columns 2^n possible stratified samples ID City Age Session_Time [City] [Age] 1 NYC 20 212 [Session_Time] 2 Urbana 40 532 [City, Age] [City, Session_Time] 3 NYC 30 243 [Age, Session_Time] [City, Age, Session_Time] 4 Urbana 40 291 5 NYC 20 453 6 NYC 30 293

Storage cost of stratified samples ● Build several multi-dimensional stratified samples ○ increase query accuracy and latency ● n columns 2^n possible stratified samples ● Solution: ○ Find subsets of column sets that maximize the weighted sum of coverage of the QCSs of the queries q_j

Optimization formulation probability of a Coverage probability query type in of each query type workload Sparsity of the data Overall storage capacity budget storage cost of all samples

System Overview

Online sample selection ● Given a Query Q with specified time/error constraints ○ BlinkDB generate different query plans for the same query Q ● How to pick the plan that best satisfies the time/error constraints?

Strategy ● Select appropriate sample(s) ● execute the query Q on small samples of those appropriate samples(s), in order to gather statistics about ○ query’s selectivity ○ complexity ○ underlying distribution of its query ● For each candidate sample ○ construct an Error Latency Profile (ELP) ○ statistically predict for larger samples

SELECT AVG(SessionTime) Example FROM Sessions ● System has 3 stratified samples WHERE City = Galena’ ○ [date, country] ○ [date designated media area for a video ○ [date, ended_flag] ● Construct an ELP for each of the samples

Implementation enable queries with response time and error bounds assign query sized return error bars samples iteratively and confidence interval create/update the set of random and multi- dimensional samples

Evaluation Setting ● Conviva Workload ○ 17 TB in size ○ log of media accessed by Conviva users across 30 days ○ A sige big fact table with ~ 5.5 billion rows & 104 columns ○ raw query log constitutes 19,296 queries ● TPC-H workload ○ 1 TB of data ○ 22 benchmark queries ● For both of the workloads ○ partitioned data across 100 nodes ○ 50% storage budget

SELECT AVG(Session_Time) FROM Sessions BlinkDB v.s. No Sampling WHERE date = … GROUP BY City

● Uniform samples: 50% of entire data Response time v.s. Error ● Single Column: stratified on 1 column ● Multi-Column: stratifies on <= 3 columns

sample of 20 Conviva queries Time Guarantees ran each of them 10 times on 17 TB data set

sample of 20 Conviva queries Error Guarantees ran each of them 10 times on 17 TB data set

BlinkDB: Queries with Bounded Error and Bounded Response Times on - PowerPoint PPT Presentation

BlinkDB: Queries with Bounded Error and Bounded Response Times on Very Large Data Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica Presented by Liqi Xu SELECT AVG(SessionTime) Problem: very large data

BlinkDB (some figures were poached from the Eurosys conference talk) The Holy Grail Support

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

6.02 Fall 2012 Lecture #12 Bounded-input, bounded-output stability Frequency response 6.02

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Welcome! Office Hours will start at 2pm and run until 3pm Please mute your microphone to cut

Operational Compliance Report March 2020 Frank Gresh Chief Information Officer To serve our

Refinement-based Exact Response-Time Analysis Martin Stigge Uppsala University, Sweden Joint

Making OpenVX Really Real Time Ming Yang 1 , Tanya Amert 1 , Kecheng Yang 1,2 , Nathan

The Quest for Average Response Time Tom Henzinger IST Austria Joint work with Krishnendu

Emulating Goliath Storage Systems with David Nitin Agrawal, NEC Labs Leo Arulraj, Andrea C.

Joint Probability Distributions In many random experiments, more than one quantity is measured,

PET Semetary: Privacys return from the dead and the Future of Privacy Engineering Seda

BlinkDB: Queries with Bounded Error and Bounded Response Times on - PowerPoint PPT Presentation

BlinkDB: Queries with Bounded Error and Bounded Response Times on Very Large Data Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica Presented by Liqi Xu SELECT AVG(SessionTime) Problem: very large data

BlinkDB (some figures were poached from the Eurosys conference talk) The Holy Grail Support

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

6.02 Fall 2012 Lecture #12 Bounded-input, bounded-output stability Frequency response 6.02

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Human Error and Human Error Identification Techniques adapted from an IE 545 presentaton by

An Overview of Human Error Drawn f rom J . Reason, Human Error , Cambridge, 1990 Aaron Brown CS

Questions From Chapter 1 Figure 1.1: Testing life cycle Ch 12 Error vocabulary 1

Error Detection Codes Error Detection Two types Nave scheme Error Detection Codes

llvm::Error Rich Error Handling in LLVM Error Handling History LLVMs APIs historically

Module 14: Analyzing Queries Overview Queries That Use the AND Operator the OR

Geometric Algorithms Range &amp; windowing queries (2 lectures) Database queries 2/180 G.

Computational Geometry Lecture 14: Windowing queries Computational Geometry Lecture 14:

Welcome! Office Hours will start at 2pm and run until 3pm Please mute your microphone to cut

Operational Compliance Report March 2020 Frank Gresh Chief Information Officer To serve our

Refinement-based Exact Response-Time Analysis Martin Stigge Uppsala University, Sweden Joint

Making OpenVX Really Real Time Ming Yang 1 , Tanya Amert 1 , Kecheng Yang 1,2 , Nathan

The Quest for Average Response Time Tom Henzinger IST Austria Joint work with Krishnendu

Emulating Goliath Storage Systems with David Nitin Agrawal, NEC Labs Leo Arulraj, Andrea C.

Joint Probability Distributions In many random experiments, more than one quantity is measured,

PET Semetary: Privacys return from the dead and the Future of Privacy Engineering Seda

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G.