Taster: Self-Tuning , Elastic and Online Approximate Query - - PowerPoint PPT Presentation

taster self tuning elastic and online approximate query
SMART_READER_LITE
LIVE PREVIEW

Taster: Self-Tuning , Elastic and Online Approximate Query - - PowerPoint PPT Presentation

JOIN AGGR SAMPLE HASH Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki Data Exploration vs. Data Preparation Challenges of interactive data


slide-1
SLIDE 1

Taster: Self-Tuning, Elastic and Online Approximate Query Processing

Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki

JOIN AGGR SAMPLE HASH

slide-2
SLIDE 2

Enable AQP with minimal pre-processing

2

Challenges of interactive data exploration

Exploratory Applications

  • Dynamic & data-driven

Instant access to data Interactive response time

Scientific exploration “Internet of Things” analytics

Building data summaries is expensive Reduce result precision – use AQP

Data Exploration vs. Data Preparation

slide-3
SLIDE 3

Performance vs. flexibility

3

Online Offline

Sampling

Query Workload Query Sample selection Pre-sampling

Inject sampling Reduce intermediate data Reduce CPU load

No Preprocessing Workload knowledge required No workload knowledge ~2x performance No storage overhead ~10x performance 0.5-2x storage overhead

Reduce I/O

Sampler Offline AQP Online AQP (e.g., BlinkDB) (e.g., Quickr)

slide-4
SLIDE 4

5 10 15 20 25 20 40 60 80 100 120 140 160 180 200

Cumulative time (hours) Query Sequence Baseline Offline AQP (BlinkDB) Online AQP (Quickr)

Reducing pre-processing time

Ideal: No sampling preparation cost & interactive access

4

Sampling pays off after 85 queries

11 node SparkSQL cluster, TPC-H (300GB) 200 queries (18 TPC-H query templates)

Sampling pays off after 159 queries

  • ver online AQP
slide-5
SLIDE 5
  • Reduce the amount of data accessed

– Materialize and Re-use intermediate generated summaries

  • Adapt materialized summaries to workload and storage budget
  • Use a variety of summaries other than samples

5

Enhancing Online Approx. Query Processing

What to materialize If/When to materialize If/When to evict

slide-6
SLIDE 6

6

Materialize and re-use synopses Γ

sampler

C

sampler

σ σ σ

A B S1

  • Store all subplans and statistics in hitmap

– Update when subplans re-appear

  • Calculate prospective gains (cost:benefit)
  • Performance gains over future workload
  • Storage cost
  • Maximize benefit – Knapsack constraint problem
  • Decide to materialize
  • Inject materializer operator
  • Store intermediate result in-memory and flush offline

Summary warehouse

slide-7
SLIDE 7

7

Adapting materialized summaries

Q1 Q2 Q3 Q4 Q5 Q6

S1 S2 S2 S1 S2 S4 S4 S4 S1 S2

  • Ideal window size depends on: user, task, data
  • Keep statistics for (1-a)w, w, (1+a)w
  • Adapt window size based on quality of predictions
  • Window-based prediction

Summary warehouse

S1 S2

w = 2

S4 Online tuning of window size improves forecast efficiency Abide to storage requirements despite workload shifts

materialize materialize use Useful Summaries

slide-8
SLIDE 8

Utilize each summary where useful

8

All queries on subset of data

  • Keep schema of original table
  • Precision depends on query
  • Uniform/Stratified sampling

Sampling

Answer large subset of queries Large size ~ 10% of input I/O cost depending on size

Sketches Some queries on all data

  • Count/Sum/Avg
  • Aggregations
  • Single grouping attribute

Answer specific queries Compact ~KB Constant access time

Combining different data summaries

slide-9
SLIDE 9
  • Inject approximation operators into plans
  • Re-use existing materialized synopses
  • Choose which synopsis to generate
  • Store statistics about the historical plans
  • Store the synopsis over HDFS

Taster Architecture

Data SQL query

Query Execution Query Optimization Online tuner

Synopsis warehouse

Metadata store

slide-10
SLIDE 10

Datasets

  • TPC-H: sf300 (300GB), 18 query templates

Systems

  • SparkSQL (2.1.0)
  • BlinkDB, Quickr, Taster over SparkSQL (2.1.0)

Hardware

  • 11 nodes x 2 x Intel Xeon X5660 CPU @ 2.80GHz, 48GB RAM,

10GbE (fix for each)

10

Experimental Setup

slide-11
SLIDE 11

200 400 600 800 1000 1200 1400 Baseline Quickr BlinkDB (50%) Taster (50%) BlinkDB (100%) Taster (100%)

Execution time (min) Offline sampling Query Execution

11 node SparkSQL cluster, TPC-H sf300, 200 queries (18 TPC-H templates)

Taster offers comparable execution time to state-of-the-art End-to-End execution time

slide-12
SLIDE 12

12

5 10 15 20 20 40 60 80 Execution time (min) Query Sequence Execution time

Taster adapts efficiently to changes in workload Adapting to shifting workload

11 node SparkSQL cluster, TPC-H sf300, 80 queries (18 TPC-H templates)

slide-13
SLIDE 13
  • Piggy-back the creation of summaries over the query execution

– In the context of distributed approximate query processing

  • Adapt data summaries to workload shifts and reduce storage budget
  • Provide query performance comparable to offline AQP approaches

– With reduced building and storage cost

13

Take home message

Thank you!