taster self tuning elastic and online approximate query
play

Taster: Self-Tuning , Elastic and Online Approximate Query - PowerPoint PPT Presentation

JOIN AGGR SAMPLE HASH Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki Data Exploration vs. Data Preparation Challenges of interactive data


  1. JOIN AGGR SAMPLE HASH Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki

  2. Data Exploration vs. Data Preparation Challenges of interactive data exploration Exploratory Applications - Dynamic & data-driven Scientific exploration “Internet of Things” analytics Interactive response time Instant access to data Reduce result precision Building data – use AQP summaries is expensive Enable AQP with minimal pre-processing 2

  3. Performance vs. flexibility Offline AQP Online AQP Sample Query Inject sampling (e.g., BlinkDB) (e.g., Quickr) selection Reduce intermediate data Reduce I/O Reduce CPU load Online Offline Sampler Query Sampling Workload Pre-sampling No Preprocessing Workload knowledge required No workload knowledge No storage overhead 0.5-2x storage overhead ~10x performance ~2x performance 3

  4. Reducing pre-processing time 11 node SparkSQL cluster, TPC-H (300GB) 200 queries (18 TPC-H query templates) 25 Baseline Offline AQP (BlinkDB) Online AQP (Quickr) Cumulative time (hours) 20 Sampling pays off after 85 queries 15 10 Sampling pays off after 159 queries 5 over online AQP 0 0 20 40 60 80 100 120 140 160 180 200 Query Sequence Ideal: No sampling preparation cost & interactive access 4

  5. Enhancing Online Approx. Query Processing • Reduce the amount of data accessed – Materialize and Re-use intermediate generated summaries • Adapt materialized summaries to workload and storage budget • Use a variety of summaries other than samples What to If/When to If/When to materialize materialize evict 5

  6. Materialize and re-use synopses Γ • Store all subplans and statistics in hitmap sampler – Update when subplans re-appear ⨝ Calculate prospective gains (cost:benefit) • - Performance gains over future workload σ sampler - Storage cost - Maximize benefit – Knapsack constraint problem ⨝ C Decide to materialize • σ σ - Inject materializer operator - Store intermediate result in-memory and flush offline Summary A B S 1 warehouse 6

  7. Adapting materialized summaries materialize materialize use • Window-based prediction w = 2 S 2 S 1 S 4 S 2 Useful Summaries S 1 S 2 S 4 S 4 S 1 S 2 Summary S 4 S 1 S 2 warehouse Q 1 Q 2 Q 3 Q 5 Q 6 Q 4 • Ideal window size depends on: user, task, data Keep statistics for (1-a)w, w, (1+a)w • Adapt window size based on quality of predictions • Abide to storage requirements despite workload shifts Online tuning of window size improves forecast efficiency 7

  8. Combining different data summaries Sketches Sampling All queries on subset of data Some queries on all data - Keep schema of original table - Count/Sum/Avg - Precision depends on query - Aggregations - Uniform/Stratified sampling - Single grouping attribute Answer large subset of queries Answer specific queries Large size ~ 10% of input Compact ~KB I/O cost depending on size Constant access time Utilize each summary where useful 8

  9. Taster Architecture SQL query - Inject approximation operators into plans Online - Re-use existing materialized synopses Query tuner Optimization Query - Choose which synopsis to generate Execution Metadata store - Store statistics about the historical plans Data Synopsis warehouse - Store the synopsis over HDFS

  10. Experimental Setup Datasets - TPC-H: sf300 (300GB), 18 query templates Systems - SparkSQL (2.1.0) - BlinkDB, Quickr, Taster over SparkSQL (2.1.0) Hardware - 11 nodes x 2 x Intel Xeon X5660 CPU @ 2.80GHz, 48GB RAM, 10GbE (fix for each) 10

  11. End-to-End execution time 11 node SparkSQL cluster, TPC-H sf300, 200 queries (18 TPC-H templates) 1400 1200 Offline sampling Query Execution Execution time (min) 1000 800 600 400 200 0 Baseline Quickr BlinkDB Taster BlinkDB Taster (50%) (50%) (100%) (100%) Taster offers comparable execution time to state-of-the-art

  12. Adapting to shifting workload 11 node SparkSQL cluster, TPC-H sf300, 80 queries (18 TPC-H templates) 20 Execution time Execution time (min) 15 10 5 0 0 20 40 60 80 Query Sequence Taster adapts efficiently to changes in workload 12

  13. Take home message • Piggy-back the creation of summaries over the query execution – In the context of distributed approximate query processing • Adapt data summaries to workload shifts and reduce storage budget • Provide query performance comparable to offline AQP approaches – With reduced building and storage cost Thank you! 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend