A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL - PowerPoint PPT Presentation

A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL SHANBHAG MIT J/W Alekh Jindal, Sam Madden, Jorge Quiane, Aaron J. Elmore Microsoft MIT QCRI Univ. Chicago

Today Data collection is cheap => Lots of data !

Data Partitioning Find average order size for all orders between Sept 10 and Sept 11, 2017 Order date Data Skipping - Skip data blocks not necessary 10% selectivity query => 10x faster if data partitioned on selection predicate

The Problem Focus of existing work Analytics Give workload => Return partitioning layout + Ad-Hoc/Exploratory Recurring Problems: Analysis Workloads 1. Tedious to collect workload 2. May not be known upfront 3. Changes over time How to get benefits of partitioning in this case ?

Our Approach Do everything adaptively ! Two step process: 1. Upfront load the dataset partitioned 2. As users query, incrementally improve the partitioning of the data

Distributed storage systems like HDFS, files broken into blocks (128 MB chunks) Upfront Partitioning > Instead of partitioning by size, partition by attributes. > Same number of blocks created as in HDFS. Each block now has additional metadata A <= 5 and B <= 7

Adaptive Re-Partitioning When user submits a query, optimizer tries to improve the partitioning by reorganizing the partitioning tree Here if queries ask A <= 3 many times, replace B 7 by A 3 Done on datasets which are O(1TB) with ~ 8000 node partition trees.

System Architecture Predicated Scan Query Example: FIND employees WITH Age < 30 AND 2 1 20k < Salary < 40k

1. Upfront Partitioner Goal: Generate a partitioning tree $ WITHOUT an upfront query workload " # > Generates a tree with heterogeneous branching > Balance the partitioning benefit across all ! ! " ! attributes

Allocation Goal: Balance partitioning benefit across attributes Allocationof attribute i ~ average partitioning of an attribute j = 𝛵 all nodes i n ij c ij Attribute Upfront Partitioning Partitioning Allocations Tree Algorithm Uniform if no workload information Weighted if we have prior workload information

2. Adaptive Query Executor Goal: Return matching tuples + check if partitioning layout can be improved Alternatives found via transformations on the partitioning tree 1. Swap Rule 2. Pushup Rule 3. Rotate Rule

Getting a plan

Cost Model The system maintain a window W of past queries Compute Benefit and Repartitioning Cost for the best plan Repartitioning ONLY happens when reduction in the total cost of the query workload is greater than re-partitioning cost. Solves constant re-partitioning due to random query sequences and bounds the worse case impact.

Performance 4 metrics 1) Load time 2) Time taken by first query 3) Aggregate runtime over a workload 4) Incremental improvement with workload hints

Load Time TPC-H: Scale Factor 200 + De-normalized. Data size: 1.4TB Loading performance: 1.38 times slower than HDFS Load time scales almost linearly with data size and independent of number of columns

Time taken by first query On Average: 45% better than full scan 20% better than k-d tree

Aggregate Workload Runtime full scaQ raQge raQge2 Amoeba Workload: 200 Queries generated from 2000 random initialization of 8 query templates of 1600 1200 TPC-H benchmark 800 400 0 full scan – Baseline 2000 1600 1200 7ime 7aNeQ (iQ s) 800 range – partitions on orderdate (1 per date) 400 1.88x better 0 2000 1600 range2 – partitions on orderdate(64), 1200 800 r_name(4),c_mktsegment(4),quantity(8) 400 0 3.48x better 2000 1600 1200 Amoeba – 3.84x better than baseline 800 400 0 0 25 50 75 100 125 150 175 200 4uery 1o

Workload Hints default better iQit Better Init : 2000 Starts with custom allocation to 1600 7ime 7aNeQ (iQ s) 1200 mimic range2 800 400 6.67x better than fullscan 0 2000 Filtering ratio: 1600 1200 default : 0.81 800 better init : 0.9 400 0 0 25 50 75 100 125 150 175 200 4uery 1o

Conclusion • Amoeba is a distributed storage system based on an adaptive data partitioning scheme • Low loading overhead • Improved first query performance • Adapt to changes and significantly improvement to workload runtime • Can exploit workload hints • Allows analysts to get started right away and reap benefits of partitioning without an upfront workload

A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL - PowerPoint PPT Presentation

A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL SHANBHAG MIT J/W Alekh Jindal, Sam Madden, Jorge Quiane, Aaron J. Elmore Microsoft MIT QCRI Univ. Chicago Today Data collection is cheap => Lots of data !

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Area 11 Redistricting Ad-Hoc Committee AREA 11 Redistricting Ad-Hoc Committee March 8 th 2017 a

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

Ad-hoc and Mesh Networks MAP-I Manuel P. Ricardo Faculdade de Engenharia da Universidade do

Mobile Communications Ad-hoc and Mesh Networks Manuel P. Ricardo Faculdade de Engenharia da

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s.

http://www.xerial.org/ I DECIDED TO EVERYBODY MUST START MASTERING XML IS LEARNING SAX, DOM,

The story of the film so far... We are discussing continuous-time Markov processes known as birth

Administrivia CS 4410: Operating Systems Fall 2019 Professors Schneider, Van Renesse [R.

for High Throughput in Data-Parallel Clusters Jinwei Liu * Haiying Shen and Ankur Sarker

Adventures in Multicellularity The social amoeba ( a.k.a. slime molds ) Dictyostelium discoideum

Rhythm: Component-distinguishable Workload Deployment in Datacenters Laiping Zhao 1 , Yanan Yang 1

Spool : Reliable Virtualized NVMe Storage Systems in Public Cloud Infrastructure Shuai Xue,

A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL - PowerPoint PPT Presentation

A Robust Partitioning Scheme for Ad-Hoc Query Workloads ANIL SHANBHAG MIT J/W Alekh Jindal, Sam Madden, Jorge Quiane, Aaron J. Elmore Microsoft MIT QCRI Univ. Chicago Today Data collection is cheap => Lots of data !

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Scheme Announcements Scheme Scheme is a Dialect of Lisp 4 Scheme is a Dialect of Lisp What

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Area 11 Redistricting Ad-Hoc Committee AREA 11 Redistricting Ad-Hoc Committee March 8 th 2017 a

Routing In Ad Hoc Networks 1. Introduction to Ad-hoc networks 2. Routing in Ad-hoc networks 3.

Ad-hoc and Mesh Networks MAP-I Manuel P. Ricardo Faculdade de Engenharia da Universidade do

Mobile Communications Ad-hoc and Mesh Networks Manuel P. Ricardo Faculdade de Engenharia da

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s.

http://www.xerial.org/ I DECIDED TO EVERYBODY MUST START MASTERING XML IS LEARNING SAX, DOM,

The story of the film so far... We are discussing continuous-time Markov processes known as birth

Administrivia CS 4410: Operating Systems Fall 2019 Professors Schneider, Van Renesse [R.

for High Throughput in Data-Parallel Clusters Jinwei Liu * Haiying Shen and Ankur Sarker

Adventures in Multicellularity The social amoeba ( a.k.a. slime molds ) Dictyostelium discoideum

Rhythm: Component-distinguishable Workload Deployment in Datacenters Laiping Zhao 1 , Yanan Yang 1

Spool : Reliable Virtualized NVMe Storage Systems in Public Cloud Infrastructure Shuai Xue,

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System