SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research - PowerPoint PPT Presentation

1 SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research

Background 2  Query optimization remains challenging despite of decades of efforts and progresses.  Cardinality estimation is the key challenge.  Selectivity of join predicates  Correlation of columns

Histogram vs. Sampling 3  Single-column histograms cannot capture data correlations between columns.  Use the attribute-value-independence (AVI) assumption.  Sampling is better than histograms on capturing data correlations.  We run query over exact rather than summarized data.

But Why are Histograms Dominant? 4  The overhead is much smaller, compared with other cardinality estimation approaches.  Sampling incurs additional overhead and should be used conservatively.  A naïve idea: use sampling for all plans considered by the optimizer.

Cost-Based Query Optimization 5 Pick the best plan from N candidates: P1 P N Merge Hash Join Join N could be large! (10 2 or even 10 3 ) C Hash Nested B … Join Loop A B A C For large N , sampling is not affordable to be used for every plan.

Our Idea 6  Use sampling as a post-processing validation step.  Detect cardinality estimation errors for the final plan returned by the optimizer.  Re-optimize the query if cardinality estimation errors are detected. Catch big mistakes of the optimizer before the plan runs!

The Re-optimization Algorithm 7 Query q Plan P q Final Plan Query Optimizer Update Cardinalities Sampling-based Refined Validation Cardinality Estimates Γ

The Re-optimization Algorithm (Cont.) 8  Example: 𝑟 = 𝐵 ⋈ 𝐶 ⋈ 𝐷 (Final Plan) P 1 P 2 Query q Query Optimizer Join Cardinality 100 𝐵 ⋈ 𝐶 300 𝐶 ⋈ 𝐷 Update 500 𝐵 ⋈ 𝐷 𝐵 ⋈ 𝐶 : 1000 Sampling-based Validation

Efficiency of Re-optimization 9  The worst-case expected number of iterations: 𝑂 N is the number of join 𝑙 ⋅ (1 − 1 𝑂) ⋅⋅⋅ (1 − 𝑙 − 1 ) ⋅ 𝑙 𝑇 𝑂 = ෍ trees in the search space. 𝑂 𝑂 𝑙=1  𝑇 𝑂 ∼ 𝑃( 𝑂 ) .

Quality of Re-optimized Plans 10  If sampling-based cost estimates are consistent with the actual costs, that is, cost_est(P1) < cost_est(P2) => cost_act(P1) < cost_act(P2), then the final re-optimized plan is locally optimal : cost_act(P final ) <= cost_act(P), for any P in re-optimization.  However, cost models are imperfect, and cardinality estimates based on sampling are imperfect, too.  See experimental results.

Experimental Evaluation 11  We implemented the re-optimization procedure in PostgreSQL 9.0.4.  We have two goals:  Test the approach for “common” cases.  Test the approach for “corner” cases.

Experimental Evaluation (Cont.) 12  “Common” cases  10GB TPC-H benchmark  “Corner” cases  (Homegrown) Optimizer “Torture Test ” (OTT) Specially designed database and queries with high data correlation that can challenge query optimizers.

Experimental Evaluation (Cont.) 13  Results on the 10GB TPC-H database

Experimental Evaluation (Cont.) 14  Results of the “torture test” (5 -join queries, log-scale)

Details of OTT 15  More details about OTT:  K tables R 1 , …, R K , with R k ( A k , B k )  Each R k is generated independently, with B k = A k .  A k (and thus B k ) is uniformly distributed.  The queries look like: 𝜏 𝐵 1 =𝑑 1 ∧⋅⋅⋅∧𝐵 𝐿 =𝑑 𝐿 ∧𝐶 1 =𝐶 2 ∧⋅⋅⋅∧𝐶 𝐿−1 =𝐶 𝐿 (𝑆 1 ×⋅⋅⋅× 𝑆 𝐿 ) Property : These queries are not empty if and only if A 1 = … = A K !

Details of OTT (Cont.) 16  An instance of OTT used in our experiments:  Use 6 TPC-H tables (excluding “ nation ” and “ region ”) .  Use a set of empty queries with non-empty sub-queries. Non-empty Empty! Bad Plan Good Plan

Summary 17 Plan P q Query Sampling-based q Optimizer Validation Improved Query Plan Feedback Sampling as post-processing : efficiency/effectiveness tradeoff !

Q & A 18  Thank you 

Cardinality Estimation Methods 19  Histograms  Single-column histograms (dominant in current DBMS)  Multi-column histograms  Other methods  Offline approaches: sampling, sketch, graphical models  Online approaches: dynamic query plans, parametric query optimization, query feedback, mid-query re- optimization, plan bouquets

A Sampling-Based Estimator 20  Estimate the selectivity 𝜍 𝑟 of a join query 𝑟 = 𝑆 1 ⋈ 𝑆 2 . [Haas et al., J. Comput. Syst. Sci. 1996] Do a “ cross product ” over the samples: 𝜍 𝑗, 𝑘 = 0 𝑝𝑠 1 . 𝜍(1, 1) ⋈ 𝑠 𝑠 | 𝑆 𝑡 1 ⋈ 𝑆 𝑡 2 | 11 21 … …… 𝑠 𝑠 11 21 ⋈ 𝜍(1, 𝑂 2 ) 𝑠 𝑠 𝑠 𝑠 2𝑂 2 11 12 22 σ 𝑗,𝑘 𝜍(𝑗, 𝑘) 𝜍 𝑟 = ො …… …… 𝑂 1 𝑂 2 𝑠 ⋈ 𝜍(𝑂 1 , 1) 𝑠 𝑠 𝑠 1𝑂 1 2𝑂 2 1𝑂 1 21 … …… | 𝑆 𝑡1 | × |𝑆 𝑡2 | R s R s 2 1 ⋈ 𝜍(𝑂 1 , 𝑂 2 ) 𝑠 𝑠 2𝑂 2 1𝑂 1 The estimator ො 𝜍 𝑟 is unbiased and strongly consistent .

Other Sampling-Based Methods 21  Sampling-Based Estimation of the Number of Distinct Values of an Attribute, VLDB’95  Towards Estimation Error Guarantees for Distinct Values, PODS’00  End-biased Samples for Join Cardinality Estimation, ICDE’06  Join Size Estimation Subject to Filter Conditions, VLDB’15

Convergence of Re-optimization 22  Convergence Condition of Re-optimization Theorem : The re-optimization procedure terminates when all the joins in the returned query plan have been observed in previous rounds of iteration. For example, re-optimization will terminate after T 1 ’ is returned.

Convergence of Re-optimization (Cont.) 23  The previous convergence condition is sufficient but not necessary .  Re-optimization could terminate even before it meets the previous condition. Termination  To understand re-optimization better, we need the notion of local/global transformations.

Local/Global Transformations 24  Local transformation of query plans Local transformations are those plans that share the same joins. They only differ in choices of specific physical operators.

Characterization of Re-optimization 25  The three possible cases in re-optimization:  (1) It terminates in two steps with P2 = P1.  (2) It terminates in n + 1 steps ( n > 1) where all plan transitions are global transformations.  (3) It terminates in n + 1 steps ( n > 1) where only the last transition is a local transformation: the others are all global transformations.

Characterization of Re-optimization (Cont.) 26  An illustration of Case (2) and (3): The number of iterations thus depends on the number of global transformations!

Analysis of Efficiency 27  A probabilistic model for analysis of expected number of steps in re-optimization:  We have N balls in a queue , initially unmarked . Yes … Is b 1 marked? Exit b 1 b N No b 1 Insert Back Mark b 1  The probability that the ball will be inserted at any position in the queue is uniformly 1/N.

Analysis of Efficiency (Cont.) 28  The expected number of steps of the previous procedure is: 𝑂 𝑙 ⋅ (1 − 1 𝑂) ⋅⋅⋅ (1 − 𝑙 − 1 ) ⋅ 𝑙 𝑇 𝑂 = ෍ 𝑂 𝑂 𝑙=1  How is it related to query optimizations?  Think of query plans (or, globally different join trees) as balls!  The uniform distribution employed in the model may be invalid in practice.  We have more analysis for situations where underestimation or overestimation is dominant. (And more analysis could be done in the future.)

SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research - PowerPoint PPT Presentation

1 SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research Background 2 Query optimization remains challenging despite of decades of efforts and progresses. Cardinality estimation is the key challenge. Selectivity of

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Introduction Query Execution Engine Implements a set of physical operators 2 key

Query Optimization Through the Looking Glass Some Lessons From Building an LLVM-Based Query

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

It's a magical world :-) Let's begin with incantations Abracadabra Aramaic phrase meaning

Complicated benefit eligibility rules. Added complexity for immigrant eligibility

IBHC for Refugees: A Qualitative Inquiry Jennifer J. Esala, Ph.D. Alison Beckman, M.S.W.,

Torture JUSTICE Student Human Rights Network conference 8 March 2008 Roger Smith Director

Height-Balanced Trees Doublets Milestone 1 due next Tuesday night Exams redux now and

Automated Bug Detection for Pointers and Memory Accesses in High-Level Synthesis Compilers Pietro

Broadening CS at the Entry Level Broadening CS at the Entry Level Interdisciplinary Science &

The Global Impact of New U.S. Hong Kong-Related Sanctions 30th September 2020 The Impact of New

SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research - PowerPoint PPT Presentation

1 SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research Background 2 Query optimization remains challenging despite of decades of efforts and progresses. Cardinality estimation is the key challenge. Selectivity of

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Introduction Query Execution Engine Implements a set of physical operators 2 key

Query Optimization Through the Looking Glass Some Lessons From Building an LLVM-Based Query

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

It's a magical world :-) Let's begin with incantations Abracadabra Aramaic phrase meaning

Complicated benefit eligibility rules. Added complexity for immigrant eligibility

IBHC for Refugees: A Qualitative Inquiry Jennifer J. Esala, Ph.D. Alison Beckman, M.S.W.,

Torture JUSTICE Student Human Rights Network conference 8 March 2008 Roger Smith Director

Height-Balanced Trees Doublets Milestone 1 due next Tuesday night Exams redux now and

Automated Bug Detection for Pointers and Memory Accesses in High-Level Synthesis Compilers Pietro

Broadening CS at the Entry Level Broadening CS at the Entry Level Interdisciplinary Science &amp;

The Global Impact of New U.S. Hong Kong-Related Sanctions 30th September 2020 The Impact of New

Broadening CS at the Entry Level Broadening CS at the Entry Level Interdisciplinary Science &