SAMPLING-BASED QUERY RE- OPTIMIZATION
Wentao Wu Microsoft Research
1
SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research - - PowerPoint PPT Presentation
1 SAMPLING-BASED QUERY RE- OPTIMIZATION Wentao Wu Microsoft Research Background 2 Query optimization remains challenging despite of decades of efforts and progresses. Cardinality estimation is the key challenge. Selectivity of
1
2
Query optimization remains challenging despite of
Cardinality estimation is the key challenge.
Selectivity of join predicates Correlation of columns
3
Single-column histograms cannot capture data
Use the attribute-value-independence (AVI) assumption.
Sampling is better than histograms on capturing
We run query over exact rather than summarized data.
4
The overhead is much smaller, compared with other
Sampling incurs additional overhead and should be
A naïve idea: use sampling for all plans considered by
5
Merge Join Hash Join A B C
Hash Join Nested Loop A C B
6
Use sampling as a post-processing validation step.
Detect cardinality estimation errors for the final plan
Re-optimize the query if cardinality estimation errors
7
Sampling-based Validation
Query Optimizer
Update Cardinalities
8
Example: 𝑟 = 𝐵 ⋈ 𝐶 ⋈ 𝐷
Join Cardinality 𝐵 ⋈ 𝐶 100 𝐶 ⋈ 𝐷 300 𝐵 ⋈ 𝐷 500
Update
Query Optimizer
Sampling-based Validation
𝐵 ⋈ 𝐶: 1000
9
The worst-case expected number of iterations:
𝑇𝑂 ∼ 𝑃( 𝑂).
𝑇𝑂 =
𝑙=1 𝑂
𝑙 ⋅ (1 − 1 𝑂) ⋅⋅⋅ (1 − 𝑙 − 1 𝑂 ) ⋅ 𝑙 𝑂
If sampling-based cost estimates are consistent with
However, cost models are imperfect, and cardinality
See experimental results.
10
11
We implemented the re-optimization procedure in
We have two goals:
Test the approach for “common” cases. Test the approach for “corner” cases.
12
“Common” cases
10GB TPC-H benchmark
“Corner” cases
(Homegrown) Optimizer “Torture Test” (OTT)
13
Results on the 10GB TPC-H database
14
Results of the “torture test” (5-join queries, log-scale)
15
More details about OTT:
K tables R1, …, RK, with Rk(Ak, Bk) Each Rk is generated independently, with Bk = Ak. Ak (and thus Bk) is uniformly distributed. The queries look like:
𝐵1=𝑑1∧⋅⋅⋅∧𝐵𝐿=𝑑𝐿∧𝐶1=𝐶2∧⋅⋅⋅∧𝐶𝐿−1=𝐶𝐿(𝑆1 ×⋅⋅⋅× 𝑆𝐿)
16
An instance of OTT used in our experiments:
Use 6 TPC-H tables (excluding “nation” and “region”). Use a set of empty queries with non-empty sub-queries.
Non-empty Empty!
17
Query Optimizer Sampling-based Validation
Improved Query Plan
18
Thank you
19
Histograms
Single-column histograms (dominant in current DBMS) Multi-column histograms
Other methods
Offline approaches: sampling, sketch, graphical models Online approaches: dynamic query plans, parametric
20
Estimate the selectivity 𝜍𝑟 of a join query 𝑟 = 𝑆1 ⋈ 𝑆2.
[Haas et al., J. Comput. Syst. Sci. 1996] 𝑠
11
𝑠
12
…… 𝑠
1𝑂1
𝑠
21
𝑠
22
…… 𝑠
2𝑂2
2
1
𝜍𝑟 is unbiased and strongly consistent. Do a “cross product” over the samples: 𝜍 𝑗, 𝑘 = 0 𝑝𝑠 1. 𝑠
11
𝑠
21
𝑠
2𝑂2
𝑠
11
𝑠
1𝑂1
𝑠
21
𝑠
2𝑂2
𝑠
1𝑂1
𝜍(1, 1) 𝜍(1, 𝑂2) 𝜍(𝑂1, 1) 𝜍(𝑂1, 𝑂2) ො 𝜍𝑟 = σ𝑗,𝑘 𝜍(𝑗, 𝑘) 𝑂1𝑂2 |𝑆𝑡1 ⋈ 𝑆𝑡2| |𝑆𝑡1| × |𝑆𝑡2|
21
Sampling-Based Estimation of the Number of Distinct
Towards Estimation Error Guarantees for Distinct Values,
End-biased Samples for Join Cardinality Estimation,
Join Size Estimation Subject to Filter Conditions,
Convergence Condition of Re-optimization
22
The previous convergence condition is sufficient but
Re-optimization could terminate even before it meets the
23
Local transformation of query plans
24
The three possible cases in re-optimization:
25
(1) It terminates in two steps with P2 = P1. (2) It terminates in n + 1 steps (n > 1) where all plan
(3) It terminates in n + 1 steps (n > 1) where only the last
An illustration of Case (2) and (3):
26
A probabilistic model for analysis of expected
We have N balls in a queue, initially unmarked.
27
The probability that the ball will be inserted at any position
The expected number of steps of the previous procedure is: How is it related to query optimizations? Think of query plans (or, globally different join trees) as balls! The uniform distribution employed in the model may be
We have more analysis for situations where underestimation or
the future.)
28
𝑇𝑂 =
𝑙=1 𝑂
𝑙 ⋅ (1 − 1 𝑂) ⋅⋅⋅ (1 − 𝑙 − 1 𝑂 ) ⋅ 𝑙 𝑂