cardinality estimation done right index based join
play

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor - PowerPoint PPT Presentation

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann Technische Universitt Mnchen Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right:


  1. Cardinality Estimation Done Right: Index-Based Join Sampling Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann Technische Universität München Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  2. Join Ordering ◮ finding a good join order is arguably the most important problem in query optimization ◮ the costs of different join orders often vary by orders of magnitude ◮ to distinguish good plans from bad ones, the cost of plans must be estimated using cardinality estimates for intermediate results ◮ example: to find the optimal join order for A ⊲ ⊳ A . id = B . aid B ⊲ ⊳ B . cid = C . id C , one needs estimates for A ⊲ ⊳ B and B ⊲ ⊳ C Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  3. ← → The State of the (Industrial) Art in Cardinality Estimation ◮ in virtually all systems estimation is based on: 1. histograms and unique value counts 2. strong assumptions: uniformity, independence, inclusion ◮ results with real-world data are terrible (Join Order Benchmark [Leis et al., VLDB 2016]): Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  4. The State of the (Industrial) Art in Cardinality Estimation ◮ in virtually all systems estimation is based on: 1. histograms and unique value counts 2. strong assumptions: uniformity, independence, inclusion ◮ results with real-world data are terrible (Join Order Benchmark [Leis et al., VLDB 2016]): PostgreSQL ← underest. [log scale] overest. → 1e4 1e2 1 1e2 1e4 1e6 1e8 0 1 2 3 4 5 6 number of joins Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  5. The State of the (Industrial) Art in Cardinality Estimation ◮ in virtually all systems estimation is based on: 1. histograms and unique value counts 2. strong assumptions: uniformity, independence, inclusion ◮ results with real-world data are terrible (Join Order Benchmark [Leis et al., VLDB 2016]): PostgreSQL PostgreSQL normalized plan cost [log scale] ← underest. [log scale] overest. → 1e4 30 1e2 1 10 1e2 1e4 3 1e6 1e8 1 0 1 2 3 4 5 6 0% 25% 50% 75%100% number of joins percentile Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  6. Sampling to the Rescue? ◮ sampling, which does not rely on strong assumptions, is a highly promising alternative ◮ some systems (e.g., HyPer) use sampling to estimate base table selections: ◮ keep random samples for each table (e.g., 10,000 rows) and execute selection on sample ◮ produces accurate estimates for arbitrary predicates (correlations etc. are not a problem) ◮ How to use sampling for joins, which are the main source of errors? Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  7. Using Samples for Joins We could pretend that samples are base tables and compute the join result ◮ would allows use to estimate the result size ◮ could become expensive, too (unlikely, but possible) ◮ usually, joining samples will produce small/empty results ◮ need a mechanism to sample the real join result ◮ but we cannot sample everything ◮ number of join candidates is exponential Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  8. Related (Sampling-Based) Work ◮ CS2 [Yu et al., SIGMOD 2013] ◮ pre-materialize correlated samples to avoid joining independent samples ◮ works well for star queries, but it is unknown how to apply this idea automatically for arbitrary queries ◮ ROX [Kader et al., SIGMOD 2009] ◮ greedy heuristics that uses sampling through indexes to make more informed decisions ◮ does not enumerate all join orders ◮ Sampling-Based Re-Optimization [Wu et al., SIGMOD 2016] ◮ get plan from traditional optimizer ◮ repeat until plan does not change: execute plan using 5% samples of each table ◮ high overhead (large samples), avoids some bad plans but often misses optimal plan (no systematic exploration) ◮ sampling-based approaches proposed so far have weaknesses that preclude their use in industrial-strength systems Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  9. Index-based Join Sampling: Main Ideas 1. use existing index structures and fixed-size samples (1000) to get samples for larger intermediate results 2. systematically explore intermediate results in a bottom-up fashion (2-way joins, 3-way joins, . . . ) 3. inject cardinalities computed in step 2 and run exhaustive join ordering algorithm (e.g., dynamic programming) Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  10. 1. Cheap Sampling Using Indexes ◮ given a sample S obtain a sample for S ⊲ ⊳ S . id = A . id A using an existing index on A . id : 1. count the number of join partners for each tuple in S 2. draw tuples at random to obtain desired number of results S sample of T count per tuple ( cpt ) sum : 15 6 1 3 5 S A B (not materialized) Sout sample of sid : {2,4,8,13} 2 4 8 13 T B A chosen : 0 chosen : 0 chosen : 2 chosen : 3 o ff set : 2 o ff set : 4 o ff set : 1 o ff set : 3 ◮ makes each sampling step cheap by avoiding “exploding” intermediate result sizes ◮ O ( | S | ) or O ( | S | log | A | ), | S | ≤ 1000 Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  11. 2. Systematic Bottom-Up Exploration ◮ generate samples for base tables ◮ join using existing indexes obtaining results for all 2-way joins, only then proceed to 3-way joins ◮ avoids “fleeing to ignorace” of the optimizer ◮ we need only one estimate per equivalence class ◮ stop early if sampling budget (e.g., 100K lookups) runs out ◮ budget is a parameter that determines how much time is spent in the additional sampling phase Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  12. 3. Join Ordering ◮ inject cardinalities from step 2 into traditional optimizer ◮ fall back to traditional estimation if a result could not be estimated using sampling (due to missing indexes, a too small budget, or very high selectivities) ◮ run traditional (typically exhaustive) join enumeration algorithm, which now has much more accurate information ◮ execute resulting plan Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  13. Does Sampling Improve Estimation? PostgreSQL index-based (10K) index-based (100K) index-based (no budget) 1e4 ← underestimation [log scale] overest. → 1e2 1 1e2 1e4 95th percentile 1e6 75th percentile median 25th percentile 1e8 5th percentile 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 number of joins Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  14. How Expensive is Sampling? sampling-based re-optimization index-based sampling (100K) index-based sampling (10K) index-based sampling (no budget) sampling time [ms] [log scale] 1000 100 10 4 8 12 16 number of joins Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  15. Does Sampling Improve Plan Quality? PostgreSQL normalized plan cost [log scale] 30 sampling-based re-opt. index-based samp. (10K) 10 index-based samp. (100K) index-based samp. (no budget) 3 1 0.3 0% 25% 50% 75% 100% percentile Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

  16. Summary ◮ index-based join sampling is an effective approach for cardinality estimation in main-memory database systems ◮ considerably improves estimation and plan quality ◮ low and configurable overhead ◮ easy to integrate into existing systems ◮ also possible as an optional phase (e.g., triggered by the user) for hard, long-running queries Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend