Cardinality Estimation Done Right: Index-Based Join Sampling Viktor - - PowerPoint PPT Presentation

cardinality estimation done right index based join
SMART_READER_LITE
LIVE PREVIEW

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor - - PowerPoint PPT Presentation

Cardinality Estimation Done Right: Index-Based Join Sampling Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann Technische Universitt Mnchen Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right:


slide-1
SLIDE 1

Cardinality Estimation Done Right: Index-Based Join Sampling

Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, Thomas Neumann

Technische Universität München

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-2
SLIDE 2

Join Ordering

◮ finding a good join order is arguably the most important

problem in query optimization

◮ the costs of different join orders often vary by orders of

magnitude

◮ to distinguish good plans from bad ones, the cost of plans

must be estimated using cardinality estimates for intermediate results

◮ example: to find the optimal join order for

A ⊲ ⊳A.id=B.aid B ⊲ ⊳B.cid=C.id C, one needs estimates for A ⊲ ⊳ B and B ⊲ ⊳ C

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-3
SLIDE 3

The State of the (Industrial) Art in Cardinality Estimation

◮ in virtually all systems estimation is based on:

  • 1. histograms and unique value counts
  • 2. strong assumptions: uniformity, independence, inclusion

◮ results with real-world data are terrible (Join Order

Benchmark [Leis et al., VLDB 2016]):

← →

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-4
SLIDE 4

The State of the (Industrial) Art in Cardinality Estimation

◮ in virtually all systems estimation is based on:

  • 1. histograms and unique value counts
  • 2. strong assumptions: uniformity, independence, inclusion

◮ results with real-world data are terrible (Join Order

Benchmark [Leis et al., VLDB 2016]):

PostgreSQL 0 1 2 3 4 5 6 1e8 1e6 1e4 1e2 1 1e2 1e4 number of joins ← underest. [log scale] overest. →

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-5
SLIDE 5

The State of the (Industrial) Art in Cardinality Estimation

◮ in virtually all systems estimation is based on:

  • 1. histograms and unique value counts
  • 2. strong assumptions: uniformity, independence, inclusion

◮ results with real-world data are terrible (Join Order

Benchmark [Leis et al., VLDB 2016]):

PostgreSQL 0 1 2 3 4 5 6 1e8 1e6 1e4 1e2 1 1e2 1e4 number of joins ← underest. [log scale] overest. → PostgreSQL 0% 25% 50% 75%100% 1 3 10 30 percentile normalized plan cost [log scale]

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-6
SLIDE 6

Sampling to the Rescue?

◮ sampling, which does not rely on strong assumptions, is a

highly promising alternative

◮ some systems (e.g., HyPer) use sampling to estimate base

table selections:

◮ keep random samples for each table (e.g., 10,000 rows) and

execute selection on sample

◮ produces accurate estimates for arbitrary predicates

(correlations etc. are not a problem)

◮ How to use sampling for joins, which are the main source of

errors?

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-7
SLIDE 7

Using Samples for Joins

We could pretend that samples are base tables and compute the join result

◮ would allows use to estimate the result size ◮ could become expensive, too (unlikely, but possible) ◮ usually, joining samples will produce small/empty results ◮ need a mechanism to sample the real join result ◮ but we cannot sample everything ◮ number of join candidates is exponential

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-8
SLIDE 8

Related (Sampling-Based) Work

◮ CS2 [Yu et al., SIGMOD 2013]

◮ pre-materialize correlated samples to avoid joining independent

samples

◮ works well for star queries, but it is unknown how to apply this

idea automatically for arbitrary queries

◮ ROX [Kader et al., SIGMOD 2009]

◮ greedy heuristics that uses sampling through indexes to make

more informed decisions

◮ does not enumerate all join orders

◮ Sampling-Based Re-Optimization [Wu et al., SIGMOD 2016]

◮ get plan from traditional optimizer ◮ repeat until plan does not change: execute plan using 5%

samples of each table

◮ high overhead (large samples), avoids some bad plans but

  • ften misses optimal plan (no systematic exploration)

◮ sampling-based approaches proposed so far have weaknesses

that preclude their use in industrial-strength systems

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-9
SLIDE 9

Index-based Join Sampling: Main Ideas

  • 1. use existing index structures and fixed-size samples (1000) to

get samples for larger intermediate results

  • 2. systematically explore intermediate results in a bottom-up

fashion (2-way joins, 3-way joins, . . . )

  • 3. inject cardinalities computed in step 2 and run exhaustive join
  • rdering algorithm (e.g., dynamic programming)

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-10
SLIDE 10
  • 1. Cheap Sampling Using Indexes

◮ given a sample S obtain a sample for S ⊲

⊳S.id=A.id A using an existing index on A.id:

  • 1. count the number of join partners for each tuple in S
  • 2. draw tuples at random to obtain desired number of results

S

6 count per tuple (cpt) sum: 15 (not materialized)

Sout

chosen: 0

  • ffset: 2

chosen: 0

  • ffset: 4

chosen: 2

  • ffset: 1

chosen: 3

  • ffset: 3

2 4 8 13

1 3 5

sid: {2,4,8,13}

sample of sample of B T A T

B S A

◮ makes each sampling step cheap by avoiding “exploding”

intermediate result sizes

◮ O(|S|) or O(|S| log |A|), |S| ≤ 1000

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-11
SLIDE 11
  • 2. Systematic Bottom-Up Exploration

◮ generate samples for base tables ◮ join using existing indexes obtaining results for all 2-way joins,

  • nly then proceed to 3-way joins

◮ avoids “fleeing to ignorace” of the optimizer ◮ we need only one estimate per equivalence class ◮ stop early if sampling budget (e.g., 100K lookups) runs out

◮ budget is a parameter that determines how much time is spent

in the additional sampling phase

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-12
SLIDE 12
  • 3. Join Ordering

◮ inject cardinalities from step 2 into traditional optimizer

◮ fall back to traditional estimation if a result could not be

estimated using sampling (due to missing indexes, a too small budget, or very high selectivities)

◮ run traditional (typically exhaustive) join enumeration

algorithm, which now has much more accurate information

◮ execute resulting plan

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-13
SLIDE 13

Does Sampling Improve Estimation?

index-based (10K) index-based (100K) index-based (no budget) PostgreSQL 1e8 1e6 1e4 1e2 1 1e2 1e4 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 number of joins ← underestimation [log scale] overest. →

5th percentile 25th percentile median 75th percentile 95th percentile

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-14
SLIDE 14

How Expensive is Sampling?

10 100 1000 4 8 12 16 number of joins sampling time [ms] [log scale]

sampling-based re-optimization index-based sampling (10K) index-based sampling (100K) index-based sampling (no budget)

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-15
SLIDE 15

Does Sampling Improve Plan Quality?

0.3 1 3 10 30 0% 25% 50% 75% 100% percentile normalized plan cost [log scale]

PostgreSQL sampling-based re-opt. index-based samp. (10K) index-based samp. (100K) index-based samp. (no budget)

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling

slide-16
SLIDE 16

Summary

◮ index-based join sampling is an effective approach for

cardinality estimation in main-memory database systems

◮ considerably improves estimation and plan quality ◮ low and configurable overhead ◮ easy to integrate into existing systems ◮ also possible as an optional phase (e.g., triggered by the user)

for hard, long-running queries

Leis, Radke, Gubichev, Kemper, Neumann Cardinality Estimation Done Right: Index-Based Join Sampling