Independence Assumptions Kostas Tzoumas, Amol Deshpande, Christian - - PowerPoint PPT Presentation

independence assumptions
SMART_READER_LITE
LIVE PREVIEW

Independence Assumptions Kostas Tzoumas, Amol Deshpande, Christian - - PowerPoint PPT Presentation

Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions Kostas Tzoumas, Amol Deshpande, Christian S. Jensen Presented by Guozhang Wang DB Lunch, Nov 23rd, 2011 Lightweight Graphical Models for Selectivity


slide-1
SLIDE 1

Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions

Kostas Tzoumas, Amol Deshpande, Christian S. Jensen Presented by Guozhang Wang DB Lunch, Nov 23rd, 2011

slide-2
SLIDE 2

Lightweight Graphical Models for Selectivity Estimation Without With Little Independence Assumptions

Kostas Tzoumas, Amol Deshpande, Christian S. Jensen Presented by Guozhang Wang DB Lunch, Nov 23rd, 2011

slide-3
SLIDE 3

Motivation: Query Optimization

 The “best” join plan

  • Cost = # intermediate tuples along the path

 Errors in estimates lead to wrong plan

slide-4
SLIDE 4

Selectivity Estimation

𝑀𝑃 = 𝑀 ∗ 𝑃 ∗ 𝑄𝑠⁡ (⁡𝑚. 𝑝𝑠𝑒𝑓𝑠𝑙𝑓𝑧 = 𝑝. 𝑝𝑠𝑒𝑓𝑠𝑙𝑓𝑧⁡ 𝑝. 𝑢𝑝𝑢𝑏𝑚𝑞𝑠𝑗𝑑𝑓⁡ 𝜗⁡ 𝑢1, 𝑢2 𝑚. 𝑓𝑦𝑢𝑓𝑜𝑒𝑓𝑒𝑞𝑠𝑗𝑑𝑓⁡𝜗⁡ 𝑓1, 𝑓2 )

 Equal to distribution estimation

  • Estimation based on histograms
  • How complicated histograms we need?
slide-5
SLIDE 5

Correlations Matter in Estimation

 Need multi-dim. histograms to capture

correlation between attributes

slide-6
SLIDE 6

Idea #1: Full Independence

 Assume attributes are mutually indept.

  • Only need 1-D histograms, one for each

attribute

 Estimates done by multiplication

𝑄𝑠 ⁡𝑚. 𝑝𝑙 = 𝑝. 𝑝𝑙⁡ 𝑝. 𝑢𝑞⁡ 𝜗⁡ 𝑢1, 𝑢2 𝑚. 𝑓𝑞⁡𝜗⁡ 𝑓1, 𝑓2 = 𝑄𝑠 ⁡𝑚. 𝑝𝑙 = 𝑝. 𝑝𝑙⁡ * 𝑄𝑠 ⁡𝑝. 𝑢𝑞⁡𝜗⁡ 𝑢1, 𝑢2 ⁡ * 𝑄𝑠 ⁡𝑚. 𝑓𝑞⁡𝜗⁡ 𝑓1, 𝑓2 ⁡

Possible Big Error!

slide-7
SLIDE 7

Idea #2: No Independence

 Any subsets of attributes could be

correlated

  • Construct one n-Dim histogram, n = # total

attributes in the database

 Estimates done by marginalization

𝑄𝑠 ⁡𝑚. 𝑝𝑙 = 𝑝. 𝑝𝑙⁡ 𝑝. 𝑢𝑞⁡ 𝜗⁡ 𝑢1, 𝑢2 𝑚. 𝑓𝑞⁡𝜗⁡ 𝑓1, 𝑓2 =⁡ 𝑄𝑠⁡ (. . )

!𝑚.𝑝𝑙…

Storage Blowup!

slide-8
SLIDE 8

Idea #3: Cond. Independence

[SIGMOD’01]

 Capture correlation in a Bayes network

  • BN model is constructed at start
  • Estimates done by computing joint dist’n

Model Constructor Database

  • ffline

Selectivity Estimator

execution time

Query Q Size(Q)

slide-9
SLIDE 9

Bayesian Network

 Each node x has a conditional probability

distribution P(x |Pa(x))

 Encodes independence in directed graph

Job Done Ride to Work Happy Good Lunch Sunny Day

0.8 f 0.2 t f f 0.6 0.4 0.01 0.99 0.2 0.8 t t t f G S P(R|S, G )

slide-10
SLIDE 10

BN Construct

 Nodes

  • Each attribute: a
  • Each foreign key: join indicator Jf

 Edges

  • Find the model that maximize log-likelihood

given data (greedy local structure search)

  • Parameters estimation after the structure is

decided

slide-11
SLIDE 11

Selectivity Estimation in BN

 Extend the query to include all the

involved nodes’ parents

 Multiplication along the graph to get the

joint distribution

select * from P , S where P.Strain = Strain-ID ..

slide-12
SLIDE 12
  • Cons. of BN

 Model is still too complicated…

  • Construction is expensive
  • Selectivity estimation is expensive

Model Constructor Database

  • ffline

Selectivity Estimator

execution time

Query Q Size(Q)

slide-13
SLIDE 13

The Tradeoff Spectrum

Full Independence No Independence Cond. Independence

 Red: better efficiency, worse accuracy  Blue: better accuracy, worse efficiency

?

slide-14
SLIDE 14

Idea #4: Constraint BN Dep.

 Further restrict the structure of BN:

  • Acyclic [SIGMOD’01]
  • Fixed structure [this paper]

 Challenge: how to choose the fixed

structure to get the good tradeoff in the spectrum?

  • Model simple enough for efficient algorithms
  • Model still capture important correlations
slide-15
SLIDE 15

Fixed Structure BN

 Within a table

  • Attributes have at most one parent: tree

structure

 Across a table

  • Joint indicators have at most two parents
  • No other cross-table edges

 3D histograms only, scalable construction

slide-16
SLIDE 16

Fixed Structure BN Construct

  • sssssssssssssssssssssssssssssssssssssssssssssssss

ssssssssssssssssssssssssss

 Edges

  • Within a table: maximum spanning tree
  • Across tables: best parent from each table
  • Weights based on mutual information

 Nodes

  • Each attribute a
  • Each join indicator Jf

based on workload

slide-17
SLIDE 17

Selectivity Est. in Fixed BN

 Transform BN into a junction tree

  • Moralization
slide-18
SLIDE 18

Selectivity Est. in Fixed BN

 Transform BN into a junction tree

slide-19
SLIDE 19

Selectivity Est. in Fixed BN

 Transform BN into a junction tree

slide-20
SLIDE 20

Selectivity Est. in Fixed BN

 Transform BN into a junction tree

slide-21
SLIDE 21

Selectivity Est. in Fixed BN

 Extract the subtree related to the query

select c_name, c_address from lineitem,orders,customer where l_orderkey=o_orderkey and

  • _custkey=c_custkey and

l_sdate<=“25/7/2011” and c_acctbal<=200000

slide-22
SLIDE 22

Selectivity Est. in Fixed BN

 Tree algorithms for joint distribution

  • Sum-product
  • DP

select c_name, c_address from lineitem,orders,customer where l_orderkey=o_orderkey and

  • _custkey=c_custkey and

l_sdate<=“25/7/2011” and c_acctbal<=200000

slide-23
SLIDE 23

Experiments

 Implementation

  • Model construction outside DBMS, use

queries to get distributions

  • Stores junction tree as tables in catalog
  • Replace selectivity estimation procedure

 Compare with PostgresSQL

slide-24
SLIDE 24

Efficiency and Accuracy

Execution & optimization times Cost of plans (intermediate tuples)

slide-25
SLIDE 25

Scalability

Error: max(real,estimate) / min(real,estimate)

slide-26
SLIDE 26

Conclusion

 Inaccurate selectivity estimation leads to

bad plan

 Fixed structure BN is a good tradeoff

between efficiency and accuracy (?) Thank Y hank You!

  • u!