Statistical Learning Techniques for Costing XML Queries Ning Zhang 1 - - PowerPoint PPT Presentation

statistical learning techniques for costing xml queries
SMART_READER_LITE
LIVE PREVIEW

Statistical Learning Techniques for Costing XML Queries Ning Zhang 1 - - PowerPoint PPT Presentation

Statistical Learning Techniques for Costing XML Queries Ning Zhang 1 Peter J. Haas 2 Vanja Josifovski 2 Guy M. Lohman 2 Chun Zhang 2 1 University of Waterloo 2 IBM Almaden Research Center VLDB 2005 Ning Zhang 1 COMET: A New Cost-Modeling


slide-1
SLIDE 1

Statistical Learning Techniques for Costing XML Queries

Ning Zhang1 Peter J. Haas2 Vanja Josifovski2 Guy M. Lohman2 Chun Zhang2

1University of Waterloo 2IBM Almaden Research Center

VLDB 2005

1 Ning Zhang

slide-2
SLIDE 2

COMET: A New Cost-Modeling Approach

Catalog Statistics Production query Identify Features

# cache misses Selectivity …

Estimate feature values

RUNSTATS

Collect Statistics Develop analytical cost model Apply cost function

Cost estimate

cost 

CPU_speed) hash_cost( * ty selectivi cost     ColCard / | | y selectivit R 2 Ning Zhang

slide-3
SLIDE 3

COMET: A New Cost-Modeling Approach

Catalog Statistics Production query Identify Features

# cache misses Selectivity …

Estimate feature values

RUNSTATS

Collect Statistics Develop analytical cost model Apply cost function

Cost estimate

Identify Features

# cache misses Selectivity …

Estimate feature values

RUNSTATS

Collect Statistics Apply cost function

Cost estimate Learn cost model Training queries

) ˆ , , ˆ ( 1

n

v v f cost  f

  ColCard / | | y selectivit R

cost 

CPU_speed) hash_cost( * ty selectivi cost     ColCard / | | y selectivit R 2 Ning Zhang

slide-4
SLIDE 4

Advantages of COMET Approach

Can handle complex operators using statistical learning

  • Operators not decomposable into simple scans, joins, etc.
  • Operators with highly non-sequential data access patterns
  • Used successfully to cost UDFs, remote DB systems

(Lee et al. 2004, He et al. 2004, Rahal et al. 2004)

3 Ning Zhang

slide-5
SLIDE 5

Advantages of COMET Approach

Can handle complex operators using statistical learning

  • Operators not decomposable into simple scans, joins, etc.
  • Operators with highly non-sequential data access patterns
  • Used successfully to cost UDFs, remote DB systems

(Lee et al. 2004, He et al. 2004, Rahal et al. 2004) Simplifies cost-model development

  • Reduces need for painstaking code analysis used in analytical

modeling

  • Easier to incorporate new operators into optimizer
  • Helps avoid brittle simplifying assumptions
  • Avoids need to explicitly incorporate HW parameters

3 Ning Zhang

slide-6
SLIDE 6

COMET Permits Optimizer to be Self Tuning

Optimizer Execution Engine Operator COMET e x e c u t i

  • n

p l a n c a l l s t r a i n i n g d a t a c

  • s

t m

  • d

e l training queries user queries

4 Ning Zhang

slide-7
SLIDE 7

Our Motivation: XML Query Optimization

Query q1:

<bib> { for $b in doc("bib.xml")/bib/book where $b/authors//last = "Stevens" and $b/@year > 1991 return <book> { $b/title } </book> } </bib>

Need to cost candidate execution plans:

1. Navigational plan:

  • navigate the bib.xml tree
  • check pred’s for each book

2. Value-based index plan:

  • find elements with “Stevens” or

“1991” using value-based index

  • navigate up to book and check

remaining conditions

3. Structure-based index plan:

  • look up matching tree structures

using a path/twig index

  • check pred’s for each book

5 Ning Zhang

slide-8
SLIDE 8

Today’s Talk: Application of COMET Approach to an XML Operator

XML operator to be modeled:

  • XNAV operator (complex and dynamic, so hard to model)
  • Adaptation of TurboXPath (Josifovski et al. 2005)
  • Will model CPU costs (nontrivial component of overall cost)
  • prior work has focused primarily on cardinality estimation

6 Ning Zhang

slide-9
SLIDE 9

Today’s Talk: Application of COMET Approach to an XML Operator

XML operator to be modeled:

  • XNAV operator (complex and dynamic, so hard to model)
  • Adaptation of TurboXPath (Josifovski et al. 2005)
  • Will model CPU costs (nontrivial component of overall cost)
  • prior work has focused primarily on cardinality estimation

Nontrivial steps in applying COMET methodology: Step 1: Identify XNAV features Step 2: Determine statistics for estimating feature values Step 3: Determine formulas for feature-value estimation Step 4: Identify appropriate statistical learning algorithm for fitting cost model

6 Ning Zhang

slide-10
SLIDE 10

XNAV: A Complex XML Navigational Operator

What is XNAV?

  • XNAV XPath(XMLTrees) −

→ list of matching XML nodes

  • XNAV is complex:
  • equivalent to non-decomposable N-way join
  • data stored as paged tree

High-level description of XNAV algorithm:

  • XNAV traverses the XML tree in a single pass, with possible

skipping of nodes

  • XNAV maintains internal states and buffers for matching the

query tree during the traversal

7 Ning Zhang

slide-11
SLIDE 11

Step 1: Identifying XNAV Features

Basis for feature identification

  • Knowledge of XNAV algorithm (involves human interaction)
  • Trial and error experimentation (with cross-validation)

8 Ning Zhang

slide-12
SLIDE 12

Step 1: Identifying XNAV Features

Basis for feature identification

  • Knowledge of XNAV algorithm (involves human interaction)
  • Trial and error experimentation (with cross-validation)

Learning algorithm automatically removes redundant features

  • Just need to find “at least enough” features

8 Ning Zhang

slide-13
SLIDE 13

Step 1: Identifying XNAV Features

Basis for feature identification

  • Knowledge of XNAV algorithm (involves human interaction)
  • Trial and error experimentation (with cross-validation)

Learning algorithm automatically removes redundant features

  • Just need to find “at least enough” features

Some features for XNAV:

  • #visits : # of XML nodes actually traversed
  • #p requests : # of pages read
  • . . . more features given in the paper

8 Ning Zhang

slide-14
SLIDE 14

Step 2: Novel Statistics for Estimating Features

How to choose statistics ?

  • “As simple as possible, but not simpler”
  • Easy to collect and maintain, less error-prone
  • Need to balance space and time requirements
  • Storing redundant stats can speed up feature-value estimation

9 Ning Zhang

slide-15
SLIDE 15

Step 2: Novel Statistics for Estimating Features

How to choose statistics ?

  • “As simple as possible, but not simpler”
  • Easy to collect and maintain, less error-prone
  • Need to balance space and time requirements
  • Storing redundant stats can speed up feature-value estimation

Example — Simple Path (SP) Statistics

  • cardinality: |p|, where p is a “simple” path (no branching, no

wildcards, etc.)

  • children and descendant cardinality: |p/∗| and |p//∗|
  • page cardinality: p
  • . . . more in the paper

9 Ning Zhang

slide-16
SLIDE 16

Step 3: Feature-Value Estimation Using Stats

Can estimate all needed feature values using SP stats

  • Analysis required, but much easier than analyzing entire

XNAV operator

  • See paper for detailed formulas (algorithms)
  • Formulas tend to overestimate feature values,

but COMET automatically compensates for bias (see below)

10 Ning Zhang

slide-17
SLIDE 17

Step 3: Feature-Value Estimation Using Stats

Can estimate all needed feature values using SP stats

  • Analysis required, but much easier than analyzing entire

XNAV operator

  • See paper for detailed formulas (algorithms)
  • Formulas tend to overestimate feature values,

but COMET automatically compensates for bias (see below) Example

  • #visits =

p∈S |p/∗| + q∈C |q//∗|

where S is a set of root-to-non-leaf simple path in the query tree whose next step is connected by a /-axis; C is a set of root-to-non-leaf simple path in the query tree whose next step is connected by a //-axis

10 Ning Zhang

slide-18
SLIDE 18

Step 4: Fitting The Cost Model

Use Transform Regression (Pednault 2004)

  • “Linear regression on steroids”
  • Handles discontinuities and nonlinearities in cost function
  • Fully automated (no statistician needed) and highly efficient
  • Seamlessly handles both numerical and categorical features

Uses 1-level linear regression tree to “linearize” each feature

vj cost w = h v

j j

( ) LRT-based partitions

wj cost

45o 11 Ning Zhang

slide-19
SLIDE 19

Step 4: Transform Regression—Continued

Uses multivariate linear regression on linearized features

  • Greedy forward stepwise-regression
  • Handles redundant features (multicollinearity)

12 Ning Zhang

slide-20
SLIDE 20

Step 4: Transform Regression—Continued

Uses multivariate linear regression on linearized features

  • Greedy forward stepwise-regression
  • Handles redundant features (multicollinearity)

Uses “gradient boosting” to capture feature interactions

  • First-order model: models the cost
  • ith-order model: models the error in (i − 1)st-order model

12 Ning Zhang

slide-21
SLIDE 21

Step 4: Transform Regression—Continued

Uses multivariate linear regression on linearized features

  • Greedy forward stepwise-regression
  • Handles redundant features (multicollinearity)

Uses “gradient boosting” to capture feature interactions

  • First-order model: models the cost
  • ith-order model: models the error in (i − 1)st-order model

Uses other tricks to speed up convergence and improve the fit

  • See paper for details

12 Ning Zhang

slide-22
SLIDE 22

Step 4: Transform Regression—Continued

Uses multivariate linear regression on linearized features

  • Greedy forward stepwise-regression
  • Handles redundant features (multicollinearity)

Uses “gradient boosting” to capture feature interactions

  • First-order model: models the cost
  • ith-order model: models the error in (i − 1)st-order model

Uses other tricks to speed up convergence and improve the fit

  • See paper for details

Model learned from estimated feature values

  • So COMET is robust to systematic bias in feature-value estimation

12 Ning Zhang

slide-23
SLIDE 23

Experimental Study

Training data and queries:

  • Synthetic and real-world data sets

(Including TPC-H, XMark, NASA, and XBench)

  • Randomly generated queries:
  • Simple linear paths (e.g., /a/b/c)
  • Branching paths (e.g., /a[b][c]/d)
  • Complex paths (e.g., /a[.//b][c//d]//e)

13 Ning Zhang

slide-24
SLIDE 24

Experimental Study

Training data and queries:

  • Synthetic and real-world data sets

(Including TPC-H, XMark, NASA, and XBench)

  • Randomly generated queries:
  • Simple linear paths (e.g., /a/b/c)
  • Branching paths (e.g., /a[b][c]/d)
  • Complex paths (e.g., /a[.//b][c//d]//e)

Model evaluation:

  • Use 5-fold cross-validation
  • Plot predicted vs. actual costs
  • Calculate accuracy measurements

13 Ning Zhang

slide-25
SLIDE 25

Evaluating COMET’s Accuracy

Error metrics:

  • NRMSE (Normalized Root-Mean-Squared Error):

measures the average (relative) prediction error NRMSE = 1 ¯ c 1 n

n

  • i=1
  • ci − ˆ

ci 2 1/2 where ci and ˆ ci are the actual and estimated costs for ith query, and ¯ c = average(c1, c2, . . . , cn)

  • Other metrics discussed in paper: R2, OPD, MUP

14 Ning Zhang

slide-26
SLIDE 26

Accuracy of COMET

COMET does decent-to-excellent job in most cases:

1000 3000 5000 1000 3000 5000 7000 Predicted vs. Actual Values Predicted (msec.) Actual (msec.)

NRMSE = 0.084 R−sq = 0.997 OPD = 0.972 MUP = 1000.110 (14.6%)

20000 40000 60000 80000 20000 40000 60000 Predicted vs. Actual Values Predicted (msec.) Actual (msec.)

NRMSE = 0.099 R−sq = 0.980 OPD = 0.948 MUP = 6428.379 (14.3%)

(a) XMark (Mixed Queries) (b) TPC-H (Mixed Queries) Add query type (simple, branching, complex) as feature?

15 Ning Zhang

slide-27
SLIDE 27

Effect of Errors in SP Statistics

COMET is not sensitive to systematic errors in SP stats:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 1 1.5 2 2.5 COMET accuracy metric Bias factor in SP stats NRMSE R-sq OPD

16 Ning Zhang

slide-28
SLIDE 28

Effect of Training-set Size

Training-set is of reasonable size for reasonable accuracy:

  • 1000

2000 3000 0.1 0.2 0.3 0.4 0.5 Number of training queries NRMSE

Model build time for 1000 training queries: < 1 second

17 Ning Zhang

slide-29
SLIDE 29

Conclusion

Summary

  • Statistical learning increasingly needed as data and its management

become increasingly complicated

  • COMET can accurately model XNAV cost
  • COMET cost model is fast to construct and adaptable to changing

environment

  • A promising approach for costing complex query operators

18 Ning Zhang

slide-30
SLIDE 30

Conclusion

Summary

  • Statistical learning increasingly needed as data and its management

become increasingly complicated

  • COMET can accurately model XNAV cost
  • COMET cost model is fast to construct and adaptable to changing

environment

  • A promising approach for costing complex query operators

Future Work

  • Automatic identification of features
  • Smarter generation of training queries
  • Extensions to handle I/O costs, multi-user environments

(will identify appropriate features)

  • Incorporation of selectivity-estimation technology
  • Improve dynamic model maintenance

(incremental model building)

  • Apply to other operators (XML, relational, text)

18 Ning Zhang