Database Systems II Query Compiler CMPT 454, Simon Fraser - - PDF document

database systems ii
SMART_READER_LITE
LIVE PREVIEW

Database Systems II Query Compiler CMPT 454, Simon Fraser - - PDF document

Database Systems II Query Compiler CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 183 Introduction The Query Compiler translates an SQL query into a physical query plan, which can be executed, in three steps: The query is parsed


slide-1
SLIDE 1

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 183

Database Systems II Query Compiler

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 184

Introduction

The Query Compiler translates an SQL query into a physical query plan, which can be executed, in three steps: The query is parsed and represented as a parse tree. The parse tree is converted into a relational algebra expression tree (logical query plan). The logical query plan is refined into a physical query plan, which also specifies the algorithms used in each step and the way in which data is obtained.

slide-2
SLIDE 2

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 185

Introduction

parse convert query rewrite estimate result sizes consider physical plans estimate costs pick best execute {P1,P2,…..}

{(P1,C1),(P2,C2)...}

Pi answer SQL query parse tree logical query plan “improved” l.q.p l.q.p. +sizes

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 186

Introduction

Example

SELECT B,D FROM R,S WHERE R.A = “c” S.E = 2 R.C=S.C

Conceptual evaluation strategy: Perform cartesian product, Apply selection, and Project to specified attributes. Use as starting point for optimization.

slide-3
SLIDE 3

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 187

Introduction

Example

B,D R.A=“c” S.E=2 R.C=S.C

X R S

B,D [ R.A=“c” S.E=2 R.C = S.C (RXS)]

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 188

Introduction

Example

B,D R.A = “c” S.E = 2

R S

natural join This logical query plan is equivalent. It is more efficient, since it reduces the sizes of the intermediate tables.

slide-4
SLIDE 4

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 189

Introduction

Example

Needs to be refined into physical query plan. E.g., use R.A and S.C indexes as follows: (1) Use R.A index to select R tuples with R.A = “c” (2) For each R.C value found, use S.C index to find matching tuples (3) Eliminate S tuples S.E 2 (4) Join matching R,S tuples, project B,D attributes and place in result

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 190

Parsing

Parse Trees

Nodes correspond to either atoms (terminal symbols)

  • r syntactic categories (non-terminal symbols).

An atom is a lexical element such as a keyord, name of an attribute or relation, constant, operator, parenthesis. A syntactic category denotes a family of query subparts that all play the same role within a query, e.g. Condition. Syntactic categories are enclosed in triangular brackets, e.g. <Condition>.

slide-5
SLIDE 5

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 191

Parsing

Example

SELECT title FROM StarsIn WHERE starName IN (SELECT name FROM MovieStar WHERE birthdate LIKE „%1960‟);

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 192

Parsing

<Query> <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Tuple> IN <Query> title StarsIn <Attribute> ( <Query> ) starName <SFW> SELECT <SelList> FROM <FromList> WHERE <Condition> <Attribute> <RelName> <Attribute> LIKE <Pattern> name MovieStar birthDate „%1960‟

slide-6
SLIDE 6

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 193

Parsing

Grammar for SQL

The following grammar describes a simple subset of SQL. Queries <Query>::= SELECT <SelList> FROM <FromList> WHERE <Condition> ; Selection lists <SelList>::= <Attribute>, <SelList> <SelList>::= <Attribute> From lists <FromList>::= <Relation>, <FromList> <FromList>::= <Relation>

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 194

Parsing

Grammar for SQL

Conditions <Condition>::= <Condition> AND <Condition> <Condition>::= <Attribute> IN (<Query>) <Condition>::= <Attribute> = <Attribute> <Condition>::= <Attribute> LIKE <Pattern> Syntactic categories Relation and Attribute are not defined by grammar rules, but by the database schema. Syntactic category Pattern defined as some regular expression.

slide-7
SLIDE 7

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 195

Conversion to Query Plan

How to convert a parse tree into a logical query plan, i.e. a relational algebra expression? Queries with conditions without subqueries are easy: Form Cartesian product of all relations in <FromList>. Apply a selection

c where C is given by <Condition>.

Finally apply a projection

L where L is the list of

attributes in <SelList>. Queries involving subqueries are more difficult. Remove subqueries from conditions and represent them by a two-argument selection in the logical query plan. See the textbook for details.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 196

Algebraic Laws for Query Plans

Introduction Algebraic laws allow us to transform a Relational Algebra (RA) expression into an equivalent one. Two RA expressions are equivalent if, for all database instances, they produce the same answer. The resulting expression may have a more efficient physical query plan. Algebraic laws are used in the query rewrite phase.

slide-8
SLIDE 8

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 197

Algebraic Laws for Query Plans

Introduction

Commutative law: Order of arguments does not matter. x + y = y + x Associative law: May group two uses of the operator either from the left

  • r the right.

(x + y) + z = x + (y + z) Operators that are commutative and associative can be grouped and ordered arbitrarily.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 198

Algebraic Laws for Query Plans

Natural Join, Cartesian Product and Union R x S = S x R (R x S) x T = R x (S x T) R U S = S U R R U (S U T) = (R U S) U T

R S = S R (R S) T = R (S T)

slide-9
SLIDE 9

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 199

Algebraic Laws for Query Plans

Natural Join, Cartesian Product and Union

R S = S R

To prove this law, need to show that any tuple resulting from the left side expression is also produced by the right side expression, and vice versa. Suppose tuple t is in R S. There must be tuples r in R and s in S that agree with t on all shared attributes. If we evaluate S R, tuples s and r will again result in t.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 200

Algebraic Laws for Query Plans

Natural Join, Cartesian Product and Union

R S = S R

Note that the order of attributes within a tuple does not matter (carry attribute names along). Relation as bag of tuples According to the same reasoning, the number of copies of t must be identical on both sides. The other direction of the proof is essentially the same, given the symmetry of S and R.

slide-10
SLIDE 10

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 201

Algebraic Laws for Query Plans

Selection

p1 p2(R) = p1vp2(R) =

p1 [ p2 (R)]

[

p1 (R)] U [ p2 (R)] p1 [ p2 (R)] = p2 [ p1 (R)]

Simple conditions p1 or p2 may be pushed down further than the complex condition.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 202

Algebraic Laws for Query Plans

Bag Union What about the union of relations with duplicates (bags)? R = {a,a,b,b,b,c} S = {b,b,c,c,d} R U S = ? Number of occurrences either SUM or MAX of

  • ccurrences in the imput relations.

SUM: R U S = {a,a,b,b,b,b,b,c,c,c,d} MAX: R U S = {a,a,b,b,b,c,c,d}

slide-11
SLIDE 11

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 203

Algebraic Laws for Query Plans

Selection

p1vp2 (R) = p1(R) U p2(R)

MAX implementation of union makes rule work. R={a,a,b,b,b,c} p1 satisfied by a,b, p2 satisfied by b,c

p1vp2 (R) = {a,a,b,b,b,c} p1(R) = {a,a,b,b,b} p2(R) = {b,b,b,c} p1 (R) U p2 (R) = {a,a,b,b,b,c} CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 204

Algebraic Laws for Query Plans

Selection

p1vp2 (R) = p1(R) U p2(R)

SUM implementation of union makes more sense.

Senators (……) Reps (……) T1 =

yr,state Senators, T2 = yr,state Reps

T1 Yr State T2 Yr State 97 CA 99 CA 99 CA 99 CA 98 AZ 98 CA Use SUM implementation, but then some laws do not hold. Union?

slide-12
SLIDE 12

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 205

Algebraic Laws for Query Plans

Selection and Set Operations

p(R U S) = p(R) U p(S) p(R - S) = p(R) - S = p(R) - p(S) CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 206

Algebraic Laws for Query Plans

Selection and Join p: predicate with only R attributes q: predicate with only S attributes m: predicate with attributes from R and S

p (R S) = q (R S) =

[

p (R)] S

R [

q (S)]

slide-13
SLIDE 13

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 207

Algebraic Laws for Query Plans

Selection and Join

p q (R S) = [ p (R)] [ q (S)] p q m (R S) = m[( p R) ( q S)] pvq (R S) =

[(

p R) S] U [R

(

q S)]

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 208

Algebraic Laws for Query Plans

Selection and Join

p q (R S) = p [ q (R S) ] = p [ R q (S) ] =

[

p (R)] [ q (S)]

slide-14
SLIDE 14

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 209

Algebraic Laws for Query Plans

Projection X: set of attributes Y: set of attributes XY: X U Y

xy (R) =

May introduce projection anywhere in an expression tree as long as it eliminates no attributes needed by an operator above and no attributes that are in result

x [ y (R)]

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 210

Algebraic Laws for Query Plans

Projection and Selection X: subset of R attributes Z: attributes in predicate P (subset of R attributes)

x ( pR) =

Need to keep attributes for the selection and for the result

{

p [ x (R) ]} x xz

slide-15
SLIDE 15

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 211

Algebraic Laws for Query Plans

Projection, Selection and Join

xy { p (R S)} = xy { p [ xz (R) yz‟ (S)]}

Y: subset of S attributes z = subset of R attributes used in P z‟ = subset of S attributes used in P

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 212

Improving Logical Query Plans

Introduction How to apply the algebraic laws to improve a logical query plan? Goal: minimize the size (number of tuples, number of attributes) of intermediate results. Push selections down in the expression tree as far as possible. Push down projections, or add new projections where applicable.

slide-16
SLIDE 16

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 213

Improving Logical Query Plans

Pushing Selections Replace the left side of one of these (and similar) rules by the right side: Can greatly reduce the number of tuples

  • f intermediate results.

p1 p2 (R) p1 [ p2 (R)] p (R S)

[

p (R)] S

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 214

Improving Logical Query Plans

Pushing Projections Replace the left side of one of this (and similar) rules by the right side: Reduces the number of attributes of intermediate results and possibly also the number of tuples.

x [ p (R)] x { p [ xz (R)]}

slide-17
SLIDE 17

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 215

Improving Logical Query Plans

Pushing Projections Consider the following example: R(A,B,C,D,E) P: (A=3) (B=“cat”) Compare

E {

p (R)} vs. E { p{ ABE(R)}} CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 216

Improving Logical Query Plans

Pushing Projections What if we have indexes on A and B? B = “cat” A=3 Intersect pointers to get pointers to matching tuples Efficiency of logical query plan may depend on choices made during refinement to physical plan. No transformation is always good!

slide-18
SLIDE 18

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 217

Improving Logical Query Plans

Grouping Associative / Commutative Operators For operators which are commutative and associative, we can order and group their arguments arbitrarily. In particular: natural join, union, intersection. As the last step to produce the final logical query plan, group nodes with the same (associative and commutative) operator into one n-ary node. Best grouping and ordering determined during the generation of physical query plan.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 218

Improving Logical Query Plans

Grouping Associative / Commutative Operators U A B C D E U A B C D E

slide-19
SLIDE 19

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 219

From Logical to Physical Plans

So far, we have parsed and transformed an SQL query into an optimized logical query plan. In order to refine the logical query plan into a physical query plan, we consider alternative physical plans, estimate their cost, and pick the plan with the least (estimated) cost. We have to estimate the cost of a plan without executing it. And we have to do that efficiently!

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 220

From Logical to Physical Plans

When creating a physical query plan, we have to decide on the following issues.

  • rder and grouping of operations that are

associative and commutative, algorithm for each operator in the logical plan, additional operators which are not represented in the logical plan, e.g. sorting, the way in which intermediate results are passed from one operator to the next, e.g. by storing on disk or passing one tuple at a time.

slide-20
SLIDE 20

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 221

Estimating the Cost of Operations

Intermediate relations are the output of some relational operator and the input of another one. The size of intermediate relations has a major impact on the cost of a physical query plan. It impacts in particular

  • the choice of an implementation for the various
  • perators and
  • the grouping and order of commutative /

associative operators.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 222

A method for estimating the size of an intermediate relation should be

  • reasonably accurate,
  • efficiently computable,
  • not depend on how that relation is computed.

We want to rank alternative query plans w.r.t. their estimated costs. Accuracy of the absolute values of the estimates not as important as the accuracy of their ranks.

Estimating the Cost of Operations

slide-21
SLIDE 21

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 223

Size estimates make use of the following statistics for relation R:

T(R) : # tuples in R S(R) : # of bytes in each R tuple B(R): # of blocks to hold all R tuples V(R, A) : # distinct values for attribute A in R. MIN(R,A): minimum value of attribute A in R. MAX(R,A): maximum value of attribute A in R. HIST(R,A): histogram for attribute A in R.

Statistics need to be maintained up-to-date under database modifications!

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 224

R A: 20 byte string B: 4 byte integer C: 8 byte date D: 5 byte string

A B C D cat 1 10 a cat 1 20 b dog 1 30 a dog 1 40 c bat 1 50 d

T(R) = 5 S(R) = 37 V(R,A) = 3 V(R,C) = 5 V(R,B) = 1 V(R,D) = 4

Estimating the Cost of Operations

slide-22
SLIDE 22

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 225

Size estimate for W = R1 x R2 T(W) = T(R1) T(R2) S(W) = S(R1) + S(R2) Size estimate for W =

A=a (R)

Assumption: values of A are uniformly distributed over the attribute domain T(W) = T(R)/V(R,A) S(W) = S(R)

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 226

Size estimate for W =

z val (R)

Solution 1: on average, half of the tuples will satisfy an inequality condition T(W) = T(R)/2 Solution 2: more selective queries are more frequent, e.g. professors who earn more than $200‟000 (rather than less than $200‟000) T(W) = T(R)/3

Estimating the Cost of Operations

slide-23
SLIDE 23

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 227

Solution 3: estimate the number of attribute values in query range Use minimum and maximum value to define range of the attribute domain. Assume uniform distribution of values

  • ver the attribute domain.

Estimate is the fraction of the domain that falls into the query range.

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 228

Z MIN(R,Z)=1 V(R,Z)=10 W=

z 15 (R)

MAX(R,Z)=20

f = 20-15+1 = 6 (fraction of range) 20-1+1 20 T(W) = f T(R) R

Estimating the Cost of Operations

slide-24
SLIDE 24

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 229

Size estimate for W = R1 R2 Consider only natural join of R1(X,Y) and R2(Y,Z). We do not know how the Y values in R1 and R2 relate:

  • disjoint, i.e. T(R1 R2) = 0,
  • Y may be a foreign key of R1 and the primary

key of R2, i.e. T(R1 R2) = T(R1),

  • all the R1 and all the R2 tuples have the same Y

value, i.e. T(R1 R2) = T(R1) T(R2).

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 230

Make several simplifying assumptions. Containment of value sets: V(R1,Y) V(R2,Y) every Y value in R1 is in R2 V(R2,Y) V(R1,Y) every Y value in R2 is in R1 This assumption is satisfied when Y is foreign key in R1 and primary key in R2. Is also approximately true in many other cases.

Estimating the Cost of Operations

slide-25
SLIDE 25

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 231

Preservation of value sets: If A is an attribute of R1 but not of R2, then V(R1 R2,A) = V(R1,A). Again, holds if the join attribute Y is foreign key in R1 and primary key in R2. Can only be violated if there are “dangling tuples” in R1, i.e. R1 tuples that have no matching partner in R2.

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 232

Uniform distribution of attribute values: the values of attribute A are uniformly distributed over their domain, i.e. P(A=a1) = P(A=a2) = . . . = P(A=ak). This assumption is necessary to make cost estimation tractable. It is often violated, but nevertheless allows reasonably accurate ranking of query plans.

Estimating the Cost of Operations

slide-26
SLIDE 26

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 233

Independence of attributes: the values of attributes A and B are independent from each other, i.e. P(A=a|B=b) = P(A=a) and P(B=b|A=a) = P(B=b) . This assumption is necessary to make cost estimation tractable. Again, often violated, but nevertheless allows reasonably accurate ranking of query plans.

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 234

Suppose that t1 is some tuple in R1, t2 some tuple in R2. What is the probability that t1 and t2 agree on the join attribute Y? If V(R1,Y) V(R2,Y), then the Y value of t1 appears in R2, because of the containment of value sets. Assuming uniform distribution of the Y values in R2 over their domain, the probability of t2 having the same Y value as t1 is 1/V(R2,Y).

Estimating the Cost of Operations

slide-27
SLIDE 27

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 235

If V(R2,Y) V(R1,Y), then the Y value of t2 appears in R1, and the probability of t1 having the same Y value as t2 is 1 / V(R1,Y). T(W) = number of pairs of tuples from R1 and R2 times the probability that an arbitrary pair agrees on Y. T(R1 R2) = T(R1) T(R2) / max(V(R1,Y), V(R2,Y)).

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 236

For complex query expressions, need to estimate T,S,V results for intermediate results. For example, W = [

A=a (R1) ] R2

treat as relation U T(U) = T(R1)/V(R1,A) S(U) = S(R1) Also need V (U, *) for all attributes of U(R1)!

Estimating the Cost of Operations

slide-28
SLIDE 28

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 237

R 1 V(R1,A)=3 V(R1,B)=1 V(R1,C)=5 V(R1,D)=3 U =

A=a (R1)

A B C D cat 1 10 10 cat 1 20 20 dog 1 30 10 dog 1 40 30 bat 1 50 10

V(U,A) =1 V(U,B) =1 V(U,C) = T(R1)/ V(R1,A) V(U,D) ... somewhere in between

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 238

R1(A,B), R2(A,C). Consider join U = R1 R2. Estimate V results for U. V(U,A) = min { V(R1, A), V(R2, A) } Holds due to containment of value sets. V(U,B) = V(R1, B) V(U,C) = V(R2, C) Holds due to preservation of value sets.

Estimating the Cost of Operations

slide-29
SLIDE 29

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 239

Consider the following example: Z = R1(A,B) R2(B,C) R3(C,D) T(R1) = 1000 V(R1,A)=50 V(R1,B)=100 T(R2) = 2000 V(R2,B)=200 V(R2,C)=300 T(R3) = 3000 V(R3,C)=90 V(R3,D)=500 Group and order as (R1 R2) R3

Estimating the Cost of Operations

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 240

Partial result: U = R1 R2 T(U) = 1000 2000 / 200 V(U,A) = 50 V(U,B) = 100 V(U,C) = 300

Estimating the Cost of Operations

slide-30
SLIDE 30

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 241

Final result: Z = U R3 T(Z) = 1000 2000 3000 / (200 300) V(Z,A) = 50 V(Z,B) = 100 V(Z,C) = 90 V(Z,D) = 500

Estimating the Cost of Operations