Data Management Systems Query Processing Execution models - - PowerPoint PPT Presentation

data management systems
SMART_READER_LITE
LIVE PREVIEW

Data Management Systems Query Processing Execution models - - PowerPoint PPT Presentation

Data Management Systems Query Processing Execution models Understanding the data Optimization I heuristics & Calculating Costs rewriting Rule Based Optimizer Optimization II cost models Optimization III -


slide-1
SLIDE 1

Data Management Systems

  • Query Processing
  • Execution models
  • Optimization I – heuristics &

rewriting

  • Optimization II – cost models
  • Optimization III - Operators

Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich

1 Query processing Optimization II

Understanding the data Calculating Costs Rule Based Optimizer Cost Based Optimizer

slide-2
SLIDE 2

Query optimization

  • Since SQL is declarative, a database engine has many options to

translate a query into an executable program

  • Problems to solve:
  • Which starting point? Definitely not the query provided by the user
  • Queries are normalized, put in capital letters, syntactic sugar removed, etc.
  • Queries are rewritten
  • How to generate possible plans for the same query
  • How to decide which plans are best
  • Rule based (use heuristics)
  • Cost based (estimate the cost of the plans and choose the cheapest one)

Query processing Optimization II 2

slide-3
SLIDE 3

What are the choices?

  • There are many choices involved in coming up with the best plan:
  • Access method for each table (leaves of the query tree)
  • Is there an index on the table?
  • Is there a predicate on the query? Is the predicate on a key?
  • Are the tables clustered (on the same extent)?
  • What implementation of the operator is chosen
  • Nested loop join, hash join
  • Is the input data sorted?
  • Is it useful if the result is sorted?
  • Shape and form of the query tree
  • Pushdown selection and projection
  • In which order a re joins executed

Query processing Optimization II 3

slide-4
SLIDE 4

Cost based query optimization

  • The key to run queries faster is to be able to guess their performance

without actually executing them

  • That way we can choose which plan is best
  • Complex problem:
  • Many components involved: CPU, memory, I/O, network
  • Typically, focus on major bottlenecks: memory, I/O
  • The structured processing of queries helps by providing some relevant

parameters (number of tuples, distribution, selectivity, etc.)

  • Try to make clever decisions by using as much information as it is

available

Query processing Optimization II 4

slide-5
SLIDE 5

Understanding the data

Query processing Optimization II 5

https://docs.oracle.com/cd/B28359_01/s erver.111/b28274/optimops.htm#i82005

slide-6
SLIDE 6

The basics for optimization

  • The main information source for query optimization are statistics on

the data

  • These statistics are constantly collected on tables, indexes, buffers,

and system and made available (in Oracle, through the “Dictionary”)

  • The statistical data is the basis for the decisions the query optimizer

makes when deciding to choose a plan over another and also regarding which operator implementation to use

Query processing Optimization II 6

slide-7
SLIDE 7

Typical statistics

Table statistics

  • Number of rows
  • Number of blocks
  • Average row length

Column statistics

  • Number of distinct values (NDV)

in column

  • Number of nulls in column
  • Data distribution (histogram)

Extended statistics

  • Index statistics
  • Number of leaf blocks
  • Levels
  • Clustering factor

System statistics

  • I/O performance and utilization
  • CPU performance and utilization

Query processing Optimization II 7

https://docs.oracle.com/cd/B28359_01/server.111/b28274/stats.htm#i37048

slide-8
SLIDE 8

Why such statistics? Examples

  • Number of rows:
  • Tells you size of tables
  • Helps decides how to run a join
  • Number of distinct values:
  • Helps to estimate selectivity of a predicate (how many results will be

produced)

  • Helps to decide on join order
  • Histograms
  • Helps to estimate selectivity of predicates on skewed tables
  • Helps to decide on join order

Query processing Optimization II 8

slide-9
SLIDE 9

Table sizes and joins

Query processing Optimization II 9

R S

In a join, there is an outer table and an inner table (do not confuse with inner and outer joins):

  • The outer table is typically the

smallest one Nested Loop join: smaller table is outer loop, big table inner loop

  • Access to the big table is sequential

Hash Join: smaller table is used to build hash table, big table is used for probing

  • Smaller hash table to maintain
  • Sequential access to big table
slide-10
SLIDE 10

Histograms

  • Histograms are widely used in all major databases:
  • Cardinality estimation (CE) in SQL Server is derived primarily from histograms

that are created when indexes or statistics are created, either manually or automatically (SQL Server Manual)

  • By default the optimizer assumes a uniform distribution of rows across the

distinct values in a column. For columns that contain data skew (a non- uniform distribution of data within the column), a histogram enables the

  • ptimizer to generate accurate cardinality estimates for filter and join

predicates that involve these columns. (Oracle Manual)

  • … in MySQL we have chosen to support two different types: The “singleton”

histogram and the “equi-height” histogram. (MYSQL Server Blog)

Query processing Optimization II 10

slide-11
SLIDE 11

Histograms

11

SELECT * FROM person WHERE 25 < age < 40;

10 20 30 40 50 60 20 bis 42 42 bis 48 48 bis 53 53 bis 59 59 bis 70

SELECT * FROM person WHERE 25 < age < 40; EQUI-WIDTH EQUI-DEPTH (EQUI-HEIGHT) Ranges of values are fixed and equal Tells how many values in each range Helps identifying hot-spots May store distinct values and min/max, etc Same number of tuples per bucket Helps to partition data evenly The size of a range helps with cardinality estimates May store distinct values and min/max, etc.

slide-12
SLIDE 12

Singleton or frequency histogram

  • The frequency histogram plots the frequency of every distinct item in

a table

  • In essence, how often each value appears in the table
  • Very useful to compute the selectivity of queries
  • Highly accurate as it gives counts for every possible value
  • Can be done if the number of distinct values is not too high

12

slide-13
SLIDE 13

Selecting a type of histogram (example)

  • NDV: This represents the

number of distinct values in a

  • column. For example, if a column
  • nly contains the values 100,

200, and 300, then the NDV for this column is 3.

  • n: This variable represents the

number of histogram buckets. The default is 254.

  • p: This variable represents an

internal percentage threshold that is equal to (1–(1/n)) * 100. For example, if n = 254, then p is 99.6.

Query processing Optimization II 13

https://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm #TGSQL-GUID-FFA0C0AF-3761-4829-995E-9AFA524F96CE

slide-14
SLIDE 14

Zone Maps

  • A zone map is a combination of coarse index and statistics
  • For every block in of a table
  • Keep the max and min values for some or all columns
  • Before reading a block, check the zone map:
  • If range does not match the predicate, do not read the block
  • It can significantly reduce I/O cost
  • In some cases it can replace an index
  • Other statistics can be kept in a zone map
  • Example of use of the Zone Maps concept is Snowflake (see chapter
  • n Storage)

Query processing Optimization II 14

slide-15
SLIDE 15

Calculating costs

Query processing Optimization II 15

slide-16
SLIDE 16

Cardinality I

  • WARNING: There are several definitions of cardinality going around …
  • The cardinality of an attribute: how many distinct values are there for that

attribute

  • Table cardinality (Oracle): the number of tuples in the table
  • Operator cardinality: the number of tuples that must be processed to get the

result

  • Predicate cardinality: how many tuples match the predicate
  • All these definitions are related and sometimes can actually be the same but they

mean different things.

  • Attribute cardinality is used to determine selectivity (how many tuples will be produced after

applying an operator) => the output of the operator

  • Operator cardinality is used to determine the cost of running an operator (how many tuples

need to be read and processed) => the input of the operator

  • Cardinality of an attribute and predicate cardinality are often related

Query processing Optimization II 16

slide-17
SLIDE 17

Cardinality II

  • The number of tuples to be processed can be estimated from
  • Size of the table or input
  • Type of operator and access method
  • Full table scan
  • Indexes scan
  • Range predicates
  • Equality predicates
  • The attribute cardinality is estimated using statistics
  • Both are easier to determine for base tables. The difficulty is how to

estimate it for intermediate results

Query processing Optimization II 17

slide-18
SLIDE 18

Selectivity

  • The selectivity of an operator is how much data it will produce.

Typically expressed as a fraction over the data in the table

  • Selectivity 1 = all the data will be selected
  • Selectivity 0 = none of the data will be selected
  • If we can guess the selectivity, we know how much data it will

produce for the next operator to process

  • The selectivity is related to the cardinality of an attribute as that can

tell us how many results will be produced for a given selection predicate

Query processing Optimization II 18

slide-19
SLIDE 19

Example

  • Table (T) with 1000 rows
  • SELECT * FROM T
  • Operator Cardinality = 1000 rows
  • Query Selectivity = 1
  • SELECT * FROM T WHERE T.id = 123456 (T.id is the key)
  • Operator Cardinality = depends on access method
  • Query Selectivity = 1/1000 = 0.001
  • SELECT * FROM T WHERE T.price > 57
  • Operator Cardinality = depends on data distribution and access method
  • Query Selectivity = depends on data distribution

Query processing Optimization II 19

slide-20
SLIDE 20

Using selectivity, simple example

SELECT * FROM R, S WHERE R.x > 100 AND S.y < 10

  • Basic statistics on R and S do not help to determine which table is

bigger if we push the selections down

  • If we know the selectivity of R.x > 100 and S.y < 10, then we can

decide

  • This is a very simple example: can be done at run time (run the

predicates, see the size of the results, then decide)

Query processing Optimization II 20

slide-21
SLIDE 21

Using selectivity, a less simple example

SELECT * FROM T,R,S WHERE T.id = R.id AND R.a = S.a AND R.x>100 AND S.y<10 AND T.w=3

  • Selectivity now plays a role in deciding in which order to do the joins
  • We want to join first the tables with the least data (highly selective)
  • The hope is that the intermediate join will also contain less data (but

the selectivity alone does not tell us that)

Query processing Optimization II 21

slide-22
SLIDE 22

Calculating selectivity, simple predicates

  • For an attribute that is the key:
  • An equality predicate selects just one tuple
  • For other attributes
  • An equality predicate
  • Uniform data: Rough approximation: number of tuples / number of distinct values
  • Skewed data: Use a histogram to get an estimate
  • A range predicate = use a histogram to get an estimate (or assume uniform

distribution)

  • Negation predicate = 1 – predicate (e.g., R.x != 3 => 1 – selectivity(R.x = 3))

Query processing Optimization II 22

slide-23
SLIDE 23

Calculating selectivity, complex predicates

  • If the distribution of values for each predicate is independent:
  • For conjunctions of predicates: multiply the selectivity of each one
  • For disjunctions: add selectivities minus the selectivity of the intersection

(which is the selectivity of the conjunction)

  • If the predicates are over correlated attributes, thing are more

difficult

  • Correlated attributes are likely to result in wrong estimates of the selectivity
  • Some systems try to guess correlation (columns of the same table) and adjust

for it

  • This is why histograms are useful (if you have the right histogram)

Query processing Optimization II 23

slide-24
SLIDE 24

Simple example of correlated attributes

  • Assume a table of students (S) and another of courses (C)
  • Assume a course X that is mandatory in one department (D-INFK) but

not offered in others (D-ITET)

  • A query looking for D-INFK students attending the mandatory course

will produce all D-INFK students

  • A query looking for D-ITET students attending the mandatory course

will return no results

  • Each one of the predicates (D-INFK, D-ITET, attending the mandatory

course) has their own selectivity over the base data but their correlation is not captured anywhere

Query processing Optimization II 24

slide-25
SLIDE 25

What else?

  • Database engines use many other form of cost based information:
  • Cost of memory accesses, cost of I/O (from benchmarks, usually in the form
  • f ratios)
  • CPU cost of operations (from benchmarks, depending on data types involved

and predicates to evaluate)

  • Space available (size of the buffers, degree of multiprogramming)
  • These parameters are less related to the data and more to the

implementation of the engine itself so they tend to be very engine specific.

Query processing Optimization II 25

slide-26
SLIDE 26

How to compute cost?

  • Each operator has a cost function with a number of parameters (such

as input cardinality)

  • For each operator, input the parameters and that gives a cost for the
  • perator
  • For a query tree, combine all the costs of the operators in the tree

and that gives a cost for the entire plan

  • The input parameters are a very important aspect in selecting good

plans

  • This is why systems invest resources in maintaining statistics!

Query processing Optimization II 26

slide-27
SLIDE 27

Rule Based Optimizer

Query processing Optimization II 27

slide-28
SLIDE 28

Using a rule based optimizer

  • Rule based optimizers do not look at statistics or the contents of the tables.

They use only:

  • Transformation rules
  • Ranks of what methods are preferred for doing an operation
  • Schema information (key, indexes, etc.)
  • The rules used are based on experience and have been developed over

decades of observing the behavior of queries. they tend to be quite accurate

  • But a rule based optimizer ignores the actual data and, thus, can make

mistakes

  • We will illustrate a rule based optimizer by using an example from an older

version of Oracle (the goal is to get the idea, not the actual rules)

Query processing Optimization II 28

slide-29
SLIDE 29

Access methods (example Oracle)

  • A database engine has a list of ways in which data is accessed:
  • Full table scan (read the entire table)
  • Sample table scan (read a sample of the table)
  • Table access by Row ID (use the row ID to locate the block where the tuple is

and find the corresponding tuple => fastest way)

  • Cluster scans (read the blocks where the tuples with the same value for an

attribute –the cluster key- are stored)

  • Hash scans (read the blocks storing the tuples hashed to the same value)
  • Index scan (use an index to retrieve one or more tuples, as needed)
  • … (many variations depending on data organization and index type)

Query processing Optimization II 29

slide-30
SLIDE 30

Heuristics for access methods

  • A rule based optimizer will typically assign a ranking to each type of access method. The

ranking indicates whether it is more or less expensive than other access methods

  • Example Oracle

(https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20b_ops.htm#8157): With the rule-based approach, the optimizer chooses whether to use an access path based

  • n these factors:
  • the available access paths for the statement
  • the ranks of these access paths

To choose an access path, the optimizer first examines the conditions in the statement's WHERE clause to determine which access paths are available. The optimizer then chooses the most highly ranked available access path. Note that the full table scan is the lowest ranked access path on the list. This means that the rule-based approach always chooses an access path that uses an index if one is available, even if a full table scan might execute

  • faster. The order of the conditions in the WHERE clause does not normally affect the
  • ptimizer's choice among access paths.

Query processing Optimization II 30

slide-31
SLIDE 31

Example

Consider this SQL statement, which selects the employee numbers of all employees in the EMP table with an ENAME value of 'CHUNG' and with a SAL value greater than 2000: SELECT empno FROM emp WHERE ename = 'CHUNG' AND sal > 2000; Consider also that the EMP table has these integrity constraints and indexes:

  • There is a PRIMARY KEY constraint on the EMPNO column that is enforced by the index PK_EMPNO.
  • There is an index named ENAME_IND on the ENAME column.
  • There is an index named SAL_IND on the SAL column.

Based on the conditions in the WHERE clause of the SQL statement, the integrity constraints, and the indexes, these access paths are available:

  • A single-column index access path using the ENAME_IND index is made available by the condition ENAME = 'CHUNG'. This

access path has rank 9.

  • An unbounded range scan using the SAL_IND index is made available by the condition SAL > 2000. This access path has

rank 11.

  • A full table scan is automatically available for all SQL statements. This access path has rank 15.

Note that the PK_EMPNO index does not make the single row by primary key access path available because the indexed column does not appear in a condition in the WHERE clause. Using the rule-based approach, the optimizer chooses the access path that uses the ENAME_IND index to execute this statement. The optimizer chooses this path because it is the most highly ranked path available.

Query processing Optimization II 31

https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20b_ops.htm#8157

slide-32
SLIDE 32

Heuristics for joins (example Oracle)

With the rule-based approach, the optimizer follows these steps to choose an execution plan for a statement that joins R tables:

  • 1. The optimizer generates a set of R join orders, each with a different

table as the first table.

  • 2. The optimizer then chooses among the resulting set of execution
  • plans. The goal of the optimizer's choice is to maximize the number
  • f nested loops join operations in which the inner table is accessed

using an index scan. Since a nested loops join involves accessing the inner table many times, an index on the inner table can greatly improve the performance of a nested loops join.

Query processing Optimization II 32

https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm

slide-33
SLIDE 33

How to generate join orders

  • To fill each position in the join order, the optimizer chooses the table with the

most highly ranked available access path. The optimizer repeats this step to fill each subsequent position in the join order.

  • For each table in the join order, the optimizer also chooses the operation with

which to join the table to the previous table or row source in the order. The

  • ptimizer does this by "ranking" the sort-merge operation as access path 12 and

applying these rules:

  • If the access path for the chosen table is ranked 11 or better, the optimizer chooses a nested

loops operation using the previous table or row source in the join order as the outer table.

  • If the access path for the table is ranked lower than 12, and there is an equijoin condition

between the chosen table and the previous table or row source in join order, the optimizer chooses a sort-merge operation.

  • If the access path for the chosen table is ranked lower than 12, and there is not an equijoin

condition, the optimizer chooses a nested loops operation with the previous table or row source in the join order as the outer table.

Query processing Optimization II 33

https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm

slide-34
SLIDE 34

How to chose a plan

Usually, the optimizer does not consider the order in which tables appear in the FROM clause when choosing an execution plan. The optimizer makes this choice by applying the following rules in order: a) The optimizer chooses the execution plan with the fewest nested-loops operations in which the inner table is accessed with a full table scan. b) If there is a tie, the optimizer chooses the execution plan with the fewest sort-merge operations. c) If there is still a tie, the optimizer chooses the execution plan for which the first table in the join order has the most highly ranked access path: d) If there is a tie among multiple plans whose first tables are accessed by the single-column indexes access path, the optimizer chooses the plan whose first table is accessed with the most merged indexes. e) If there is a tie among multiple plans whose first tables are accessed by bounded range scans, the

  • ptimizer chooses the plan whose first table is accessed with the greatest number of leading columns of

the composite index. f) If there is still a tie, the optimizer chooses the execution plan for which the first table appears later in the query's FROM clause.

Query processing Optimization II 34

https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm

slide-35
SLIDE 35

Cost Based Optimizer

Query processing Optimization II 35

slide-36
SLIDE 36

Cost based optimizer

  • Basic approach:
  • Consider all possible ways to access the base tables and their cost
  • Generate all valid ways to execute the query (typically, all possible join
  • rderings)
  • Evaluate the cost of every possible plan
  • Pick the best
  • Sounds simple, but:
  • Query optimization is NP-hard (even ordering Cartesian products)
  • Cost is only approximate
  • Space to explore can be huge
  • We do not have an arbitrary amount of time to optimize the query

Query processing Optimization II 36

slide-37
SLIDE 37

Enumeration Algorithms

  • The most expensive part of query optimization is plan enumeration,

which typically amounts to determine the join order

  • Overview of Algorithms
  • Dynamic Programming (good plans, exp. complexity)
  • Greedy heuristics (e.g., highest selectivity join first)
  • Randomized Algorithms (iterative improvement, Sim. An., …)
  • Other heuristics (e.g., rely on hints by programmer)
  • Smaller search space (e.g., deep plans, limited group-bys)
  • Products
  • Dynamic Programming used by many systems
  • Some systems also use greedy heuristics in addition

37

slide-38
SLIDE 38

Query optimization

Query processing Optimization II 38

SELECT a,b,c FROM R,S,T WHERE r.id= s.id AND s.id = T.id AND … Step 1: find out all possible ways to access tables R,S,T and estimate the cost as well as the selectivity Step 2: pick the best access methods Step 3: Generate all possible join orders for the three tables ( R S ) T; ( R T ) S; ( T S ) R Step 4: estimate the cost for each join Step 5: pick the best query tree

slide-39
SLIDE 39

Simplified example

SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b ORDER BY R.c; We will just assign costs arbitrarily just as an example …

Query processing Optimization II 39

slide-40
SLIDE 40

Access Plans for R

  • SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b

ORDER BY R.c;

  • Assume Indexes on R.a, R.b, R.c, R.d
  • scan(R): cost = 100;
  • rder = none
  • idx(R.a): cost = 100;
  • rder = R.a
  • idx(R.b): cost = 1000;
  • rder = R.b
  • idx(R.c): cost = 1000;
  • rder = R.c
  • idx(R.d): cost = 1000;
  • rder = none
  • Keep best plans (cost but also order)

40

slide-41
SLIDE 41

Access Plans for S

  • SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b

ORDER BY R.c;

  • Assume Indexes on S.b, S.c, S.d
  • scan(S): cost = 1000;
  • rder = none
  • idx(S.b): cost = 10000;
  • rder = none
  • idx(S.c): cost = 10000;
  • rder = none
  • idx(S.d): cost = 10000;
  • rder = none

41

slide-42
SLIDE 42

Access Plans for T

  • SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b

ORDER BY R.c;

  • Assume Indexes on T.a, T.b
  • scan(T): cost = 10;
  • rder = none
  • idx(T.a): cost = 100;
  • rder = none
  • idx(T.b): cost = 100;
  • rder = T.b

42

slide-43
SLIDE 43

Join Plans for R join S

  • SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b

ORDER BY R.c;

  • Consider all combinations of access plans
  • Consider all join algorithms (NL, IdxNL, SMJ, GHJ, …)
  • Consider all orders: R x S, S x R
  • Prune based on cost estimates, interesting orders
  • Some examples:
  • scan(R) NLJ scan(S):

cost = 100;

  • rder = none
  • scan(S) IdxNL Idx(R.a):

cost = 1000;

  • rder = none
  • idx(R.b) GHJ scan(S):

cost = 150;

  • rder = R.b
  • idx(R.b) NLJ scan(S):

cost = 250;

  • rder = R.b

43

slide-44
SLIDE 44

Join Plans for three-way joins

  • SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b

ORDER BY R.c;

  • Consider all combinations of joins (assoc., commut.)
  • e.g., (R ⋈ S) ⋈ T, S ⋈ (T ⋈ R), ….
  • sometimes even enumerate Cartesian products
  • Use (pruned) plans of prev. steps as building blocks
  • consider all combinations
  • Prune based on cost estimates and interesting orders

44

slide-45
SLIDE 45

Interesting Orders in plans

  • Consider the expression (r1

r2 r3) r4 r5

  • An interesting sort order is a particular sort order of tuples that

could be useful for a later operation.

  • Generating the result of r1

r2 r3 sorted on the attributes common with r4 or r5 may be useful, but generating it sorted on the attributes common only r1 and r2 is not useful.

  • Using merge-join to compute r1

r2 r3 may be costlier, but may provide an

  • utput sorted in an interesting order.
  • Usually, number of interesting orders is quite small and doesn’t affect

time/space complexity significantly

slide-46
SLIDE 46

Queries Over Multiple Relations

  • In principle, we need to consider all possible join orderings:
  • As the number of joins increases, the number of alternative plans grows rapidly; we need

to restrict the search space.

  • System-R: consider only left-deep join trees. Left-deep trees allow us to generate all fully

pipelined plans: Intermediate results not written to temporary files (Not all left-deep trees are fully pipelined, though).

B A C D B A C D

C D B A

slide-47
SLIDE 47

Enumeration Left-Deep Plans

  • Enumerated using N passes (if N relations joined):
  • Pass 1: Find best 1-relation plan for each relation.
  • Pass 2: Find best way to join result of each 1-relation plan (as outer) to

another relation. (All 2-relation plans.)

  • Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the

N’th relation. (All N-relation plans.)

  • For each subset of relations, retain only:
  • Cheapest plan overall, plus
  • Cheapest plan for each interesting order of the tuples.
slide-48
SLIDE 48

Enumeration of Plans (Contd.)

  • ORDER BY, GROUP BY, aggregates etc. handled as a final step, using

either an `interestingly ordered’ plan or an additional sorting

  • perator.
  • An n-1 plan is not combined with an additional relation unless there is

a join condition between them, unless all predicates in WHERE have been used up.

  • i.e., avoid Cartesian products if possible.
  • In spite of pruning plan space, this approach is still exponential in the

# of tables.

slide-49
SLIDE 49

Ways to simplify the problem

  • Example from old version of Oracle (for both cost based and rule based):
  • The optimizer first determines whether joining two or more of the tables

definitely results in a row source containing at most one row. The optimizer recognizes such situations based on UNIQUE and PRIMARY KEY constraints

  • n the tables. If such a situation exists, the optimizer places these tables

first in the join order. The optimizer then optimizes the join of the remaining set of tables.

  • For join statements with outer join conditions, the table with the outer join
  • perator must come after the other table in the condition in the join order.

The optimizer does not consider join orders that violate this rule.

Query processing Optimization II 49

https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm

slide-50
SLIDE 50

Example from systems

With the cost-based approach, the optimizer generates a set of execution plans based on the possible join orders, join operations, and available access paths. The optimizer then estimates the cost of each plan and chooses the one with the lowest cost. The optimizer estimates costs in these ways:

  • The cost of a nested loops operation is based on the cost of reading each selected row of the outer table and

each of its matching rows of the inner table into memory. The optimizer estimates these costs using the statistics in the data dictionary.

  • The cost of a sort-merge join is based largely on the cost of reading all the sources into memory and sorting

them.

The optimizer also considers other factors when determining the cost of each operation. For example:

  • A smaller sort area size is likely to increase the cost for a sort-merge join because sorting takes more CPU

time and I/O in a smaller sort area. Sort area size is specified by the initialization parameter SORT_AREA_SIZE.

  • A larger multiblock read count is likely to decrease the cost for a sort-merge join in relation to a nested

loops join. If a large number of sequential blocks can be read from disk in a single I/O, an index on the inner table for the nested loops join is less likely to improve performance over a full table scan. The multiblock read count is specified by the initialization parameter DB_FILE_MULTIBLOCK_READ_COUNT.

  • For join statements with outer join conditions, the table with the outer join operator must come after the
  • ther table in the condition in the join order. The optimizer does not consider join orders that violate this rule.

Query processing Optimization II 50

https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm

slide-51
SLIDE 51

Complementary reading

  • Take a look at some of the manuals indicated in the slides and look at

the following documents to get an idea of how what we have discussed maps to real systems

  • Oracle’s query optimizer (Oracle 19)
  • https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-
  • ptimizer-with-oracledb-19c-5324206.pdf
  • SQL Server query processing
  • https://docs.microsoft.com/en-us/sql/relational-databases/query-processing-

architecture-guide?view=sql-server-ver15

Query processing Optimization II 51