Data Management Systems Query Processing Execution models Views - - PowerPoint PPT Presentation

data management systems
SMART_READER_LITE
LIVE PREVIEW

Data Management Systems Query Processing Execution models Views - - PowerPoint PPT Presentation

Data Management Systems Query Processing Execution models Views and Schemas Optimization I heuristics & Query rewriting rewriting Optimization II cost models Basic optimization heuristics Optimization III -


slide-1
SLIDE 1

Data Management Systems

  • Query Processing
  • Execution models
  • Optimization I – heuristics &

rewriting

  • Optimization II – cost models
  • Optimization III - Operators

Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich

1 Query processing Optimization I

Views and Schemas Query rewriting Basic optimization heuristics

slide-2
SLIDE 2

Views and Schemas

Query processing Optimization I 2

slide-3
SLIDE 3

Let’s take one step back

  • Remember “logical data independence” and “access controls”?
  • They are implemented using views
  • What is exactly a view?
  • Have you asked yourselves where queries come from?
  • Most often, from applications and programs
  • Heavy use of templates and parameterized queries
  • Database engines aim at making it easier to provide these
  • abstractions. In the process, there is a lot going on in the background

–mostly related to query processing

Query processing Optimization I 3

slide-4
SLIDE 4

A Database with a View

  • In the relational model:
  • with the exception of base tables, a query and a table are the same
  • the table contains the actual data, the query tells you how to get the data in

the table (i.e., the table itself)

  • both can be used to reason about queries
  • A view, when created, results in a virtual table being added to the

schema

  • A materialized view results in an actual table being added to the

schema with the data described in the view

Query processing Optimization I 4

slide-5
SLIDE 5

Example of a view

QUERY: SELECT employee_id, last_name, job_id, manager_id, dept_id FROM employees VIEW: CREATE VIEW staff AS SELECT employee_id, last_name, job_id, manager_id, dept_id FROM employees

Query processing Optimization I 5

https://docs.oracle.com/cd/B19306_01/network.102/b14266/ accessre.htm#i1007436

slide-6
SLIDE 6

Why views?

  • Views are extensively used
  • To implement logical data independence (create a schema different from the
  • riginal one)
  • To make it easier to write applications (writing SQL is programming) by

providing views that combine data in useful ways

  • To make it easier to write queries by writing intermediate stages as views
  • For access control by giving a user access to a view instead of to the base

tables

  • To speed up processing by materializing the view and having the data already

pre-processed for some queries (e.g., a join)

Query processing Optimization I 6

slide-7
SLIDE 7

Schemas

  • Schemas provide the basic
  • rganization of the data
  • Views allow to tailor that logical
  • rganization to the needs of

particular applications without changing the basic schema:

  • Orders in a given district
  • Orders in a period of time
  • Orders from a customer

Query processing Optimization I 7

TPC-C schema

slide-8
SLIDE 8

Analytics (TPC-H)

  • Databases for analytics often use a

start schema:

  • Fact table (very big) with the central

element of the schema

  • Dimension tables (typically smaller):

with more data on the different attributes that are mentioned in the fact table

  • It helps separate the relevant
  • perational data from information

that might be needed but it is not central to processing, e.g., an

  • rder

Query processing Optimization I 8

slide-9
SLIDE 9

Snowflake schema

  • In star schemas, the dimension tables and the fact table

are not normalized

  • A snowflake schema is a star schema where the

dimension tables are normalized (some or all of them). Normalization is applied to low cardinality attributes to remove redundancy

Query processing Optimization I 9

slide-10
SLIDE 10

Modern analytics (TPC-DS): Snow-storm

Query processing Optimization I 10

  • Multiple snow-flake schemas linked to each other
  • TPC-DS
  • Large number of tables (26)
  • Large number of columns per table (38)
  • Multiple fact tables to enable joins between large tables (fact to fact joins)
slide-11
SLIDE 11

Schemas and views

  • The more complex the schema (very common in serious applications),

the more extensive use of views:

  • Provide an application with the data it needs without having to understand

the whole schema

  • Since many applications will be running off the same schema (marketing,

sales, auditing, logistics, etc.), each one can get just the data it needs simplifying application development

  • Common operations over the schema can be captured with materialized

views

Query processing Optimization I 11

slide-12
SLIDE 12

Where do queries come from?

  • In a commercial setting, it is very rare that a database will be used

interactively with a user typing queries directly at a terminal

  • Databases are used programmatically
  • An application program contains queries that are run to extract the data the

program needs

  • A user interface generates queries after the user checks some options
  • A service can be provided by providing an RPC like interface to already written

queries that implement the service (users call a procedure, not a query)

  • Database engines aim at simplifying how to write such queries to be

used by programs and applications

Query processing Optimization I 12

slide-13
SLIDE 13

Query templates in user interfaces

Brand  Lenovo  HP  Apple  Samsung Price range  Less than 800  Between 800 and 1500  More than 1500

Query processing Optimization I 13

SELECT model, price, delivery_date FROM Laptops WHERE Brand = X AND price < Y AND price > Z

slide-14
SLIDE 14

Implications for query processing

  • The use of views and query templates affects query processing:
  • Query might be expressed over a view instead of over base tables
  • A query might contain many conjunctions and disjunctions of arbitrarily

complex predicates

  • Such queries might not be the most efficient way to express the query

Query processing Optimization I 14

slide-15
SLIDE 15

Query rewriting

Query processing Optimization I 15

slide-16
SLIDE 16

Anatomy of query processing

Query processing introduction 16

QUERY PARSER QUERY INTERMEDIATE REPRESENTATION (Abstract Syntax Tree) DB SCHEMA REWRITING OPERATOR TREE (PLAN) STATISTICS OPTIMIZATION CODE GENERATION QUERY EXECUTION QUERY PLAN CODE/ PLAN Validation, access control Check caches Interpretation Compilation

slide-17
SLIDE 17

IBM DB2

Query processing introduction 17

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5 .0/com.ibm.db2.luw.admin.perf.doc/doc/c0005292.html

slide-18
SLIDE 18

Query rewritring

  • Query rewriting refers to transforming the original SQL query into an

equivalent SQL query that:

  • Is more efficient by removing operations
  • Gives the query optimizer more freedom to operate
  • Makes more explicit what the query wants to do (and the optimizer can then

act accordingly)

  • Maps the query to actual base tables and views as needed and for efficiency

reasons

  • Query rewriting is typically done just looking at the schema, without

paying attention to statistics or cost estimates

Query processing Optimization I 18

slide-19
SLIDE 19

Rewriting Predicates

  • Predicates indicate operations to be performed on a tuple or tuples

SELECT * FROM T WHERE T.salary > 50.000 AND T.age < 45 AND T.dept = D17

  • Think about it as a big “if … then … else” construct that must be run
  • n every tuple of table T
  • If we find ways to reduce the number of comparisons that need to be

made, the query will run faster

  • If we avoid having to go several times over the same tuple, each one

checking a different predicate, the query will run faster

Query processing Optimization I 19

slide-20
SLIDE 20

Predicate transformation

Query processing Optimization I 20

SELECT * FROM T WHERE (T.price > 50 AND T.price < 100) OR (T.price > 90 AND T.price < 200) SELECT * FROM T WHERE (T.price > 50 AND T.price < 200) SELECT * FROM employee WHERE deptno = 'D11' or deptno = 'D21' or deptno = 'E21' SELECT * FROM employee where deptno in ('D11', 'D21', 'E21') Preferred if there is an index over deptno (follow the index for the three values) Preferred if there is no index over deptno (match every tuple against set of values) https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0 /com.ibm.db2.luw.admin.perf.doc/doc/c0005296.html

slide-21
SLIDE 21

Predicate augmentation

Query processing Optimization I 21

SELECT empno, lastname, firstname, deptno, deptname FROM employee emp, department dept WHERE emp.workdept = dept.deptno AND dept.deptno > 'E00' SELECT empno, lastname, firstname, deptno, deptname FROM employee emp, department dept WHERE emp.workdept = dept.deptno AND dept.deptno > 'E00‘ AND emp.workdept > ’E00’ The query is telling us that emp.workdept and dept.deptno match and can be compared By adding the second predicate on emp.workdept, the optimizer can filter out tuples form the table emp before doing the join, making it cheaper https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0 /com.ibm.db2.luw.admin.perf.doc/doc/c0005296.html

slide-22
SLIDE 22

Predicate augmentation

Query processing Optimization I 22

SELECT * FROM T, R, S WHERE T.id = R.id AND R.id = S.id SELECT * FROM T, R, S WHERE T.id = R.id AND R.id = S.id AND T.id = S.id As written, the query join T and R, and R and S But that is not the only

  • ption

By adding T.id = S.id, we are telling the optimizer that it can choose any combination

  • f joins of those three tables
slide-23
SLIDE 23

Predicate augmentation

  • Sometimes, predicate augmentation is also called transitive closure:
  • If A=B and B=C then A=C
  • Translated to queries, if two of those equality predicates appear in the query,

the query rewriter will add the third one

  • Applies to joins with many tables:
  • If A=B and B=C and C=D … then A=C, A=D, B=D …
  • Query rewriter will add the transitive closure of those predicates
  • The same applies to predicates involving constants
  • If A=B and B> 1000 then A>1000
  • Using transitivity to include more predicates gives the optimizer more
  • ptions to consider

Query processing Optimization I 23

slide-24
SLIDE 24

Why giving the optimizer more freedom helps

  • Assume T R produces many results
  • Assume R S produces many results
  • Assume T S produces almost no results

Query processing Optimization I 24

T R S T R S T S R

slide-25
SLIDE 25

Arithmetic

Query processing Optimization I 25

SELECT sum(salary+bonus+comm) as osum, avg(salary+bonus+comm) as oavg, count(*) as ocount FROM employee SELECT osum, osum/ocount, ocount FROM ( SELECT sum(salary+bonus+comm) as osum, count(*) as ocount FROM employee ) as shared_agg As written, the query performs three calculations for each tuple: SUM, AVG, COUNT Now the query performs only two calculations for each tuple: SUM, COUNT and derives the AVG from the

  • thers

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5 .0/com.ibm.db2.luw.admin.perf.doc/doc/c0005294.html

slide-26
SLIDE 26

Predicate pushdown

Query processing Optimization I 26

CREATE VIEW d11_employee (empno, firstnme, lastname, phoneno, salary, bonus, comm) AS SELECT empno, firstnme, lastname, phoneno, salary, bonus, comm FROM employee WHERE workdept = 'D11' SELECT firstnme, phoneno FROM d11_employee WHERE lastname = 'BROWN' SELECT firstnme, phoneno FROM employee WHERE lastname = 'BROWN' and workdept = 'D11' VIEW As written, the query computes the view and then filters for the lastname. It goes over the data twice The query works on the base table and does a single pass

  • ver the data
slide-27
SLIDE 27

View Folding (predicate pushdown)

Query processing Optimization I 27

CREATE VIEW sales_by_product AS SELECT product_key, product_name, SUM(quantity*amount) AS total FROM sales, product WHERE sales_product_key = product_key GROUP BY product_key, product_name; VIEW SELECT product_name FROM sales_by_product WHERE total > 50000; SELECT product.product_name FROM sales, product WHERE sales_product_key=product.product_key GROUP BY product.product_key, product.product_name HAVING (SUM(quantity * amount))>50000; https://docs.teradata.com/reader/Daz9Bt8GiwSdtthYFn~vdw/Oi0XrKuoru9IJlPR6dn9Cg

slide-28
SLIDE 28

Views and base tables

Query processing Optimization I 28

CREATE VIEW emp_education (empno, firstnme, lastname, edlevel) AS SELECT empno, firstnme, lastname, edlevel FROM employee WHERE edlevel > 17 CREATE VIEW emp_salaries (empno, firstname, lastname, salary) AS SELECT empno, firstnme, lastname, salary FROM employee WHERE salary > 35000 SELECT e1.empno, e1.firstnme, e1.lastname, e1.edlevel, e2.salary FROM emp_education e1, emp_salaries e2 WHERE e1.empno = e2.empno SELECT e1.empno, e1.firstnme, e1.lastname, e1.edlevel, e2.salary FROM employee e1, employee e2 WHERE e1.empno = e2.empno and e1.edlevel > 17 and e2.salary > 35000 Merge views by joining base tables Eliminate join (same base table) SELECT empno, firstnme, lastname, edlevel, salary FROM employee WHERE edlevel > 17 and salary > 35000 https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5 .0/com.ibm.db2.luw.admin.perf.doc/doc/c0005294.html

slide-29
SLIDE 29

Queries over views

  • Some of the query rewriting involving views might seem trivial but it

is necessary:

  • User might not have access to base tables (access controls)
  • Queries might be automatically generated and the easiest way to do it is to

treat the view as a table

  • Views can be optimized on their own and their plan embedded into

the global plan but that is unlikely to be optimal

  • Predicate push down to views
  • Pushing operations to underlying base table
  • Removing views and working directly on base table

Query processing Optimization I 29

slide-30
SLIDE 30

Unnesting of queries

Query processing Optimization I 30

SELECT empno, firstnme, lastname, phoneno FROM employee WHERE workdept in (SELECT deptno FROM department WHERE deptname = 'OPERATIONS') SELECT empno, firstnme, lastname, phoneno FROM employee emp, department dept WHERE emp.workdept = dept.deptno and dept.deptname = 'OPERATIONS' These are two queries and the check against the nested query is an expensive

  • peration (a scan)

There is only one query with a join, a more efficient way to execute the query https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5 .0/com.ibm.db2.luw.admin.perf.doc/doc/c0005294.html

slide-31
SLIDE 31

Why are there nested queries?

  • Nested queries might seem more complex but, in reality, make the

writing of complex SQL queries easier:

  • Divide the problem into several steps
  • Map each steps into a query
  • Use nesting to combine them
  • Rely on the query rewriter and optimizer to make the whole thing more

efficient

  • Nested queries are also a natural way to work with views (the view is

included in the query as a nested query)

Query processing Optimization I 31

slide-32
SLIDE 32

Optimization using views

  • If views are materialized, it pays off to use a view rather than going to

the base table. This can be done even for partial matches between the query and the view

  • An example from Apache Calcite:

MATERIALIZED VIEW (mv): SELECT a, b, c FROM T WHERE x = 5 QUERY: SELECT a, c FROM T WHERE x = 5 AND b=4 QUERY REWRITE: SELECT a,c FROM mv WHERE b=4

https://calcite.apache.org/docs/materialized_views.html

Query processing Optimization I 32

slide-33
SLIDE 33

A final example

Query processing Optimization I 33

SELECT distinct empno, firstnme, lastname FROM employee If in table employee, empno (employee number) is a key, then distinct is not needed:

  • Key attributes are already unique by definition

Query rewriting can look at the schema and see that empno is the key, so it can remove distinct: SELECT empno, firstnme, lastname FROM employee This matters because “distinct” is expensive to implement: it requires to, e.g., sort the results to remove tuples with the same empno. By removing “distinct” the query optimizer does not have to worry about it

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5 .0/com.ibm.db2.luw.admin.perf.doc/doc/c0005295.html

slide-34
SLIDE 34

Basic optimization heuristics

Query processing Optimization I 34

slide-35
SLIDE 35

Why can we do all this?

  • Let’s go back to basics:
  • Relational model
  • Relational algebra
  • We have a formalism to prove equivalence among queries, thereby

enabling rewriting and transformations with the guarantee that the result will be the same

Query processing Optimization I 35

slide-36
SLIDE 36

Equivalence Rules

Conjunctive selection operations can be deconstructed into a sequence of individual selections.

)) ( ( ) (

2 1 2 1

E E

   

   

The rule allows to split a complex predicate into smaller parts that are applied separately SELECT * FROM T WHERE T.X > 5 AND T.Y > 15 AND … SELECT * FROM (SELECT * FROM (SELECT * FROM T WHERE …) WHERE T.Y > 15) WHERE T.X = 15)

slide-37
SLIDE 37

Equivalence Rules

Selection operations are commutative.

)) ( ( ) (

2 1 2 1

E E

   

   

Selection predicates can be applied in different orders Combined with the previous rule:

  • Complex conjunctive predicate can be split
  • We then apply the most selective predicate first (cost based optimization)
  • In row stores, we check a tuple starting with the most selective predicate
  • In column stores, we start with the column with the most selective predicate
slide-38
SLIDE 38

Equivalence Rules 3

Selections can be combined with Cartesian products and theta joins.

  • a. (E1 X E2) = E1

 E2

  • b. 1(E1

2 E2) = E1 1 2 E2

SELECT a, b, c, d, … FROM (SELECT * FROM R, S) WHERE a > b SELECT a, b, c, d, … FROM R, S WHERE a > b SELECT a, b, c, d, … FROM (SELECT * FROM R, S WHERE R.a > S.b) WHERE c = d SELECT a, b, c, d, … FROM R, S WHERE a > b and c = d

slide-39
SLIDE 39

Equivalence Rules 4

Theta-join operations (and natural joins) are commutative. E1  E2 = E2

 E1

Natural join operations are associative: (E1 E2) E3 = E1 (E2 E3) Theta joins are associative in the following manner: (E1 1 E2) 2  3 E3 = E1 1 3 (E2

2 E3)

where 2 involves attributes from only E2 and E3.

These rules allow to change the order in which joins are done

slide-40
SLIDE 40

Equivalence Rules 5

The selection operation distributes over the theta join operation under the following two conditions: (a) When all the attributes in 0 involve only the attributes of one of the expressions (E1) being joined. 0E1

 E2) = (0(E1))  E2

(b) When  1 involves only the attributes of E1 and 2 involves only the attributes of E2. 1E1

 E2) = (1(E1))  ( (E2))

This rule is what allows predicate pushdown

slide-41
SLIDE 41

Pictorial Depiction of Equivalence Rules

slide-42
SLIDE 42

Equivalence Rules

The projections operation distributes over the theta join operation as follows: (a) if L involves only attributes from L1  L2: (b) Consider a join E1  E2.

  • Let L1 and L2 be sets of attributes from E1 and E2, respectively.
  • Let L3 be attributes of E1 that are involved in join condition , but are

not in L1  L2, and

  • let L4 be attributes of E2 that are involved in join condition , but are

not in L1  L2.

)) ( ( )) ( ( ) (

2 ...... 1 2 ....... 1

2 1 2 1

E E E E

L L L L

   

  

))) ( ( )) ( (( ) ..... (

2 ...... 1 2 1

4 2 3 1 2 1 2 1

E E E E

L L L L L L L L    

    

 

This rule enables projection push down

slide-43
SLIDE 43

Applying these rules

slide-44
SLIDE 44

More Rules

The set operations union and intersection are commutative (set difference is not commutative). E1  E2 = E2  E1 E1  E2 = E2  E1 Set union and intersection are associative. (E1  E2)  E3 = E1  (E2  E3) (E1  E2)  E3 = E1  (E2  E3) The selection operation distributes over ,  and –.  (E1 – E2) =  (E1) – (E2) and similarly for  and  in place of – Also:  (E1 – E2) = (E1) – E2 and similarly for  in place of –, but not for  The projection operation distributes over union L(E1  E2) = (L(E1))  (L(E2))

slide-45
SLIDE 45

Back to query rewriting

  • These rules enable to transform a query in many different equivalent

queries

  • The role of the optimizer is to consider as many as possible of these

equivalent queries (plans) and figure out the best possible one

  • Query rewriting helps by
  • Simplifying the way queries are represented (e.g., view folding)
  • Given the optimizer more options (e.g., predicate augmentation, transitive

closure of predicates)

  • Removing syntactic sugar and programming artifacts to give the optimizer a

clearer view of what the query does (e.g., un-nesting of queries)

Query processing Optimization I 45