[PPT] - Query Optimization Lecture # 13 Database Systems Andy Pavlo AP PowerPoint Presentation

SLIDE 1

Database Systems 15-445/15-645 Fall 2018 Andy Pavlo Computer Science Carnegie Mellon Univ.

AP AP

Lecture # 13

Query Optimization

SLIDE 2

CMU 15-445/645 (Fall 2018)

ADM IN ISTRIVIA

Mid-term Exam is on Wednesday October 17th

→ See mid-term exam guide for more info.

Project #2 – Checkpoint #2 is due Friday October 19th @ 11:59pm.

2

SLIDE 3

CMU 15-445/645 (Fall 2018)

Q UERY O PTIM IZATIO N

Remember that SQL is declarative.

→ User tells the DBMS what answer they want, not how to get the answer.

There can be a big difference in performance based

n plan is used:

→ See last week: 1.3 hours vs. 0.45 seconds

4

SLIDE 4

CMU 15-445/645 (Fall 2018)

IBM SYSTEM R

First implementation of a query optimizer. People argued that the DBMS could never choose a query plan better than what a human could write. A lot of the concepts from System R’s optimizer are still used today.

5

SLIDE 5

CMU 15-445/645 (Fall 2018)

Q UERY O PTIM IZATIO N

Heuristics / Rules

→ Rewrite the query to remove stupid / inefficient things. → Does not require a cost model.

Cost-based Search

→ Use a cost model to evaluate multiple equivalent plans and pick the one with the lowest cost.

6

SLIDE 6

CMU 15-445/645 (Fall 2018)

Q UERY PLAN N IN G OVERVIEW

7

SQL Query

Parser

Abstract Syntax Tree Annotated AST

Query Plan Cost Model System Catalog Rewriter

(Optional)

Binder Optimizer

Annotated AST Name→Internal ID

SLIDE 7

CMU 15-445/645 (Fall 2018)

TO DAY'S AGEN DA

Relational Algebra Equivalences Plan Cost Estimation Plan Enumeration Nested Sub-queries Mid-Term Review

8

SLIDE 8

CMU 15-445/645 (Fall 2018)

RELATIO N AL ALGEBRA EQ UIVALEN CES

Two relational algebra expressions are equivalent if they generate the same set of tuples. The DBMS can identify better query plans without a cost model. This is often called query rewriting.

9

SLIDE 9

CMU 15-445/645 (Fall 2018)

PREDICATE PUSH DOWN

10

stud student ent enr enrolled lled

s.sid=e.sid grade='A' s.name,e.cid

⨝

s

p

stud student ent enr enrolled lled

s.sid=e.sid grade='A' s.name,e.cid

s

p

⨝

SELECT s.name, e.cid FROM student AS s, enrolled AS e WHERE s.sid = e.sid AND e.grade = 'A'

SLIDE 10

CMU 15-445/645 (Fall 2018)

RELATIO N AL ALGEBRA EQ UIVALEN CES

11

name, cid(σgrade='A'(student⋈enrolled)) name, cid(student⋈(σgrade='A'(enrolled)))

=

SELECT s.name, e.cid FROM student AS s, enrolled AS e WHERE s.sid = e.sid AND e.grade = 'A'

SLIDE 11

CMU 15-445/645 (Fall 2018)

RELATIO N AL ALGEBRA EQ UIVALEN CES

Selections:

→ Perform filters as early as possible. → Reorder predicates so that the DBMS applies the most selective one first. → Break a complex predicate, and push down

σp1∧p2∧…pn(R) = σp1(σp2(…σpn(R)))

Simplify a complex predicate

→ (X=Y AND Y=3) → X=3 AND Y=3

12

SLIDE 12

CMU 15-445/645 (Fall 2018)

RELATIO N AL ALGEBRA EQ UIVALEN CES

Projections:

→ Perform them early to create smaller tuples and reduce intermediate results (if duplicates are eliminated) → Project out all attributes except the ones requested or required (e.g., joining keys)

This is not important for a column store…

13

SLIDE 13

CMU 15-445/645 (Fall 2018)

PRO J ECTIO N PUSH DOWN

14

stud student ent enr enrolled lled

s.sid=e.sid grade='A' s.name,e.cid

⨝

s

p

stud student ent enr enrolled lled

s.sid=e.sid grade='A' s.name,e.cid

⨝

s

p

sid,cid

p

sid,namep

SELECT s.name, e.cid FROM student AS s, enrolled AS e WHERE s.sid = e.sid AND e.grade = 'A'

SLIDE 14

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Impossible / Unnecessary Predicates

15

Source: Lukas Eder

SELECT * FROM A WHERE 1 = 0;X

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 15

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Impossible / Unnecessary Predicates

15

Source: Lukas Eder

SELECT * FROM A WHERE 1 = 0; SELECT * FROM A WHERE 1 = 1;

X

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 16

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Impossible / Unnecessary Predicates

15

Source: Lukas Eder

SELECT * FROM A WHERE 1 = 0; SELECT * FROM A WHERE 1 = 1; SELECT * FROM A;

X

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 17

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Impossible / Unnecessary Predicates Join Elimination

15

Source: Lukas Eder

SELECT * FROM A WHERE 1 = 0; SELECT A1.* FROM A AS A1 JOIN A AS A2 ON A1.id = A2.id; SELECT * FROM A WHERE 1 = 1; SELECT * FROM A;

X

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 18

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Impossible / Unnecessary Predicates Join Elimination

15

Source: Lukas Eder

SELECT * FROM A WHERE 1 = 0; SELECT * FROM A WHERE 1 = 1; SELECT * FROM A; SELECT * FROM A;

X

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 19

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Ignoring Projections

16

Source: Lukas Eder

SELECT * FROM A AS A1 WHERE EXISTS(SELECT * FROM A AS A2 WHERE A1.id = A2.id);

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 20

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Ignoring Projections

16

Source: Lukas Eder

SELECT * FROM A;

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 21

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Ignoring Projections Merging Predicates

16

Source: Lukas Eder

SELECT * FROM A WHERE val BETWEEN 1 AND 100 OR val BETWEEN 50 AND 150; SELECT * FROM A;

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 22

CMU 15-445/645 (Fall 2018)

M O RE EXAM PLES

Ignoring Projections Merging Predicates

16

Source: Lukas Eder

SELECT * FROM A WHERE val BETWEEN 1 AND 150; SELECT * FROM A;

CREATE TABLE A ( id INT PRIMARY KEY, val INT NOT NULL );

SLIDE 23

CMU 15-445/645 (Fall 2018)

RELATIO N AL ALGEBRA EQ UIVALEN CES

Joins:

→ Commutative, associative

R⋈S = S⋈R (R⋈S)⋈T = R⋈(S⋈T) How many different orderings are there for an n- way join?

17

SLIDE 24

CMU 15-445/645 (Fall 2018)

RELATIO N AL ALGEBRA EQ UIVALEN CES

How many different orderings are there for an n- way join? Catalan number ≈4n

→ Exhaustive enumeration will be too slow.

We’ll see in a second how an optimizer limits the search space...

18

SLIDE 25

CMU 15-445/645 (Fall 2018)

CO ST ESTIM ATIO N

How long will a query take?

→ CPU: Small cost; tough to estimate → Disk: # of block transfers → Memory: Amount of DRAM used → Network: # of messages

How many tuples will be read/written? What statistics do we need to keep?

19

SLIDE 26

CMU 15-445/645 (Fall 2018)

STATISTICS

The DBMS stores internal statistics about tables, attributes, and indexes in its internal catalog. Different systems update them at different times. Manual invocations:

→ Postgres/SQLite: ANALYZE → Oracle/MySQL: ANALYZE TABLE → SQL Server: UPDATE STATISTICS → DB2: RUNSTATS

20

SLIDE 27

CMU 15-445/645 (Fall 2018)

STATISTICS

For each relation R, the DBMS maintains the following information:

→ NR: Number of tuples in R. → V(A,R): Number of distinct values for attribute A.

21

SLIDE 28

CMU 15-445/645 (Fall 2018)

DERIVABLE STATISTICS

The selection cardinality SC(A,R) is the average number of records with a value for an attribute A given NR / V(A,R) Note that this assumes data uniformity.

→ 10,000 students, 10 colleges – how many students in SCS?

22

SLIDE 29

CMU 15-445/645 (Fall 2018)

SELECTIO N STATISTICS

Equality predicates on unique keys are easy to estimate. What about more complex predicates? What is their selectivity?

23

SELECT * FROM people WHERE id = 123 SELECT * FROM people WHERE val > 1000 SELECT * FROM people WHERE age = 30 AND status = 'Lit'

SLIDE 30

CMU 15-445/645 (Fall 2018)

CO M PLEX PREDICATES

The selectivity (sel) of a predicate P is the fraction of tuples that qualify. Formula depends on type of predicate:

→ Equality → Range → Negation → Conjunction → Disjunction

24

SLIDE 31

CMU 15-445/645 (Fall 2018)

CO M PLEX PREDICATES

The selectivity (sel) of a predicate P is the fraction of tuples that qualify. Formula depends on type of predicate:

→ Equality → Range → Negation → Conjunction → Disjunction

24

SLIDE 32

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Assume that V(age,people) has five distinct values (0–4) and NR = 5 Equality Predicate: A=constant

→ sel(A=constant) = SC(P) / V(A,R) → Example: sel(age=2) =

25

1 2 3 4

count age

SELECT * FROM people WHERE age = 2

SLIDE 33

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Assume that V(age,people) has five distinct values (0–4) and NR = 5 Equality Predicate: A=constant

→ sel(A=constant) = SC(P) / V(A,R) → Example: sel(age=2) =

25

1 2 3 4

count age

V(age,people)=5

SELECT * FROM people WHERE age = 2

SLIDE 34

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Assume that V(age,people) has five distinct values (0–4) and NR = 5 Equality Predicate: A=constant

→ sel(A=constant) = SC(P) / V(A,R) → Example: sel(age=2) =

25

1 2 3 4

count age

V(age,people)=5 SC(age=2)=1

SELECT * FROM people WHERE age = 2 1/5

SLIDE 35

CMU 15-445/645 (Fall 2018)

1 2 3 4

count age

SELECTIO N S CO M PLEX PREDICATES

Range Query:

→ sel(A>=a) = (Amax – a) / (Amax – Amin) → Example: sel(age>=2)

26

= (4 – 2) / (4 – 0) = 1/2

agemin = 0

SELECT * FROM people WHERE age >= 2

agemax = 4

SLIDE 36

CMU 15-445/645 (Fall 2018)

1 2 3 4

count age

SELECTIO N S CO M PLEX PREDICATES

Negation Query:

→ sel(not P) = 1 – sel(P) → Example: sel(age != 2)

27

SC(age=2)=1

SELECT * FROM people WHERE age != 2

SLIDE 37

CMU 15-445/645 (Fall 2018)

1 2 3 4

count age

SELECTIO N S CO M PLEX PREDICATES

Negation Query:

→ sel(not P) = 1 – sel(P) → Example: sel(age != 2)

Observation: Selectivity ≈ Probability

27

= 1 – (1/5) = 4/5

SC(age!=2)=2 SC(age!=2)=2

SELECT * FROM people WHERE age != 2

SLIDE 38

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Conjunction:

→ sel(P1 ⋀ P2) = sel(P1) ∙ sel(P2) → sel(age=2 ⋀ name LIKE 'A%')

This assumes that the predicates are independent.

28

SELECT * FROM people WHERE age = 2 AND name LIKE 'A%'

P1 P2

SLIDE 39

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Conjunction:

→ sel(P1 ⋀ P2) = sel(P1) ∙ sel(P2) → sel(age=2 ⋀ name LIKE 'A%')

This assumes that the predicates are independent.

28

SELECT * FROM people WHERE age = 2 AND name LIKE 'A%'

P1 P2

SLIDE 40

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Conjunction:

→ sel(P1 ⋀ P2) = sel(P1) ∙ sel(P2) → sel(age=2 ⋀ name LIKE 'A%')

This assumes that the predicates are independent.

28

SELECT * FROM people WHERE age = 2 AND name LIKE 'A%'

P1 P2

SLIDE 41

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Disjunction:

→ sel(P1 ⋁ P2) = sel(P1) + sel(P2) – sel(P1⋁P2) = sel(P1) + sel(P2) – sel(P1) ∙ sel(P2) → sel(age=2 OR name LIKE 'A%')

This again assumes that the selectivities are independent.

29

SELECT * FROM people WHERE age = 2 OR name LIKE 'A%'

P1 P2

SLIDE 42

CMU 15-445/645 (Fall 2018)

SELECTIO N S CO M PLEX PREDICATES

Disjunction:

→ sel(P1 ⋁ P2) = sel(P1) + sel(P2) – sel(P1⋁P2) = sel(P1) + sel(P2) – sel(P1) ∙ sel(P2) → sel(age=2 OR name LIKE 'A%')

This again assumes that the selectivities are independent.

29

SELECT * FROM people WHERE age = 2 OR name LIKE 'A%'

P1 P2

SLIDE 43

CMU 15-445/645 (Fall 2018)

RESULT SIZE ESTIM ATIO N FO R J O IN S

Given a join of R and S, what is the range of possible result sizes in # of tuples? In other words, for a given tuple of R, how many tuples of S will it match?

30

SLIDE 44

CMU 15-445/645 (Fall 2018)

RESULT SIZE ESTIM ATIO N FO R J O IN S

General case: Rcols⋂Scols={A} where A is not a key for either table.

→ Match each R-tuple with S-tuples: estSize ≈ NR ∙ NS / V(A,S) → Symmetrically, for S: estSize ≈ NR ∙ NS / V(A,R)

Overall:

→ estSize ≈ NR ∙ NS / max({V(A,S), V(A,R)})

31

SLIDE 45

CMU 15-445/645 (Fall 2018)

CO ST ESTIM ATIO N S

Our formulas are nice but we assume that data values are uniformly distributed.

32

5 1 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 Uniform Approximation

Distinct values of attribute # of occurrences

SLIDE 46

CMU 15-445/645 (Fall 2018)

CO ST ESTIM ATIO N S

Our formulas are nice but we assume that data values are uniformly distributed.

33

Bucket #1 Count=8 Bucket #2 Count=4 Bucket #3 Count=1 5 Bucket #4 Count=3 Bucket #5 Count=1 4

5 1 1 5

1

3

4-6 7-9 1 0-1 2 1 3-1 5

Non-Uniform Approximation

Bucket Ranges

SLIDE 47

CMU 15-445/645 (Fall 2018)

H ISTO GRAM S WITH Q UAN TILES

A histogram type wherein the "spread" of each bucket is same.

34

5 1 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 Equi-width Histogram (Quantiles)

Bucket #1 Count=1 2 Bucket #2 Count=1 2 Bucket #3 Count=9 Bucket #4 Count=1 2

SLIDE 48

CMU 15-445/645 (Fall 2018)

H ISTO GRAM S WITH Q UAN TILES

A histogram type wherein the "spread" of each bucket is same.

34

5 1 1 5 1

5

6-8 9-1 3 1 4-1 5 Equi-width Histogram (Quantiles)

SLIDE 49

CMU 15-445/645 (Fall 2018)

SAM PLIN G

Modern DBMSs also collect samples from tables to estimate selectivities. Update samples when the underlying tables changes significantly.

35

⋮

1 billion tuples

SELECT AVG(age) FROM people WHERE age > 50

id name age status 1001 Obama 56 Rested 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit

SLIDE 50

CMU 15-445/645 (Fall 2018)

SAM PLIN G

Modern DBMSs also collect samples from tables to estimate selectivities. Update samples when the underlying tables changes significantly.

35

⋮

1 billion tuples

sel(age>50) = SELECT AVG(age) FROM people WHERE age > 50

id name age status 1001 Obama 56 Rested 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit 1001 Obama 56 Rested 1003 Tupac 25 Dead 1005 Andy 37 Lit

Table Sample

SLIDE 51

CMU 15-445/645 (Fall 2018)

SAM PLIN G

Modern DBMSs also collect samples from tables to estimate selectivities. Update samples when the underlying tables changes significantly.

35

⋮

1 billion tuples

1/3 sel(age>50) = SELECT AVG(age) FROM people WHERE age > 50

id name age status 1001 Obama 56 Rested 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit 1001 Obama 56 Rested 1003 Tupac 25 Dead 1005 Andy 37 Lit

Table Sample

SLIDE 52

CMU 15-445/645 (Fall 2018)

O BSERVATIO N

Now that we can (roughly) estimate the selectivity

f predicates, what can we actually do with them?

36

SLIDE 53

CMU 15-445/645 (Fall 2018)

Q UERY O PTIM IZATIO N

After performing rule-based rewriting, the DBMS will enumerate different plans for the query and estimate their costs.

→ Single relation. → Multiple relations. → Nested sub-queries.

It chooses the best plan it has seen for the query after exhausting all plans or some timeout.

37

SLIDE 54

CMU 15-445/645 (Fall 2018)

SIN GLE- RELATIO N Q UERY PLAN N IN G

Pick the best access method.

→ Sequential Scan → Binary Search (clustered indexes) → Index Scan

Simple heuristics are often good enough for this. OLTP queries are especially easy.

38

SLIDE 55

CMU 15-445/645 (Fall 2018)

O LTP Q UERY PLAN N IN G

Query planning for OLTP queries is easy because they are sargable.

→ Search Argument Able → It is usually just picking the best index. → Joins are almost always on foreign key relationships with a small cardinality. → Can be implemented with simple heuristics.

39

SLIDE 56

CMU 15-445/645 (Fall 2018)

M ULTI- RELATIO N Q UERY PLAN N IN G

As number of joins increases, number of alternative plans grows rapidly

→ We need to restrict search space.

Fundamental decision in System R: only left-deep join trees are considered.

→ Modern DBMSs do not always make this assumption anymore.

40

SLIDE 57

CMU 15-445/645 (Fall 2018)

M ULTI- RELATIO N Q UERY PLAN N IN G

Fundamental decision in System R: Only consider left-deep join trees.

41

⨝ ⨝ ⨝

A B C D

⨝ ⨝ ⨝

A B C D

⨝ ⨝ ⨝

A B C D

SLIDE 58

CMU 15-445/645 (Fall 2018)

M ULTI- RELATIO N Q UERY PLAN N IN G

Fundamental decision in System R: Only consider left-deep join trees.

41

⨝ ⨝ ⨝

A B C D

⨝ ⨝ ⨝

A B C D

⨝ ⨝ ⨝

A B C D

X X

SLIDE 59

CMU 15-445/645 (Fall 2018)

M ULTI- RELATIO N Q UERY PLAN N IN G

Fundamental decision in System R: Only consider left-deep join trees. Allows for fully pipelined plans where intermediate results are not written to temp files.

→ Not all left-deep trees are fully pipelined.

42

SLIDE 60

CMU 15-445/645 (Fall 2018)

M ULTI- RELATIO N Q UERY PLAN N IN G

Enumerate the orderings

→ Example: Left-deep tree #1, Left-deep tree #2…

Enumerate the plans for each operator

→ Example: Hash, Sort-Merge, Nested Loop…

Enumerate the access paths for each table

→ Example: Index #1, Index #2, Seq Scan…

Use dynamic programming to reduce the number of cost estimations.

43

SLIDE 61

CMU 15-445/645 (Fall 2018)

DYN AM IC PRO GRAM M IN G

45

R S T

SortMerge Join

R.a=S.a

SortMerge Join

T.b=S.b

Hash Join

T.b=S.b

R ⨝ S T T ⨝ S R R ⨝ S ⨝ T

Hash Join

R.a=S.a

SELECT * FROM R, S, T WHERE R.a = S.a AND S.b = T.b

SLIDE 62

CMU 15-445/645 (Fall 2018)

DYN AM IC PRO GRAM M IN G

45

R S T

Hash Join

T.b=S.b

R ⨝ S T T ⨝ S R R ⨝ S ⨝ T

Hash Join

R.a=S.a

SELECT * FROM R, S, T WHERE R.a = S.a AND S.b = T.b

SLIDE 63

CMU 15-445/645 (Fall 2018)

DYN AM IC PRO GRAM M IN G

45

R S T

Hash Join

T.b=S.b

R ⨝ S T T ⨝ S R R ⨝ S ⨝ T

Hash Join

R.a=S.a

Hash Join

S.b=T.b

SortMerge Join

S.b=T.b

SortMerge Join

S.a=R.a

Hash Join

S.a=R.a

SELECT * FROM R, S, T WHERE R.a = S.a AND S.b = T.b

SLIDE 64

CMU 15-445/645 (Fall 2018)

DYN AM IC PRO GRAM M IN G

45

R S T

Hash Join

T.b=S.b

R ⨝ S T T ⨝ S R R ⨝ S ⨝ T

Hash Join

R.a=S.a

Hash Join

S.b=T.b

SortMerge Join

S.a=R.a

SELECT * FROM R, S, T WHERE R.a = S.a AND S.b = T.b

SLIDE 65

CMU 15-445/645 (Fall 2018)

DYN AM IC PRO GRAM M IN G

45

R S T

Hash Join

T.b=S.b

R ⨝ S T T ⨝ S R R ⨝ S ⨝ T

SortMerge Join

S.a=R.a

SELECT * FROM R, S, T WHERE R.a = S.a AND S.b = T.b

SLIDE 66

CMU 15-445/645 (Fall 2018)

CAN DIDATE PLAN EXAM PLE

How to generate plans for search algorithm:

→ Enumerate relation orderings → Enumerate join algorithm choices → Enumerate access method choices

No real DBMSs does it this way. It’s actually more messy…

46

SELECT * FROM R, S, T WHERE R.a = S.a AND S.b = T.b

SLIDE 67

CMU 15-445/645 (Fall 2018)

CAN DIDATE PLAN S

Step #1: Enumerate relation orderings

47

⨝ ⨝

T R S

⨝ ⨝

S T R

× ⨝

R S T

⨝ ⨝

R S T

⨝ ⨝

S R T

× ⨝

S T R

SLIDE 68

CMU 15-445/645 (Fall 2018)

CAN DIDATE PLAN S

Step #1: Enumerate relation orderings

47

⨝ ⨝

T R S

⨝ ⨝

S T R

× ⨝

R S T

⨝ ⨝

R S T

⨝ ⨝

S R T

× ⨝

S T R

X X

Prune plans with cross- products immediately!

SLIDE 69

CMU 15-445/645 (Fall 2018)

CAN DIDATE PLAN S

Step #2: Enumerate join algorithm choices

48

⨝ ⨝

R S T

Do this for the other plans.

R S T

NLJ NLJ

R S T

HJ NLJ

R S T

NLJ HJ

R S T

HJ HJ

SLIDE 70

CMU 15-445/645 (Fall 2018)

CAN DIDATE PLAN S

Step #2: Enumerate join algorithm choices

48

⨝ ⨝

R S T

Do this for the other plans.

R S T

NLJ NLJ

R S T

HJ NLJ

R S T

NLJ HJ

R S T

HJ HJ

SLIDE 71

CMU 15-445/645 (Fall 2018)

CAN DIDATE PLAN S

Step #3: Enumerate access method choices

49

R S T

HJ HJ

Do this for the other plans.

HJ HJ SeqScan SeqScan SeqScan HJ HJ SeqScan IndexScan(S.b) SeqScan

SLIDE 72

CMU 15-445/645 (Fall 2018)

N ESTED SUB- Q UERIES

The DBMS treats nested sub-queries in the where clause as functions that take parameters and return a single value or set of values. Two Approaches:

→ Rewrite to de-correlate and/or flatten them → Decompose nested query and store result to temporary table

50

SLIDE 73

CMU 15-445/645 (Fall 2018)

N ESTED SUB- Q UERIES: REWRITE

51

SELECT name FROM sailors AS S WHERE EXISTS ( SELECT * FROM reserves AS R WHERE S.sid = R.sid AND R.day = '2018-10-15' ) SELECT name FROM sailors AS S, reserves AS R WHERE S.sid = R.sid AND R.day = '2018-10-15'

SLIDE 74

CMU 15-445/645 (Fall 2018)

N ESTED SUB- Q UERIES: DECO M PO SE

For each sailor with the highest rating (over all sailors) and at least two reservations for red boats, find the sailor id and the earliest date on which the sailor has a reservation for a red boat.

52

SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red' AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1

SLIDE 75

CMU 15-445/645 (Fall 2018)

DECO M PO SIN G Q UERIES

For harder queries, the optimizer breaks up queries into blocks and then concentrates on one block at a time. Sub-queries are written to a temporary table that are discarded after the query finishes.

53

SLIDE 76

CMU 15-445/645 (Fall 2018)

DECO M PO SIN G Q UERIES

54

SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red' AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1

Nested Block

SLIDE 77

CMU 15-445/645 (Fall 2018)

DECO M PO SIN G Q UERIES

54

SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red' AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1

Nested Block

SELECT MAX(rating) FROM sailors

###

SLIDE 78

CMU 15-445/645 (Fall 2018)

DECO M PO SIN G Q UERIES

54

SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red' AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1 SELECT MAX(rating) FROM sailors

###

SLIDE 79

CMU 15-445/645 (Fall 2018)

DECO M PO SIN G Q UERIES

54

SELECT S.sid, MIN(R.day) FROM sailors S, reserves R, boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = 'red' AND S.rating = (SELECT MAX(S2.rating) FROM sailors S2) GROUP BY S.sid HAVING COUNT(*) > 1

Outer Block

SELECT MAX(rating) FROM sailors

###

SLIDE 80

CMU 15-445/645 (Fall 2018)

CO N CLUSIO N

Filter early as possible. Selectivity estimations

→ Uniformity → Independence → Histograms → Join selectivity

Dynamic programming for join orderings Rewrite nested queries Query optimization is really hard…

55

SLIDE 81

CMU 15-445/645 (Fall 2018)

M i d t er m Exam

Who: You What: Midterm Exam When: Wed Oct 17th 12:00pm ‐ 1:20pm Where: Posner Mellon Auditorium Why: https://youtu.be/xgMiaIPxSlc

56

SLIDE 82

CMU 15-445/645 (Fall 2018)

M IDTERM

What to bring:

→ CMU ID → Calculator → One 8.5x11" page of notes (double-sided)

What not to bring:

→ Live animals → Your wet laundry

57

SLIDE 83

CMU 15-445/645 (Fall 2018)

M IDTERM

Covers up to Joins (inclusive).

→ Closed book, one sheet of notes (double-sided) → Please email Andy if you need special accommodations.

https://15445.courses.cs.cmu.edu/fall2018/midter m-guide.html

58

SLIDE 84

CMU 15-445/645 (Fall 2018)

RELATIO N AL M O DEL

Integrity Constraints Relation Algebra

59

SLIDE 85

CMU 15-445/645 (Fall 2018)

SQ L

Basic operations:

→ SELECT / INSERT / UPDATE / DELETE → WHERE predicates → Output control

More complex operations:

→ Joins → Aggregates → Common Table Expressions

60

SLIDE 86

CMU 15-445/645 (Fall 2018)

STO RAGE

Buffer Management Policies

→ LRU / MRU / CLOCK

On-Disk File Organization

→ Heaps → Linked Lists

Page Layout

→ Slotted Pages → Log-Structured

61

SLIDE 87

CMU 15-445/645 (Fall 2018)

H ASH IN G

Static Hashing

→ Linear Probing → Robin Hood → Cuckoo Hashing

Dynamic Hashing

→ Extendible Hashing → Linear Hashing

Comparison with B+Trees

62

SLIDE 88

CMU 15-445/645 (Fall 2018)

TREE IN DEXES

B+Tree

→ Insertions / Deletions → Splits / Merges → Difference with B-Tree → Latch Crabbing / Coupling

Radix Trees Skip Lists

63

SLIDE 89

CMU 15-445/645 (Fall 2018)

SO RTIN G

Two-way External Merge Sort General External Merge Sort Cost to sort different data sets with different number of buffers.

64

SLIDE 90

CMU 15-445/645 (Fall 2018)

Q UERY PRO CESSIN G

Processing Models

→ Advantages / Disadvantages

Join Algorithms

→ Nested Loop → Sort-Merge → Hash

65

SLIDE 91

CMU 15-445/645 (Fall 2018)

N EXT CLASS

Parallel Query Execution

66