Cost-Based Optimization Database Systems: The Complete Book Ch 2.3, - PowerPoint PPT Presentation

Cost-Based Optimization Database Systems: The Complete Book Ch 2.3, 6.1-6.4,15, 16.4-16.5 1

Optimizing 2

Optimizing • Some equivalence rules are always good… • Which? 2

Optimizing • Some equivalence rules are always good… • Which? • Some equivalence rules are sometimes good • Which? • What do we do about it? 2

Cost Estimation • Compare many different plans by… • … actually running the query • … estimating the plan’s “cost” 3

Cost Estimation 4

Costs 5

Costs • Memory Cost (Working Set Size) 5

Costs • Memory Cost (Working Set Size) • Compute Cost (“Big-O”) 5

Costs • Memory Cost (Working Set Size) • Compute Cost (“Big-O”) • IO Cost (Pages read, Pages written) 5

The variable in all of these costs is the arity (size) of a relation. 6

How do you compute Arities? • Heuristic Assumptions (Pick a “good enough” RF) • Summary Statistics About The Data… • Upper/Lower Bounds or Value Domains • Distribution Summaries (Histograms) • Data Sampling 7

How do you compute Arities? There is no perfect solution (yet)! 8

How do you compute Arities? There is no perfect solution (yet)! We don’t need a perfect solution… … we just need one that’s good enough 8

Summary Statistics • Per-Attribute Bounds / Domain Statistics • Assume a Uniform Distribution. • Per-Attribute Histograms • Use the histogram to model the data distribution • Data Samples • Use the samples to measure the RF 9

Uniform Distribution A = 1 10

Uniform Distribution A = 1 Chance of Hit = 1 / # of distinct values of A 10

Uniform Distribution A ∈ (1 , 2 , . . . ) 11

Uniform Distribution A ∈ (1 , 2 , . . . ) Chance of Hit = | (1,2,3,…) | / # of distinct values of A 11

Uniform Distribution A < 3 12

Uniform Distribution A < 3 Chance of Hit = 3-Low(A) / High(A) - Low(A) 12

Uniform Distribution . / R.A = S.B 13

Uniform Distribution . / R.A = S.B Chance of Hit Per B = 1 / # Distinct Values of A Chance of Hit Per B = 1 (If B is a FK Reference) Chance of Hit Per A = 1 (If A is a FK Reference) 13

Let’s apply it SELECT O.Rank, COUNT(*), FROM Officers O WHERE O.Rank >= 2 AND O.Age > 20 AND O.Age < 30 GROUP BY O.Rank HAVING COUNT(DISTINCT O.Ship) > 2 What is the relational algebra plan for this expression? 14

Stats O.Rank : 0-5 (Increments of 0.5; 11 total values) O.Age : 16-100 (Increments of 1; 85 total values) Officers : 40,000 tuples (over 500 pages) Tree Indexes available over O.Age, O.Rank What is the total cost in IOs? What is the total cost in CPU/Tuples? 15

Histograms Uniform Distributions are a strong assumption! (data is often skewed) 16

Histograms SELECT Name People FROM People Name Age Rank WHERE Rank = 3 <“Alice”, 21, 1 > AND Age = 20 <“Bob”, 20, 2 > vs <“Carol”, 21, 1 > … <“Dave”, 19, 3 > AND Age = 19 <“Eve”, 20, 2 > <“Fred”, 20, 3 > RF Age = 1 / nkeys = 1 / 4 <“Gwen”, 22, 1 > RF Rank = 1 / nkeys = 1 / 3 <“Harry”, 20, 3 > Age is best! 17

Histograms SELECT Name People FROM People Name Age Rank WHERE Rank = 3 <“Alice”, 21, 1 > AND Age = 20 <“Bob”, 20, 2 > vs <“Carol”, 21, 1 > … <“Dave”, 19, 3 > AND Age = 19 <“Eve”, 20, 2 > <“Fred”, 20, 3 > RF Age-20 = 1 / 2 <“Gwen”, 22, 1 > RF Rank = 1 / 3 <“Harry”, 20, 3 > Age is worst! 18

Histograms SELECT Name People FROM People Name Age Rank WHERE Rank = 3 <“Alice”, 21, 1 > AND Age = 20 <“Bob”, 20, 2 > vs <“Carol”, 21, 1 > … <“Dave”, 19, 3 > AND Age = 19 <“Eve”, 20, 2 > <“Fred”, 20, 3 > RF Age-19 = 1 / 8 <“Gwen”, 22, 1 > RF Rank = 1 / 3 <“Harry”, 20, 3 > Age is best! 19

Histograms 4 2 1 1 19 20 21 22 20

Histograms 2.5 1.5 19 21 22 21

Histograms 2 19 22 22

Histograms 63 30 22 20 15 10 10 0 0 10 20 30 40 50 60 70 80 SELECT … WHERE A = 33 23

Histograms 63 30 22 20 15 10 10 0 0 10 20 30 40 50 60 70 80 SELECT … WHERE A > 33 24

Using Constraints • A Key attribute has one distinct value per row (equality selects exactly one row) • Foreign Key joins generate one row for each row in the referencing relation. • Cascade relationship guarantees EXACTLY one row per reference. 25

Sampling • Take a bunch of tuples from each relation. • Run 2-3 different query plans on these tuples. • Estimate the sampling factors for each operator in the plan based on how many survive. 26

Sampling How big is a “bunch?” 27

Sampling • Problem : Very Selective Predicates • Problem : Joins and the Birthday Paradox • Problem : Counting Aggregate Groups 28

Very Selective Predicates T S R [100 Tuples] 29

Very Selective Predicates T S [100 Tuples] R 30

Very Selective Predicates T S [0 Tuples] R 30

Join Conditions Image: Wikipedia 31

Join Conditions Birthday Paradox Need O( √ |R|+|S|) tuples to reliably guess RF for equijoin Image: Wikipedia 31

Estimating Join Costs How many query plans are there? R S T U . / . / . / 32

Estimating Join Costs There are (N-1)! (factorial) different ways (plans) to evaluate this join. Computing costs for all of these plans is expensive! 33

Left-Deep Plans RHS Join Input is always a relation . / 1) Shrinks join search space . / R 2) Allows index scans/lookups T . / Technique Pioneered by the System R Optimizer U S 34

In Practice Heuristics, Histograms and Sampling are “good enough” to optimize the common cases. 35

In Practice Heuristics, Histograms and Sampling are “good enough” to optimize the common cases. Some relational databases have manual overrides. 35

Oracle SELECT /*+ INDEX (employees emp_department_ix)*/ employee_id, department_id FROM employees WHERE department_id > 50; 36

In Practice Heuristics, Histograms and Sampling are “good enough” to optimize the common cases. 38

In Practice Heuristics, Histograms and Sampling are “good enough” to optimize the common cases. Some relational databases have manual overrides. 38

In Practice Heuristics, Histograms and Sampling are “good enough” to optimize the common cases. Some relational databases have manual overrides. All relational databases have an “EXPLAIN” operator 38

Postgres EXPLAIN SELECT sum(i) FROM foo WHERE i < 10; QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=23.93..23.93 rows=1 width=4) -> Index Scan using fi on foo (cost=0.00..23.92 rows=6 width=4) Index Cond: (i < 10) 39

Backup Slides 40

Join Algorithm Comparison Can Support Pipelining? But? RHS Hash Table needs Hybrid Hash Yes to fit in memory RHS Table needs an Index Nested Loop Yes index on the join key LHS and RHS must both Sort/Merge Join Yes be sorted on the join key RHS Table needs Yes (Block) Nested Loop to fit in memory No buts. Hash Join Hash Join No always materializes 41

Join Algorithm IO Costs IO Cost R . / S Hybrid Hash [#pages of S] (if fits in mem) Index Nested Loop |R| * [cost of one scan/lookup on S] Sort/Merge Join [#pages of S] (+sorting costs) Nested Loop [#pages of S] (if fits in mem) [#pages of R] + [#of block pairs] * Block Nested Loop ([#pages per block of R]+[#pages per block of S]) Hash Join 2*([#pages of R]+[#pages of S]) + [#pages of S] 42

Data Access IO Costs Full Scan Range Scan Lookup Raw File N N N Sorted File N log 2 (N)+|R| log 2 (N) Static Hash Index >N >N ~1 >N+|D| >N+|D| 2 Extendible Hash Index (random) (random) >N >N ~1 Linear Hash Index ISAM Tree Index ~N ~log |T| (N)+|R| ~log |T| (N) N log |T| (N)+|R| B+ Tree Index log |T| (N) (random) (random) 43

Cost-Based Optimization Database Systems: The Complete Book Ch 2.3, - PowerPoint PPT Presentation

Cost-Based Optimization Database Systems: The Complete Book Ch 2.3, 6.1-6.4,15, 16.4-16.5 1 Optimizing 2 Optimizing Some equivalence rules are always good Which? 2 Optimizing Some equivalence rules are always good

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Chapter 4 Chapter 4 Marginal Costing and Cost-Volume-Profit Analysis Cost behaviour Cost

Pricing according to cost Cost-based pricing Cost of a service = value of economic means used in

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

COST European Cooperation in Science and Technology Introduction to the COST Framework Programme

COST Action CA18108 Quantum gravity phenomenology in the multi-messenger approach What is a COST

Chapter 2: Cost Behavior, Activity Analysis, and Cost Estimation Agenda History of Cost

Cost-based Query Optimization Christoph Koch Computing the cost of a query plan We know how

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

Cost Cost Ov Over erruns uns and Cos and Cost t Gr Growth: wth: A A Thr hree Decade ee

Strategic Cyber Cost- Effectiveness Analysis Robin Smith Brief Introduction Arke Cost

Graph-based Nearest Neighbor Search: From Practice to Theory Liudmila Prokhorenkova, Aleksandr

Privacy Preserving Data Mining: Additive Data Perturbation Outline Input perturbation

Hyperuniformity on the Sphere Peter Grabner (joint work with J. Brauchart and W. Kusner)

Kuperbergs Collimation Sieve vs. CSIDH Chris Peikert University of Michigan Quantum

Quantum Collision-Finding in Non-Uniform Random Functions Marko Balogh 1 and Edward Eaton 2 , 3

Webinar 1: Getting Started With Simulation www. SIMUL8 .com Max Guild Bsc. Applied

opencypher.org | opencypher@googlegroups.com opencypher.org | opencypher@googlegroups.com

The Cause of Union and Freedom: Missouri Germans and the