Why Is This Important? Now that we know about the benefits of - PDF document

Why Is This Important?  Now that we know about the benefits of indexes, how does the DBMS know when to use them? Overview of Query Evaluation  An SQL query can be implemented in many ways, but which one is best?  Perform selection before or after join etc. Chapter 12  Many ways of physically implementing a join (or other relational operator), how to choose the right one?  The DBMS does this automatically, but we need to understand it to know what performance to expect 1 2 Overview of Query Evaluation Some Common Techniques  SQL query is implemented by a query plan  Algorithms for evaluating relational operators use  Tree of relational operators some simple ideas extensively: • `Pull ’ interface: when an operator is `pulled’ for the next output tuples , it `pulls’ on its inputs and computes them.  Indexing: Can use WHERE conditions to retrieve small set • Can change structure of tree of tuples (selections, joins) • Can choose different operator implementations  Iteration: Sometimes, faster to scan all tuples even if there  Two main issues in query optimization:  For a given query, what plans are considered? is an index. (And sometimes, we can scan the data entries • Algorithm to search plan space for cheapest (estimated) plan. in an index instead of the table itself.)  How is the cost of a plan estimated?  Partitioning: By using sorting or hashing, we can partition  Ideally: Want to find best plan. the input tuples and replace an expensive operation by  Practically: Avoid worst plans! similar operations on smaller inputs.  We will study the System R approach. * Watch for these techniques as we discuss query evaluation! 3 4 Statistics and Catalogs Access Paths  Need information about the relations and indexes  Access path = way of retrieving tuples: involved. Catalog typically contains:  File scan, or index that matches a selection (in the query)  #tuples (NTuples) and #pages (NPages) for each relation.  Cost depends heavily on access path selected  #distinct key values (NKeys), INPages, and low/high key values  A tree index matches (a conjunction of) conditions that (ILow/IHigh) for each index. involve only attributes in a prefix of the search key.  Index height (IHeight) for each tree index.  E.g., Tree index on <a, b, c> matches “a=5 AND b=3” and “a=5  Catalog data stored in tables; can be queried AND b>6”, but not “b=3”.  Catalogs updated periodically.  A hash index matches (a conjunction of) conditions that  Updating whenever data changes is too expensive; costs are has a term attribute = value for every attribute in the approximate anyway, so slight inconsistency ok. search key of the index.  More detailed information (e.g., histograms of the values  E.g., Hash index on <a, b, c> matches “a=5 AND b=3 AND c=5”; in some field) sometimes stored. but not “b=3”, “a=5 AND b=3”, or “a>5 AND b=3 AND c=5”. 5 6

A Note on Complex Selections Selectivity of Access Paths (day<8/9/94 AND rname =‘Paul’) OR bid=5 OR sid=3  Selectivity = #pages retrieved (index + data pages)  Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index:  Selection conditions are first converted to  Terms that match the index reduce the number of tuples conjunctive normal form (CNF): retrieved  E.g., (day<8/9/94 OR bid=5 OR sid=3 ) AND  Other terms are used to discard some retrieved tuples, but do (rname =‘Paul’ OR bid=5 OR sid=3) not affect number of tuples fetched.  We only discuss case with no ORs; see text if you are  Consider “day < 8/9/94 AND bid=5 AND sid =3”. curious about the general case. • Can use B+ tree index on day; then check bid=5 and sid=3 for each retrieved tuple • Could similarly use a hash index on <bid,sid>; then check day < 8/9/94 7 8 SELECT DISTINCT Using an Index for Selections Projection R.sid, R.bid Reserves R FROM  Without index on R.rname, have to scan  The expensive part is removing duplicates.  DBMS does not remove duplicates by default.  Index cost depends on #qualifying tuples and clustering.  Sorting Approach  Cost of finding qualifying data entries (small) plus cost of  Sort on <sid, bid> and remove duplicates: scan of Reserves retrieving records (could be large). (1000 pages), plus 2-3 more passes of projected data set (~1000  Data: 100K tuples on 1000 pages pages)  Assuming uniform distribution of names, about 10% of tuples  Hashing Approach qualify (100 pages, 10000 tuples).  Hash on <sid, bid> to create partitions.  Clustered index on rname: little more than 100 I/O  Load partitions into memory one at a time  Unclustered index on rname: up to 10,000 I/O! • Build in-memory hash structure, eliminate duplicates.  Scan of Reserves (1000 pages), plus write and read projected SELECT * data (~500 I/O); but could be more Reserves R  If there is an index with all selected attributes in the FROM search key, use index-only access on index leaves. WHERE R.rname < ‘C%’ 9 10 Join: Index Nested Loops Examples of Index Nested Loops foreach tuple r in R do foreach tuple s in S where r i == s j do  Join Sailors and Reserves on sid add <r, s> to result  Assumption: R has 100K tuples on 1000 pages; S has 40K tuples on 500 pages  Naïve implementation: scan S for each tuple in R  Hash-index (Alt. 2) on sid of Sailors (as inner):  #pages(R) + |R| * #pages(S) page accesses  Scan Reserves: 1000 page I/Os, 100K tuples.  Improved by block nested loops: foreach block of R, process  For each Reserves tuple: 1.2 I/Os to get data entry in hash index, plus each block from S 1 I/O to get (the exactly one) matching Sailors tuple. Total: 220,000  #pages(R) + #pages(R) * #pages(S) page accesses I/Os.  Can do even better with an index on the join column of one  Hash-index (Alt. 2) on sid of Reserves (as inner): relation (say S) by making it the inner.  Scan Sailors: 500 page I/Os, 40K tuples.  Cost: #pages(R) + ( |R| * costOfFindingMatchingStuples )  For each Sailors tuple: 1.2 I/Os to find index page with data entries,  For each R tuple, cost of probing S index is about 1.2 I/O for hash plus cost of retrieving matching Reserves tuples. Assuming uniform index, 2-4 for B+ tree. Cost of then finding S tuples (assuming Alt. (2) distribution, 2.5 reservations per sailor (100,000 / 40,000). Cost of or (3) for data entries) depends on clustering. retrieving them from heap file is 2.5 I/Os. Total: 148,000 I/Os. • Clustered index: 1 I/O (typical), unclustered: upto 1 I/O per matching S tuple. 11 12

Why Is This Important? Now that we know about the benefits of - PDF document

Why Is This Important? Now that we know about the benefits of indexes, how does the DBMS know when to use them? Overview of Query Evaluation An SQL query can be implemented in many ways, but which one is best? Perform selection before

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Why is this important? Why is this important? Anyone know what song this is? Yes thats right,

Why is shipping important? CARGO TRINITY HOUSE // KEY STAGE 1 WHY IS SHIPPING IMPORTANT?

The Worlds First SMART LASER The Platform That Redefines the Therapy Laser Industry. Why is

Smart City Wonderful Life Guo

Why is there a price to pay? Why is there a price to pay? Why cant God just

Crises Important and urgent Goals and planning Important but not urgent Assess tasks using

Occupational Hygiene Why is it important? And why ignoring it is bad business? Julian Dowson

WHAT IS HEATING VALUE WHY IS IT IMPORTANT WHAT IS HEATING VALUE WHY IS IT IMPORTANT

Deborah Grove, Principal Grove-Associates Why Green IT is Important Why Green IT is Important

WHY IS MIGRATION SO IMPORTANT? Why migration? German National Team 2014 Why migration?

G 98 7 Artist Workshop Spring 2014 Why We Do What We Do What? Why? The Reason Why what we

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Important Concepts Important Concepts Some important concepts in financial and derivative

WELCOME Building Envelope Problems AGENDA WHY IMPORTANT CONDENSATION STONE

WELCOME VENTED and UNVENTED Roof/Deck Systems AGENDA WHY IMPORTANT CODE THEORY

Evaluating Relational Operations: Part I Database Management Systems 3ed, R. Ramakrishnan and

Generic vs Alternative Specific Coefficients in Conditional Logits: An Application to Party

Application of Multi-criteria Decision Analysis Methods to Comparative Evaluation of Nuclear

Alternatives EVALUATION BASED ON MULTIPLE CRITERIA Many problems associated with economic methods

Evaluating Relational Operations: Part I (From Chapter 14)

e g r o e G l e The Alternative Lending Risk a h Premium in the Capital Markets c i M

0 James S. Riepe Non-Executive Chairman of the Board 1 G. Kent Conrad Melina E. Higgins

Eastern San Joaquin Subbasin Groundwater Sustainability Workgroup February 13, 2019