Query Processing Review Support for data retrieval at the - PowerPoint PPT Presentation

Query Processing

Review  Support for data retrieval at the physical level:  Indices : data structures to help with some query evaluation:  SELECTION queries (ssn = 123)  RANGE queries (100 <= ssn <=200)  Index choices : Primary vs secondary, dense vs sparse, ISAM vs B+-tree vs Extendible Hashing vs Linear Hashing  Sometimes, indexes not useful, even for SELECTION queries.  And what about join queries or other queries not directly supported by the indices? How do we evaluate these queries?  What decides these implementation choices? Ans: Query Processor(one of the most complex components of a database system)

QP & O SQL Query Query Processor Data: result of the query

QP & O SQL Query Query Processor Parser Algebraic Expression Query Optimizer Execution plan Evaluator Data: result of the query

QP & O Query Algebraic Optimizer Query Rewriter Representation Algebraic Representation Data Stats Plan Generator Query Execution Plan

Query Processing and Optimization Parser / translator (1 st step) Input: SQL Query (or OQL, …) Output: Algebraic representation of query (relational algebra expression)  balance (  balance  2500 ( account)) Eg SELECT balance FROM account or WHERE balance < 2500  balance  balance  2500 account

Query Processing & Optimization Plan Generator produces: Query execution plan  Algorithms of operators that read from disk:  Sequential scan  Index scan  Merge-sort join  Nested loop join  ….. Plan Evaluator (last step) Input: Query Execution Plan Output: Data (Query results)

Query Processing & Optimization Query Rewriting Input: Algebraic representation of query Output: Algebraic representation of query Idea: Apply heuristics to generate equivalent expression that is likely to lead to a better plan e.g.:  amount > 2500 (borrower loan) borrower (  amount > 2500 (loan)) Why is 2 nd better than 1 st ?

Equivalence Rules 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. s = s s E E ( ) ( ( )) q Ù q q q 1 2 1 2 2. Selection operations are commutative. s s = s s E E ( ( )) ( ( )) q q q q 1 2 2 1 3. Only the last in a sequence of projection operations is needed, the others can be omitted.        ( ( ( ( E )) )) ( E ) L L Ln L 1 2 1 4. Selections can be combined with Cartesian products and theta joins. a.   (E 1 X E 2 ) = E 1  E 2 b.   1 (E 1  2 E 2 ) = E 1  1   2 E 2

Query Processing & Optimization Plan Generator Input: Algebraic representation of query Output: Query execution plan Idea: 1) generate alternative plans for evaluating a query   amount > 2500 Sequential scan Index scan 2) Estimate cost for each plan 3) Choose the plan with the lowest cost COST: approx., counts sources of latency

Query Processing & Optimization Goal: generate plan with minimum cost (i.e., fast as possible) Cost factors: 1. CPU time (trivial compared to disk time) 2. Disk access time  main cost in most DBs 3. Network latency  Main concern in distributed DBs Our metric: count disk accesses

Cost Model How do we predict the cost of a plan? Ans: Cost model  For each plan operator and each algorithm we have a cost formula  Inputs to formulas depend on relations, attributes, etc.  Database maintains statistics about relations for this (Metadata)

Metadata  Given a relation r, DBMS likely maintains the following metadata: 1. Size (# tuples in r) n r 2. Size (# blocks in r) b r 3. Block size (# tuples per block) f r (typically b r =  n r / f r  ) 4. Tuple size (in bytes) s r 5. Attribute Values V(att, r) (for each attribute att in r , # of different values) 6. Selection Cardinality SC(att, r) (for each attribute att in r , expected size of a selection:  att = K (r ) )

Example account n account = 6 bname acct_no balance Dntn A-101 500 s account = 33 bytes Mianus A-215 700 Perry A-102 500 R.H. A-305 900 f account =  4K/33  Dntn A-200 700 Perry A-301 500 V(balance, account) = 3 V(acct_no, account) = 6 SC (balance, account) = 2 ( n r / V(att, r))

Some typical plans and their costs Query :  att = K (r )  A1 ( linear search ). Scan each file block and test all records to see whether they satisfy the selection condition.  Cost estimate (number of disk blocks scanned) = b r  b r denotes number of blocks containing records from relation r  If selection is on a key attribute, cost = ( b r /2)  stop on finding record (on average in the middle of the file)  Linear search can be applied regardless of  selection condition or  ordering of records in the file, or  availability of indices E A1 = b r (if attr is not a key) b r /2 (if attr is a key)

Selection Operation (Cont.) Query :  att = K (r )  A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is sorted.  Requires that the blocks of a relation are sorted and stored contiguously.  Cost estimate:   log 2 ( b r )  — cost of locating the first tuple by a binary search on the blocks  Plus number of blocks containing records that satisfy selection condition E A2 =  log 2 ( b r )  +  SC (att, r) / f r  - 1 What is the cost if att is a key? less the one you E A2 =  log 2 ( b r )  already found

Example Account (bname, acct_no, balance)  bname = “ Perry ” ( Account ) Query: n account = 10,000 f account = 20 tuples/block b account= = 10,000 / 20 = 500 V(bname, Account) = 50 SC(bname, Account) =  10,000 / 50  = 200 Assume sorted on bname Cost Estimates: A1: E A1 =  n Account / f Account  =  10,000 / 20  = 500 I/O ’ s A2: E A2 =  log 2 ( b Account )  +  SC(bname, Account) / f account  -1 = 9 + 9 = 18 I/O ’ s

More Plans for selection  What if there’s an index (B+Tree) on att? We need metadata on size of index (i). DBMS keeps track of: HT i 1. Index height: 2. Index “ Fan Out ” : f i Average # of children per node (not same as order.) 3. Index leaf nodes: LB i Note : HT i ~  log fi (LB i ) 

B+-trees REMINDER B+-tree of order 3: This is a primary index root: internal node 6 9 < 6 ≥ 9 ≥ 6 < 9 leaf node 4 3 6 7 9 13 (3, Joe, 23) (4, John, 23) Data File (3, Bob, 23) ………… ………… …………

B+Tree, Primary Index HTi i Leaf nodes 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 2 2 2 2 3 3 3 Data File  SC(att, r) / f r  = 14 / 7 14 tuples / 7 tuples per block = 2 blocks

More Plans for selection Query :  att = K (r )  A3: Index scan, Primary (B + -Tree) Index What: Follow primary index, searching for key K Prereq: Primary index on att, i Cost: EA3 = HT i + 1, if att is a candidate key EA3 = HT i +1+  SC(att, r) / f r  , if not Remember for primary index, data file is sorted => sparse index Number of blocks containing att=K in data file

Secondary Indexing REMINDER secondary index: typically, with ‘ postings lists ’ Postings lists STUDENT forbes ave Ssn Name Address 123 smith main str main str 234 jones forbes ave 345 tomson main str 456 stevens forbes ave 567 smith forbes ave

B+Tree, Primary Index HTi i 3 Posting List 1 2 3 3 4 9 4 4 3 2 1 7 3 3 2 5 3 4 1 1 3 The number of records SC(att, r) with value = 3 = # blocks read

Query :  att = K (r )  A4: Index scan, Secondary Index Prereq: Secondary index on att, i What: Follow index, searching for key K Cost: bucket read for posting list If att not a key: E A4 = HT i + 1 + SC( att , r ) YIKES! Index block File block reads reads (in worst case, each tuple on different block) Else, if att is a candidate key: E A4 = HT i + 1

How Big is the Posting List BPL r = Blocks in Posting List P r = size of a pointer  NEW f r = size of a block BPL r = SC ( attr , r ) Num occurrences f r P Num pointers in a block r

Selections Involving Comparisons Query:  Att  K (r )  A5 ( primary index, comparison). (Relation is sorted on Att )  For  Att  K (r) use index to find first tuple  v and scan relation sequentially from there  For  Att  K ( r ) just scan relation sequentially until first tuple > K; do not use index E A5 = HT i +  c / f r  (where c is the cardinality of result) Cost of first: HT i Leaf nodes k . SORTED on Att k Data File

Selections Involving Comparisons (cont.) Query:  Att  K (r ) Cardinality: More metadata on r is needed: min ( att , r ) : minimum value of att in r max( att , r ): maximum value of att in r Then the selectivity of  Att  K (r ) is estimated as:  max( attr , r ) K n r (or n r /2 if  max( att , r ) min( att , r ) min, max unknown) Intuition: assume uniform distribution of values between min and max min(attr, r) K max(attr, r)

Plan generation: Range Queries  Att  K (r ) A6: ( secondary index, comparison ). Att is a candidate key HT i ... Leaf nodes k+m k, k+1 k+1 ... File k k+m Cost: E A6 = HT i -1+ #of leaf nodes to read + # of file blocks to read = HT i -1+  LB i * (c / n r )  + c, if att is a candidate key – There will be c file pointers for a key.

Query Processing Review Support for data retrieval at the - PowerPoint PPT Presentation

Query Processing Review Support for data retrieval at the physical level: Indices : data structures to help with some query evaluation: SELECTION queries (ssn = 123) RANGE queries (100 <= ssn <=200) Index choices :

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Worst-Case Ef fi cient External-Memory Priority Queues Gerth Stlting Brodal

Comparison Sorting Alphabetical list of people List of countries ordered by population

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of

Data Management Systems Query Processing Execution models Optimization heuristics

Tournament Trees Winner trees. Loser Trees. Winner Trees Complete binary tree with n external

For Monday Read Weiss, chapter 4.3-4.4 Quicksort homework described on Blackboard

2-Way Sort Problem You have some large number (e.g., 3072) pages of data to sort You only have a

Query Processing Review Support for data retrieval at the - PowerPoint PPT Presentation

Query Processing Review Support for data retrieval at the physical level: Indices : data structures to help with some query evaluation: SELECTION queries (ssn = 123) RANGE queries (100 <= ssn <=200) Index choices :

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit &amp; Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance &lt; 2500 ( balance ( account)) balance

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval &gt; Query Us User er Query Words Query Words Search Personalization

Module 13: Optimizing Query Performance Overview Introduction to the Query Optimizer

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Worst-Case Ef fi cient External-Memory Priority Queues Gerth Stlting Brodal

Comparison Sorting Alphabetical list of people List of countries ordered by population

Chapter 7 External Sorting Sorting Tables Larger Than Main Memory Query Processing Sorting

on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of

Data Management Systems Query Processing Execution models Optimization heuristics

Tournament Trees Winner trees. Loser Trees. Winner Trees Complete binary tree with n external

For Monday Read Weiss, chapter 4.3-4.4 Quicksort homework described on Blackboard

2-Way Sort Problem You have some large number (e.g., 3072) pages of data to sort You only have a

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

Information Retrieval > Query Us User er Query Words Query Words Search Personalization