evaluation of join operations
play

Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A - PowerPoint PPT Presentation

Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A (Joins) 340151 Big Data & Cloud Computing (P. Baumann) 1 Relational Operations: Join Definition Natural join: R S := L ( C ( R S ) ) Where C:


  1. Evaluation of Join Operations Ramakrishnan/Gehrke Chapter 14, Part A (Joins) 340151 Big Data & Cloud Computing (P. Baumann) 1

  2. Relational Operations: Join Definition  Natural join: • R  S := L ( C ( R S ) )  Where • C: condition that equates all pairs of attributes of R and S that have the same name • Ex: R has x, S has x " R.x=S.x and… " in C • L: list of all attributes of R and S, except equate duplicates • Ex: C contains "R.x=S.x" only one x chosen for L  Example: R(a,b) and S(b,c) relations • Then, R  S = …  Hence: join is shorthand (but more efficient to compute in 1 step) 340151 Big Data & Cloud Computing (P. Baumann) 2

  3. Relational Operations: More Joins  Theta join • Let R(a,b) and S(b,c) be relations • R  S := C ( R S ) C  Why "theta"? • Historically: R  S where { =, , >, , <, } x y • Today: can be any condition  Special case: C = "R.x = S.y" equijoin  No projection! 340151 Big Data & Cloud Computing (P. Baumann) 3

  4. Schema for Examples Sailors (sid: integer, sname: string, rating: integer, age: real) Reserves (sid: integer, bid: integer, day: dates, rname: string)  Similar to old schema; rname added for variations  Reserves: tuple 40 bytes, 100 tuples per page, 1000 pages  Sailors: tuple 50 bytes, 80 tuples per page, 500 pages 340151 Big Data & Cloud Computing (P. Baumann) 4

  5. Equality Joins With One Join Column SELECT * FROM Reserves R, Sailors S WHERE R.sid=S.sid  R S large R S followed by selection inefficient  Assume: M #pages of R, p R tuples per page, N #pages of S, p S tuples per page  Cost metric: # of I/Os • will ignore output costs, disk access patterns 340151 Big Data & Cloud Computing (P. Baumann) 5

  6. Simple Nested Loops Join foreach tuple r in R do Reserves:  foreach tuple s in S do M = 1000; p R = 100 if r i == s j Sailors:  then add <r,s> to result N = 500; p S = 80  For each tuple in outer relation R, scan entire inner relation S • Cost: M + p R * M * N = 1,000 + 100*1000*500 I/Os = 50,001,000 I/Os  Page-oriented Nested Loops join ( Block-Nested Loop Join ) : For each page of R: get each page of S, write out matching pairs of tuples <r,s>, where r in R-page and s in S-page • Cost: M + M*N = 1000 + 1000*500 = 501,000  • If smaller relation (S) is outer, cost = 500 + 500*1000 = 500,500 340151 Big Data & Cloud Computing (P. Baumann) 6

  7. Index Nested Loops Join  Assume index on join column of one relation (say S) can make it inner and exploit index • Cost: M + ( M*p R * cost of finding matching S tuples )  For each R tuple, cost of probing S index ~1.2 for hash index  For each R tuple, cost of probing S index 2 … 4 for B+ tree • Cost of then finding S tuples (assuming Alt. (2) or (3) for data entries) depends on clustering: • Clustered index: 1 I/O for all tuples (typical), unclustered: up to 1 I/O per matching S tuple 340151 Big Data & Cloud Computing (P. Baumann) 7

  8. Examples of Index Nested Loops SELECT *  Hash-index on sid of Sailors (as inner): FROM Reserves R, Sailors S WHERE R.sid=S.sid • Scan Reserves: 1000 page I/Os, 100*1000 tuples • For each Reserves tuple: 1.2 I/Os to get data entry in index + 1 I/O to get (the exactly one) matching Sailors tuple • Total: 220,000 I/Os  Hash-index on sid of Reserves (as inner): • Scan Sailors: 500 page I/Os, 80*500 tuples • For each Sailors tuple: 1.2 I/Os to find index page with data entries + cost of retrieving matching Reserves tuples (*) • (*) Assuming uniform distribution, 2.5 reservations per sailor (100,000 / 40,000) Cost is 1 or 2.5 I/Os, depending on whether index is clustered • Total: … 4,000 * (1.2 + 2.5) = 148,000 I/Os 340151 Big Data & Cloud Computing (P. Baumann) 8

  9. Block Nested Loops Join  1 page as input buffer for scanning inner S  1 page as output buffer Earlier: block = page Now: block = sequence of pages  all remaining pages hold "block" of outer R R & S Join Result Hash table for block of R (k < B-1 pages) . . . . . . . . . Input buffer for S Output buffer 340151 Big Data & Cloud Computing (P. Baumann) 9

  10. Examples of Block Nested Loops  Cost: Scan of outer + #outer blocks * scan of inner • #outer blocks = # of pages of outer / blocksize  With Reserves (R) as outer, and 100 pages of R per block: • Cost of scanning R is 1000 I/Os; total of 10 blocks • Per block of R, scan Sailors (S); 10*500 I/Os  With 100-page block of Sailors as outer: • Cost of scanning S is 500 I/Os; total of 5 blocks • Per block of S, scan Reserves; 5*1000 I/Os 340151 Big Data & Cloud Computing (P. Baumann) 10

  11. Sort-Merge Join  Approach: • Sort R & S on join column • scan them to do a ``merge’’ (on join column) • output result tuples  Efficiency: • R scanned once; each S group scanned once per matching R tuple • Multiple scans of an S group likely to find needed pages in buffer  Cost: M log M + N log N + (M+N) • Ex: with 35, 100 or 300 buffer pages, Reserves & Sailors sorted in 2 passes • total join cost: 7500 ( BNL cost: 2,500 to 15,000 I/Os )  In practice, cost of sort-merge join linear 340151 Big Data & Cloud Computing (P. Baumann) 11

  12. Refinement of Sort-Merge Join  combine merging phases in sorting of R & S with merging required for join • With B > , where L is size of larger relation, L using sorting refinement that produces runs of length 2B in Pass 0: #runs of each relation is < B/2 • Allocate 1 page per run of each relation, `merge’ while checking join condition • Cost: read+write each relation in Pass 0 + read each relation in (only) merging pass (+ writing of result tuples) • In example: cost goes down from 7,500 to 4,500 I/Os  In practice, cost of sort-merge join is linear • like cost of external sorting 340151 Big Data & Cloud Computing (P. Baumann) 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend