Join Algorithms Lecture # 12 Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Join Algorithms Lecture # 12 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018

2 ADM IN ISTRIVIA Project #2 – Checkpoint #1 is due TODAY No class on Wednesday October 10th Mid-term Exam is on Wednesday October 17 th → Will cover up to and including this lecture (L12). → Study guide will be posted on Piazza later this week. → One sheet of handwritten notes (double-sided). CMU 15-445/645 (Fall 2018)

3 WH Y DO WE N EED TO J O IN ? We normalize tables in a relational database to avoid unnecessary repetition of information. We use the join operate to reconstruct the original tuples without any information loss. CMU 15-445/645 (Fall 2018)

4 N O RM ALIZED TABLES Artist (id, name, year, country) id name year country 123 Wu Tang Clan 1992 USA ArtistAlbum (artist_id, album_id) 456 Notorious BIG 1992 USA 789 Ice Cube 1989 USA artist_id album_id 123 11 Album (id, name, year) 123 22 id name year 789 22 11 Enter the Wu Tang 1993 456 22 22 St.Ides Mix Tape 1994 33 AmeriKKKa's Most Wanted 1990 CMU 15-445/645 (Fall 2018)

5 J O IN ALGO RITH M S We will focus on joining two tables at a time. In general, we want the smaller table to always be the outer table. Things we need to discuss first: → Output → Cost Analysis Criteria CMU 15-445/645 (Fall 2018)

6 J O IN O PERATO R O UTPUT SELECT R.id, S.cdate For a tuple r ∈ R and a tuple s ∈ S FROM R, S that match on join attributes, WHERE R.id = S.id AND S.value > 100 concatenate r and s together into a p new tuple. R.id, S.cdate Contents can vary: ⨝ R.id=S.id → Depends on processing model s → Depends on storage model value>100 → Depends on the query R S CMU 15-445/645 (Fall 2018)

7 J O IN O PERATO R O UTPUT: DATA SELECT R.id, S.cdate Copy the values for the attributes in FROM R, S outer and inner tuples into a new WHERE R.id = S.id AND S.value > 100 output tuple. R(id,name) S(id,value,cdate) id name id value cdate ⨝ 123 abc 123 1000 10/9/2018 123 2000 10/9/2018 R.id R.name S.id S.value S.cdate 123 abc 123 1000 10/9/2018 123 abc 123 2000 10/9/2018 CMU 15-445/645 (Fall 2018)

7 J O IN O PERATO R O UTPUT: DATA SELECT R.id, S.cdate Copy the values for the attributes in FROM R, S outer and inner tuples into a new WHERE R.id = S.id AND S.value > 100 output tuple. p R.id, S.cdate R.id R.name S.id S.value S.cdate ⨝ 123 abc 123 1000 10/9/2018 R.id=S.id 123 abc 123 2000 10/9/2018 s value>100 R S CMU 15-445/645 (Fall 2018)

7 J O IN O PERATO R O UTPUT: DATA SELECT R.id, S.cdate Copy the values for the attributes in FROM R, S outer and inner tuples into a new WHERE R.id = S.id AND S.value > 100 output tuple. p R.id, S.cdate Subsequent operators in the query plan never need to go back to the base ⨝ R.id=S.id tables to get more data. s value>100 R S CMU 15-445/645 (Fall 2018)

8 J O IN O PERATO R O UTPUT: RECO RD IDS SELECT R.id, S.cdate Only copy the joins keys along with FROM R, S the record ids of the matching tuples. WHERE R.id = S.id AND S.value > 100 A(id,name) S(id,value,cdate) id name id value cdate ⨝ 123 abc 123 1000 10/9/2018 123 2000 10/9/2018 R.id R.RID S.id S.RID 123 R.### 123 S.### 123 R.### 123 S.### CMU 15-445/645 (Fall 2018)

8 J O IN O PERATO R O UTPUT: RECO RD IDS SELECT R.id, S.cdate Only copy the joins keys along with FROM R, S the record ids of the matching tuples. WHERE R.id = S.id AND S.value > 100 p R.id, S.cdate R.id R.RID S.id S.RID ⨝ 123 R.### 123 S.### R.id=S.id 123 R.### 123 S.### s value>100 R S CMU 15-445/645 (Fall 2018)

8 J O IN O PERATO R O UTPUT: RECO RD IDS SELECT R.id, S.cdate Only copy the joins keys along with FROM R, S the record ids of the matching tuples. WHERE R.id = S.id AND S.value > 100 p Ideal for column stores because the R.id, S.cdate DBMS does not copy data that is not need for the query. ⨝ R.id=S.id s This is called late materialization . value>100 R S CMU 15-445/645 (Fall 2018)

9 I/ O CO ST AN ALYSIS SELECT R.id, S.cdate Assume: FROM R, S → M pages in table R , m tuples total WHERE R.id = S.id → N pages in S , n tuples total AND S.value > 100 Cost Metric: # of IOs to compute join We will ignore output costs since that depends on the data and we cannot compute that yet. CMU 15-445/645 (Fall 2018)

10 J O IN VS CRO SS- PRO DUCT R ⨝ S is the most common operation and thus must be carefully optimized. R×S followed by a selection is inefficient because the cross-product is large. There are many algorithms for reducing join cost, but no particular algorithm works well in all scenarios. CMU 15-445/645 (Fall 2018)

11 J O IN ALGO RITH M S Nested Loop Join → Simple → Block → Index Sort-Merge Join Hash Join CMU 15-445/645 (Fall 2018)

12 SIM PLE N ESTED LO O P J O IN foreach tuple r ∈ R : Outer foreach tuple s ∈ S : Inner emit , if r and s match R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 200 GZA 500 7777 10/9/2018 100 Andy 400 6666 10/9/2018 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

13 SIM PLE N ESTED LO O P J O IN Why is this algorithm bad? → For every tuple in R , it scans S once Cost: M + (m ∙ N) R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 N pages 200 GZA 500 7777 10/9/2018 M pages n tuples 100 Andy 400 6666 10/9/2018 m tuples 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

14 SIM PLE N ESTED LO O P J O IN Example database: → M = 1000, m = 100,000 → N = 500, n = 40,000 Cost Analysis: → M + ( m ∙ N ) = 1000 + (100000 ∙ 500) = 50,000,100 IOs → At 0.1 ms/IO, Total time ≈ 1.3 hours What if smaller table ( S ) is used as the outer table? → N + ( n ∙ M ) = 500 + (40000 ∙ 1000) = 40,000,500 Ios → At 0.1 ms/IO, Total time ≈ 1.1 hours CMU 15-445/645 (Fall 2018)

15 BLO CK N ESTED LO O P J O IN foreach block B R ∈ R : foreach block B S ∈ S : foreach tuple r ∈ B R : foreach tuple s ∈ B s : emit , if r and s match R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 N pages 200 GZA 500 7777 10/9/2018 M pages n tuples 100 Andy 400 6666 10/9/2018 m tuples 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

16 BLO CK N ESTED LO O P J O IN This algorithm performs fewer disk accesses. → For every block in R , it scans S once Cost: M + ( M ∙ N ) R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 N pages 200 GZA 500 7777 10/9/2018 M pages n tuples 100 Andy 400 6666 10/9/2018 m tuples 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

17 BLO CK N ESTED LO O P J O IN Which one should be the outer table? → The smaller table in terms of # of pages R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 N pages 200 GZA 500 7777 10/9/2018 M pages n tuples 100 Andy 400 6666 10/9/2018 m tuples 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

18 BLO CK N ESTED LO O P J O IN Example database: → M = 1000, m = 100,000 → N = 500, n = 40,000 Cost Analysis: → M + ( M ∙ N ) = 1000 + (1000 ∙ 500) = 501,000 IOs → At 0.1 ms/IO, Total time ≈ 50 seconds CMU 15-445/645 (Fall 2018)

19 BLO CK N ESTED LO O P J O IN What if we have B buffers available? → Use B -2 buffers for scanning the outer table. → Use one buffer for the inner table, one buffer for storing output. R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 N pages 200 GZA 500 7777 10/9/2018 M pages n tuples 100 Andy 400 6666 10/9/2018 m tuples 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

19 BLO CK N ESTED LO O P J O IN foreach B - 2 blocks b R ∈ R : foreach block b S ∈ S : foreach tuple r ∈ b R : foreach tuple s ∈ b s : emit , if r and s match R(id,name) S(id,value,cdate) id name id value cdate 600 MethodMan 100 2222 10/9/2018 N pages 200 GZA 500 7777 10/9/2018 M pages n tuples 100 Andy 400 6666 10/9/2018 m tuples 300 ODB 100 9999 10/9/2018 500 RZA 200 8888 10/9/2018 700 Ghostface 400 Raekwon CMU 15-445/645 (Fall 2018)

20 BLO CK N ESTED LO O P J O IN This algorithm uses B -2 buffers for scanning R . Cost: M + (  M / ( B -2)  ∙ N ) What if the outer relation completely fits in memory ( B > M +2 )? → Cost: M + N = 1000 + 500 = 1500 IOs → At 0.1ms/IO, Total time ≈ 0.15 seconds CMU 15-445/645 (Fall 2018)

21 IN DEX N ESTED LO O P J O IN Why do basic nested loop joins suck ass? → For each tuple in the outer table, we have to do a sequential scan to check for a match in the inner table. Can we accelerate the join using an index? Use an index to find inner table matches. → We could use an existing index for the join. → Or even build one on the fly. CMU 15-445/645 (Fall 2018)

Join Algorithms Lecture # 12 Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Join Algorithms Lecture # 12 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 ADM IN ISTRIVIA Project #2 Checkpoint #1 is due TODAY No class on Wednesday October 10th Mid-term Exam is

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

CS411 Database Systems Join Expressions 06: SQL Kazuhiro Minami Join Expressions Products and

How does Hash Join work in PostgreSQL and its derivates Yandong Yao Pivotal Greenplum team

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Please Join Us 2009 Consumer Attitudes Toward If you have not called in to join the audio

50 th MedCruise General Assembly 14-17 June 2017 Gibraltar Join @ slido.com #50MedCruiseGA Join

enabled by Circular Economy Service Platform join CEP Be part of the future transformation join

JOIN 2 . Alexander Wagner, for JOIN 2 Overview > Fraunhofer requirements > What happened in

Invenio @ JOIN 2 . Alexander Wagner, for JOIN 2 4 th IUGW Overview > Partner > What happened

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

CS 327E Class 3 September 23, 2019 1) Which SQL join type does this query contain? S(a: int, b:

Optimal Join Algorithms meet Top- Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some

Computer Graphics 543 Lecture 4 (Part 1): Rotations and Matrix Concatenation Prof Emmanuel Agu

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Kolmogorov Complexity Suppose I say I tossed a coin 40 times and got:

Compile-time type transformation Meeting C++ 2019, Berlin dr Ivan uki KDAB

Announcements Extra office hours today (instead of DIS sections); Zoom links on Canvas P6

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

Join Algorithms Lecture # 12 Database Systems Andy Pavlo AP AP - PowerPoint PPT Presentation

Join Algorithms Lecture # 12 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 ADM IN ISTRIVIA Project #2 Checkpoint #1 is due TODAY No class on Wednesday October 10th Mid-term Exam is

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

When to Optimize Enumerating all possible plans Selection Pushdown Join Conversion Join

CS411 Database Systems Join Expressions 06: SQL Kazuhiro Minami Join Expressions Products and

How does Hash Join work in PostgreSQL and its derivates Yandong Yao Pivotal Greenplum team

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Please Join Us 2009 Consumer Attitudes Toward If you have not called in to join the audio

50 th MedCruise General Assembly 14-17 June 2017 Gibraltar Join @ slido.com #50MedCruiseGA Join

enabled by Circular Economy Service Platform join CEP Be part of the future transformation join

JOIN 2 . Alexander Wagner, for JOIN 2 Overview &gt; Fraunhofer requirements &gt; What happened in

Invenio @ JOIN 2 . Alexander Wagner, for JOIN 2 4 th IUGW Overview &gt; Partner &gt; What happened

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

CS 327E Class 3 September 23, 2019 1) Which SQL join type does this query contain? S(a: int, b:

Optimal Join Algorithms meet Top- Nikolaos Tziavelis, Wolfgang Gatterbauer, Mirek Riedewald

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some

Computer Graphics 543 Lecture 4 (Part 1): Rotations and Matrix Concatenation Prof Emmanuel Agu

SORTING Review of Sorting Merge Sort Sets sorting 1 Sorting Algorithms

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Kolmogorov Complexity Suppose I say I tossed a coin 40 times and got:

Compile-time type transformation Meeting C++ 2019, Berlin dr Ivan uki KDAB

Announcements Extra office hours today (instead of DIS sections); Zoom links on Canvas P6

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

JOIN 2 . Alexander Wagner, for JOIN 2 Overview > Fraunhofer requirements > What happened in

Invenio @ JOIN 2 . Alexander Wagner, for JOIN 2 4 th IUGW Overview > Partner > What happened