CS4224/CS5424 Lecture 9 Distributed Query Processing Query - - PowerPoint PPT Presentation

cs4224 cs5424 lecture 9 distributed query processing
SMART_READER_LITE
LIVE PREVIEW

CS4224/CS5424 Lecture 9 Distributed Query Processing Query - - PowerPoint PPT Presentation

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a query plan that minimizes some cost function Minimize total cost CPU cost, I/O cost, & communication cost Minimize response time


slide-1
SLIDE 1

CS4224/CS5424 Lecture 9 Distributed Query Processing

slide-2
SLIDE 2

Query Processing

  • Translates query into a query plan that

minimizes some cost function

◮ Minimize total cost ⋆ CPU cost, I/O cost, & communication cost ◮ Minimize response time ⋆ Time elapsed for query execution CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 2

slide-3
SLIDE 3

Example

  • Site A: Relations R(a, c, · · · ) & S(a, · · · )
  • Site B: Relations T(b, c, · · · ) & U(b, · · · )
  • Query at Site B:

SELECT * FROM R, S, T, U WHERE R.a = S.a AND T.b = U.b AND R.c = T.c

  • Query Plans:

⊲ ⊳ ⊲ ⊳ R S ⊲ ⊳ T U

Site B

⊲ ⊳ ⊲ ⊳ ⊲ ⊳ R S T U

Site B

Plan 1 Plan 2

CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 3

slide-4
SLIDE 4

Example (cont.)

  • Cost model:

◮ JC(X,Y) = CPU & I/O cost of joining relations X & Y ◮ CC(X) = Communication cost of sending relation X from one

site to another site

◮ JC(R, S) = 2000,

JC(T, U) = 2000

◮ JC(R ⊲

⊳ S, T) = 1000, JC(R ⊲ ⊳ S ⊲ ⊳ T, U) = 600

◮ JC(R ⊲

⊳ S, T ⊲ ⊳ U) = 100, CC(R ⊲ ⊳ S) = 200

⊲ ⊳ ⊲ ⊳ R S ⊲ ⊳ T U

Site B

⊲ ⊳ ⊲ ⊳ ⊲ ⊳ R S T U

Site B

Plan 1 Plan 2

CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 4

slide-5
SLIDE 5

Example (cont.)

  • Cost model:

◮ JC(X,Y) = CPU & I/O cost of joining relations X & Y ◮ CC(X) = Communication cost of sending relation X from one

site to another site

◮ JC(R, S) = 2000,

JC(T, U) = 2000

◮ JC(R ⊲

⊳ S, T) = 1000, JC(R ⊲ ⊳ S ⊲ ⊳ T, U) = 600

◮ JC(R ⊲

⊳ S, T ⊲ ⊳ U) = 100, CC(R ⊲ ⊳ S) = 200

100 2000

R S

2000

T U

200

600 1000 2000

R S T U

200

Plan 1: Total Cost = 4300 Plan 2: Total Cost = 3800

CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 5

slide-6
SLIDE 6

Example (cont.)

  • Cost model:

◮ JC(X,Y) = CPU & I/O cost of joining relations X & Y ◮ CC(X) = Communication cost of sending relation X from one

site to another site

◮ JC(R, S) = 2000,

JC(T, U) = 2000

◮ JC(R ⊲

⊳ S, T) = 1000, JC(R ⊲ ⊳ S ⊲ ⊳ T, U) = 600

◮ JC(R ⊲

⊳ S, T ⊲ ⊳ U) = 100, CC(R ⊲ ⊳ S) = 200

100 2000

R S

2000

T U

200

600 1000 2000

R S T U

200

Plan 1: Response Time = 2300 Plan 2: Response Time = 3800

CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 6

slide-7
SLIDE 7

Query Processing Steps

  • Query rewriting

◮ Query decomposition ⋆ Translates query into relational algebra query ◮ Data localization ⋆ Rewrites distributed query into a fragment query

  • Global query optimization

◮ Finds an optimal execution plan for query

  • Distributed query execution

◮ Executes query plan to compute query result CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 7

slide-8
SLIDE 8

Query Decomposition

SELECT * FROM R, S WHERE R.a = S.a AND R.b = 20

− →

⊲ ⊳R.a=S.a σb=20 R S

  • Normalization

◮ Rewrites query into some normal form

  • Semantic Analysis

◮ Checks that query is semantically correct

  • Simplification & Restructuring

◮ Rewrites query into simpler form (e.g., eliminates redundancy) CS4224/CS5424: Sem 1, 2019/20 Query Decomposition 8

slide-9
SLIDE 9

Normalization

  • A simple predicate defined on a relation R is of

the form “Ai op v” where Ai is an attribute of R,

  • p ∈ {=, =, <, ≤, >, ≥} and v ∈ Domain(Ai)
  • Conjunctive Normal Form (CNF)

(p11 ∨p12 · · ·∨p1n1) ∧ · · ·∧ (pm1 ∨pm2 · · ·∨pmnm)

  • Disjunctive Normal Form (DNF)

(p11 ∧p12 · · ·∧p1n1) ∨ · · ·∨ (pm1 ∧pm2 · · ·∧pmnm)

  • Each pij is a simple predicate

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Normalization 9

slide-10
SLIDE 10

Review of RA Equivalence Rules

attributes(R) = Set of attributes in schema of relation R attributes(p) = Set of attributes in predicate p

  • 1. Commutativity of binary operators

1.1 R × S ≡ S × R 1.2 R ⊲ ⊳ S ≡ S ⊲ ⊳ R

  • 2. Associativity of binary operators

2.1 (R × S) × T ≡ R × (S × T) 2.2 (R ⊲ ⊳ S) ⊲ ⊳ T ≡ R ⊲ ⊳ (S ⊲ ⊳ T)

  • 3. Idempotence of unary operators

3.1 πL′(πL(R)) ≡ πL′(R) if L′ ⊆ L ⊆ attributes(R) 3.2 σp1(σp2(R)) ≡ σp1∧p2(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 10

slide-11
SLIDE 11

Review of RA Equivalence Rules (cont.)

  • 4. Commutating selection with projection

4.1 πL(σp(R)) ≡ πL(σp(πL∪attributes(p)(R)))

  • 5. Commutating selection with binary
  • perators

5.1 σp(R × S) ≡ σp(R) × S if attributes(p) ⊆ attributes(R) 5.2 σp(R ⊲ ⊳p′ S) ≡ σp(R) ⊲ ⊳p′ S if attributes(p) ⊆ attributes(R) 5.3 σp(R ∪ S) ≡ σp(R) ∪ σp(S)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 11

slide-12
SLIDE 12

Review of RA Equivalence Rules (cont.)

  • 6. Commutating projection with binary
  • perators

Let L = LR ∪ LS, where LR ⊆ attributes(R) and LS ⊆ attributes(S)

6.1 πL(R × S) ≡ πLR(R) × πLS(S) 6.2 πL(R ⊲ ⊳p S) ≡ πLR(R) ⊲ ⊳p πLS(S) if attributes(p) ∩ attributes(R) ⊆ LR and attributes(p) ∩ attributes(S) ⊆ LS 6.3 πL(R ∪ S) ≡ πL(R) ∪ πL(S)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 12

slide-13
SLIDE 13

Example

Student (sid, sname, major) Course (cid, cname, area) Enrol (sid,cid, grade)

SELECT sname, cname FROM Student S, Course C, Enrol E WHERE E.sid = S.sid AND E.cid = C.cid AND major = "CS" AND area = "DB" AND grade = "F"

− →

πsname, cname σ(major=”CS”)∧(area=”DB”)∧(grade=”F”) ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid Enrol E Student S

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 13

slide-14
SLIDE 14

Example (cont.)

πsname, cname σ(major=”CS”)∧(area=”DB”)∧(grade=”F”) ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid Enrol E Student S

3.2

− →

πsname, cname σmajor=”CS” σarea=”DB” σgrade=”F” ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid Enrol E Student S

σp1(σp2(R)) ≡ σp1∧p2(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-15
SLIDE 15

Example (cont.)

πsname, cname σmajor=”CS” σarea=”DB” ⊲ ⊳C.cid=E.cid Course C σgrade=”F” ⊲ ⊳E.sid=S.sid Enrol E Student S

5.2

← −

πsname, cname σmajor=”CS” σarea=”DB” σgrade=”F” ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid Enrol E Student S

σp(R ⊲ ⊳p′ S) ≡ σp(R) ⊲ ⊳p′ S if attributes(p) ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-16
SLIDE 16

Example (cont.)

πsname, cname σmajor=”CS” σarea=”DB” ⊲ ⊳C.cid=E.cid Course C σgrade=”F” ⊲ ⊳E.sid=S.sid Enrol E Student S

5.2

− →

πsname, cname σmajor=”CS” σarea=”DB” ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E Student S

σp(R ⊲ ⊳p′ S) ≡ σp(R) ⊲ ⊳p′ S if attributes(p) ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-17
SLIDE 17

Example (cont.)

πsname, cname σmajor=”CS” ⊲ ⊳C.cid=E.cid σarea=”DB” Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E Student S

5.2

← −

πsname, cname σmajor=”CS” σarea=”DB” ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E Student S

σp(R ⊲ ⊳p′ S) ≡ σp(R) ⊲ ⊳p′ S if attributes(p) ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-18
SLIDE 18

Example (cont.)

πsname, cname σmajor=”CS” ⊲ ⊳C.cid=E.cid σarea=”DB” Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E Student S

5.2

− →

πsname, cname ⊲ ⊳C.cid=E.cid σarea=”DB” Course C σmajor=”CS” ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E Student

σp(R ⊲ ⊳p′ S) ≡ σp(R) ⊲ ⊳p′ S if attributes(p) ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-19
SLIDE 19

Example (cont.)

πsname, cname ⊲ ⊳C.cid=E.cid σarea=”DB” Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=”CS” Student S

5.2

← −

πsname, cname ⊲ ⊳C.cid=E.cid σarea=”DB” Course C σmajor=”CS” ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E Student

σp(R ⊲ ⊳p′ S) ≡ σp(R) ⊲ ⊳p′ S if attributes(p) ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-20
SLIDE 20

Example (cont.)

πsname, cname ⊲ ⊳C.cid=E.cid σarea=”DB” Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=”CS” Student S

3.1

− →

πsname, cname πsname, cname, cid ⊲ ⊳C.cid=E.cid σarea=”DB” Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=” Student

πL′(πL(R)) ≡ πL′(R) if L′ ⊆ L ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-21
SLIDE 21

Example (cont.)

πsname, cname ⊲ ⊳C.cid=E.cid πcname, cid σarea=”DB” Course C πsname, cid ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=”CS” Student S

6.2

← −

πsname, cname πsname, cname, cid ⊲ ⊳C.cid=E.cid σarea=”DB” Course C ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=” Student

πL(R ⊲ ⊳p S) ≡ πLR(R) ⊲ ⊳p πLS(S)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-22
SLIDE 22

Example (cont.)

πsname, cname ⊲ ⊳C.cid=E.cid πcname, cid σarea=”DB” Course C πsname, cid ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=”CS” Student S

3.1

− →

πsname, cname ⊲ ⊳C.cid=E.cid πcname, cid σarea=”DB” Course C πsname, cid πsname, cid, sid ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=” Student

πL′(πL(R)) ≡ πL′(R) if L′ ⊆ L ⊆ attributes(R)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-23
SLIDE 23

Example (cont.)

πsname, cname ⊲ ⊳C.cid=E.cid πcname, cid σarea=”DB” Course C πsname, cid ⊲ ⊳E.sid=S.sid πcid, sid σgrade=”F” Enrol E πsname, sid σmajor=”CS” Student S

6.2

← −

πsname, cname ⊲ ⊳C.cid=E.cid πcname, cid σarea=”DB” Course C πsname, cid πsname, cid, sid ⊲ ⊳E.sid=S.sid σgrade=”F” Enrol E σmajor=” Student

πL(R ⊲ ⊳p S) ≡ πLR(R) ⊲ ⊳p πLS(S)

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 14

slide-24
SLIDE 24

Example (cont.)

πsname, cname σ(major=”CS”)∧(area=”DB”)∧(grade=”F”) ⊲ ⊳C.cid=E.cid Course C ⊲ ⊳E.sid=S.sid Enrol E Student S πsname, cname ⊲ ⊳C.cid=E.cid πcname, cid σarea=”DB” Course C πsname, cid ⊲ ⊳E.sid=S.sid πcid, sid σgrade=”F” Enrol E πsname, sid σmajor=”CS” Student S

Original Query Final Query

CS4224/CS5424: Sem 1, 2019/20 Query Decomposition: Restructuring 15

slide-25
SLIDE 25

Query Localization

  • Rewrites distributed query into a fragment query
  • Uses data distribution information to determine

which fragments are involved

⊲ ⊳R.a=S.a σb=20 R S

− →

∪ ⊲ ⊳ S1 σb=20 R1 ⊲ ⊳ S2 σb=20 R2

Distributed Query Fragment Query

CS4224/CS5424: Sem 1, 2019/20 Query Localization 16

slide-26
SLIDE 26

Localization Program

  • A localization program for a fragmented relation

R is a reconstruction rule for R in terms of its fragments

  • Example:

◮ Let {R1, · · · , Rn} be a complete & disjoint partitioning of R ◮ If each Ri = σFi(R), then localization program for R is

R1 ∪ · · · ∪ Rn

◮ If each Ri = πLi(R) & key(R) ∈ Li, then localization program for

R is R1 ⊲ ⊳ · · · ⊲ ⊳ Rn

CS4224/CS5424: Sem 1, 2019/20 Query Localization 17

slide-27
SLIDE 27

Localized Query

  • Localized query = query with each fragmented

relation replaced by its localization program

R1 = σa≤10(R), R2 = σa>10(R) S1 = σa≤10(S), S2 = σa>10(S) ⊲ ⊳R.a=S.a σb=20 R S

− →

⊲ ⊳R.a=S.a σb=20 ∪ R1 R2 ∪ S1 S2

Distributed Query Localized Query

CS4224/CS5424: Sem 1, 2019/20 Query Localization 18

slide-28
SLIDE 28

Reduction Techniques

  • Reduction techniques = rewriting techniques to

simplify localized queries

  • Identify & eliminate query expressions on

fragments that do not contribute to query results

  • Techniques:

◮ Reduction for primary horizontal fragmentation ⋆ Reduction with selection ⋆ Reduction with join ◮ Reduction for derived horizontal fragmentation ◮ Reduction for vertical fragmentation ◮ Reduction for hybrid fragmentation CS4224/CS5424: Sem 1, 2019/20 Query Localization 19

slide-29
SLIDE 29

Reduction with Selection

Rule 1: σp(Ri) = ∅ if Ri = σFi(R) and Fi ∧ p = false

R = R1 ∪ R2 ∪ R3 R1 = σa<10(R) R2 = σa∈[10,70](R) R3 = σa>70(R) Q1 = σa=12(R) = σa=12(R1 ∪ R2 ∪ R3) = σa=12(R1) ∪ σa=12(R2) ∪ σa=12(R3) = σa=12(R2)

CS4224/CS5424: Sem 1, 2019/20 Query Localization 20

slide-30
SLIDE 30

Reduction with Join

Rule 2: Ri ⊲ ⊳a Sj = ∅ if Ri = σFa∧F(R), Sj = σF ′

a∧F ′(S),

Fa&F ′

a are predicates on attribute a, and

Fa ∧ F ′

a = false

R = R1 ∪ R2 ∪ R3 R1 = σa<10(R) R2 = σa∈[10,70](R) R3 = σa>70(R) S = S1 ∪ S2 S1 = σa<10(S) S2 = σa≥10(S)

Q = R ⊲ ⊳a S = (R1 ∪ R2 ∪ R3) ⊲ ⊳a (S1 ∪ S2) = (R1 ⊲ ⊳a S1) ∪ (R1 ⊲ ⊳a S2) ∪ (R2 ⊲ ⊳a S1) ∪ (R2 ⊲ ⊳a S2) ∪ (R3 ⊲ ⊳a S1) ∪ (R3 ⊲ ⊳a S2) = (R1 ⊲ ⊳a S1) ∪ (R2 ⊲ ⊳a S2) ∪ (R3 ⊲ ⊳a S2)

CS4224/CS5424: Sem 1, 2019/20 Query Localization 21

slide-31
SLIDE 31

Reduction for Derived Fragmentation

Rule 3: Si ⊲ ⊳a Rj = ∅ if Si = S ⋉a Ri is a derived horizontal fragmentation of S wrt R, and i = j Consider R(a, b, c), S(x, y, a) where S.a is a foreign key of R

  • R1 = σb=10(R),

R2 = σb=10(R)

  • S1 = S ⋉a R1,

S2 = S ⋉a R2 Q = σb=20(R) ⊲ ⊳a S = σb=20(R1 ∪ R2) ⊲ ⊳a (S1 ∪ S2) = σb=20(R2) ⊲ ⊳a (S1 ∪ S2) = (σb=20(R2) ⊲ ⊳a S1) ∪ (σb=20(R2) ⊲ ⊳a S2) = σb=20(R2) ⊲ ⊳a S2

CS4224/CS5424: Sem 1, 2019/20 Query Localization 22

slide-32
SLIDE 32

Reduction for Vertical Fragmentation

Rule 4: πL(R1 ⊲ ⊳ R2 ⊲ ⊳ · · · ⊲ ⊳ Rn) = πL(R2 ⊲ ⊳ · · · ⊲ ⊳ Rn) if R1, · · · , Rn are vertical fragments of R and (attributes(R1) − key(R)) ∩ L = ∅ Consider R(a,b,c) where R = R1 ⊲ ⊳a R2 R1 = πa,b(R) R2 = πa,c(R) Q = πc(R) = πc(R1 ⊲ ⊳a R2) = πc(R2)

CS4224/CS5424: Sem 1, 2019/20 Query Localization 23

slide-33
SLIDE 33

Reduction for Hybrid Fragmentation

R = (R1 ∪ R2) ⊲ ⊳a R3 R1 = πa,b(σa<10(R)) R2 = πa,b(σa≥10(R)) R3 = πa,c(R) Q = πb(σa=20(R)) = πb(σa=20((R1 ∪ R2) ⊲ ⊳a R3) = πb(σa=20(R1 ∪ R2)) = πb(σa=20(R1) ∪ σa=20(R2)) = πb(σa=20(R2))

CS4224/CS5424: Sem 1, 2019/20 Query Localization 24

slide-34
SLIDE 34

Distributed Join Strategies for R ⊲ ⊳A S

  • Assume that both R & S have been partitioned
  • ver all the nodes
  • Collocated/Local join Both R and S have been

partitioned on join key.

  • Directed join Only table R has been partitioned
  • n join key. Dynamically repartition S on join

key.

  • Broadcast join Replicate table S to all nodes.
  • Repartitioned join Neither of the tables has been

partitioned on join key. Repartition both tables

  • ver all the nodes using the join key as

partitioning key.

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 25

slide-35
SLIDE 35

Join Strategies: Example

  • Customers (cust#, cname, city)
  • Orders (order#, cust#, odate)
  • Suppliers (supp#, sname, city)

Customers cust# cname city 1 Alice Singapore 2 Bob Penang 3 Carol Bangkok 4 Dave Singapore 5 Eve Singapore 6 Fred Penang 7 George Bangkok Orders

  • rder#

cust#

  • date

302 1 June 2013 304 2 May 2013 307 3 Nov 2013 308 1 April 2013 309 5 May 2013 311 6 Dec 2013 312 3 July 2013 Suppliers supp# sname city 32 A Bangkok 33 B Singapore 34 C Singapore 36 D Bangkok 37 E Penang 38 F Penang 39 G Singapore CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 26

slide-36
SLIDE 36

Collocated Join: Example

  • Customers hash partitioned on cust# using hcust
  • Orders hash partitioned on cust# using hcust
  • hcust(c) = (c mod 3) + 1

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Orders1

  • rder#

cust#

  • date

307 3 Nov 2013 311 6 Dec 2013 312 3 July 2013 Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Orders2

  • rder#

cust#

  • date

302 1 June 2013 308 1 April 2013 Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Orders3

  • rder#

cust#

  • date

304 2 May 2013 309 5 May 2013 CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 27

slide-37
SLIDE 37

Collocated Join: Example (cont.)

  • Query 1: Customers ⊲

⊳cust# Orders

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Orders1

  • rder#

cust#

  • date

307 3 Nov 2013 311 6 Dec 2013 312 3 July 2013 Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Orders2

  • rder#

cust#

  • date

302 1 June 2013 308 1 April 2013 Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Orders3

  • rder#

cust#

  • date

304 2 May 2013 309 5 May 2013

3

  • i=1

(Customersi ⊲ ⊳ Ordersi)

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 28

slide-38
SLIDE 38

Directed Join: Example

  • Customers hash partitioned on cust# using hcust
  • Orders hash partitioned on order# using horder
  • hcust(c) = (c mod 3) + 1
  • horder(o) = (o mod 3) + 1

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Orders1

  • rder#

cust#

  • date

309 5 May 2013 312 3 July 2013 Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Orders2

  • rder#

cust#

  • date

304 2 May 2013 307 3 Nov 2013 Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Orders3

  • rder#

cust#

  • date

302 1 June 2013 308 1 April 2013 311 6 Dec 2013 CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 29

slide-39
SLIDE 39

Directed Join: Example (cont.)

  • Query 1: Customers ⊲

⊳cust# Orders

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Orders1

  • rder#

cust#

  • date

309 5 May 2013 312 3 July 2013 Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Orders2

  • rder#

cust#

  • date

304 2 May 2013 307 3 Nov 2013 Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Orders3

  • rder#

cust#

  • date

302 1 June 2013 308 1 April 2013 311 6 Dec 2013 CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 30

slide-40
SLIDE 40

Directed Join: Example (cont.)

  • Query 1: Customers ⊲

⊳cust# Orders

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Orders1

  • rder#

cust#

  • date

309 5 May 2013 312 3 July 2013 Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Orders2

  • rder#

cust#

  • date

304 2 May 2013 307 3 Nov 2013 Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Orders3

  • rder#

cust#

  • date

302 1 June 2013 308 1 April 2013 311 6 Dec 2013

Repartition Orders on cust# : Orders′

i = σhcust(cust#)=i(Orders)

3

  • i=1

(Customersi ⊲ ⊳ Orders′

i )

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 31

slide-41
SLIDE 41

Broadcast Join: Example

  • Customers hash partitioned on cust# using hcust
  • Suppliers hash partitioned on supp# using hsupp
  • hcust(c) = (c mod 3) + 1
  • hsupp(s) = (s mod 3) + 1

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 32

slide-42
SLIDE 42

Broadcast Join: Example (cont.)

  • Query 2: Customers ⊲

⊳city Suppliers

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 33

slide-43
SLIDE 43

Broadcast Join: Example (cont.)

Query 2: Customers ⊲ ⊳city Suppliers

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang

Option 1: Broadcast Suppliers

3

  • i=1

(Customersi ⊲ ⊳ Suppliers)

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 34

slide-44
SLIDE 44

Broadcast Join: Example (cont.)

  • Query 2: Customers ⊲

⊳city Suppliers

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang

Option 2: Broadcast Customers

3

  • i=1

(Customers ⊲ ⊳ Suppliersi)

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 35

slide-45
SLIDE 45

Repartitioned Join: Example

  • Customers hash partitioned on cust# using hcust
  • Suppliers hash partitioned on supp# using hsupp
  • hcust(c) = (c mod 3) + 1
  • hsupp(s) = (s mod 3) + 1

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 36

slide-46
SLIDE 46

Repartitioned Join: Example (cont.)

  • Query 2: Customers ⊲

⊳city Suppliers

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 37

slide-47
SLIDE 47

Repartitioned Join: Example (cont.)

  • Query 2: Customers ⊲

⊳city Suppliers

  • Repartition both tables using hcity

c hcity(c) Singapore 1 Penang 2 Bangkok 3

  • Customers′

i = σhcity(city)=i(Customers)

  • Suppliers′

i = σhcity(city)=i(Suppliers)

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 38

slide-48
SLIDE 48

Repartitioned Join: Example (cont.)

Query 2: Customers ⊲ ⊳city Suppliers

Site 1 Customers1 cust# cname city 3 Carol Bangkok 6 Fred Penang Suppliers1 supp# sname city 33 B Singapore 36 D Bangkok 39 G Singapore Site 2 Customers2 cust# cname city 1 Alice Singapore 4 Dave Singapore 7 George Bangkok Suppliers2 supp# sname city 34 C Singapore 37 E Penang Site 3 Customers3 cust# cname city 2 Bob Penang 5 Eve Singapore Suppliers3 supp# sname city 32 A Bangkok 38 F Penang

3

  • i=1

(Customers′

i ⊲

⊳ Suppliers′

i )

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 39

slide-49
SLIDE 49

Comparison of join strategies for R ⊲ ⊳ S

Let the relations be partitioned over n nodes Join Strategy Communication Cost Collocated Directed size(R) if R is being repartitioned Broadcast (n − 1) × size(R) if R is being broadcast Repartitioned size(R) + size(S)

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies 40

slide-50
SLIDE 50

Query Processing in Google’s F1

  • Google’s early NewSQL system

◮ Distributed relational database system ◮ Hybrid of NoSQL & RDBMS ◮ NoSQL: High availability, scalability ◮ RDBMS: Consistency, usability

  • Used by Google’s AdWords system since 2012

◮ 100s of applications & 1000s of users ◮ Database is over 100 TB, 105 requests/sec CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies: Google’s F1 41

slide-51
SLIDE 51

Query Processing in Google’s F1

SELECT agc.CampaignId, ac.Region, c.Language, SUM(ac.Clicks) FROM AdClick ac JOIN AdGroupCreative agc USING (AdGroupId, CreativeId) JOIN Creative c USING (CustomerId, CreativeId) WHERE c.Date = ’2013-02-23’ GROUP BY agc.CampaignId, ac.Region, c.Language

Creative (CreativeId, CustomerId, Language, ...) AdgroupCreative (AdGroupId, CreativeId, CampaignId, CustomerId, ...) Adclick (AdGroupId, CreativeId, Region, Date, Clicks, ...)

(Shute, et al., 2013)

CS4224/CS5424: Sem 1, 2019/20 Distributed Join Strategies: Google’s F1 42

slide-52
SLIDE 52

References

  • T. Özsu & P

. Valdureiz, Overview of Query Processing, Chapter 6, Principles of Distributed Database Systems, 3rd Edition, 2011

  • T. Özsu & P

. Valdureiz, Query Decomposition and Data Localization, Chapter 7, Principles of Distributed Database Systems, 3rd Edition, 2011

  • C. Baru, et al. DB2 Parallel Edition, IBM Systems Journal,

34(2), 1995.

  • Google goes back to the future with SQL F1 database,

The Register, August 2013, https://www.theregister.co.uk/2013/08/30/google_f1_deepdive/

  • F1: A Distributed SQL Database That Scales, VLDB

2013, https://research.google.com/pubs/pub41344.html

CS4224/CS5424: Sem 1, 2019/20 Distributed Query Processing 43