Optimal Communication Cost Magdalena Balazinska and Dan Suciu - - PowerPoint PPT Presentation

optimal communication cost
SMART_READER_LITE
LIVE PREVIEW

Optimal Communication Cost Magdalena Balazinska and Dan Suciu - - PowerPoint PPT Presentation

Query Processing with Optimal Communication Cost Magdalena Balazinska and Dan Suciu University of Washington AITF 2017 1 Context Past: NSF Big Data grant PhD student Paris Koutris received the ACM SIGMOD Jim Gray Dissertation Award


slide-1
SLIDE 1

Query Processing with Optimal Communication Cost

Magdalena Balazinska and Dan Suciu University of Washington

1 AITF 2017

slide-2
SLIDE 2

Context

Past: NSF Big Data grant

  • PhD student Paris Koutris received the

ACM SIGMOD Jim Gray Dissertation Award Current: AiTF Grant

  • PI’s Magda Balazinska, Dan Suciu
  • Student: Walter Cai

2

slide-3
SLIDE 3

Basic Question

  • How much communication is needed to

compute a query Q on p servers?

  • Parallel data processing

– Gamma, MapReduce, Hive, Teradata, Aster Data, Spark, Impala, Myria, Tensorflow – See Magda Balazinska’s current class

slide-4
SLIDE 4

Background

  • Q conjunctive query;

ρ* = its fractional edge covering number

  • Q(x,y,z) :- R(x,y) ∧S(y,z)∧ T(z,x)

If |R|, |S|, |T| ≤ m then |Output(Q)| ≤ m3/2

4

  • Thm. [Atserias,Grohe,Marx’2011] If every input

relation has size ≤ m then |Output(Q)| ≤ mρ*

½ ½ ½

x y z ρ* = 3/2

slide-5
SLIDE 5

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Input (size=m)

Input data = size m Number of servers = p

O(m/p) O(m/p) Extends BSP [Valiant]

slide-6
SLIDE 6

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) Round 1

Input data = size m One round = Compute & communicate Number of servers = p

O(m/p) O(m/p) Extends BSP [Valiant] ≤L ≤L

slide-7
SLIDE 7

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Algorithm = Several rounds One round = Compute & communicate Number of servers = p

O(m/p) O(m/p) Extends BSP [Valiant] ≤L ≤L ≤L ≤L ≤L ≤L

slide-8
SLIDE 8

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Max communication load / round / server = L Algorithm = Several rounds One round = Compute & communicate Number of servers = p

≤L ≤L ≤L ≤L ≤L ≤L O(m/p) O(m/p) Extends BSP [Valiant]

slide-9
SLIDE 9

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Max communication load / round / server = L Algorithm = Several rounds One round = Compute & communicate Number of servers = p

≤L ≤L ≤L ≤L ≤L ≤L O(m/p) O(m/p) Extends BSP [Valiant]

Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m L = m/p Rounds r 1 O(1) 1 p

slide-10
SLIDE 10

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Max communication load / round / server = L Algorithm = Several rounds One round = Compute & communicate Number of servers = p

≤L ≤L ≤L ≤L ≤L ≤L O(m/p) O(m/p) Extends BSP [Valiant]

Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m L = m/p Rounds r 1 O(1) 1 p

slide-11
SLIDE 11

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Max communication load / round / server = L Algorithm = Several rounds One round = Compute & communicate Number of servers = p

≤L ≤L ≤L ≤L ≤L ≤L O(m/p) O(m/p) Extends BSP [Valiant]

Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m L = m/p Rounds r 1 O(1) 1 p

slide-12
SLIDE 12

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Max communication load / round / server = L Algorithm = Several rounds One round = Compute & communicate Number of servers = p

≤L ≤L ≤L ≤L ≤L ≤L O(m/p) O(m/p) Extends BSP [Valiant]

Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m L = m/p Rounds r 1 O(1) 1 p

slide-13
SLIDE 13

Massively Parallel Communication Model (MPC)

Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Server 1 Server 1 Server p Server p . . . . Input (size=m) . . . . Round 1 Round 2 Round 3 . . . .

Input data = size m Max communication load / round / server = L Algorithm = Several rounds One round = Compute & communicate Number of servers = p

≤L ≤L ≤L ≤L ≤L ≤L O(m/p) O(m/p) Extends BSP [Valiant]

Cost: Ideal Practical ε∈(0,1) Naïve 1 Naïve 2 Load L L = m/p L = m/p1-ε L = m L = m/p Rounds r 1 O(1) 1 p

slide-14
SLIDE 14

A Naïve Lower Bound

  • Query Q
  • Inputs R, S, T, … s.t. |size(Q)| = mρ*
  • Algorithm with load L,
  • After r rounds, one server “knows” ≤ L*r

tuples: it can output ≤ (L*r)ρ* tuples (AGM)

  • p servers compute |size(Q)| = mρ* , hence

p*(L*r)ρ* ≥ mρ*

14

  • Thm. Any r-round algorithm has L ≥ m / r*p1/ρ*
slide-15
SLIDE 15

Speedup

15

# processors (=p) Speed = O(1/L)

A load of L = m/p corresponds to linear speedup A load of L = m/p1-ε corresponds to sub-linear speedup What is the theoretically optimal load L = f(m,p)? Is this the right question in the field? What is the theoretically optimal load L = f(m,p)? Is this the right question in the field?

slide-16
SLIDE 16

Join of Two Tables

Join(x,y,z) = R(x,y) ∧ S(y,z) |R| = |S| = m tuples In the field:

  • Hash-join on y:

L = m / p (w/o skew)

  • Broadcast-join:

L ≈ m In theory: L ≥ m / p1/2

1 1 ρ* = 2 x y z

slide-17
SLIDE 17

Triangles

State of the art:

  • Hash-join, two rounds:
  • Problem: intermediate result too big!
  • Broadcast S,T, one round:
  • Problem: two local tables are huge!

17

|R| = |S| = |T| = m tuples

Triangles(x,y,z) = R(x,y) ∧ S(y,z) ∧ T(z,x)

slide-18
SLIDE 18

Triangles in One Round

  • Place servers in a cube p = p1/3 × p1/3 × p1/3
  • Each server identified by (i,j,k)

i j k (i,j,k)

p1/3 Server (i,j,k) Server (i,j,k)

18

Triangles(x,y,z) = R(x,y) ∧ S(y,z) ∧ T(z,x) |R| = |S| = |T| = m tuples

[Afrati&Ullman’10] [Beame’13,’14]

slide-19
SLIDE 19

Triangles in One Round

k (i,j,k)

Z X Fred Alice Jack Jim Fred Jim Carol Alice … Jack Jim Y Z Fred Alice Jack Jim Fred Jim Carol Alice Jim Jack Jim Jack X Y Fred Alice Jack Jim Fred Jim Carol Alice …

R S T

i = h2(Fred) j = h1(Jim) Fred Jim Fred Jim Fred Jim Fred Jim Jim Jack Jim Jack Fred Jim Jim Jack Jim Jack Fred Jim Fred Jim Jack Jim Jack Jim

Round 1: Send R(x,y) to all servers (h1(x),h2(y),*) Send S(y,z) to all servers (*, h2(y), h3(z)) Send T(z,x) to all servers (h1(x), *, h3(z)) Output: compute locally R(x,y)∧S(y,z)∧T(z,x)

19

p1/3

|R| = |S| = |T| = m tuples Triangles(x,y,z) = R(x,y) ∧ S(y,z) ∧ T(z,x)

slide-20
SLIDE 20

Communication load per server

20

Can we compute Triangles with L = m/p?

|R| = |S| = |T| = m tuples

No!

Theorem Assuming “no skew”, HyperCube computes Triangles with L = O(m/p2/3) w.h.p. Theorem Any 1-round algo. has L = Ω(m/p2/3 ), even on inputs with no skew.

Triangles(x,y,z) = R(x,y) ∧ S(y,z) ∧ T(z,x)

slide-21
SLIDE 21

1.1M triples of Twitter data  220k triangles; p=64

21

2 rounds hash-join 1 round broadcast 1 round hypercube local 1 or 2-step hash-join; local 1-step Leapfrog Trie-join (a.k.a. Generic-Join)

|R| = |S| = |T| = 1.1M

Wall clock time Total CPU time Number of tuples shuffled

Triangles(x,y,z) = R(x,y) ∧ S(y,z) ∧ T(z,x)

slide-22
SLIDE 22

1.1M triples of Twitter data  220k triangles; p=64

|R| = |S| = |T| = 1.1M Triangles(x,y,z) = R(x,y) ∧ S(y,z) ∧ T(z,x)

slide-23
SLIDE 23

General Case

Theorem The optimal load for computing Q in

  • ne-round on skew-free data is L = O(m / p1/τ*)

τ* = fractional vertex cover of Q’s hypergraph m / p2/3

τ* = 3/2 ½ ½ ½ 1 τ* = 1

m / p

  • Thm. Any r-round algorithm has L ≥ m / r*p1/ρ*

ρ* = fractional edge cover of Q’s hypergraph

1 1 ρ* = 2

m / p1/2 m / p2/3

ρ* = 3/2 ½ ½ ½

slide-24
SLIDE 24

Skew

  • Skewed data is major impediment to parallel

data processing

  • Practical solutions:

– Deal with stragglers, hope they eventually terminate – Remove heavy hitters from computation

  • Our approach:

– Query  Residual Query – Join R(x,y) ∧S(y,z)  Cartesian Product R(x) ∧ S(z)

24

slide-25
SLIDE 25

Skewed Values  New Query

1 2 p½ 1 2 p½

R(x)  S(z) 

Join(x,y,z) = R(x,y) ∧ S(y,z) No-skew: τ* = 1, L = m/p Skewed: (y = single value, degree = m) Join becomes Product(x,z) = R(x)∧S(z) τ* = 2, L = m/p1/2

1 2 p

slide-26
SLIDE 26

Summary of Results so Far

  • 1 Round

– No skew: optimal load = m / p1/τ* – Skew: provably higher

  • Multiple rounds

– Lower bound: load >= m / p1/ρ* – All relations are binary: optimal load = m / p1/ρ* [PODS’2017a] – Arbitrary relations: optimal load = ?? Open

  • Additional statistics: keys, degree constraints

[PODS’2017b]

26

Thank you! Thank you!