SCALABLE DISTRIBUTED SUBGRAPH ENUMERATION AUTHORS: LONGBIN LAI - - PowerPoint PPT Presentation

scalable distributed subgraph enumeration
SMART_READER_LITE
LIVE PREVIEW

SCALABLE DISTRIBUTED SUBGRAPH ENUMERATION AUTHORS: LONGBIN LAI - - PowerPoint PPT Presentation

SCALABLE DISTRIBUTED SUBGRAPH ENUMERATION AUTHORS: LONGBIN LAI LU QIN XUEMIN LIN YING ZHANG LIJUN CHANG OUTLINE PROBLEM DEFINITION ALGORITHM FRAMEWORK TWINTWIG JOIN - VLDB15 SEED EXPERIMENTS CONCLUSION PROBLEM PROBLEM DEFINTION


slide-1
SLIDE 1

SCALABLE DISTRIBUTED SUBGRAPH ENUMERATION

AUTHORS: LONGBIN LAI LU QIN XUEMIN LIN YING ZHANG LIJUN CHANG

slide-2
SLIDE 2

PROBLEM DEFINITION

OUTLINE

ALGORITHM FRAMEWORK SEED CONCLUSION TWINTWIG JOIN - VLDB15’ EXPERIMENTS

slide-3
SLIDE 3

PROBLEM

slide-4
SLIDE 4

SUBGRAPH ENUMERATION

PROBLEM DEFINTION

  • Given a data graph , and a pattern graph , subgraph

enumeration aims to find all subgraphs (matches), that are isomorphic to .

  • G

P

g ⊆ G

P

1

v

2

v

3

v

4

v

1

u

2

u

3

u

4

u

5

u

6

u

G

P

✓ v1 v2 v3 v4 u1 u2 u5 u3 ◆

slide-5
SLIDE 5
  • Given a data graph , and a pattern graph , subgraph

enumeration aims to find all subgraphs (matches), that are isomorphic to .

  • SUBGRAPH ENUMERATION

PROBLEM DEFINTION

G

P

g ⊆ G

P

1

v

2

v

3

v

4

v

1

u

2

u

3

u

4

u

5

u

6

u

G

P

✓ v1 v2 v3 v4 u4 u2 u3 u5 ◆

slide-6
SLIDE 6
  • Given a data graph , and a pattern graph , subgraph

enumeration aims to find all subgraphs (matches), that are isomorphic to .

  • SUBGRAPH ENUMERATION

PROBLEM DEFINTION

G

P

g ⊆ G

P

1

v

2

v

3

v

4

v

1

u

2

u

3

u

4

u

5

u

6

u

G

P

✓ v1 v2 v3 v4 u6 u3 u2 u5 ◆

slide-7
SLIDE 7

FRAMEWORK

slide-8
SLIDE 8

PATTERN DECOMPOSITION

1

v

2

v

3

v

4

v

1

u

2

u

3

u

4

u

5

u

6

u

v1 v2 v4 v2 v4 v3 v4 v3 p0 p1 p2

Join Units

P = p0 ∪ p1 ∪ p2

slide-9
SLIDE 9
  • Graph Storage
  • Stored as for each data node
  • : Local Graph of s.t.
  • (1) Connected
  • (2)
  • (3)

WHAT CAN BE JOIN UNITS

Φ(G) = {Gu|u ∈ V (G)} (u; Gu) u Gu u ∈ V (Gu)

[

u∈V (G)

E(Gu) = E(G)

slide-10
SLIDE 10
  • A structure p can be a join unit iff.
  • stands for the matches of in

WHAT CAN BE JOIN UNITS

RG(p) = [

u∈V (G)

RGu(p)

RG(p) p G

slide-11
SLIDE 11

JOIN PLAN (TREE)

  • Decomposing
  • Solving:

P = p0 ∪ p1 ∪ p2 ∪ p3 R(P) = R(p0) o n R(p1) o n R(p2) o n R(p3)

R(P)

R(P 0

1)

R(P 0

2)

R(p0)

R(p1)

R(p2)

R(p3)

  • n
  • n
  • n
slide-12
SLIDE 12

JOIN PLAN (TREE)

  • Decomposing
  • Solving:

P = p0 ∪ p1 ∪ p2 ∪ p3 R(P) = R(p0) o n R(p1) o n R(p2) o n R(p3)

R(P)

R(P 0

1)

R(P 0

2)

R(p0)

R(p1)

R(p2)

R(p3)

  • n
  • n
  • n

The matches of each join unit can be online computed independently in each local graph

slide-13
SLIDE 13

JOIN PLAN (TREE)

  • Decomposing
  • Solving:

R(P) R(P)

R(P ′

1)

R(P ′

1)

R(P ′

2)

R(P ′

2)

R(p0)

R(p0) R(p1) R(p1)

R(p2) R(p2)

R(p3) R(p3)

⋊ ⋉ ⋊ ⋉ ⋊ ⋉ ⋊ ⋉ ⋊ ⋉ ⋊ ⋉

P = p0 ∪ p1 ∪ p2 ∪ p3

Left-deep tree Bushy tree

R(P) = R(p0) o n R(p1) o n R(p2) o n R(p3)

slide-14
SLIDE 14

DESCRIBE THE ALGORITHMS

  • Graph Strorage mechanism
  • Determine the join units, thereafter the pattern

decomposition

  • Join Structure
  • Left-deep tree vs bushy tree

14

slide-15
SLIDE 15

TWINTWIG JOIN - VLDB15’

slide-16
SLIDE 16

SIMPLE GRAPH STORAGE

TWINTWIG JOIN - VLDB2015

  • The simple graph storage, each local graph

V (Gu) = {u} ∪ N(u)

Gu

E(Gu) = {(u, u0)|u0 ∈ N(u)}

16

u1

u2

u3

u4

u5 u6

u1 u2 u3 Gu1 u2 u1 u3 u4 u5 Gu2

slide-17
SLIDE 17

SIMPLE GRAPH STORAGE

TWINTWIG JOIN - VLDB2015

  • The simple graph storage, where

V (Gu) = {u} ∪ N(u)

E(Gu) = {(u, u0)|u0 ∈ N(u)}

17

… Star as the join unit

slide-18
SLIDE 18

SIMPLE GRAPH STORAGE

TWINTWIG JOIN - VLDB2015

  • The simple graph storage, where

V (Gu) = {u} ∪ N(u)

E(Gu) = {(u, u0)|u0 ∈ N(u)}

18

… Star as the join unit A node with degree 1,000,000 will generate 3-stars

1018

slide-19
SLIDE 19

SIMPLE GRAPH STORAGE

TWINTWIG JOIN

  • Using twintwigs as the join units
  • Instance Optimality
  • Given any join plan involving general stars, we

can solve it using twintwigs with at most the same (often much less) cost

slide-20
SLIDE 20

LEFT-DEEP JOIN PLAN

TWINTWIG JOIN

  • An optimal left-deep join plan with minimum estimated cost

v1 v2 v4 v2 v4 v3

  • n

v1 v2 v3 v4

  • n

v4 v3 p0 p1 p2 v1 v2 v3 v4

slide-21
SLIDE 21

DRAWBACKS

TWINTWIG JOIN

  • Simple storage mechanism only support using star

as join units, too many intermediate results

  • Twintwig: confine to be at most two edges
  • The node with degree 1,000,000 still have two-

edge twintwigs

  • Too many execution rounds.
  • A clique of 6 nodes (15 edges): Seven rounds of

TwinTwigJoin

21

1012

slide-22
SLIDE 22

DRAWBACKS

TWINTWIG JOIN

  • Left-deep join: may result in sub-optimal results
  • 22

v1 v2

v3

v4 v5

v6

v1 v2

v3

v1

v3

v4 v1 v2 v4

v3

  • n

v1 v4

v5

  • n

v1 v2

v3

v4 v5 v1 v5

v6

  • n

v1 v2 v3 v1

v3

v4 v1 v2 v4

v3

  • n

v1 v4

v5 v1 v5 v6

v1 v4

v5 v6

  • n
  • n

R(p0) R(p0) R(p1) R(p1) R(p2) R(p2) R(p3) R(p3)

Optimal solution is a bushy join

slide-23
SLIDE 23

MOTIVATIONS

SEED - VLDB17’

  • Subgraph EnumEration in Distributed Context
  • SCP (Star-Clique-Preserved) graph storage: Use

star and clique as the join units

  • We can avoid using star if clique is an alternative
  • Shorter execution. The 6-clique can now be processed in
  • ne single round, instead of 7 rounds in TwinTwigJoin
  • Bushy join plan: Optimality Guarantee
  • Much better performance

23

slide-24
SLIDE 24

SEED

slide-25
SLIDE 25

SCP GRAPH STORAGE

SEED

  • The SCP Graph Storage, where each local graph

25

G+

u

V (Gu) = {u} ∪ N(u)

V (G+

u ) =

E(G+

u ) =E(Gu) ∪

{(u0, u00)|(u0, u00) ∈ E(G) ∧ u0, u00 ∈ N(u)}

slide-26
SLIDE 26

SCP GRAPH STORAGE

SEED

  • The SCP Graph Storage, where each local graph

26

G+

u

V (Gu) = {u} ∪ N(u)

V (G+

u ) =

NEIGHBOUR EDGES

E(G+

u ) =E(Gu) ∪

{(u0, u00)|(u0, u00) ∈ E(G) ∧ u0, u00 ∈ N(u)}

slide-27
SLIDE 27

SCP GRAPH STORAGE

SEED

  • The SCP Graph Storage, where each local graph

27

G+

u

V (Gu) = {u} ∪ N(u)

V (G+

u ) =

E(G+

u ) =E(Gu) ∪

{(u0, u00)|(u0, u00) ∈ E(G) ∧ u0, u00 ∈ N(u)}

TRIANGLE EDGES

slide-28
SLIDE 28

SCP GRAPH STORAGE

SEED

  • The SCP Graph Storage, where each local graph

28

u1

u2

u3

u4

u5 u6

u1 u2 u3 u2 u1 u3 u4 u5 G+

u

V (Gu) = {u} ∪ N(u)

V (G+

u ) =

E(G+

u ) =E(Gu) ∪

{(u0, u00)|(u0, u00) ∈ E(G) ∧ u0, u00 ∈ N(u)}

NEIGHBOUR EDGES TRIANGLE EDGES

G+

u1

G+

u2

slide-29
SLIDE 29

SCP GRAPH STORAGE

SEED

  • We show that SCP graph storage supports using

both star and clique as the join units

  • A more compact version which has bounded size

for each local graph

29

slide-30
SLIDE 30

OPTIMAL BUSHY JOIN PLAN

SEED

  • Notations
  • : The join plan to solve
  • : The cost of the join plan
  • : Estimated # matches of P in G
  • We aim at finding a join plan for , s.t.

30

EP P C(EP ) C(P) P C(EP ) is minimised

slide-31
SLIDE 31

OPTIMAL BUSHY JOIN PLAN

SEED

  • A dynamic programming transform function
  • e.g.
  • (1)
  • (2)
  • (3)

31

P 0

P 0

l

P 0

r

  • n

EP 0

EP 0

l

EP 0

r

EP 0

l

R(P 0) = R(P 0

l ) o

n R(P 0

r)

EP 0

r

EP 0

slide-32
SLIDE 32

OPTIMAL BUSHY JOIN PLAN

SEED

  • A dynamic programming transform function
  • e.g.
  • (1)
  • (2)
  • (3)

32

P 0

P 0

l

P 0

r

  • n

EP 0

EP 0

l

EP 0

r

EP 0

l

R(P 0) = R(P 0

l ) o

n R(P 0

r)

EP 0

r

EP 0

C(EP 0) = min

P 0

l ⇢P 0^P 0 r=P 0\P 0 l

{C(EP 0

l ) + C(P 0

l ) + C(EP 0

r) + C(P 0

r)}

slide-33
SLIDE 33

EXPERIMENTS

slide-34
SLIDE 34

SETUP

EXPERIMENTS

  • Queries
  • Algorithms
  • SEED+O (The most optimised SEED)
  • TT (The most optimised TwinTwigJoin, VLDB 2015)
  • pSgL (Shao et al. Sigmod 2014)

v1 v2

v3 v4 v1

v2

v3 v4 v1

v2

v3 v4

v1 v2

v3 v4 v5

v1 v2

v3 v4 v5

v1 v2 v5

v3 v4

v1 v2

v3 v4 v5 v6

v1 < v2, v1 < v3 v1 < v4, v2 < v4

v1 < v3 v2 < v4 v1 < v2 < v3 < v4 v2 < v5 v2 < v5 v3 < v4

v1 < v2 < v3 v3 < v4 < v5

v3 < v5

q1 q2 q3 q4 q5 q6

q7

slide-35
SLIDE 35

SETUP

EXPERIMENTS

  • Cluster
  • Amazon EC2: 1 master node, 10 slave nodes
  • Hadoop 2.6.2
  • JVM heap space: mapper 1524MB, reducer 2848MB
  • 6 mappers and 6 reducers each machine

Node Instance vCPU Memory Disk master m3.xlarge 4 15GB 2 x 40GBSSD slave c3.4xlarge 16 30GB 2 x 160GB SSD

slide-36
SLIDE 36

RESULTS

EXPERIMENTS

101 102 103 104 INF

yt lj

134 134 220 220

Running Time (s)

SEED+O TT PSgL 101 102 103 104 INF

yt lj Running Time (s)

29 612 107 5206

SEED+O TT PSgL

slide-37
SLIDE 37

RESULTS

EXPERIMENTS

101 102 103 104 INF

yt lj Running Time (s)

28 63 279 60 1281 5071

SEED+O TT PSgL

v1 v2

v3

v4

v1 < v2 < v3 < v4

q3

101 102 103 104 INF

yt lj Running Time (s)

780 3282 1686

SEED+O TT PSgL

v1 v2

v3

v4 v5

v2 < v5

q4

slide-38
SLIDE 38

RESULTS

EXPERIMENTS

101 102 103 104 INF

yt lj Running Time (s)

306 5814

SEED+O TT PSgL

v1 v2

v3

v4 v5

v6

v3 < v5

q5

101 102 103 104 INF

yt lj Running Time (s)

66 229 850 1013 6968

SEED+O TT PSgL

v1 v2

v3

v4 v5

v2 < v5 v3 < v4

q6

slide-39
SLIDE 39

RESULTS

EXPERIMENTS

101 102 103 104 INF

yt lj Running Time (s)

29 129 493 1206

SEED+O TT PSgL

slide-40
SLIDE 40

CONCLUSION

  • A general decompose-and-join framework to

solve subgraph enumeration

  • TwinTwigJoin = Simple graph storage (twintwigs

as the join units) + Optimal left-deep join

  • SEED = SCP graph storage (star and clique as the

join units) + Optimal bushy join

40

slide-41
SLIDE 41

Q & A THANK YOU!