Arabesque.io A system for distributed graph mining Carlos Teixeira, - - PowerPoint PPT Presentation

arabesque io
SMART_READER_LITE
LIVE PREVIEW

Arabesque.io A system for distributed graph mining Carlos Teixeira, - - PowerPoint PPT Presentation

Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1 Graphs are ubiquitous 2 2 Graph Mining - Concepts Label Distinguishable


slide-1
SLIDE 1

Arabesque.io

A system for distributed graph mining

Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga

1

slide-2
SLIDE 2

2

2

Graphs are ubiquitous

slide-3
SLIDE 3

3

Graph Mining - Concepts

  • Label
  • Distinguishable property of a vertex (e.g. color).
  • Pattern - “Meta” sub-graph.
  • Captures subgraph structure and labelling
  • Embedding - Instance of a pattern.
  • Actual vertices and edges

1 4 6 5 1 6 1 3 6 4 3 6 4 2 6 2

Input graph Pattern Embeddings

3 2

slide-4
SLIDE 4

4

Graph Mining: Cliques

4 Property: Fully connected subgraphs

slide-5
SLIDE 5

5

Graph Mining: Motifs

5 Motifs Size = 3 Motifs Size = 4

slide-6
SLIDE 6

6

Graph Mining: FSM

6

1 2 4 3 7 8 14 9 10 11 13 12 6 5

  • Frequent Subgraph mining in a single large graph.
  • Find subgraphs that have a minimum embedding count
slide-7
SLIDE 7

7

Applications

7

  • Web:
  • Community detection, link spam detection
  • Semantic data:
  • Attributed patterns in RDF
  • Biology:
  • Characterize protein-protein or gene interaction
slide-8
SLIDE 8

8

  • Exponential number of embeddings

Challenges

Size of embedding

8

4K 22K 335K 7.8M 117M 1.7B 1 2 3 4 5 6

# unique embedding (log-scale)

slide-9
SLIDE 9

9

Challenges

  • No standard way to solve theses problems.
  • No way to distribute the processing easily.
  • Way too complicated for programmers (Many …isms)
  • Detect and identify repeated subgraphs – Automorphisms
  • Aggregate to Pattern – Isomorphism
  • Above all not all problems are tractable. No cluster grows exponentially.

9

slide-10
SLIDE 10

10

State of the Art: Custom Algorithms

10

Easy to Code Efficient Implementation Transparent Distribution Custom Algorithms ✗

✓ ✗

slide-11
SLIDE 11

11

State of the Art: Think Like a Vertex

11

Easy to Code Efficient Implementation Transparent Distribution Custom Algorithms ✗

✓ ✗

Think Like a Vertex

✗ ✗ ✓

slide-12
SLIDE 12

12

  • New execution model & system
  • Think Like an Embedding
  • Purpose-built for distributed graph mining
  • Hadoop-based
  • Contributions:
  • Simple & Generic API
  • High performance
  • Distributed & Scalable by design

Arabesque

12

slide-13
SLIDE 13

13

Arabesque

Easy to Code Efficient Implementation Transparent Distribution Custom Algorithms ✗

✓ ✗

Think Like a Vertex

✗ ✗ ✓

Arabesque ✓

✓ ✓

13

slide-14
SLIDE 14

14

Graph Mining - Concepts

  • Label
  • Distinguishable property of a vertex (e.g. color).
  • Pattern - “Meta” sub-graph.
  • Captures subgraph structure and labelling
  • Embedding - Instance of a pattern.
  • Actual vertices and edges

1 4 6 5 1 6 1 3 6 4 3 6 4 2 6 2

Input graph Pattern Embeddings

3 2

slide-15
SLIDE 15

15

boolean filter(Embedding e) { return isClique(e); } void process(Embedding e) {

  • utput(e);

} boolean shouldExpand(Embedding embedding) { return embedding.getNumVertices() < maxsize; } boolean isClique(Embedding e) { return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; }

API Example: Clique finding

15

State of the Art

(Mace, centralized)

4,621 LOC

1 2 3 4 5 6 7 8 9 10 11 12

slide-16
SLIDE 16

16

boolean filter(Embedding e) { return true; } void process(Embedding embedding) {

  • utput(embedding);

map(AGG_MOTIFS, embedding.getPattern(), reusableLongWritableUnit); } boolean shouldExpand(Embedding embedding) { return embedding.getNumVertices() < maxsize; }

API Example: Motif Counting

16

State of the Art

(GTrieScanner, centralized)

3,145 LOC

1 2 3 4 5 6 7 8 9 10

slide-17
SLIDE 17

17

API Example: FSM

  • Ours was the first distributed implementation
  • 280 lines of Java Code
  • … of which 212 compute frequent metric
  • Baseline (GRAMI): 5,443 lines of Java code.

17

slide-18
SLIDE 18

18

Arabesque: An Efficient System

Application - Graph Centralized Baseline Arabesque 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s FSM - CiteSeer (S=300) 4.8s 5s

  • As efficient as centralized state of the art

18 77s

slide-19
SLIDE 19

19

Arabesque: A Scalable System

  • Scalable to thousands of workers
  • Hours/days → Minutes

Application - Graph Centralized Baseline Arabesque 640 cores

Motifs - MiCo 2 hours 24 minutes 25 seconds Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds

19 First Distributed Implementation

slide-20
SLIDE 20

20

  • Avoid Redundant Work
  • Efficient canonicality checking
  • Subgraph Compression
  • Overapproximating Directed Acyclic Graphs (ODAGs)
  • Efficient Aggregation
  • 2-level pattern aggregation
  • Avoid Redundant Work
  • Efficient canonicality checking
  • Subgraph Compression
  • Overapproximating Directed Acyclic Graphs (ODAGs)

How: Arabesque Optimizations

20

slide-21
SLIDE 21

Outline

  • Graph mining exploration & Arabesque fundamentals
  • System Architecture & Optimizations
  • Evaluation of System
  • How to Run & Code

21

slide-22
SLIDE 22

Graph mining exploration & Arabesque fundamentals

slide-23
SLIDE 23

23

Graph Mining - Exploration

  • Iterative expansion
  • Subgraph size n → Subgraph size n + 1
  • Connect to neighbours, one vertex at a time.

1 3 2 4

Input graph

1 2 3 4

Depth 1

1 2 1 3 2 1 2 3 2 4 3 1 3 2 3 4 4 2 4 3

Depth 2 23

slide-24
SLIDE 24

24

Graph Mining - Exploration

1 2 3 1 2 4 1 3 2 1 3 4 4 2 3 4 2 1 4 3 2 4 3 1 2 1 3 2 3 1 2 3 4 2 4 3 3 1 2 3 2 1 3 2 4 3 4 2

Depth 3

1 3 2 4

Input graph 24

slide-25
SLIDE 25

25

Arabesque: Fundamentals

  • Embeddings as 1st class citizens:
  • Think Like an Embedding model

Arabesque responsibilities User responsibilities

Graph Exploration Load Balancing Aggregation (Isomorphism) Automorphism Detection Filter Process 25

slide-26
SLIDE 26

26

Model - Think Like an Embedding

1 2 3 1 2 1 3 3 6 1 2 3 1 2 1 3 3 6 1 2 6 4 5 6

Exploration step 1 Exploration step 2 Exploration step 3 Input Output Input Output

1 2 3 1 2 6 1 2 6 4 1 2 6 3

Input Output

1 2 1 3

  • 1. Start from a

set of initial embeddings

1 2 3 1 2 6

e

  • 2. Candidates:

Expand by 1 vertex/ edge Filter Discard false

  • 3. Filter

uninteresting candidates Process Save

  • 4. Produce outputs

true User-defined functions 26

slide-27
SLIDE 27

27

boolean filter(Embedding e) { return isClique(e); } void process(Embedding e) {

  • utput(e);

} boolean shouldExpand(Embedding embedding) { return embedding.getNumVertices() < maxsize; } boolean isClique(Embedding e) { return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; }

API Example: Clique finding

27

1 2 3 4 5 6 7 8 9 10 11 12

slide-28
SLIDE 28

28

Guarantee: Completeness

1 2 6 1 2 6 3 1 2 6 4

Filter = true Filter = true Keep expanding Filter = false Filter = false

We can prune and be sure that we won’t ignore desired embeddings For each e, if filter(e) == true then Process(e) is executed Requirement: Anti-monotonicity

slide-29
SLIDE 29

29

Aggregation during expansion

  • Filter might need aggregated values
  • E.g.: Frequent subgraph mining
  • Frequency calculation → look at all candidates
  • Aggregation in parallel with exploration step
  • Embeddings filtered as soon as aggregated values are ready.

29

slide-30
SLIDE 30

30

Aggregation during expansion

Process

  • 4. Produce outputs

...

  • Aggr. key-value pairs for next step

map(k, v)

...

1 3 1 2

  • 1. Initial embeddings

and aggr. values

  • 2. Candidates: Expand

by 1 vertex/edge

Agg Filter Agg Process

Save Discard 1-1. Filter using aggr. values 1-2. Process using

  • aggr. values
  • Aggr. key-value pairs from previous step

e’

1 3 1 2

  • Filter function may depend on aggregated data
  • E.g.: Frequent subgraph mining
  • Frequency requires looking at all candidates

Exploration step 1 Exploration step 2

1

... ...

e

User-defined functions 30

slide-31
SLIDE 31

31

Arabesque API

  • Main App-defined functions:

boolean filter(E embedding);

void process(E embedding); boolean shouldExpand(E newEmbedding); // Terminate early if max depth defined boolean aggregationFilter(E Embedding); // Ignore embedding boolean aggregationFilter(Pattern pattern); // Ignore pattern (ex. not frequent) void aggregationProcess(E embedding); void handleNoExpansions(E embedding);

  • Performance improvements:

void filter(E existingEmbedding, IntCollection extensionPoints); // prune extensions

boolean filter(E existingEmbedding, int newWord); // Canonicality check

  • Functions Provided by Arabesque:

void output(String outputString);

void map(String name, K key, V value); AggregationStorage<K, V> readAggregation(String name);

31

slide-32
SLIDE 32

System Architecture & Optimizations

slide-33
SLIDE 33

33

Arabesque Architecture

Input Embeddings size n

split 1 split 4 split 7 split 2 split 5 split 8 split 3 split 6 split 9

Worker 2 Worker 1 Worker 3 Output Embeddings size n + 1

split 1 split 4 split 7 split 2 split 5 split 8 split 3 split 6 split 9

Next step Previous step 33

slide-34
SLIDE 34

34

Avoiding redundant work

  • Problem: Automorphic embeddings
  • Automorphisms == subgraph equivalences
  • Redundant work

1 2 3

34

3 2 1

Worker 1 Worker 2 ==

slide-35
SLIDE 35

35

Avoiding redundant work

  • Solution: Decentralized Embedding Canonicality
  • No coordination
  • Efficient

1 2 3

35

3 2 1

Worker 1 Worker 2 == isCanonical(e) → true isCanonical(e) → false

slide-36
SLIDE 36

36

Efficient Pattern Aggregation

  • Goal: Aggregate automorphic patterns to single key
  • Find canonical pattern
  • No known polynomial solution

1 2 2 4 3 5

3x Expensive graph canonization Canonical pattern

slide-37
SLIDE 37

37

Efficient Pattern Aggregation

  • Solution: 2-level pattern aggregation
  • 1. Embeddings → quick patterns
  • 2. Quick patterns → canonical pattern

1 2 2 4 3 5

3x Linear matching to quick pattern 2) Canonical pattern 1) Quick patterns 2x Expensive graph canonization

slide-38
SLIDE 38

38

Handling Exponential growth

  • Goal: handle trillions+ different embeddings?
  • Solution: Overapproximating DAGs (ODAGs)
  • Compress into less restrictive superset
  • Deal with spurious embeddings

4 1 5 2 3

Canonical Embeddings

1 4 2 1 4 3 1 4 5 2 3 4 2 4 5 3 4 5

Input Graph Embedding List

1 2 3 3 4 2 3 4 5

ODAG 38

slide-39
SLIDE 39

Performance

slide-40
SLIDE 40

40

Evaluation - Setup

  • 20 servers: 32 threads @ 2.67 GHz, 256GB RAM
  • 10 Gbps network
  • 3 algorithms: Frequent Subgraph Mining, Counting Motifs and Clique Finding
  • Input graphs:

# Vertices # Edges # Labels

  • Avg. Degree

CiteSeer 3,312 4,732 6 3 MiCO 100,000 1,080,298 29 22 Patents 2,745,761 13,965,409 37 10 Youtube 4,589,876 43,968,798 80 19 SN 5,022,893 198,613,776 79 Instagram 179,527,876 887,390,802 10

40

slide-41
SLIDE 41

41

Evaluation - Scalability

41

slide-42
SLIDE 42

42

Evaluation - Scalability

Application - Graph Centralized Baseline Arabesque - Num. Servers (32 threads) 1 5 10 15 20

Motifs - MiCo 8,664s 328s 74s 41s 31s 25s FSM - Citeseer 1,813s 431s 105s 65s 52s 41s Cliques - MiCo 14,901s 1,185s 272s 140s 91s 70s Motifs - Youtube Fail 8,995s 2,218s 1,167s 900s 709s FSM - Patents >19h 548s 186s 132s 102s 88s 42

slide-43
SLIDE 43

43

Evaluation - ODAGs Compression

43

4000 vertices 1.7 billion embeddings 44 GB 60 MB

slide-44
SLIDE 44

44

Evaluation - Speedup w ODAGs

44

slide-45
SLIDE 45

45

Evaluation - 2-level aggregation

Motifs MiCo (MS = 4) Motifs Youtube (MS=4) FSM CiteSeer (S=220, MS=7) FSM Patents (S=24k) Embeddings 10,957,439,024 218,909,854,429 1,680,983,703 1,910,611,704 Quick Patterns 21 21 1433 1800 Canonical Patterns 6 6 97 1348 Reduction Factor 521,782,810x 10,424,278,782x 1,173,052x 1,061,451x

45

slide-46
SLIDE 46

46

Evaluation - 2-level aggregation

46

slide-47
SLIDE 47

How to Run & Code

slide-48
SLIDE 48

48

Requirements

48

  • Hadoop installation:
  • Runs a map-reduce job (Giraph based)
  • To develop:
  • Java 7
slide-49
SLIDE 49

49

Input Graph

49

  • Graphs:
  • labels on vertices
  • labels on edges
  • Multiple edges with labels between two vertices
  • Graph should have sequential vertex ids, and it

should be ordered

slide-50
SLIDE 50

50

How to Run?

50

./run_arabesque.sh cluster.yaml application.yaml

slide-51
SLIDE 51

51

Cluster.yaml

num_workers: 10 num_compute_threads: 16

  • utput_active: yes

# Giraph configuration #giraph.nettyClientThreads: 32 #giraph.nettyServerThreads: 32 #giraph.nettyClientExecutionThreads: 32 #giraph.channelsPerServer: 4 #giraph.useBigDataIOForMessages: true #giraph.useNettyPooledAllocator: true #giraph.useNettyDirectMemory: true #giraph.nettyRequestEncoderBufferSize: 1048576

51

slide-52
SLIDE 52

52

Fsm.yaml

computation: io.arabesque.examples.fsm.FSMComputation master_computation: io.arabesque.examples.fsm.FSMMasterComputation input_graph_path: citeseer.graph

  • utput_path: FSM_Output

#communication_strategy: embeddings # Custom parameters arabesque.fsm.support: 300 #arabesque.fsm.maxsize: 7 # Split all aggregations in 10 parts for parallel aggregation # (use only with heavy aggregations) # arabesque.aggregators.default_splits: 10 52

slide-53
SLIDE 53

53

Cliques.yaml

computation: io.arabesque.examples.clique.CliqueComputation input_graph_path: citeseer-single-label.graph

  • utput_path: Cliques_Output

#communication_strategy: embeddings

  • ptimizations:
  • io.arabesque.optimization.CliqueOptimization

# Custom parameters arabesque.clique.maxsize: 4

53

slide-54
SLIDE 54

54

https://github.com/Qatar-Computing-Research-Institute/Arabesque

54

http://arabesque.io

slide-55
SLIDE 55
  • Graph mining is complex
  • Existing approaches not ideal
  • Arabesque - facilitate distributed graph mining algorithms
  • General & Simple API
  • Efficient & Scalable
  • Just the beginning!!!

Conclusion

55

slide-56
SLIDE 56

56

Graph Exploration with TLV

1. Receive embeddings 2. Expand by adding neighboring vertices 3. Send canonical embeddings to their constituting vertices 56

1 3 4 3 2

Input graph

2 1

1-4 2-4 3-4

3 2 1

1-4-2 1-4-3 1-4-2

4

1-4-2 1-4-3 2-4-1 2-4-3 3-4-1 3-4-2 1-4-3 Receive Expand Send Superstep 2 for vertex 4

4 4

slide-57
SLIDE 57

57

Evaluation - TLP & TLV

  • Use case: frequent subgraph mining
  • No scalability. Bottlenecks:
  • TLV: Replication of embeddings, hotspots
  • TLP: very few patterns do all the work

57

1 5 10 2 4 6 8 10 Number of nodes (32 threads) Speedup Ideal TLP TLV