arabesque io
play

Arabesque.io A system for distributed graph mining Carlos Teixeira, - PowerPoint PPT Presentation

Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1 Graphs are ubiquitous 2 2 Graph Mining - Concepts Label Distinguishable


  1. Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1

  2. Graphs are ubiquitous 2 2

  3. Graph Mining - Concepts • Label • Distinguishable property of a vertex (e.g. color). • Pattern - “Meta” sub-graph. • Captures subgraph structure and labelling • Embedding - Instance of a pattern. • Actual vertices and edges 1 2 1 1 4 4 3 4 2 3 3 2 5 6 6 6 6 6 Input graph Pattern Embeddings 3

  4. Graph Mining: Cliques Property: Fully connected subgraphs 4 4

  5. Graph Mining: Motifs Motifs Size = 3 Motifs Size = 4 5 5

  6. Graph Mining: FSM • Frequent Subgraph mining in a single large graph. 7 14 1 9 2 13 5 3 8 4 10 6 11 12 • Find subgraphs that have a minimum embedding count 6 6

  7. Applications • Web: • Community detection, link spam detection • Semantic data: • Attributed patterns in RDF • Biology: • Characterize protein-protein or gene interaction 7 7

  8. Challenges # unique embedding (log-scale) 1.7B 117M 7.8M 335K 22K 4K 1 2 3 4 5 6 Size of embedding • Exponential number of embeddings 8 8

  9. Challenges • No standard way to solve theses problems. • No way to distribute the processing easily. • Way too complicated for programmers (Many … isms) • Detect and identify repeated subgraphs – Automorphisms • Aggregate to Pattern – Isomorphism • Above all not all problems are tractable. No cluster grows exponentially. 9 9

  10. State of the Art: Custom Algorithms Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom 10 10

  11. State of the Art: Think Like a Vertex Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom ✗ ✗ ✓ Think Like a Vertex 11 11

  12. Arabesque • New execution model & system • Think Like an Embedding • Purpose-built for distributed graph mining • Hadoop-based • Contributions: • Simple & Generic API • High performance • Distributed & Scalable by design 12 12

  13. Arabesque Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom ✗ ✗ ✓ Think Like a Vertex Arabesque ✓ ✓ ✓ 13 13

  14. Graph Mining - Concepts • Label • Distinguishable property of a vertex (e.g. color). • Pattern - “Meta” sub-graph. • Captures subgraph structure and labelling • Embedding - Instance of a pattern. • Actual vertices and edges 1 2 1 1 4 4 3 4 2 3 3 2 5 6 6 6 6 6 Input graph Pattern Embeddings 14

  15. API Example: Clique finding boolean filter (Embedding e) { 1 State of the Art return isClique (e); 2 } 3 (Mace, centralized) void process (Embedding e) { 4 4,621 LOC output (e); 5 } 6 boolean shouldExpand (Embedding embedding) { 7 return embedding.getNumVertices() < maxsize ; 8 } 9 boolean isClique (Embedding e) { 10 return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; 11 } 12 15 15

  16. API Example: Motif Counting State of the Art boolean filter (Embedding e) { 1 (GTrieScanner, centralized) return true; 2 } 3 3,145 LOC void process (Embedding embedding) { 4 output(embedding); 5 map( AGG_MOTIFS , embedding.getPattern(), reusableLongWritableUnit ); 6 } 7 boolean shouldExpand (Embedding embedding) { 8 return embedding.getNumVertices() < maxsize ; 9 } 10 16 16

  17. API Example: FSM • Ours was the first distributed implementation • 280 lines of Java Code • … of which 212 compute frequent metric • Baseline (GRAMI): 5,443 lines of Java code. 17 17

  18. Arabesque: An Efficient System • As efficient as centralized state of the art Centralized Arabesque Application - Graph Baseline 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s 77s FSM - CiteSeer (S=300) 4.8s 5s 18 18

  19. Arabesque: A Scalable System • Scalable to thousands of workers • Hours/days → Minutes Arabesque Application - Graph Centralized Baseline 640 cores Motifs - MiCo 2 hours 24 minutes 25 seconds First Distributed Implementation Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds 19 19

  20. How: Arabesque Optimizations • Avoid Redundant Work • Avoid Redundant Work • Efficient canonicality checking • Efficient canonicality checking • Subgraph Compression • Subgraph Compression • Overapproximating Directed Acyclic Graphs (ODAGs) • Overapproximating Directed Acyclic Graphs (ODAGs) • Efficient Aggregation • 2-level pattern aggregation 20 20

  21. Outline • Graph mining exploration & Arabesque fundamentals • System Architecture & Optimizations • Evaluation of System • How to Run & Code 21

  22. Graph mining exploration & Arabesque fundamentals

  23. Graph Mining - Exploration • Iterative expansion • Subgraph size n → Subgraph size n + 1 • Connect to neighbours, one vertex at a time. 1 2 3 1 1 1 3 3 2 1 2 2 2 1 3 4 3 2 3 4 2 3 4 4 2 4 4 3 Input graph Depth 1 Depth 2 23 23

  24. Graph Mining - Exploration 1 2 3 2 1 3 1 2 4 2 3 1 1 3 2 2 3 4 1 3 4 2 4 3 1 2 Depth 3 3 1 2 4 2 1 3 4 3 2 1 Input graph 4 2 3 3 2 4 4 3 1 3 4 2 4 3 2 24 24

  25. Arabesque: Fundamentals • Embeddings as 1st class citizens: • Think Like an Embedding model Arabesque responsibilities User responsibilities Graph Aggregation Filter Exploration (Isomorphism) Load Automorphism Process Balancing Detection 25 25

  26. Model - Think Like an Embedding Exploration step 3 Exploration step 1 Exploration step 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 1 3 4 3 3 3 4 6 3 6 3 6 6 6 3 6 5 6 Input Output Input Output Input Output true 1 2 e Filter Process 1 2 6 false 1 3 Save 1 2 Discard 3 1. Start from a 2. Candidates : 3. Filter 4. Produce outputs set of initial Expand by 1 vertex/ uninteresting embeddings edge candidates User-defined functions 26 26

  27. API Example: Clique finding boolean filter (Embedding e) { 1 return isClique (e); 2 } 3 void process (Embedding e) { 4 output (e); 5 } 6 boolean shouldExpand (Embedding embedding) { 7 return embedding.getNumVertices() < maxsize ; 8 } 9 boolean isClique (Embedding e) { 10 return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; 11 } 12 27 27

  28. Guarantee: Completeness For each e, if filter(e) == true then Process(e) is executed Requirement: Anti-monotonicity 1 2 Filter = true 6 1 2 1 2 Filter = true Filter = false 3 6 4 6 Keep expanding Filter = false We can prune and be sure that we won’t ignore desired embeddings 28

  29. Aggregation during expansion • Filter might need aggregated values • E.g.: Frequent subgraph mining • Frequency calculation → look at all candidates • Aggregation in parallel with exploration step • Embeddings filtered as soon as aggregated values are ready. 29 29

  30. Aggregation during expansion • Filter function may depend on aggregated data • E.g.: Frequent subgraph mining • Frequency requires looking at all candidates Aggr. key-value pairs from previous step Aggr. key-value pairs for next step map(k, v) 1 Agg e e’ Agg Process 1 2 1 2 Filter Process ... 1 3 1 3 Discard Save ... 1. Initial embeddings 1-1. Filter using aggr. 1-2. Process using 4. Produce outputs 2. Candidates : Expand and aggr. values values aggr. values by 1 vertex/edge ... ... Exploration step 2 Exploration step 1 User-defined functions 30 30

  31. Arabesque API • Main App-defined functions: boolean filter(E embedding); void process(E embedding); boolean shouldExpand(E newEmbedding); // Terminate early if max depth defined boolean aggregationFilter(E Embedding); // Ignore embedding boolean aggregationFilter(Pattern pattern); // Ignore pattern (ex. not frequent) void aggregationProcess(E embedding); void handleNoExpansions(E embedding); • Performance improvements: void filter(E existingEmbedding, IntCollection extensionPoints); // prune extensions boolean filter(E existingEmbedding, int newWord); // Canonicality check • Functions Provided by Arabesque: void output(String outputString); void map(String name, K key, V value); AggregationStorage<K, V> readAggregation(String name); 31 31

  32. System Architecture & Optimizations

  33. Arabesque Architecture Input Output Embeddings Embeddings size size n n + 1 Worker 1 split 1 split 1 split 4 split 4 split 7 split 7 Worker 2 Previous step Next step split 2 split 2 split 5 split 5 split 8 split 8 Worker 3 split 3 split 3 split 6 split 6 split 9 split 9 33 33

  34. Avoiding redundant work • Problem: Automorphic embeddings • Automorphisms == subgraph equivalences • Redundant work == 1 2 3 3 2 1 Worker 1 Worker 2 34 34

  35. Avoiding redundant work • Solution: Decentralized Embedding Canonicality • No coordination • Efficient == 1 2 3 3 2 1 Worker 1 Worker 2 isCanonical(e) → true isCanonical(e) → false 35 35

  36. Efficient Pattern Aggregation • Goal: Aggregate automorphic patterns to single key • Find canonical pattern • No known polynomial solution 1 2 2 4 3 5 3x Expensive graph canonization Canonical pattern 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend