Arabesque.io A system for distributed graph mining Carlos Teixeira, - PowerPoint PPT Presentation

Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1

Graphs are ubiquitous 2 2

Graph Mining - Concepts • Label • Distinguishable property of a vertex (e.g. color). • Pattern - “Meta” sub-graph. • Captures subgraph structure and labelling • Embedding - Instance of a pattern. • Actual vertices and edges 1 2 1 1 4 4 3 4 2 3 3 2 5 6 6 6 6 6 Input graph Pattern Embeddings 3

Graph Mining: Cliques Property: Fully connected subgraphs 4 4

Graph Mining: Motifs Motifs Size = 3 Motifs Size = 4 5 5

Graph Mining: FSM • Frequent Subgraph mining in a single large graph. 7 14 1 9 2 13 5 3 8 4 10 6 11 12 • Find subgraphs that have a minimum embedding count 6 6

Applications • Web: • Community detection, link spam detection • Semantic data: • Attributed patterns in RDF • Biology: • Characterize protein-protein or gene interaction 7 7

Challenges # unique embedding (log-scale) 1.7B 117M 7.8M 335K 22K 4K 1 2 3 4 5 6 Size of embedding • Exponential number of embeddings 8 8

Challenges • No standard way to solve theses problems. • No way to distribute the processing easily. • Way too complicated for programmers (Many … isms) • Detect and identify repeated subgraphs – Automorphisms • Aggregate to Pattern – Isomorphism • Above all not all problems are tractable. No cluster grows exponentially. 9 9

State of the Art: Custom Algorithms Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom 10 10

State of the Art: Think Like a Vertex Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom ✗ ✗ ✓ Think Like a Vertex 11 11

Arabesque • New execution model & system • Think Like an Embedding • Purpose-built for distributed graph mining • Hadoop-based • Contributions: • Simple & Generic API • High performance • Distributed & Scalable by design 12 12

Arabesque Easy to Efficient Transparent Code Implementation Distribution Algorithms ✗ ✓ ✗ Custom ✗ ✗ ✓ Think Like a Vertex Arabesque ✓ ✓ ✓ 13 13

Graph Mining - Concepts • Label • Distinguishable property of a vertex (e.g. color). • Pattern - “Meta” sub-graph. • Captures subgraph structure and labelling • Embedding - Instance of a pattern. • Actual vertices and edges 1 2 1 1 4 4 3 4 2 3 3 2 5 6 6 6 6 6 Input graph Pattern Embeddings 14

API Example: Clique finding boolean filter (Embedding e) { 1 State of the Art return isClique (e); 2 } 3 (Mace, centralized) void process (Embedding e) { 4 4,621 LOC output (e); 5 } 6 boolean shouldExpand (Embedding embedding) { 7 return embedding.getNumVertices() < maxsize ; 8 } 9 boolean isClique (Embedding e) { 10 return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; 11 } 12 15 15

API Example: Motif Counting State of the Art boolean filter (Embedding e) { 1 (GTrieScanner, centralized) return true; 2 } 3 3,145 LOC void process (Embedding embedding) { 4 output(embedding); 5 map( AGG_MOTIFS , embedding.getPattern(), reusableLongWritableUnit ); 6 } 7 boolean shouldExpand (Embedding embedding) { 8 return embedding.getNumVertices() < maxsize ; 9 } 10 16 16

API Example: FSM • Ours was the first distributed implementation • 280 lines of Java Code • … of which 212 compute frequent metric • Baseline (GRAMI): 5,443 lines of Java code. 17 17

Arabesque: An Efficient System • As efficient as centralized state of the art Centralized Arabesque Application - Graph Baseline 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s 77s FSM - CiteSeer (S=300) 4.8s 5s 18 18

Arabesque: A Scalable System • Scalable to thousands of workers • Hours/days → Minutes Arabesque Application - Graph Centralized Baseline 640 cores Motifs - MiCo 2 hours 24 minutes 25 seconds First Distributed Implementation Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds 19 19

How: Arabesque Optimizations • Avoid Redundant Work • Avoid Redundant Work • Efficient canonicality checking • Efficient canonicality checking • Subgraph Compression • Subgraph Compression • Overapproximating Directed Acyclic Graphs (ODAGs) • Overapproximating Directed Acyclic Graphs (ODAGs) • Efficient Aggregation • 2-level pattern aggregation 20 20

Outline • Graph mining exploration & Arabesque fundamentals • System Architecture & Optimizations • Evaluation of System • How to Run & Code 21

Graph mining exploration & Arabesque fundamentals

Graph Mining - Exploration • Iterative expansion • Subgraph size n → Subgraph size n + 1 • Connect to neighbours, one vertex at a time. 1 2 3 1 1 1 3 3 2 1 2 2 2 1 3 4 3 2 3 4 2 3 4 4 2 4 4 3 Input graph Depth 1 Depth 2 23 23

Graph Mining - Exploration 1 2 3 2 1 3 1 2 4 2 3 1 1 3 2 2 3 4 1 3 4 2 4 3 1 2 Depth 3 3 1 2 4 2 1 3 4 3 2 1 Input graph 4 2 3 3 2 4 4 3 1 3 4 2 4 3 2 24 24

Arabesque: Fundamentals • Embeddings as 1st class citizens: • Think Like an Embedding model Arabesque responsibilities User responsibilities Graph Aggregation Filter Exploration (Isomorphism) Load Automorphism Process Balancing Detection 25 25

Model - Think Like an Embedding Exploration step 3 Exploration step 1 Exploration step 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 1 3 4 3 3 3 4 6 3 6 3 6 6 6 3 6 5 6 Input Output Input Output Input Output true 1 2 e Filter Process 1 2 6 false 1 3 Save 1 2 Discard 3 1. Start from a 2. Candidates : 3. Filter 4. Produce outputs set of initial Expand by 1 vertex/ uninteresting embeddings edge candidates User-defined functions 26 26

API Example: Clique finding boolean filter (Embedding e) { 1 return isClique (e); 2 } 3 void process (Embedding e) { 4 output (e); 5 } 6 boolean shouldExpand (Embedding embedding) { 7 return embedding.getNumVertices() < maxsize ; 8 } 9 boolean isClique (Embedding e) { 10 return e.getNumEdgesAddedWithExpansion()==e.getNumberOfVertices()-1; 11 } 12 27 27

Guarantee: Completeness For each e, if filter(e) == true then Process(e) is executed Requirement: Anti-monotonicity 1 2 Filter = true 6 1 2 1 2 Filter = true Filter = false 3 6 4 6 Keep expanding Filter = false We can prune and be sure that we won’t ignore desired embeddings 28

Aggregation during expansion • Filter might need aggregated values • E.g.: Frequent subgraph mining • Frequency calculation → look at all candidates • Aggregation in parallel with exploration step • Embeddings filtered as soon as aggregated values are ready. 29 29

Aggregation during expansion • Filter function may depend on aggregated data • E.g.: Frequent subgraph mining • Frequency requires looking at all candidates Aggr. key-value pairs from previous step Aggr. key-value pairs for next step map(k, v) 1 Agg e e’ Agg Process 1 2 1 2 Filter Process ... 1 3 1 3 Discard Save ... 1. Initial embeddings 1-1. Filter using aggr. 1-2. Process using 4. Produce outputs 2. Candidates : Expand and aggr. values values aggr. values by 1 vertex/edge ... ... Exploration step 2 Exploration step 1 User-defined functions 30 30

Arabesque API • Main App-defined functions: boolean filter(E embedding); void process(E embedding); boolean shouldExpand(E newEmbedding); // Terminate early if max depth defined boolean aggregationFilter(E Embedding); // Ignore embedding boolean aggregationFilter(Pattern pattern); // Ignore pattern (ex. not frequent) void aggregationProcess(E embedding); void handleNoExpansions(E embedding); • Performance improvements: void filter(E existingEmbedding, IntCollection extensionPoints); // prune extensions boolean filter(E existingEmbedding, int newWord); // Canonicality check • Functions Provided by Arabesque: void output(String outputString); void map(String name, K key, V value); AggregationStorage<K, V> readAggregation(String name); 31 31

System Architecture & Optimizations

Arabesque Architecture Input Output Embeddings Embeddings size size n n + 1 Worker 1 split 1 split 1 split 4 split 4 split 7 split 7 Worker 2 Previous step Next step split 2 split 2 split 5 split 5 split 8 split 8 Worker 3 split 3 split 3 split 6 split 6 split 9 split 9 33 33

Avoiding redundant work • Problem: Automorphic embeddings • Automorphisms == subgraph equivalences • Redundant work == 1 2 3 3 2 1 Worker 1 Worker 2 34 34

Avoiding redundant work • Solution: Decentralized Embedding Canonicality • No coordination • Efficient == 1 2 3 3 2 1 Worker 1 Worker 2 isCanonical(e) → true isCanonical(e) → false 35 35

Efficient Pattern Aggregation • Goal: Aggregate automorphic patterns to single key • Find canonical pattern • No known polynomial solution 1 2 2 4 3 5 3x Expensive graph canonization Canonical pattern 36

Arabesque.io A system for distributed graph mining Carlos Teixeira, - PowerPoint PPT Presentation

Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1 Graphs are ubiquitous 2 2 Graph Mining - Concepts Label Distinguishable

RESPONSIBLE INVESTMENT BRIEFING 2015 TRENDS & GROWTH IN RI Simon OConnor CEO, RIAA

Link ARQ issues for IP traffic draft-ietf-pilc-link-arq-issues-01.txt Gorry Fairhurst Department

July 2019 Monthly Update North Dakota Pipeline Authority Justin J. Kringstad July 16, 2019 US

s rt s t

Design and Analysis of Multi-Hop D2D Communications Presented by Chandra R. Murthy Joint work

Review: List Implementations The external interface is already defined Implementation goal:

Linked Lists Definition of Linked Lists A linked list is a sequence of items (objects) where

5 Step Sales Process Page 81 5 Step Sales Process 1. Establishing rapport 2. Ask questions 3.

CS261 Data Structures Linked List Implementation of the Deque Deque Interface (Review) void

Differential Kinematics Robert Platt Northeastern University Differential Kinematics Up to this

Differential Kinematics Up to this point, we have only considered the relationship of the joint

Review Cedric Fischer and Michael Mattmann Institute of Robotics and Intelligent Systems

Paper Summaries Any takers? Dynamics I Note on next lectures papers: Linear Motion

Development of Pellet Target Tracking Systems in Uppsala Main activities autumn 2010: Time and

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk

NETWORK DATA VISUAL ADJACENCY LISTS FOR DYNAMIC GRAPHS Authors: Marcel Hlawatsch, Michael Burch,

Distribution of MC Information PANDA Computings Workshop - SUT Juli 3, 2017 | Tobias Stockmanns

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Listening In relationship (yes, its that basic and deep) What keeps us (white people) from

Listening and Note Taking English for Academic Purposes Workshop Series Professional Development

In The Presence of a Holy God 1 and 2 Samuel and The Ark of the Covenant In The Presence of a

Speech Question Answering TOEFL Listening Comprehension Test by Machine Wei Fang December 13,

Listening to the Masters Great Science Teachers and Science Teacher Mentors Redesign Science

Listening(to(big(data( ( Is(clone(analysis(/(empirical(SE(a(Big(Data(problem?(

Arabesque.io A system for distributed graph mining Carlos Teixeira, - PowerPoint PPT Presentation

Arabesque.io A system for distributed graph mining Carlos Teixeira, Alexandre Fonseca, Marco Serafini, Georgos Siganos, Mohammed Zaki, Ashraf Aboulnaga 1 Graphs are ubiquitous 2 2 Graph Mining - Concepts Label Distinguishable

RESPONSIBLE INVESTMENT BRIEFING 2015 TRENDS &amp; GROWTH IN RI Simon OConnor CEO, RIAA

Link ARQ issues for IP traffic draft-ietf-pilc-link-arq-issues-01.txt Gorry Fairhurst Department

July 2019 Monthly Update North Dakota Pipeline Authority Justin J. Kringstad July 16, 2019 US

s rt s t

Design and Analysis of Multi-Hop D2D Communications Presented by Chandra R. Murthy Joint work

Review: List Implementations The external interface is already defined Implementation goal:

Linked Lists Definition of Linked Lists A linked list is a sequence of items (objects) where

5 Step Sales Process Page 81 5 Step Sales Process 1. Establishing rapport 2. Ask questions 3.

CS261 Data Structures Linked List Implementation of the Deque Deque Interface (Review) void

Differential Kinematics Robert Platt Northeastern University Differential Kinematics Up to this

Differential Kinematics Up to this point, we have only considered the relationship of the joint

Review Cedric Fischer and Michael Mattmann Institute of Robotics and Intelligent Systems

Paper Summaries Any takers? Dynamics I Note on next lectures papers: Linear Motion

Development of Pellet Target Tracking Systems in Uppsala Main activities autumn 2010: Time and

Machine Learning Lecture 06: Deep Feedforward Networks Nevin L. Zhang lzhang@cse.ust.hk

NETWORK DATA VISUAL ADJACENCY LISTS FOR DYNAMIC GRAPHS Authors: Marcel Hlawatsch, Michael Burch,

Distribution of MC Information PANDA Computings Workshop - SUT Juli 3, 2017 | Tobias Stockmanns

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Listening In relationship (yes, its that basic and deep) What keeps us (white people) from

Listening and Note Taking English for Academic Purposes Workshop Series Professional Development

In The Presence of a Holy God 1 and 2 Samuel and The Ark of the Covenant In The Presence of a

Speech Question Answering TOEFL Listening Comprehension Test by Machine Wei Fang December 13,

Listening to the Masters Great Science Teachers and Science Teacher Mentors Redesign Science

Listening(to(big(data( ( Is(clone(analysis(/(empirical(SE(a(Big(Data(problem?(

RESPONSIBLE INVESTMENT BRIEFING 2015 TRENDS & GROWTH IN RI Simon OConnor CEO, RIAA