CS 744: Big Data Systems Shivaram Venkataraman Fall 2018

ADMINISTRIVIA - Midterm grades up today - Pick up papers office hours today or Tuesday class - Course Projects: round 2 meetings

Graph Mining

WHATS DIFFERENT ? Graph Analytics Graph Mining Examples Examples PageRank Counting motifs Shortest path Frequent sub-graph mining Connected components Finding cliques … …

GRAPH MINING: DEFINITIONS Graph G = (V, E) Vertices and edges have unique ids. Embedding sub-graph of G, i.e., subset of vertices and edges Vertex-induced – start from vertices, include all edges for vertices - - Edge-induced – start from edges, include all endpoint vertices Pattern any arbitrary graph Pattern is a template, embedding is an instance

AUTOMORPHISM, ISOMORPHISM Embedding is isomorphic to pattern iff one-to-one mapping between vertices, edges vertex mapped has same label edges connect matching vertices Embedding e is automorphic to e’ iff contain same edges and vertices

EXAMPLE: MOTIF COUNTING Motifs: Connected patterns that are non-isomorphic k=3 – two patterns k=4 – six patterns Goal: Find counts of each pattern in graph

FILTER PROCESS MODEL Two UDFs: Filter embedding Φ and Process embedding π Algorithm BSP Execution by Initial embedding set I parallelizing this loop For each embedding in set C <- generate embeddings(add one vertex) For each embedding e in C If Φ(e): F <- F U π(e) Terminate if F is empty else loop with I <- F

AGGREGATION FUNCTIONS Aggregation functions: Filter function α , Aggregate function β Similar to groupByWindowAndApply ? Consistency properties If embeddings are automorphic, all UDFs return same value Anti-monotonicity – filter return same values for extensions

OTHER APPROACHES Think like Vertex Think like Pattern - Vertex has local embedding - Don’t materialize embeddings - “Push” message to border vertex - Store patterns, recompute embeddings on the fly Cons - Highly connected vertex à hotspot Cons - Duplicate messages, one per border - Partition by pattern (fewer ?) - Popular pattern, load imbalance

ARABESQUE API: EXAMPLE boolean filter(Embedding e){ return (numVertices(e) <= MAX_SIZE); } Counting void process(Embedding e){ Motifs mapOutput (pattern(e),1); } Pair<Pattern,Integer> reduceOutput ( Pattern p, List<Integer> counts){ return Pair (p, sum(counts)); }

DISTRIBUTED EXECUTION Apache Giraph based distributed implementation Synchronous super-steps (BSP) - Workers receive messages sent previously [Embeddings] - Process messages [Filter Process] - Send new messages to be delivered [Aggregate output?] Can be implemented in any BSP system ? (e.g., Spark)

EXPLORATION STRATEGY Goals - Prune embeddings that are “identical” (i.e. automorphic) - Need to do this without coordination (why ?) Approach - Determine a “canonical” embedding (unique and extensible) - Canonical property - Start with smallest id - Add the neighbor with smallest id not visited yet - Incremental check while adding vertex to embedding

EFFICIENT STORAGE: ODAG Storage model: Ordered lists of vertex / edge ids (integers) ODAG format Store all first elements of embeddings in one array (and so on) Links between array indices if embedding has a such link Could lead to spurious embeddings

EFFICIENT STORAGE: ODAG ODAG benefits N vertices can have up to N k embeddings of size k ODAG upper bound O(k . N 2 ) (k << N) Using ODAGs Avoid spurious embeddings using filter, aggregateFilter Merging ODAGs Every worker creates ODAG outputs Use map-reduce to do the merge! Map each entry based on position to worker

OTHER OPTIMIZATIONS Partitioning Embeddings Performed at start of every iteration Round-robin scheme with block size b Estimate embeddings that start from a vertex Two-level aggregation Need to aggregate by pattern. Equality requires isomorphism check Quick pattern calculated locally and aggregated Use canonical pattern to do second level aggregation

SUMMARY Graph Mining: new workload that is compute and data intensive First system to do distributed graph mining Challenges: Lots of intermediate state (trillions of embeddings) Key ideas: Filter / prune embeddings using canonical definition Efficient storage using ODAGs

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 - PowerPoint PPT Presentation

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Midterm grades up today - Pick up papers office hours today or Tuesday class - Course Projects: round 2 meetings Graph Mining WHATS DIFFERENT ? Graph Analytics Graph

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1 -

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 With slides from Mosharaf Chowdhury

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Who am I ? New faculty in Computer

CS 744: Big Data Systems Shivaram Venkataraman Fall 2019 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1: Due Oct

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Administrivia Course Project

CS 744: Big Data Systems Shivaram Venkataraman Fall 2020 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2, Midterm

Veering structures of the canonical decompositions of hyperbolic fibered two-bridge links Naoki

st

FallNot Foolproof & Fallproof Skyler Adams, Tony DiBlasi, Isabella DiDio, Graham Francis,

Parent University November 2020 Forging ahead to reinvent teaching and learning: What is

On relationships between canonical genus and flat Seifert surfaces VI

Klt varieties with trivial canonical class Holonomy, differential forms, and fundamental

REST: Intro, Patterns & Anti-Patterns Stefan Tilkov | innoQ | stefan.tilkov@innoq.com What

An algebraic approach to canonical formulas Nick Bezhanishvili Imperial College London Joint