Evaluating Use of Data Flow Systems for Large Graph Analysis Andy - PowerPoint PPT Presentation

Lawrence Livermore National Laboratory Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

Graph mining techniques have been widely-used in many important applications in recent years � Graph mining extracts information by analyzing relations and structures in graphs (such as ER graphs) So-called “scale-free” graphs can carry rich information Lawrence Livermore National Laboratory 2

Graph Mining Applications: Web Search � Google’s PageRank uses a web graph to rank web pages for given queries � Related applications – Personalized web search search – People search – Eigenvalue/eigenvector – Random walk with restart Lawrence Livermore National Laboratory 3

Graph Mining Applications: Social Network Analysis Community detection algorithms can identify the two communities (e.g., Girvan and Newman, 2002) Further analysis reveals detailed community structures in the graph (e.g., van Dongen, 2000 and Palla, 2005 ) Zachary’s Karate Club, 1977 Divided into two groups centered around two individuals, 1 and 34 Lawrence Livermore National Laboratory 4

Graph Mining Applications: Protein Clustering Can discover proteins with similar functions by clustering protein modules in the protein- protein interaction graphs. Protein-protein interaction network of yeast Adamcsek et. al., Bioinformatica, 1021, 2006 Lawrence Livermore National Laboratory 5

Graph Mining Applications: National Security � Apply subgraph pattern matching algorithms to intelligence analysis (e.g., J. Ullman, 1976 ) � Other related applications – Exact and inexact – Exact and inexact pattern discovery – Fraud detection – Cyber security – Behavioral prediction T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis , ACM, 2004 Lawrence Livermore National Laboratory 6

Challenges � High complexity of graph mining algorithms – Common graph mining algorithms have high-order computational complexity • High-order algorithms (O(N 2+ )) − Page rank, community finding, path traversal • NP-Complete algorithms • NP-Complete algorithms − Maximal cliques, subgraph pattern matching � Large data size requires out-of-core approaches – Graphs with 10 9+ nodes and edges are increasingly common – Intermediate result increases exponentially in many cases Lawrence Livermore National Laboratory 7

Traditional relational databases have been used in large graph analysis Distribution of Response Time for 100 Bi-directional searches � Due to prevalence and ease 5% 10+ minutes of use conventional database 2% 5 - 10 minutes systems have been used in 30% 2 - 5 minutes graph analysis 37% 1 - 2 minutes 26% < 1 minute � Designed for transaction 0% 5% 10% 15% 20% 25% 30% 35% 40% processing 300B node graph search on Netezza on 700-node NPS ( SC’06 ) 300B node graph search on Netezza on 700-node NPS ( SC’06 ) � Poor performance and scalability 120B node graph search on 60-node MSSG ( Cluster ’06 ) Lawrence Livermore National Laboratory 8

Many-tasks paradigm is currently used for analyzing large data sets: Map/Reduce � Map/Reduce is a popular many- tasks model being used for a wide range of applications � Map/Reduce model – A M/R program consists of many map and reduce tasks map and reduce tasks – Each task works independently – Data between mappers and reducers via intermediate files – Processes list of (key, value) pairs Map/Reduce model � Is Map/Reduce for everything? Lawrence Livermore National Laboratory 9

Map/Reduce model is too limited for large complex graph analysis � Map/Reduce successfully used for some applications, but – Inverted index construction – Distributed sort – Term-vector calculation – Page Rank � Drawbacks � Drawbacks – Model limited to embarrassingly parallel applications – Poor performance and scalability (due to poor handling of intermediate results) 333.75 Sec/64 Nodes System Platform Time (Sec) BFS Search Results Full PubMed graph with 30 million Map/Reduce 20-node Fenix Cluster 1068 vertices and 500 million edges were used, except SGRACE for which a SGRACE 64-node Tuson Cluster 221 synthetic graph with 25 million vertices and 125 million edges is used Lawrence Livermore National Laboratory 10

Dataflow model is a promising alternative to address these issues � More flexible and complex than Map/Reduce ( Map/Reduce on steroids!! ) � Many independent tasks accessing external data in parallel, realizing data parallelism – Tasks triggered by the – Tasks triggered by the availability of data – No flow of control – Data parallel and independent � We evaluated the use of dataflow Dryad dataflow diagram model for large graph analysis in this work Lawrence Livermore National Laboratory 11

We measured the performance of graph algorithms on an actual dataflow machine: Data Analytic Supercomputer DAS RDBMS VS. � � Sequential or parallel relational Parallel dataflow engine on commodity clusters database systems on � Specialized high-performance commodity HW commodity HW library library � Optimized for transaction • Streaming data pipelined for processing maximum in-memory processing � Ubiquitous • Sequentialized disk accesses � Relatively easy to use • Optimized for SORT and JOIN operations � Relies on SQL compiler for � Offers great flexibility for optimization optimization Lawrence Livermore National Laboratory 12

DAS programming and execution environment ECL Code � Uses ECL, a proprietary dataflow language ECL Compiler ECL Library � Built-in ECL data manipulation constructs are C++ Code implemented in a highly optimized library optimized library Executable – JOIN, SORT, MERGE, etc. � Unlike SQL, these low-level CE CE CE constructs are suitable for … complex graph operations Lawrence Livermore National Laboratory 13

An example ECL code �� !"#$��%��%��&"''��#��'��(%�%!'��% )�� *"#+��&��,�%�-"''./��(%�0��,��')��1��2� ��%�� 03��)1��145�%��22� ��%,��%��)�%��)�/��/2� ��3*��%,��)�%��)�/��/2� �3�*3��2� Lawrence Livermore National Laboratory 14

We evaluated some of the most commonly used applications in our experiments Applications evaluated on DAS System Path Traversal Uni- and Bi-directional BFS Pattern Matching Find subgraphs that matches given template template TeraByte (TB) Sort Jim Gray’s SORT Benchmark Page Rank Eigenvector using power method Disambiguation Binning-based coreference resolution Lawrence Livermore National Laboratory 15

Real-world graphs are used in our performance experiments Grant Agency IssuedG Autho rant r Gran t Journal PubMed Sm PubMed Lg IsAut horOf FundedBy IsIss Grant Grant ueOf ueOf |V| |V| 1M 1M 29M 29M Article Published Journal In Issue HasChemic |E| 2M 270M al HasKeywor Chemical d HasContac Raw 400 MB 127 GB tInfo data size HasMeshHe Keyword ading ContactInf MeshHeadin o g Lawrence Livermore National Laboratory 16

Path Traversal: Breadth-first search (BFS) on DAS Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join Destin Sou ation rce ��,��%2 ��,��%2 Edge List Adjacency List (Denormalized) Unidirectional 287.926 120.359 Bidirectional 204.90 56.431 Used large PubMed data Lawrence Livermore National Laboratory 17

DAS system is ideal for handling complex subgraph pattern queries on large data sets -��"�6,�%��6,�!"#'�%6��&,"��'�%��6�� -��"�6,�%��6,�!"#'�%6��&,"��'�%��6�� -��"�6,�%��6,�!"#'�%6��&,"�� ,"��'� �� 7"��8�42 ��'�%��%!��&��%��7"��8�92 -��"�6,�%��6,�!"#'�%6��,��'�%� ��6��%�$��,"��'��7"��8�52 Lawrence Livermore National Laboratory 18

Evaluating Use of Data Flow Systems for Large Graph Analysis Andy - PowerPoint PPT Presentation

Lawrence Livermore National Laboratory Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Why Data Flow Models? Models from Chapter 5 emphasized control Control flow graph, call

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Integration Testing Path Based Chapter 13 Call graph based integration Use the call graph

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Graph Algorithms Maximum Flow Applications Algorithm Theory WS 2012/13 Fabian Kuhn Maximum Flow

Boleslaw Szymanski based on slides by Albert-Lszl Barabsi and Roberta Sinatr a

Outline: This week 1. Subnetwork querying. More colorcoding. Treewidth graphs. 2.

Web Mining and Recommender Systems T ext Mining Learning Goals Introduce the topic of text

Recovering continuous conformations and reconstructing the energy landscape of a molecular machine

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

B u i l d i n g R e a l - T i me E mb e d d e d A p p l i c a t i o

R&D EFFORT IN LABS 5 AND 6 Anna Pla-Dalmau R&D Review October 29, 2014 SCINTILLATION

PELLEXTRUDER: IS IT POSSIBLE TO 3D-PRINT FROM PLASTIC PELLETS? CARLO FONDA CFONDA@ICTP.IT

Evaluating Use of Data Flow Systems for Large Graph Analysis Andy - PowerPoint PPT Presentation

Lawrence Livermore National Laboratory Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Why Data Flow Models? Models from Chapter 5 emphasized control Control flow graph, call

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Integration Testing Path Based Chapter 13 Call graph based integration Use the call graph

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Graph Algorithms Maximum Flow Applications Algorithm Theory WS 2012/13 Fabian Kuhn Maximum Flow

Boleslaw Szymanski based on slides by Albert-Lszl Barabsi and Roberta Sinatr a

Outline: This week 1. Subnetwork querying. More colorcoding. Treewidth graphs. 2.

Web Mining and Recommender Systems T ext Mining Learning Goals Introduce the topic of text

Recovering continuous conformations and reconstructing the energy landscape of a molecular machine

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

B u i l d i n g R e a l - T i me E mb e d d e d A p p l i c a t i o

R&amp;D EFFORT IN LABS 5 AND 6 Anna Pla-Dalmau R&amp;D Review October 29, 2014 SCINTILLATION

PELLEXTRUDER: IS IT POSSIBLE TO 3D-PRINT FROM PLASTIC PELLETS? CARLO FONDA CFONDA@ICTP.IT

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

R&D EFFORT IN LABS 5 AND 6 Anna Pla-Dalmau R&D Review October 29, 2014 SCINTILLATION