SCALABALE GRAPH ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN - PDF document

SCALABALE GRAPH ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN JUNGHANNS, ANDRE PETERMANN, KEVIN GOMEZ, ERIC PEUKERT www.scads.de GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany  ScaDS Dresden/Leipzig  Berlin Big Data Center (BBDC) ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions Dresden/Leipzig)  scientific coordinators: Nagel (TUD), Rahm (UL)  start: Oct. 2014  duration: 4 years (option for 3 more years)  initial funding: ca. 5.6 Mio. Euro 2

GOALS  Bundling and advancement of existing expertise on Big Data  Development of Big Data Services and Solutions  Big Data Innovations 3 FUNDED INSTITUTES Univ. Leipzig TU Dresden Leibniz Institute of Max-Planck Institute for Ecological Urban and Regional Molecular Cell Biology Development and Genetics 4

ASSOCIATED PARTNERS  Avantgarde-Labs GmbH  Hochschule für Telekommunikation Leipzig  Data Virtuality GmbH  Institut für Angewandte Informatik  E-Commerce Genossenschaft e. G. e. V.  European Centre for Emerging  Landesamt für Umwelt, Landwirtschaft Materials and Processes Dresden und Geologie  Fraunhofer-Institut für Verkehrs- und  Netzwerk Logistik Leipzig-Halle e. V. Infrastruktursysteme  Sächsische Landesbibliothek – Staats-  Fraunhofer-Institut für Werkstoff- und und Universitätsbibliothek Dresden Strahltechnik  Scionics Computer Innovation GmbH  GISA GmbH  Technische Universität Chemnitz  Helmholtz-Zentrum Dresden - Rossendorf  Universitätsklinikum Carl Gustav Carus 5 STRUCTURE OF THE CENTER Life sciences Service Material and Engineering sciences Environmental / Geo sciences center Digital Humanities Business Data Big Data Life Cycle Management and Workflows Data Quality / Knowledge Visual Data Integration Extraktion Analytics Efficient Big Data Architectures 6

RESEARCH PARTNERS  Data-intensive computing W.E. Nagel  Data quality / Data integration E. Rahm  Databases W. Lehner, E. Rahm  Knowledge extraction/Data mining C. Rother, P. Stadler, G. Heyer  Visualization S. Gumhold, G. Scheuermann  Service Engineering, Infrastructure K.-P. Fähnrich, W.E. Nagel, M. Bogdan 7 APPLICATION COORDINATORS  Life sciences G. Myers  Material / Engineering sciences M. Gude  Environmental / Geo sciences J. Schanze  Digital Humanities G. Heyer  Business Data B. Franczyk 8

AGENDA  ScaDS Dresden/Leipzig  Big Graph Data  Graph-based Business Intelligence with BIIIG  basic approaches for graph data management/analysis  GraDoop: Hadoop-based graph data management and analysis  Gradoop characteristics and architecture  Extended Property Graph Data Model (EPGM) / Graph operators  Distributed graph store  Sample workflows  Summary and outlook 9 „GRAPHS ARE EVERYWHERE“ Social science Engineering Life science Information science Facebook Internet Gene (human) World Wide Web ca. 1.3 billion users ca. 2.9 billion users 20,000-25,000 ca. 1 billion Websites ca. 340 friends per user ca. 4 million individuals LOD-Cloud Twitter Patients ca. 31 billion triples ca. 300 million users > 18 millions (Germany) ca. 500 million tweets per day Illnesses > 30.000 10

USE CASE: GRAPH-BASED BUSINESS INTELLIGENCE  Business intelligence usually based on relational data warehouses  enterprise data is integrated within dimensional schema  analysis limited to predefined relationships  no support for relationship-oriented data mining  Graph-based approach (BIIIG)  integrate data sources within an instance graph by preserving original relationships between data objects (transactional and master data)  determine subgraphs (business transaction graphs) related to business activities  analyze subgraphs or entire graphs with aggregation queries, mining relationship patterns, etc. 11 SAMPLE GRAPH 12

BIIIG DATA INTEGRATION AND ANALYSIS WORKFLOW „ B usiness I ntelligence on I ntegrated I nstance G raphs“ (PVLDB 2014) 13 SCREENSHOT FOR NEO4J IMPLEMENTATION 14

GRAPH DATA MANAGEMENT  Relational database systems  store vertices and edges in tables  utilize indexes, column stores, etc.  could be used as a basis (graph store) to implement graph operators  Graph database system, e.g. Neo4J  use of property graph data model: vertices and edges have arbitrary set of properties ( represented as key-value pairs )  focus on simple transactions and queries  insufficient scalability  insufficient support for graph mining 15 GRAPH DATA MANAGEMENT (2)  Parallel graph processing systems, e.g., Google Pregel, Apache Giraph, GraphX, etc.  in-memory storage of graphs in Shared Nothing cluster  parallel processing of general graph algorithms, e.g. page rank, connected components, …  newer approaches (Spark, Flink): analysis workflow with graph operators  little support for semantically expressive graphs  no end-to-end approach with data integration and persistent graph storage 16

WHAT‘S MISSING? An end-to-end framework and research platform for efficient, distributed and domain independent graph data management and analytics. 17 AGENDA  ScaDS Dresden/Leipzig  Big Graph Data  Graph-based Business Intelligence with BIIIG  basic approaches for graph data management/analysis  GraDoop: Hadoop-based graph data management and analysis  Gradoop characteristics and architecture  Extended Property Graph Data Model (EPGM) / Graph operators  Distributed graph store  Sample workflows  Summary and outlook 18

GRADOOP CHARACTERISTICS  Hadoop-based framework for graph data management and analysis  Graph storage in scalable distributed store, e.g., HBase  Extended property graph data model  operators on graphs and sets of (sub) graphs  support for semantic graph queries and mining  Leverages powerful components of Hadoop ecosystem  MapReduce, Giraph, Spark, Pig, Drill …  New functionality for graph-based processing workflows and graph mining 19 END-TO-END GRAPH ANALYTICS Data Integration Graph Analytics Representation  Int Integr grate ate dat ata from one or more sources into a dedicated gr graph aph storage with common sto common gr graph aph dat ata model odel  Definition of analytical analytical wor orkf kflows lows from oper operator ator algebr algebra  Result representation in meaningful meaningful way

HIGH LEVEL ARCHITECTURE Visual Workflow Data flow Representation Declaration GrALa DSL Control flow Workflow Execution Operator Implementations Data Integration Graph Analytics Representation Extended Property Graph Model HBase Distributed Graph Store HDFS Cluster DATA MODEL - REQUIREMENTS 1. Simple but powerful • intuitive graphs are flat structures of vertices and binary edges 2. Logical graphs • support of multiple, possibly overlapping graphs in one database is advantageous for analytical applications 3. Attributes and type labels • type labels and custom properties for vertices, edges and graphs 4. Parallel edges and loops • allow multiple relations between two vertices and self- connected relations

EXTENDED PROPERTY GRAPH MODEL �� , �, �, Τ, �, �, �, � EXTENDED PROPERTY GRAPH MODEL Logical graphs Edge space Vertex space � � � �� , � � , . . , � � � � �� , . . , � � � � � � � , . . , � � � � � �, � � ⊆ � ∧ � ⊆ � � � � � � , � � � � , � � ∈ � Type labels Properties � ∶ � ∪ � ∪ � → T � ∶ � ∪ � ∪ � � � → A �� , �, �, �, �, �, �, �

GRAPH OPERATORS Operator Definition GrALa notation unary � � ∗ ,� ∶ � → � � Pattern graph.match(patternGraph,predicate) : Collection Matching � � ∶ � → � graph.aggregate(propertyKey,aggregateFunction) : Aggregation Graph � �,� ∶ � → � Projection graph.project(vertexFunction,edgeFunction) : Graph � �,� ∶ � → � Summarization graph.summarize(vertexGroupKeys, vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph binary ⊔ ∶ � � → � Combination graph.combine(otherGraph) : Graph ⊓ ∶ � � → � Overlap graph.overlap(otherGraph) : Graph � ∶ � � → � graph.exclude(otherGraph) : Graph Exclusion PATTERN MATCHING 1: pattern = new Graph(“(a)< ‐ d ‐ (b) ‐ e ‐ >(c)”) 2: predicate = (Graph g => g.V[$a][:type] == “Person” && g.V[$b][:type] == “Forum” && g.V[$c][:type] == “Person” && g.E[$d][:type] == “hasMember” && g.E[$e][:type] == “hasMember”) 3: result = db. match (pattern, predicate)

PATTERN MATCHING 1: pattern = new Graph(“(a)< ‐ d ‐ (b) ‐ e ‐ >(c)”) 2: predicate = (Graph g => g.V[$a][:type] == “Person” && g.V[$b][:type] == “Forum” && g.V[$c][:type] == “Person” && g.E[$d][:type] == “hasMember” && g.E[$e][:type] == “hasMember”) 3: result = db. match (pattern, predicate) SUMMARIZATION 1: personGraph = db.G[0]. combine (db.G[1]). combine (db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph. summarize (vertexGroupingKeys, edgeGroupingKeys, vertexAggFunc, edgeAggFunc)

SCALABALE GRAPH ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN - PDF document

SCALABALE GRAPH ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN JUNGHANNS, ANDRE PETERMANN, KEVIN GOMEZ, ERIC PEUKERT www.scads.de GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin

Temporal Graph Analytics with GRADOOP Christopher Rost and Kevin Gomez Leipzig University About

Temporal Graph Analysis using Gradoop 5th March 2018 Christopher Rost Prof. Dr. Andreas Thor

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Extended Property Graphs and Cypher on Gradoop Martin Junghanns University of Leipzig Database

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

WITH RAPIDS Joe Eaton, Ph.D. Technical Lead for Graph Analytics AGENDA Introduction - Why

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Massively Parallel Graph Analytics Supercomputing for large-scale graph analytics George M. Slota

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Environmental Impacts by the Use of Geothermal Energy ENGINE Mid-Term Conference, Potsdam, 11 th

Greg Odegard MICHIGAN TECH Greg O Odegard , gmodegar@mtu.edu RESEARCH FORUM TECHTALKS

Hari Govind V K, Arie Gurfinkel, Yakir Vizel and Vijay Ganesh Given < Init , Tr , P >

Objectives Objectives 1. Define vulnerabilities prevalent among 1. Define vulnerabilities

Speeding R up on your computer by parallelized computations a geostatistical case study

POSITIONING COMPUTER SCIENCE IN A UNIVERSITY - RESEARCH PERSPECTIVE VERSUS MANAGEMENT

Cluster Report Climate, Energy and Mobility ERA-LEARN, Discussion paper 15.05.2019 Tour de

PeCoH Performance Concious HPC Status 2019 H. Stben, K. Himstedt, N. Hbbe, S. Schder,

SCALABALE GRAPH ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN - PDF document

SCALABALE GRAPH ANALYTICS WITH GRADOOP ERHARD RAHM, MARTIN JUNGHANNS, ANDRE PETERMANN, KEVIN GOMEZ, ERIC PEUKERT www.scads.de GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin

Temporal Graph Analytics with GRADOOP Christopher Rost and Kevin Gomez Leipzig University About

Temporal Graph Analysis using Gradoop 5th March 2018 Christopher Rost Prof. Dr. Andreas Thor

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Extended Property Graphs and Cypher on Gradoop Martin Junghanns University of Leipzig Database

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

WITH RAPIDS Joe Eaton, Ph.D. Technical Lead for Graph Analytics AGENDA Introduction - Why

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Massively Parallel Graph Analytics Supercomputing for large-scale graph analytics George M. Slota

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Google Analytics Overview Whats Google Analytics? The Google Analytics

Google Analytics A beginners guide What is Google Analytics? Google Analytics is not magic.

Introduction to Talent Analytics and Interim View 01 Overview Erich OSaben Talent Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Environmental Impacts by the Use of Geothermal Energy ENGINE Mid-Term Conference, Potsdam, 11 th

Greg Odegard MICHIGAN TECH Greg O Odegard , gmodegar@mtu.edu RESEARCH FORUM TECHTALKS

Hari Govind V K, Arie Gurfinkel, Yakir Vizel and Vijay Ganesh Given &lt; Init , Tr , P &gt;

Objectives Objectives 1. Define vulnerabilities prevalent among 1. Define vulnerabilities

Speeding R up on your computer by parallelized computations a geostatistical case study

POSITIONING COMPUTER SCIENCE IN A UNIVERSITY - RESEARCH PERSPECTIVE VERSUS MANAGEMENT

Cluster Report Climate, Energy and Mobility ERA-LEARN, Discussion paper 15.05.2019 Tour de

PeCoH Performance Concious HPC Status 2019 H. Stben, K. Himstedt, N. Hbbe, S. Schder,

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Hari Govind V K, Arie Gurfinkel, Yakir Vizel and Vijay Ganesh Given < Init , Tr , P >