Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs - PowerPoint PPT Presentation

Distributed Graph Storage Veronika Molnár, UZH

Overview - Graphs and Social Networks - Criteria for Graph Processing Systems - Current Systems - Storage - Computation - Large scale systems - Comparison / Best systems - Questions 2

Graphs and Social Networks 1 Graph = collection of nodes + edges connecting nodes to each other Social Network = collection of individuals and social relations Social Network is also a Graph! (node = person, edge = relation) Social Network graph (image source : thenextweb.com) 3

Graphs and Social Networks 2 - Social Network graph properties (SNA = Social Network Analysis) - Limited number of connections at each node (person) e.g. Facebook: max 5000 - Distribution not uniform - Most people: an average number of connections - But: a few people have a lot of connections (Power law distribution) - Small degree of separation = “Small World” (length of shortest paths) - Centrality - Constantly changing, but very large graph! (7 billion people = 7 billion nodes) 4

Graphs and Social Networks 3 Shortest Path Centrality VM Betweenness Closeness BP PageRank Degree 5

Graphs and Social Networks 4 - Social Network can be… - Facebook - Emails - Mailing lists - Academic networks 6

Criteria for Graph Processing Systems 1 - Modes: - Distributed processing - Research and industry use - Interactive and noninteractive modes - Storage of static and dynamic information E-mail connectivity graph 7 (image source: research.microsoft.com)

Criteria for Graph Processing Systems 2 - Properties: - Scalability (social networks are large!) - Speed - Features: - SNA (Social Network Analysis) metrics: PageRank, Centrality, Shortest paths, ... - Extensibility E-mail connectivity graph 8 (image source: research.microsoft.com)

Current Systems 1 Storage: - Apache Hive (and Hadoop) - Titan Graph Database - Neo4j 9

Current Systems - Storage 2 Apache Hive (and Hadoop) Hadoop: Map/Reduce architecture - Hive: High-level operations on large data sets - HiveQL (similar to SQL) - Converted to MapReduce jobs - Not graph-specific - Supports custom data formats - Can be used as a backend for other systems 10

Current Systems - Storage 3 Titan Graph Database - Store and Query large graphs - Graph schemas - edge and vertex labels - Gremlin query language - transactional query model - high level operations - Two backends: Cassandra and HBase 11

Current Systems - Storage 4 Neo4j - Cost: €12K for startups (more for large companies), free for personal use - Graph Database Management - ACID compliant (Atomicity, Consistency, Isolation, Durability) - Graphs are stored as Edges, Nodes, Attributes - Focus on finding and querying data - Graph analytics with igraph or GraphX - Community! 12

Neo4j 13

Current Systems 5 Computation: - igraph - Spark GraphX - GraphLab 14

Current Systems - Computation 6 igraph - Network analysis / network research - Portable and efficient - Python, R, C, C++ - Built-in, optimized SNA metrics (centrality, diameter, connected components) - Stand-alone or Grid - Extensible, 3 layer API 15

Current Systems - Computation 7 Spark GraphX - Graphs and parallel graph computations - User-defined parallel operations - stored in-memory for faster processing - very good end-to-end performance - graphs are immutable; all operations create a new graph - Prebuilt graph algorithms, e.g. PageRank 16

Current Systems - Computation 8 GraphLab - Cost: $4,000/machine/year, or free 1 year student subscription - Graph computations: processing & analytics - Visualization (GraphLab Canvas) - Machine learning - Common graph algorithms + API 17

GraphLab 18

Current Systems 9 Used by Facebook/Google: - Pregel/Pregelix - Apache Giraph 19

Current Systems - Large Scale 10 Pregel/Pregelix - Pregel: Google-only, Pregelix: open-source - BSP (bulk synchronous processing) model - User defined edge, vertex, message types - Supersteps - Extremely large graphs - in-memory/out-of-core operation models - Vertex-based API, libraries with graph algorithms 20

Current Systems - Large Scale 11 Apache Giraph - BSP model - Graph-wide metrics via global operations - Built on Hadoop, 5-26 times faster than Hive - Highly parallel, keeps all data in memory - Scales linearly with number of edges, can make efficient use of large clusters - Used for PageRank, popularity rank, shortest paths - No built-in graph metrics 21

Comparison Focus Scalability SNA Extensibility Used for Hive parallel computations any size no Java generic Titan storage ~100 B no Python, Java graph queries Neo4j transactional DB ~1 B yes Java, Python, R recommender systems igraph efficiency, portability ~1 M yes R, Python, C++ research GraphX parallel computations ~1 B yes Java, Python, R graph processing GraphLab processing, analytics ~1 B yes C++ recommender systems Giraph large scale, BSP any size no Java, Python Facebook Pregel(ix) large scale, BSP any size yes Java Google 22

Which is the best? Depends on the network and intended use.. - Very large Social Networks: - High-performance, customizable systems, such as Pregelix - Research: - igraph and GraphX support R and Python integration - Analysis and Visualisation of Social Networks - GraphLab with built-in interactive analysis and plotting features - Neo4j contains vast amounts of community resources for these tasks - Custom use cases... - Existing systems might not support these - Instead: use Hadoop/Hive and write the rest yourself! 23

Thank You! aaaaaand Stay for some questions 24

Questions 1 Why do we analyse social data? What are the possible uses of analysing social data? 25

Questions 2 Can visualisation help to understand graphs? (connections can be viewed, subset of graph can be analysed, …) 26

Questions 3 Have you ever used such a system? Which one? 27

Questions 4 What are the advantages and disadvantages of distributed graph processing? What is the value of graph processing? 28

Questions 5 How can social metric calculations deal with fake accounts? 29

The End ... 30

Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs - PowerPoint PPT Presentation

Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs and Social Networks - Criteria for Graph Processing Systems - Current Systems - Storage - Computation - Large scale systems - Comparison / Best systems - Questions

Distributed Storage and Consistency Distributed Storage and Consistency Storage moves into the

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Distributed Storage Systems part 1 Marko Vukoli Distributed Systems and Cloud Computing This

Distributed Storage Systems part 2 Marko Vukoli Distributed Systems and Cloud Computing

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Exploring Trade-offs in Transactional Parallel Data Movement Ivo Jimenez, Carlos Maltzahn (UCSC)

I Hate Your Database I Hate Your Database Andrew Godwin Andrew Godwin @andrewgodwin

NoSQL Concepts, Techniques & Systems Part 1 Valentina Ivanova IDA, Linkping University

Networked and Distributed File Systems CS 111 Operating Systems Peter Reiher Lecture 13 CS

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

1 Transaction State Transaction State (Cont.) Active, the initial state; the transaction stays in

Relational Databases Week 7 INFM 603 Agenda Questions Relational database design

Genuine atomic multicast in asynchronous distributed systems Rachid Guerraoui, Andre Schiper

Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs - PowerPoint PPT Presentation

Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs and Social Networks - Criteria for Graph Processing Systems - Current Systems - Storage - Computation - Large scale systems - Comparison / Best systems - Questions

Distributed Storage and Consistency Distributed Storage and Consistency Storage moves into the

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Distributed Storage Systems part 1 Marko Vukoli Distributed Systems and Cloud Computing This

Distributed Storage Systems part 2 Marko Vukoli Distributed Systems and Cloud Computing

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN &amp; Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Graph Partitioning for Scalable Distributed Graph Computations Aydn Bulu Kamesh

A Simulation-based Evaluation of a Hybrid Storage System combining P2P, F2F, and Cloud storage

Exploring Trade-offs in Transactional Parallel Data Movement Ivo Jimenez, Carlos Maltzahn (UCSC)

I Hate Your Database I Hate Your Database Andrew Godwin Andrew Godwin @andrewgodwin

NoSQL Concepts, Techniques &amp; Systems Part 1 Valentina Ivanova IDA, Linkping University

Networked and Distributed File Systems CS 111 Operating Systems Peter Reiher Lecture 13 CS

DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #5: LOGGING

1 Transaction State Transaction State (Cont.) Active, the initial state; the transaction stays in

Relational Databases Week 7 INFM 603 Agenda Questions Relational database design

Genuine atomic multicast in asynchronous distributed systems Rachid Guerraoui, Andre Schiper

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

NoSQL Concepts, Techniques & Systems Part 1 Valentina Ivanova IDA, Linkping University