Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs - - PowerPoint PPT Presentation

distributed graph storage
SMART_READER_LITE
LIVE PREVIEW

Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs - - PowerPoint PPT Presentation

Distributed Graph Storage Veronika Molnr, UZH Overview - Graphs and Social Networks - Criteria for Graph Processing Systems - Current Systems - Storage - Computation - Large scale systems - Comparison / Best systems - Questions


slide-1
SLIDE 1

Distributed Graph Storage

Veronika Molnár, UZH

slide-2
SLIDE 2

Overview

  • Graphs and Social Networks
  • Criteria for Graph Processing Systems
  • Current Systems
  • Storage
  • Computation
  • Large scale systems
  • Comparison / Best systems
  • Questions

2

slide-3
SLIDE 3

Graphs and Social Networks

1 Graph = collection of nodes + edges connecting nodes to each other Social Network = collection of individuals and social relations Social Network is also a Graph! (node = person, edge = relation)

3

Social Network graph

(image source : thenextweb.com)

slide-4
SLIDE 4

Graphs and Social Networks

2

  • Social Network graph properties (SNA = Social Network Analysis)
  • Limited number of connections at each node (person)

e.g. Facebook: max 5000

  • Distribution not uniform
  • Most people: an average number of connections
  • But: a few people have a lot of connections (Power law distribution)
  • Small degree of separation = “Small World” (length of shortest paths)
  • Centrality
  • Constantly changing, but very large graph! (7 billion people = 7 billion nodes)

4

slide-5
SLIDE 5

Graphs and Social Networks

3 Shortest Path

5

Centrality

VM BP

Betweenness Closeness PageRank Degree

slide-6
SLIDE 6

Graphs and Social Networks

4

  • Social Network can be…
  • Facebook
  • Emails
  • Mailing lists
  • Academic networks

6

slide-7
SLIDE 7

Criteria for Graph Processing Systems 1

  • Modes:
  • Distributed processing
  • Research and industry use
  • Interactive and noninteractive modes
  • Storage of static and dynamic

information

7

E-mail connectivity graph

(image source: research.microsoft.com)

slide-8
SLIDE 8

Criteria for Graph Processing Systems 2

  • Properties:
  • Scalability (social networks are large!)
  • Speed
  • Features:
  • SNA (Social Network Analysis) metrics:

PageRank, Centrality, Shortest paths, ...

  • Extensibility

8

E-mail connectivity graph

(image source: research.microsoft.com)

slide-9
SLIDE 9

Current Systems

1 Storage:

  • Apache Hive (and Hadoop)
  • Titan Graph Database
  • Neo4j

9

slide-10
SLIDE 10

Current Systems - Storage

2

Apache Hive (and Hadoop)

Hadoop: Map/Reduce architecture

  • Hive: High-level operations on large data sets
  • HiveQL (similar to SQL)
  • Converted to MapReduce jobs
  • Not graph-specific
  • Supports custom data formats
  • Can be used as a backend for other systems

10

slide-11
SLIDE 11

Current Systems - Storage

3

Titan Graph Database

  • Store and Query large graphs
  • Graph schemas
  • edge and vertex labels
  • Gremlin query language
  • transactional query model
  • high level operations
  • Two backends: Cassandra and HBase

11

slide-12
SLIDE 12

Current Systems - Storage

4

Neo4j

  • Cost: €12K for startups (more for large companies), free for personal use
  • Graph Database Management
  • ACID compliant (Atomicity, Consistency, Isolation, Durability)
  • Graphs are stored as Edges, Nodes, Attributes
  • Focus on finding and querying data
  • Graph analytics with igraph or GraphX
  • Community!

12

slide-13
SLIDE 13

Neo4j

13

slide-14
SLIDE 14

Current Systems

5 Computation:

  • igraph
  • Spark GraphX
  • GraphLab

14

slide-15
SLIDE 15

Current Systems - Computation

6

igraph

  • Network analysis / network research
  • Portable and efficient
  • Python, R, C, C++
  • Built-in, optimized SNA metrics (centrality, diameter, connected components)
  • Stand-alone or Grid
  • Extensible, 3 layer API

15

slide-16
SLIDE 16

Current Systems - Computation

7

Spark GraphX

  • Graphs and parallel graph computations
  • User-defined parallel operations
  • stored in-memory for faster processing
  • very good end-to-end performance
  • graphs are immutable; all operations create a new graph
  • Prebuilt graph algorithms, e.g. PageRank

16

slide-17
SLIDE 17

Current Systems - Computation

8

GraphLab

  • Cost: $4,000/machine/year, or free 1 year student subscription
  • Graph computations: processing & analytics
  • Visualization (GraphLab Canvas)
  • Machine learning
  • Common graph algorithms + API

17

slide-18
SLIDE 18

GraphLab

18

slide-19
SLIDE 19

Current Systems

9 Used by Facebook/Google:

  • Pregel/Pregelix
  • Apache Giraph

19

slide-20
SLIDE 20

Current Systems - Large Scale

10

Pregel/Pregelix

  • Pregel: Google-only, Pregelix: open-source
  • BSP (bulk synchronous processing) model
  • User defined edge, vertex, message types
  • Supersteps
  • Extremely large graphs
  • in-memory/out-of-core operation models
  • Vertex-based API, libraries with graph algorithms

20

slide-21
SLIDE 21

Current Systems - Large Scale

11

Apache Giraph

  • BSP model
  • Graph-wide metrics via global operations
  • Built on Hadoop, 5-26 times faster than Hive
  • Highly parallel, keeps all data in memory
  • Scales linearly with number of edges, can make efficient use of large clusters
  • Used for PageRank, popularity rank, shortest paths
  • No built-in graph metrics

21

slide-22
SLIDE 22

Comparison

22

Focus Scalability SNA Extensibility Used for Hive parallel computations any size no Java generic Titan storage ~100 B no Python, Java graph queries Neo4j transactional DB ~1 B yes Java, Python, R recommender systems igraph efficiency, portability ~1 M yes R, Python, C++ research GraphX parallel computations ~1 B yes Java, Python, R graph processing GraphLab processing, analytics ~1 B yes C++ recommender systems Giraph large scale, BSP any size no Java, Python Facebook Pregel(ix) large scale, BSP any size yes Java Google

slide-23
SLIDE 23

Which is the best?

Depends on the network and intended use..

  • Very large Social Networks:
  • High-performance, customizable systems, such as Pregelix
  • Research:
  • igraph and GraphX support R and Python integration
  • Analysis and Visualisation of Social Networks
  • GraphLab with built-in interactive analysis and plotting features
  • Neo4j contains vast amounts of community resources for these tasks
  • Custom use cases...
  • Existing systems might not support these
  • Instead: use Hadoop/Hive and write the rest yourself!

23

slide-24
SLIDE 24

Thank You! aaaaaand Stay for some questions

24

slide-25
SLIDE 25

Questions 1

Why do we analyse social data? What are the possible uses of analysing social data?

25

slide-26
SLIDE 26

Questions 2

Can visualisation help to understand graphs? (connections can be viewed, subset of graph can be analysed, …)

26

slide-27
SLIDE 27

Questions 3

Have you ever used such a system? Which one?

27

slide-28
SLIDE 28

Questions 4

What are the advantages and disadvantages of distributed graph processing? What is the value of graph processing?

28

slide-29
SLIDE 29

Questions 5

How can social metric calculations deal with fake accounts?

29

slide-30
SLIDE 30

The End ...

30