End-toEnd In-memory Graph Analytics
Jure Leskovec (@jure)
Including joint work with Rok Sosic, Deepak Narayanan, Yonathan Perez, et al.
Jure Leskovec, Stanford 1
End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) - - PowerPoint PPT Presentation
End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic, Deepak Narayanan, Yonathan Perez, et al. Jure Leskovec, Stanford 1 Background & Motivation My research at Stanford: Mining large social and
Including joint work with Rok Sosic, Deepak Narayanan, Yonathan Perez, et al.
Jure Leskovec, Stanford 1
Jure Leskovec, Stanford 2
Jure Leskovec, Stanford 3
Jure Leskovec, Stanford 4
Data Graph analytics
New knowledge and insights
Jure Leskovec, Stanford 5
Python Q&A Users Questions Answers
Select Select Join Construct Graph
Scores
PageRank Algorithm
Experts
Join
Posts
Jure Leskovec, Stanford 6
Jure Leskovec, Stanford 7
Hadoop MapReduce
Graph analytics Structured data
Relational tables
Raw data
video, text, sound, events, sensor data, gene sequences, documents, …
Jure Leskovec, Stanford 8
9
Jure Leskovec, Stanford
SNAP: A General Purpose Network Analysis and Graph Mining Library.
RINGO: Interactive Graph Analytics on Big-Memory Machines Y. Perez, R. Sosic,
§ C++, Python (BSD, open source) § http://snap.stanford.edu
Jure Leskovec, Stanford 10
Data Graph analytics
New knowledge and insights
§ Common high-level programming language
§ Interactive use (as opposed to batch use)
§ Billions of edges
§ Transformations between tables and graphs
§ Straightforward to use
§ Provenance
Jure Leskovec, Stanford 11
§ Networks in Stanford Large Network Collection
§ http://snap.stanford.edu § Common benchmark Twitter2010 graph has 1.5B edges, requires 13.2GB RAM in SNAP
12
Number of Edges Number of Graphs
<0.1M 16 0.1M – 1M 25 1M – 10M 17 10M – 100M 7 100M – 1B 5 > 1B 1
Jure Leskovec, Stanford
Jure Leskovec, Stanford 13
Entity #Items Size Papers 122.7M 32.4GB Authors 123.1M 3.1GB References 757.5M 14.4GB Affiliations 325.4M 15.3GB Keywords 176.8M 5.9GB Total 1.9B 104.1GB
Dataset #Items Raw Size
DisGeNet 30K 10MB STRING 10M 1TB OMIM 25K 100MB CTD 55K 1.2GB HPRD 30K 30MB BioGRID 64K 100MB DrugBank 7K 60MB Disease Ontology 10K 5MB Protein Ontology 200K 130MB Mesh Hierarchy 30K 40MB PubChem 90M 1GB DGIdb 5K 30MB Gene Ontology 45K 10MB MSigDB 14K 70MB Reactome 20K 100MB GEO 1.7M 80GB ICGC (66 cancer projects) 40M 1TB GTEx 50M 100GB Jure Leskovec, Stanford 14
15 Jure Leskovec, Stanford
16
Jure Leskovec, Stanford
Standard SQL database Custom representations Separate systems for tables and graphs Integrated system for tables and graphs Single representation for tables and graphs Separate table and graph representations Distributed system Single machine system Disk-based structures In-memory structures
Jure Leskovec, Stanford 17
Jure Leskovec, Stanford 18
Unstructured data Relational tables Specify relationships Network representation Specify entities Optimize representation Perform graph analytics Tabular networks Results Integrate results
Jure Leskovec, Stanford 19
Jure Leskovec, Stanford 20
RINGO: Interactive Graph Analytics on Big-Memory Machines Y. Perez, R. Sosic,
21
Table Objects Graph Containers Graph Methods Graph, Table Conversions Filters
SNAP: In-memory Graph Processing Engine High-Level Language User Front-End
Provenance Script Interface with Graph Processing Engine Metadata (Provenance) Secondary Storage
Jure Leskovec, Stanford
Jure Leskovec, Stanford 22
Src Dst … v1 v2 … v2 v3 … v3 v4 … v1 v3 … v1 v4 … v1 v2 v3 v4
Table data structure Graph data structure
Jure Leskovec, Stanford 23
§ Euclidean, Haversine, Jaccard distance
§ SimJoin, connect if data points are closer than some threshold § How to get around quadratic complexity
– Locality Sensitive Hashing
Jure Leskovec, Stanford 24
§ Events selected per user, ordered by timestamps § NextK, connect K successors
Jure Leskovec, Stanford 25
§ Partition users to groups § Identify interactions within each group § Compute a score for each group based on interactions
Jure Leskovec, Stanford 26
Jure Leskovec, Stanford 27
graphs networks generation manipulation analytics
Graph containers Graph methods
Jure Leskovec, Stanford 28
29
1 3 6 4 Nodes table Sorted vectors of in- and out- neighbors
Jure Leskovec, Stanford
1 3 6 4 Nodes table Sorted vectors of in- and out- edges 2 3 8 5 Edges table 7 9 7 1
Directed graphs in SNAP Directed multigraphs in SNAP
Dataset LiveJournal Twitter2010 Nodes 4.8M 42M Edges 69M 1.5B Text Size (disk) 1.1GB 26.2GB Graph Size (RAM) 0.7GB 13.2GB Table Size (RAM) 1.1GB 23.5GB
Jure Leskovec, Stanford 30
Algorithm Graph PageRank LiveJournal PageRank T witter2010 Triangles LiveJournal Triangles T witter2010
Giraph 45.6s 439.3s N/A N/A GraphX 56.0s
54.0s 595.3s 66.5s
27.5s 251.7s 5.4s 706.8s SNAP 2.6s 72.0s 13.7s 284.1s
Jure Leskovec, Stanford 31
Hardware: 4x Intel CPU, 64 cores, 1TB RAM, $35K
System Hosts CPUs host Host Configuration Time GraphChi 1 4
8x core AMD, 64GB RAM
158s TurboGraph 1 1
6x core Intel, 12GB RAM
30s Spark 50 2 97s GraphX 16 1
8X core Intel, 68GB RAM
15s PowerGraph 64 2
8x hyper Intel, 23GB RAM
3.6s SNAP 1 4
20x hyper Intel, 1TB RAM
6.0s
Jure Leskovec, Stanford 32
Twitter2010, one iteration of PageRank
Jure Leskovec, Stanford 33
Jure Leskovec, Stanford 34
Algorithm Time (s) Implementation In-degree 14 1 core Out-degree 8 1 core PageRank 115 64 cores Triangles 107 64 cores WCC 1,716 1 core K-core 2,325 1 core
13.0 MEdges/s
18.0 MEdges/s
46.0 MEdges/s
50.4 MEdges/s
35
Hardware: 4x Intel CPU, 80 cores, 1TB RAM, $35K
Jure Leskovec, Stanford
575.0 MRows/s
917.7 MRows/s
109.5 MRows/s
348.8 MRows/s
36 Jure Leskovec, Stanford
Hardware: 4x Intel CPU, 80 cores, 1TB RAM, $35K
37
Jure Leskovec, Stanford
Jure Leskovec, Stanford 38
Mode
Cross-mode links In-mode links Node
Jure Leskovec, Stanford 39
Jure Leskovec, Stanford 40
§ Create subgraphs, dynamic graphs, …
Jure Leskovec, Stanford 41
Jure Leskovec, Stanford 42
Jure Leskovec, Stanford 43
Nodes in modes 0 to 9 are fully connected to each other Each node in modes 0 to 9 is connected to 10
mode 10
each and 100M edges each
connected to all nodes in mode 10
Mode 0
Mode 10
Mode 9
Jure Leskovec, Stanford 44
0.001 0.01 0.1 1 10 100 1000
SG(0,1) SG(0,1,4) SG(0 to 9) GNIds(0,1,3)
Time (in seconds) Workloads x=1000 x=10000 x=100000 x=1000000
Extract subgraph
For X=1M, graph has 10.1B edges
§ I.e., we want good memory locality
Jure Leskovec, Stanford 45
Jure Leskovec, Stanford 46
Jure Leskovec, Stanford 47
k(k+1)/2 bipartite graphs, each bipartite graph has its own node hash table Nodes can be repeated across different graphs Each node object in a node hash table maps to a list of in- and out- neighbors
Jure Leskovec, Stanford 48
k node hash tables Each node object in a node hash table maps to k lists of in- and out- neighbors sorted by node-id Nodes only appear in a single node hash table
Jure Leskovec, Stanford 49
k node hash tables Nodes
appear in a single node hash table Each node object in a node hash table maps to a consolidated list of in- and out- neighbors sorted by (mode-id, node-id)
0.001 0.01 0.1 1 10 100 1000
SG(0,1) SG(0,1,4) SG(0 to 9) GNIds(0,1,3)
Time (in seconds) Workloads Naive BGC MNCA Hybrid
3.5x order of magnitude improvement!
Jure Leskovec, Stanford 50
11 modes in total § 10k nodes in modes 0-9; edges between all nodes § 1M nodes in mode 10; edges between every node in mode 10 and all other nodes (total of 110B edges)
Jure Leskovec, Stanford 51
BGC Hybrid MNCA
Per-mode NodeId lookups All-adjacent NodeId accesses Per-mode adjacent NodeId accesses Mode-pair SubGraph accesses
Jure Leskovec, Stanford 52
Sparser graphs Denser graphs Number of out-neighbors
BGC Hybrid MNCA
53
Jure Leskovec, Stanford
node2vec: Scalable Feature Learning for Networks
54
Raw Data Structured Data Learning Algorithm Model Downstream prediction task Feature Engineering
Automatically learn the features
Jure Leskovec, Stanford
55 Jure Leskovec, Stanford
" 𝑣 … neighbourhood of u obtained by
56 Jure Leskovec, Stanford
Pr(NS(u)|f(u)) = Y
ni∈NS(u)
Pr(ni|f(u))
f
u∈V
Pr(ni|f(u)) = exp(f(ni) · f(u)) P
v∈V exp(f(v) · f(u))
Estimate 𝑔(𝑣) using stochastic gradient descent.
57 Jure Leskovec, Stanford
u s3 s2
s1
s4 s8 s9 s6 s7 s5
58 Jure Leskovec, Stanford
u
59 Jure Leskovec, Stanford
Structural equivalence (structural roles)
Homophily (network communities)
60 Jure Leskovec, Stanford
v
α=1 α=1/q α=1/q α=1/p
x2 x3 t x1
The walk just traversed (𝑢,𝑤) and aims to make a next step.
61 Jure Leskovec, Stanford
§ Spectral embedding § DeepWalk [B. Perozzi et al., KDD ‘14] § LINE [J. Tang et al.. WWW ‘15]
Algorithm Dataset BlogCatalog PPI Wikipedia Spectral Clustering 0.0405 0.0681 0.0395 DeepWalk 0.2110 0.1768 0.1274 LINE 0.0784 0.1447 0.1164 node2vec 0.2581 0.1791 0.1552 node2vec settings (p,q) 0.25, 0.25 4, 1 4, 0.5 Gain of node2vec [%] 22.3 1.3 21.8
62 Jure Leskovec, Stanford
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Fraction of missing edges
0.00 0.05 0.10 0.15 0.20
Macro-F1 score
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Fraction of additional edges
0.00 0.05 0.10 0.15 0.20
Macro-F1 score
63 Jure Leskovec, Stanford
64
Jure Leskovec, Stanford
Jure Leskovec, Stanford 65
66
Relational tables Graphs and networks graph construction
graph analytics
Jure Leskovec, Stanford
67 Jure Leskovec, Stanford
Jure Leskovec, Stanford 68
§ Papers:
§ SNAP: A General Purpose Network Analysis and Graph Mining Library.
§ Ringo: Interactive Graph Analytics on Big-Memory Machines by Y. Perez, R Sosic, A. Banerjee, R. Puttagunta,
. Shah, J. Leskovec. SIGMOD 2015. § node2vec: Scalable Feature Learning for Networks. A. Grover, J. Leskovec. KDD 2016.
§ Software:
§ http://snap.stanford.edu/ringo/ § http://snap.stanford.edu/snappy § https://github.com/snap-stanford/snap
Jure Leskovec, Stanford 69
70 Jure Leskovec, Stanford