SLIDE 1
Processing Massive Sized Graphs using Sector/Sphere
Yunhong Gu, Li Lu: University of Illinois at Chicago Robert Grossman: University of Chicago and Open Data Group Andy Yoo: Lawrence Livermore National Laboratory
SLIDE 2 Background
Very large graph (billions of vertices) processing is important
in many real world applications (e.g., social networks)
Traditional systems are often complicated to use and/or
expensive to build
Processing graphs distributedly requires shared data access or
complicated data moving
This paper investigates how to support large graph processing
with “cloud” style compute system
Data centric model, simplified API E.g., MapReduce
SLIDE 3
Overview
Sector/Sphere In-Storage Data Processing Framework Graph Breadth-First Search Experimental Results Conclusion
SLIDE 4 Sector/Sphere
Sector: distributed file system
Running on clusters of commodity computers Software fault tolerance with replication T
Application aware
Sphere: parallel data processing framework
In-storage processing User-defined functions on data segments in parallel Load balancing and fault tolerance
SLIDE 5 Parallel Data Processing Framework
Data Storage
Locality aware distributed
file system
Data Processing
MapReduce User-defined functions
Data Exchanging
Hash Reduce
Disk Disk UDF Disk Disk Disk Disk UDF Disk UDF Output
Input
Bucket Writer Bucket Writer Bucket Writer Bucket Writer Input
Input
Output
Output
Output
SLIDE 6
Key Performance Factors
Input locality
Data is processed on the node where it resides, or on nearest
nodes
Output locality
Output data can be put such locations such that data
movement can be reduced in further processing
In-memory objects
Frequently accessed data may be stored in memory
SLIDE 7 Output Locality: An Example
Join two datasets Scan each one
independently, put their results together
Merge the result buckets
DataSet 1 UDF 1 UDF 1 DataSet 2 UDF 2 UDF 2 UDF- Join UDF- Join UDF- Join
SLIDE 8 Graph BFS
a b a b
SLIDE 9
Data Segmentation
Adjacency list Each segments contains approximately same number of
edges
Edges belonging to the adjacency list of one vertex will
not be slit into two segments
SLIDE 10 Sphere UDF for Graph BFS
Basic idea: scan each data segment, find the neighbors of
the current level, generates next level, which is the union
- f the neighbors of all vertices in the current level. Repeat
this, until destination is found.
Sphere UDF for unidirectional BFS
Input: Graph data segment x, current level segment l_x. If a
vertex appears in level segment l_x, then it must exist in the graph data segment x
For each vertex in level segment l_x, find its neighbor vertices
in the data segment x, label each neighbor vertex a bucket ID so that it satisfies the above relationship
SLIDE 11
Experiments Setup
Data
PubMed: 28m vertices, 542m edges, 6GB data PubMedEx: 5b vertices, 118b edges, 1.3TB data
Testbed
Open Cloud T
estbed: JHU, UIC, StarLight, Calit2
4 racks, 120 nodes, 10GE inter-connection
SLIDE 12
Average Time Cost (seconds) on PubMed using 20 Servers
Length Count Percent Avg Time Uni-BFS Avg Time Bi-BFS 2 28 10.8 21 25 3 85 32.7 26 29 4 88 33.8 38 33 5 34 13.1 70 42 6 13 5 69 42 7 7 2.7 88 51 8 5 1.9 84 54 Total 260 Avg Time 40 33
SLIDE 13 Performance Impact of Various Components in Sphere
Components Change Time Cost Change Without in-memory object 117% Without bucket location
146% With bucket combiner 106% With bucket fault tolerance 110% Data segmentation by the same number of vertices 118%
SLIDE 14
The Average Time Cost (seconds) on PubMedEx using 60 Servers
Length Count Percent Avg Time 2 11 4.2 56 3 1 0.4 82 4 60 23.2 79 5 141 54.2 197 6 45 17.3 144 7 2 0.7 201 Total 260 Avg Time 156
SLIDE 15
The Average Time Cost (seconds) on PubMedEx on 19, 41, 61, 83 Servers
Group 3 4 5 6 7 Count 1 24 58 16 1 Servers# AVG 19 112 257 274 327 152 275 41 153 174 174 165 280 174 59 184 150 157 140 124 153 83 214 145 146 138 192 147
SLIDE 16 Conclusion
We can process very large graphs with the “cloud”
compute model such as Sphere
Performance is comparable to traditional systems, but
requires simple development effort (less than 1000 lines
A BFS-type query can be done in a few minutes
Future work: concurrent queries