using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at - - PowerPoint PPT Presentation

using sector sphere
SMART_READER_LITE
LIVE PREVIEW

using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at - - PowerPoint PPT Presentation

Processing Massive Sized Graphs using Sector/Sphere Yunhong Gu , Li Lu : University of Illinois at Chicago Robert Grossman : University of Chicago and Open Data Group Andy Yoo : Lawrence Livermore National Laboratory Background Very large


slide-1
SLIDE 1

Processing Massive Sized Graphs using Sector/Sphere

Yunhong Gu, Li Lu: University of Illinois at Chicago Robert Grossman: University of Chicago and Open Data Group Andy Yoo: Lawrence Livermore National Laboratory

slide-2
SLIDE 2

Background

 Very large graph (billions of vertices) processing is important

in many real world applications (e.g., social networks)

 Traditional systems are often complicated to use and/or

expensive to build

 Processing graphs distributedly requires shared data access or

complicated data moving

 This paper investigates how to support large graph processing

with “cloud” style compute system

 Data centric model, simplified API  E.g., MapReduce

slide-3
SLIDE 3

Overview

 Sector/Sphere  In-Storage Data Processing Framework  Graph Breadth-First Search  Experimental Results  Conclusion

slide-4
SLIDE 4

Sector/Sphere

 Sector: distributed file system

 Running on clusters of commodity computers  Software fault tolerance with replication  T

  • pology aware

 Application aware

 Sphere: parallel data processing framework

 In-storage processing  User-defined functions on data segments in parallel  Load balancing and fault tolerance

slide-5
SLIDE 5

Parallel Data Processing Framework

 Data Storage

 Locality aware distributed

file system

 Data Processing

 MapReduce  User-defined functions

 Data Exchanging

 Hash  Reduce

Disk Disk UDF Disk Disk Disk Disk UDF Disk UDF Output

  • Seg. 1

Input

  • Seg. x

Bucket Writer Bucket Writer Bucket Writer Bucket Writer Input

  • Seg. y

Input

  • Seg. z

Output

  • Seg. 2

Output

  • Seg. 3

Output

  • Seg. n
slide-6
SLIDE 6

Key Performance Factors

 Input locality

 Data is processed on the node where it resides, or on nearest

nodes

 Output locality

 Output data can be put such locations such that data

movement can be reduced in further processing

 In-memory objects

 Frequently accessed data may be stored in memory

slide-7
SLIDE 7

Output Locality: An Example

 Join two datasets  Scan each one

independently, put their results together

 Merge the result buckets

DataSet 1 UDF 1 UDF 1 DataSet 2 UDF 2 UDF 2 UDF- Join UDF- Join UDF- Join

slide-8
SLIDE 8

Graph BFS

a b a b

slide-9
SLIDE 9

Data Segmentation

 Adjacency list  Each segments contains approximately same number of

edges

 Edges belonging to the adjacency list of one vertex will

not be slit into two segments

slide-10
SLIDE 10

Sphere UDF for Graph BFS

 Basic idea: scan each data segment, find the neighbors of

the current level, generates next level, which is the union

  • f the neighbors of all vertices in the current level. Repeat

this, until destination is found.

 Sphere UDF for unidirectional BFS

 Input: Graph data segment x, current level segment l_x. If a

vertex appears in level segment l_x, then it must exist in the graph data segment x

 For each vertex in level segment l_x, find its neighbor vertices

in the data segment x, label each neighbor vertex a bucket ID so that it satisfies the above relationship

slide-11
SLIDE 11

Experiments Setup

 Data

 PubMed: 28m vertices, 542m edges, 6GB data  PubMedEx: 5b vertices, 118b edges, 1.3TB data

 Testbed

 Open Cloud T

estbed: JHU, UIC, StarLight, Calit2

 4 racks, 120 nodes, 10GE inter-connection

slide-12
SLIDE 12

Average Time Cost (seconds) on PubMed using 20 Servers

Length Count Percent Avg Time Uni-BFS Avg Time Bi-BFS 2 28 10.8 21 25 3 85 32.7 26 29 4 88 33.8 38 33 5 34 13.1 70 42 6 13 5 69 42 7 7 2.7 88 51 8 5 1.9 84 54 Total 260 Avg Time 40 33

slide-13
SLIDE 13

Performance Impact of Various Components in Sphere

Components Change Time Cost Change Without in-memory object 117% Without bucket location

  • ptimization

146% With bucket combiner 106% With bucket fault tolerance 110% Data segmentation by the same number of vertices 118%

slide-14
SLIDE 14

The Average Time Cost (seconds) on PubMedEx using 60 Servers

Length Count Percent Avg Time 2 11 4.2 56 3 1 0.4 82 4 60 23.2 79 5 141 54.2 197 6 45 17.3 144 7 2 0.7 201 Total 260 Avg Time 156

slide-15
SLIDE 15

The Average Time Cost (seconds) on PubMedEx on 19, 41, 61, 83 Servers

Group 3 4 5 6 7 Count 1 24 58 16 1 Servers# AVG 19 112 257 274 327 152 275 41 153 174 174 165 280 174 59 184 150 157 140 124 153 83 214 145 146 138 192 147

slide-16
SLIDE 16

Conclusion

 We can process very large graphs with the “cloud”

compute model such as Sphere

 Performance is comparable to traditional systems, but

requires simple development effort (less than 1000 lines

  • f code for BFS)

 A BFS-type query can be done in a few minutes

 Future work: concurrent queries