X-Stream: Edge-centric Graph Processing using Streaming Partitions - - PowerPoint PPT Presentation

x stream edge centric graph processing using streaming
SMART_READER_LITE
LIVE PREVIEW

X-Stream: Edge-centric Graph Processing using Streaming Partitions - - PowerPoint PPT Presentation

X-Stream: Edge-centric Graph Processing using Streaming Partitions AMITABHA ROY, IVO MIHAILOVIC, WILLY ZWAENEPOEL PRESENTED BY: MAREK STRELEC Motivation q Large graphs billions of vertices and edges q Process on large clusters q Pregel,


slide-1
SLIDE 1

X-Stream: Edge-centric Graph Processing using Streaming Partitions

AMITABHA ROY, IVO MIHAILOVIC, WILLY ZWAENEPOEL PRESENTED BY: MAREK STRELEC

slide-2
SLIDE 2

Motivation

q Large graphs – billions of vertices and edges q Process on large clusters

q Pregel, GraphLab, PowerGraph, Niad q Complexity and cost

q Process on a single machine

q GraphChi, X-Stream

q 64 GB RAM, 32 cores, 2 x 200 GB SSD, 3 x 3TB drive

slide-3
SLIDE 3

Vertex-centric processing model

q “Think like a vertex” q Popularized by the Pregel and GraphLab projects q Mutable states stored in vertices q Scatter-Gather model

q Scatter updates along outgoing edges q Gather updates from incoming edges

slide-4
SLIDE 4

Vertex-centric BFS

slide-5
SLIDE 5

Vertex-centric BFS

slide-6
SLIDE 6

Vertex-centric BFS

slide-7
SLIDE 7

Vertex-centric BFS

slide-8
SLIDE 8

Sequential vs. Random access

q Graph traversal = Random access q For all storage media (RAM, SSD, and HDD)

q Sequential bandwidth >> random access bandwidth q HDD - 300x higher q SSD - 30x higher q RAM (1 core) - 4.6x higher q RAM (16 cores) - 1.8x higher

slide-9
SLIDE 9

X-stream processing model: Edge-centric

q Input to X-stream is an unordered set of directed edges

q For undirected graphs - pair of directed edges

q Scatter and Gather phases iterate over vertices edges q X-stream makes graph access sequential

slide-10
SLIDE 10

Edge-centric BFS

slide-11
SLIDE 11

Edge-centric BFS

slide-12
SLIDE 12

Edge-centric BFS

slide-13
SLIDE 13

Edge-centric BFS

slide-14
SLIDE 14

Edge-centric properties

q Many sequential scans of the edge list q The order of edges is irrelevant q Tradeoff

q Sequential access is faster q More Scatter/Gather iterations

q The number of iterations might be fever if the edge set >> vertex set q Problem: still have random access to vertex set

slide-15
SLIDE 15

Streaming partitions

q Partition the graph into streaming partitions

q vertex set: a subset of vertices that fit into RAM q edge list: all edges whose source vertex is in the partition’s vertex set q update list: all updates whose destination vertex is in the partition’s vertex set

q Streaming partitions can be processed in parallel q Vertices (random access) => fast storage, Edges (sequential access) => slow storage q The number of partitions is crucial for performance q Shuffle phase - updates must be re-arranged after the scatter phase

slide-16
SLIDE 16

Scalability

Traversal algorithms – BFS, WCC Multiplication algorithms – PageRank, SpMW

q Increasing thread count q Increasing number of I/O devices q Across devices

slide-17
SLIDE 17

Comparison with Other Systems: Ligra

q Ligra

q In-memory graph processing system q Requires pre-processing

slide-18
SLIDE 18

Comparison with Other Systems: GraphChi

q GraphChi

q Traditional vertex-centric approach q Out-of-core data structure, parallel sliding windows, to reduce the amount of random access to disk q needs time to pre-sort the graph into shards

slide-19
SLIDE 19

Criticism

q Assumes that the number of edges is larger than the number of vertices q Performs well only on graphs with a low diameter q Workload imbalance as the partitions can have different numbers of edges assigned to them

q Is work stealing sufficient?

slide-20
SLIDE 20

Thank you!