x stream edge centric graph processing using streaming
play

X-Stream: Edge-centric Graph Processing using Streaming Partitions - PowerPoint PPT Presentation

X-Stream: Edge-centric Graph Processing using Streaming Partitions AMITABHA ROY, IVO MIHAILOVIC, WILLY ZWAENEPOEL PRESENTED BY: MAREK STRELEC Motivation q Large graphs billions of vertices and edges q Process on large clusters q Pregel,


  1. X-Stream: Edge-centric Graph Processing using Streaming Partitions AMITABHA ROY, IVO MIHAILOVIC, WILLY ZWAENEPOEL PRESENTED BY: MAREK STRELEC

  2. Motivation q Large graphs – billions of vertices and edges q Process on large clusters q Pregel, GraphLab, PowerGraph, Niad q Complexity and cost q Process on a single machine q GraphChi, X-Stream q 64 GB RAM, 32 cores, 2 x 200 GB SSD, 3 x 3TB drive

  3. Vertex-centric processing model q “Think like a vertex” q Popularized by the Pregel and GraphLab projects q Mutable states stored in vertices q Scatter-Gather model q Scatter updates along outgoing edges q Gather updates from incoming edges

  4. Vertex-centric BFS

  5. Vertex-centric BFS

  6. Vertex-centric BFS

  7. Vertex-centric BFS

  8. Sequential vs. Random access q Graph traversal = Random access q For all storage media (RAM, SSD, and HDD) q Sequential bandwidth >> random access bandwidth q HDD - 300x higher q SSD - 30x higher q RAM (1 core) - 4.6x higher q RAM (16 cores) - 1.8x higher

  9. X-stream processing model: Edge-centric q Input to X-stream is an unordered set of directed edges q For undirected graphs - pair of directed edges q Scatter and Gather phases iterate over vertices edges q X-stream makes graph access sequential

  10. Edge-centric BFS

  11. Edge-centric BFS

  12. Edge-centric BFS

  13. Edge-centric BFS

  14. Edge-centric properties q Many sequential scans of the edge list q The order of edges is irrelevant q Tradeoff q Sequential access is faster q More Scatter/Gather iterations q The number of iterations might be fever if the edge set >> vertex set q Problem: still have random access to vertex set

  15. Streaming partitions q Partition the graph into streaming partitions q vertex set: a subset of vertices that fit into RAM q edge list: all edges whose source vertex is in the partition’s vertex set q update list: all updates whose destination vertex is in the partition’s vertex set q Streaming partitions can be processed in parallel q Vertices (random access) => fast storage, Edges (sequential access) => slow storage q The number of partitions is crucial for performance q Shuffle phase - updates must be re-arranged after the scatter phase

  16. Scalability q Increasing thread count q Increasing number of I/O devices q Across devices Traversal algorithms – BFS, WCC Multiplication algorithms – PageRank, SpMW

  17. Comparison with Other Systems: Ligra q Ligra q In-memory graph processing system q Requires pre-processing

  18. Comparison with Other Systems: GraphChi q GraphChi q Traditional vertex-centric approach q Out-of-core data structure, parallel sliding windows, to reduce the amount of random access to disk q needs time to pre-sort the graph into shards

  19. Criticism q Assumes that the number of edges is larger than the number of vertices q Performs well only on graphs with a low diameter q Workload imbalance as the partitions can have different numbers of edges assigned to them q Is work stealing sufficient?

  20. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend