grafboost
play

GraFBoost: Using accelerated flash storage for external graph - PowerPoint PPT Presentation

GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature Human neural Structure of Social


  1. GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1

  2. Large Graphs are Found Everywhere in Nature Human neural Structure of Social network the internet networks TB to 100s of TB in size! 1) Connectomics graph of the brain - Source: “ Connectomics and Graph Theory” Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 2 3) The Graph of a Social Network – Source: Griff’s Graphs

  3. Storage for Graph Analytics Extremely sparse TB of DRAM DRAM Irregular structure Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W 3

  4. Random Access Challenge in Flash Storage Flash DRAM Bandwidth: 3.6 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth! 4

  5. Two Pillars for Flash in GraFBoost Flash Storage Sort-Reduce Hardware Algorithm Acceleration Sequentialize Reduce Overhead of Random Access Sort-Reduce • System resources • Power consumption • Performance 5

  6. Vertex-Centric Programming Model Popular model for efficient parallism and distribution “Vertex program” only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more “Active Vertices” 5 10 5 ∞ 5 ∞ 5 9 4 2 4 1 1 3 2 ∞ 4 1 1 2 Active Vertices 6

  7. Algorithmic Representation of a Vertex Program Iteration 𝐠𝐩𝐬 𝐟𝐛𝐝𝐢 𝑤 𝑡𝑠𝑑 𝐣𝐨 𝐵𝑑𝑢𝑗𝑤𝑓𝑀𝑗𝑡𝑢 𝐞𝐩 𝐠𝐩𝐬 𝐟𝐛𝐝𝐢 𝑓(𝑤 𝑡𝑠𝑑 , 𝑤 𝑒𝑡𝑢 ) 𝐣𝐨 G 𝐞𝐩 𝑓𝑤 = 𝐟𝐞𝐡𝐟_𝐪𝐬𝐩𝐡𝐬𝐛𝐧 (𝑤 𝑡𝑠𝑑 , . 𝑤𝑏𝑚, 𝑓. 𝑥𝑓𝑗𝑕ℎ𝑢) 𝑤 𝑒𝑡𝑢 . 𝑜𝑓𝑦𝑢_𝑤𝑏𝑚 = 𝐰𝐟𝐬𝐮𝐟𝐲_𝐯𝐪𝐞𝐛𝐮𝐟(𝑤 𝑒𝑡𝑢 . 𝑜𝑓𝑦𝑢_𝑤𝑏𝑚, 𝑓𝑤) 𝐟𝐨𝐞 𝐠𝐩𝐬 𝐟𝐨𝐞 𝐠𝐩𝐬 Random read-modify update into vertex data Sort-reduce algorithm solves this issue 7

  8. General Problem of Irregular Array Updates Updating an array x with a stream of update requests xs and update function f 𝑮𝒑𝒔 𝒇𝒃𝒅𝒊 < 𝑗𝑒𝑦, 𝑏𝑠𝑕 > 𝒋𝒐 𝑦𝑡: 𝑦 𝑗𝑒𝑦 = 𝒈(𝑦 𝑗𝑒𝑦 , 𝑏𝑠𝑕) X xs Random f Updates 8

  9. Solution Part One - Sort Sort xs according to index X xs Sorted xs Sequential Sort Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead 9

  10. Solution Part Two - Reduce Associative update function f can be interleaved with sort e.g., (A + B) + C = A + (B + C) xs 1 3 1 8 3 8 1 3 7 0 3 1 5 2 9 2 merge 1 3 8 1 3 8 1 3 3 10 5 2 3 0 1 9 7 2 merge Reduced Overhead 1 3 8 19 7 3 10

  11. Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase 1 1 1 Normalized size of Normalized size of Normalized size of Kron32 Kron32 update stream update stream 0.8 0.8 0.8 Kron32 update stream WDC 0.6 0.6 0.6 Kron30 WDC 0.4 0.4 0.4 Twitter 0.2 0.2 0.2 90% 0 0 0 0 0 0 1 1 1 2 2 2 3 3 3 Merge Iteration Merge Iteration Merge Iteration 11

  12. Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Host (Server/PC/Laptop) Software Multirate 16-to-1 Accelerator-Aware Merge-Sorter Flash Management FPGA Multirate Wire-Speed Aggregator On-chip Sorter Partially Sort- Flash Edge Vertex Reduced Files Data Data Active Vertices 12

  13. Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Host (Server/PC/Laptop) Software Vertex Value 1GB DRAM Update Log ( xs ) Sort-Reduce Edge FPGA Accelerator Program Edge Property Partially Sort- Flash Edge Vertex Reduced Files Data Data Active Vertices 13

  14. Evaluated Graph Analytic Systems In-memory Semi-External External FlashGraph ( SE1 ) GraphLab ( IN ) GraphChi ( EX ) X-Stream ( SE2 ) GraFSoft GraFBoost Projected system with 2x memory bandwidth GraFBoost2 “Distributed GraphLab: a framework for machine learning and data mining in the cloud ,” VLDB 2012 “FlashGraph: Processing billion - node graphs on an array of commodity SSDs,” FAST 2015 “X -Stream: edge- centric graph processing using streaming partitions,“ SOSP 2013 14 “GraphChi: Large - scale graph computation on just a PC,“ USENIX 2012

  15. Evaluation Environment + 32-core Xeon 4-core i5 128 GB RAM 4 GB RAM 5x 0.5TB PCIe Flash $400 + $8000 Virtex 7 FPGA All software 1TB custom flash experiments 1GB on-board RAM $1000 $??? 15

  16. Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Fail Fast Slow Fast Medium graphs: Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirement Memory, CPU, Power 16

  17. Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab ( IN ) out of memory GraphChi ( EX ) did not finish 3 Performance 1.7x Normalized 2 10x 2.8x GraFSoft 1 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 17

  18. Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab ( IN ) out of memory GraphChi ( EX ) did not finish 4 Performance Normalized 3 2 1 GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 18

  19. Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab ( IN ) out of memory GraphChi ( EX ) did not finish 4 Performance Normalized 3 Only GraFBoost succeeds in both graphs 2 GraFBoost can run still larger graphs! 1 GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 19

  20. Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.09 TB 0.3 TB 0.04 Billion 0.3 Billion 1 Billion Fastest! 5 Normalized Performance Slowest Slowest Slowest 4 3 2 1 GraFSoft 0 twitter kron28 kron30 IN SE1 SE2 EX GraFBoost GraFBoost2 20

  21. Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale 28 0.09 TB in text, 0.3 Billion vertices 6 Normalized Performance 5 4 3 2 1 GraFSoft 0 PR BFS 5xIN SE1 SE2 EX GraFBoost 21

  22. GraFBoost Reduces Resource Requirements 32 100 35 800 80 30 80 600 25 Threads 410 Watts 60 20 GB 400 15 40 10 200 20 2 40 5 2 0 0 0 Conventional Conventional Conventional GraFBoost GraFBoost GraFBoost External Analytics External analytics Hardware Acceleration + Hardware Acceleration 22

  23. Future Work Open-source GraFBoost Cleaning up code for users! Acceleration using Amazon F1 Commercial accelerated storage Collaborating with Samsung More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center 23

  24. Thank You Flash Storage Sort-Reduce Hardware Algorithm Acceleration 24

  25. 25

  26. BlueDBM Cluster Architecture minFlash Host Server FPGA (24-Core) (VC707) minFlash 1GB RAM minFlash Host Server FPGA Ethernet 1 TB (24-Core) (VC707) minFlash PCIe FMC … 4GB/s 10Gbps network ×8 minFlash Host Server FPGA (24-Core) (VC707) minFlash Uniform latency of 100 µs! 26

  27. The BlueDBM Cluster BlueDBM Storage Device 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend