GraFBoost: Using accelerated flash storage for external graph - PowerPoint PPT Presentation

GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1

Large Graphs are Found Everywhere in Nature Human neural Structure of Social network the internet networks TB to 100s of TB in size! 1) Connectomics graph of the brain - Source: “ Connectomics and Graph Theory” Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 2 3) The Graph of a Social Network – Source: Griff’s Graphs

Storage for Graph Analytics Extremely sparse TB of DRAM DRAM Irregular structure Terabytes in size $$$ $8000/TB, 200W Our goal: $ $500/TB, 10W 3

Random Access Challenge in Flash Storage Flash DRAM Bandwidth: 3.6 GB/s ~100 GB/s Access Granularity: 8192 Bytes 128 Bytes Wastes performance by not using most of fetched page Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth! 4

Two Pillars for Flash in GraFBoost Flash Storage Sort-Reduce Hardware Algorithm Acceleration Sequentialize Reduce Overhead of Random Access Sort-Reduce • System resources • Power consumption • Performance 5

Vertex-Centric Programming Model Popular model for efficient parallism and distribution “Vertex program” only sees neighbors Algorithm executed in terms of disjoint iterations Vertex program is executed on one or more “Active Vertices” 5 10 5 ∞ 5 ∞ 5 9 4 2 4 1 1 3 2 ∞ 4 1 1 2 Active Vertices 6

Algorithmic Representation of a Vertex Program Iteration 𝐠𝐩𝐬 𝐟𝐛𝐝𝐢 𝑤 𝑡𝑠𝑑 𝐣𝐨 𝐵𝑑𝑢𝑗𝑤𝑓𝑀𝑗𝑡𝑢 𝐞𝐩 𝐠𝐩𝐬 𝐟𝐛𝐝𝐢 𝑓(𝑤 𝑡𝑠𝑑 , 𝑤 𝑒𝑡𝑢 ) 𝐣𝐨 G 𝐞𝐩 𝑓𝑤 = 𝐟𝐞𝐡𝐟_𝐪𝐬𝐩𝐡𝐬𝐛𝐧 (𝑤 𝑡𝑠𝑑 , . 𝑤𝑏𝑚, 𝑓. 𝑥𝑓𝑗𝑕ℎ𝑢) 𝑤 𝑒𝑡𝑢 . 𝑜𝑓𝑦𝑢_𝑤𝑏𝑚 = 𝐰𝐟𝐬𝐮𝐟𝐲_𝐯𝐪𝐞𝐛𝐮𝐟(𝑤 𝑒𝑡𝑢 . 𝑜𝑓𝑦𝑢_𝑤𝑏𝑚, 𝑓𝑤) 𝐟𝐨𝐞 𝐠𝐩𝐬 𝐟𝐨𝐞 𝐠𝐩𝐬 Random read-modify update into vertex data Sort-reduce algorithm solves this issue 7

General Problem of Irregular Array Updates Updating an array x with a stream of update requests xs and update function f 𝑮𝒑𝒔 𝒇𝒃𝒅𝒊 < 𝑗𝑒𝑦, 𝑏𝑠𝑕 > 𝒋𝒐 𝑦𝑡: 𝑦 𝑗𝑒𝑦 = 𝒈(𝑦 𝑗𝑒𝑦 , 𝑏𝑠𝑕) X xs Random f Updates 8

Solution Part One - Sort Sort xs according to index X xs Sorted xs Sequential Sort Updates Much better than naïve random updates Terabyte graphs can generate terabyte logs Significant sorting overhead 9

Solution Part Two - Reduce Associative update function f can be interleaved with sort e.g., (A + B) + C = A + (B + C) xs 1 3 1 8 3 8 1 3 7 0 3 1 5 2 9 2 merge 1 3 8 1 3 8 1 3 3 10 5 2 3 0 1 9 7 2 merge Reduced Overhead 1 3 8 19 7 3 10

Big Benefits from Interleaving Reduction Ratio of data copied at each sort phase 1 1 1 Normalized size of Normalized size of Normalized size of Kron32 Kron32 update stream update stream 0.8 0.8 0.8 Kron32 update stream WDC 0.6 0.6 0.6 Kron30 WDC 0.4 0.4 0.4 Twitter 0.2 0.2 0.2 90% 0 0 0 0 0 0 1 1 1 2 2 2 3 3 3 Merge Iteration Merge Iteration Merge Iteration 11

Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Host (Server/PC/Laptop) Software Multirate 16-to-1 Accelerator-Aware Merge-Sorter Flash Management FPGA Multirate Wire-Speed Aggregator On-chip Sorter Partially Sort- Flash Edge Vertex Reduced Files Data Data Active Vertices 12

Accelerated GraFBoost Architecture In-storage accelerator reduces data movement and cost Host (Server/PC/Laptop) Software Vertex Value 1GB DRAM Update Log ( xs ) Sort-Reduce Edge FPGA Accelerator Program Edge Property Partially Sort- Flash Edge Vertex Reduced Files Data Data Active Vertices 13

Evaluated Graph Analytic Systems In-memory Semi-External External FlashGraph ( SE1 ) GraphLab ( IN ) GraphChi ( EX ) X-Stream ( SE2 ) GraFSoft GraFBoost Projected system with 2x memory bandwidth GraFBoost2 “Distributed GraphLab: a framework for machine learning and data mining in the cloud ,” VLDB 2012 “FlashGraph: Processing billion - node graphs on an array of commodity SSDs,” FAST 2015 “X -Stream: edge- centric graph processing using streaming partitions,“ SOSP 2013 14 “GraphChi: Large - scale graph computation on just a PC,“ USENIX 2012

Evaluation Environment + 32-core Xeon 4-core i5 128 GB RAM 4 GB RAM 5x 0.5TB PCIe Flash $400 + $8000 Virtex 7 FPGA All software 1TB custom flash experiments 1GB on-board RAM $1000 $??? 15

Evaluation Result Overview In-memory Semi-External External GraFBoost Large graphs: Fail Fail Slow Fast Fail Fast Slow Fast Medium graphs: Small graphs: Fast Fast Slow Fast GraFBoost has very low resource requirement Memory, CPU, Power 16

Results with a Large Graph: Synthetic Scale 32 Kronecker Graph 0.5 TB in text, 4 Billion vertices GraphLab ( IN ) out of memory GraphChi ( EX ) did not finish 3 Performance 1.7x Normalized 2 10x 2.8x GraFSoft 1 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 17

Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab ( IN ) out of memory GraphChi ( EX ) did not finish 4 Performance Normalized 3 2 1 GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 18

Results with a Large Graph: Web Data Commons Web Crawl 2 TB in text, 3 Billion vertices GraphLab ( IN ) out of memory GraphChi ( EX ) did not finish 4 Performance Normalized 3 Only GraFBoost succeeds in both graphs 2 GraFBoost can run still larger graphs! 1 GraFSoft 0 PR BFS BC SE1 SE2 GraFBoost GraFBoost2 19

Results with Smaller Graphs: Breadth-First Search 0.03 TB 0.09 TB 0.3 TB 0.04 Billion 0.3 Billion 1 Billion Fastest! 5 Normalized Performance Slowest Slowest Slowest 4 3 2 1 GraFSoft 0 twitter kron28 kron30 IN SE1 SE2 EX GraFBoost GraFBoost2 20

Results with a Medium Graph: Against an In-Memory Cluster Synthesized Kronecker Scale 28 0.09 TB in text, 0.3 Billion vertices 6 Normalized Performance 5 4 3 2 1 GraFSoft 0 PR BFS 5xIN SE1 SE2 EX GraFBoost 21

GraFBoost Reduces Resource Requirements 32 100 35 800 80 30 80 600 25 Threads 410 Watts 60 20 GB 400 15 40 10 200 20 2 40 5 2 0 0 0 Conventional Conventional Conventional GraFBoost GraFBoost GraFBoost External Analytics External analytics Hardware Acceleration + Hardware Acceleration 22

Future Work Open-source GraFBoost Cleaning up code for users! Acceleration using Amazon F1 Commercial accelerated storage Collaborating with Samsung More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center 23

Thank You Flash Storage Sort-Reduce Hardware Algorithm Acceleration 24

BlueDBM Cluster Architecture minFlash Host Server FPGA (24-Core) (VC707) minFlash 1GB RAM minFlash Host Server FPGA Ethernet 1 TB (24-Core) (VC707) minFlash PCIe FMC … 4GB/s 10Gbps network ×8 minFlash Host Server FPGA (24-Core) (VC707) minFlash Uniform latency of 100 µs! 26

The BlueDBM Cluster BlueDBM Storage Device 27

GraFBoost: Using accelerated flash storage for external graph - PowerPoint PPT Presentation

GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature Human neural Structure of Social

Fast and Secure H T Y T O H Laptop Backups F G R E U D B I N with Encrypted

TB Slide Set, 2016 Surveillance Data NCHHSTP AtlasPlus National Center for HIV/AIDS, Viral

S chool Rove r STAND AND DELIVER TALA MORSEU FONT The weeks in Term 3 are flashing by and almost

History and Principles of Steganography CSM25 Secure Information Hiding Dr Hans Georg Schaathun

Sampling Plans and Initial Condition Problems For Continuous Time Duration Models James J.

tb NLO predictions on the ratio of t b and t tjj cross sections at the LHC Giuseppe

GES DISC Data Operations and Services Mike Theobald GES DISC Production Highlights

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel and Gunnar Hartung)

interventions M. Rashad Massoud, MD, MPH, FACP Director, USAID Applying Science to Strengthen and

In-Compute Networking & In-Network Computing - the Great Confluence David Oran Network

Oh, $#*@! Exascale! The effect of emerging architectures on

Jean-Yves Nief CC-IN2P3 activity. iRODS in production: Hardware setup. Usage.

Feynmans Quantum Paths (Advanced Relativity, Quantum Chromodynamics) Rubin H Landau Sally

Framework for Temporal Tunnel Services (TTS) draft-chen-teas-frmwk-tts-00 Huaimo Chen

Thompson-Like Groups Acting on Julia Sets Jim Belk 1 Bradley Forrest 2 1 Mathematics Program Bard

Oncology Care Model FAQs and Applications April 22, 2015 http://innovation.cms.gov/initiatives/

LIO and the TCMU Userspace Passthrough: The Best of Both Worlds Andy Grover

Next Generation Transportation Construction Management Pooled Fund Program Keith Molenaar 2015

A Practical Example of the Integration of Simulations, Battle Command, and Modern Technology

Generic Theses on Spontaneous Wave Function Collapse Lajos Disi Wigner Research Centre for

Transition a and Care M e Managem emen ent S Services es FY2020 Adrienne Weede, LCSW

ECEN 5032 Data Networks Wireless PANs/MANs Peter Mathys mathys@colorado.edu University of

Biological Organisation as the True Foundation of Reality Brian D. Josephson MindMatter

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do

GraFBoost: Using accelerated flash storage for external graph - PowerPoint PPT Presentation

GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature Human neural Structure of Social

Fast and Secure H T Y T O H Laptop Backups F G R E U D B I N with Encrypted

TB Slide Set, 2016 Surveillance Data NCHHSTP AtlasPlus National Center for HIV/AIDS, Viral

S chool Rove r STAND AND DELIVER TALA MORSEU FONT The weeks in Term 3 are flashing by and almost

History and Principles of Steganography CSM25 Secure Information Hiding Dr Hans Georg Schaathun

Sampling Plans and Initial Condition Problems For Continuous Time Duration Models James J.

tb NLO predictions on the ratio of t b and t tjj cross sections at the LHC Giuseppe

GES DISC Data Operations and Services Mike Theobald GES DISC Production Highlights

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel and Gunnar Hartung)

interventions M. Rashad Massoud, MD, MPH, FACP Director, USAID Applying Science to Strengthen and

In-Compute Networking &amp; In-Network Computing - the Great Confluence David Oran Network

Oh, $#*@! Exascale! The effect of emerging architectures on

Jean-Yves Nief CC-IN2P3 activity. iRODS in production: Hardware setup. Usage.

Feynmans Quantum Paths (Advanced Relativity, Quantum Chromodynamics) Rubin H Landau Sally

Framework for Temporal Tunnel Services (TTS) draft-chen-teas-frmwk-tts-00 Huaimo Chen

Thompson-Like Groups Acting on Julia Sets Jim Belk 1 Bradley Forrest 2 1 Mathematics Program Bard

Oncology Care Model FAQs and Applications April 22, 2015 http://innovation.cms.gov/initiatives/

LIO and the TCMU Userspace Passthrough: The Best of Both Worlds Andy Grover

Next Generation Transportation Construction Management Pooled Fund Program Keith Molenaar 2015

A Practical Example of the Integration of Simulations, Battle Command, and Modern Technology

Generic Theses on Spontaneous Wave Function Collapse Lajos Disi Wigner Research Centre for

Transition a and Care M e Managem emen ent S Services es FY2020 Adrienne Weede, LCSW

ECEN 5032 Data Networks Wireless PANs/MANs Peter Mathys mathys@colorado.edu University of

Biological Organisation as the True Foundation of Reality Brian D. Josephson MindMatter

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do

In-Compute Networking & In-Network Computing - the Great Confluence David Oran Network