GraFBoost: Using accelerated flash storage for external graph - - PowerPoint PPT Presentation

grafboost
SMART_READER_LITE
LIVE PREVIEW

GraFBoost: Using accelerated flash storage for external graph - - PowerPoint PPT Presentation

GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature Human neural Structure of Social


slide-1
SLIDE 1

GraFBoost:

Using accelerated flash storage for external graph analytics

Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind

MIT CSAIL

1

Funded by:

slide-2
SLIDE 2

Large Graphs are Found Everywhere in Nature

2

Human neural network

1) Connectomics graph of the brain - Source: “Connectomics and Graph Theory” Jon Clayden, UCL 2) (Part of) the internet - Source: Criteo Engineering Blog 3) The Graph of a Social Network – Source: Griff’s Graphs

Structure of the internet Social networks

TB to 100s of TB in size!

slide-3
SLIDE 3

Storage for Graph Analytics

3

Terabytes in size Irregular structure Extremely sparse DRAM TB of DRAM

$$$ $

Our goal: $8000/TB, 200W $500/TB, 10W

slide-4
SLIDE 4

Access Granularity: 8192 Bytes 128 Bytes

Random Access Challenge in Flash Storage

4

Flash DRAM Using 8 bytes in a 8192 Byte page uses 1/1024 of bandwidth! Wastes performance by not using most of fetched page Bandwidth: 3.6 GB/s ~100 GB/s

slide-5
SLIDE 5

Two Pillars for Flash in GraFBoost

5

Hardware Acceleration Sort-Reduce Algorithm Flash Storage

Sequentialize Random Access Reduce Overhead of Sort-Reduce

  • System resources
  • Power consumption
  • Performance
slide-6
SLIDE 6

Vertex-Centric Programming Model

6

“Vertex program” only sees neighbors 3 Algorithm executed in terms of disjoint iterations Popular model for efficient parallism and distribution Vertex program is executed on one or more “Active Vertices” 1 ∞ ∞ 2 1 9 ∞ 4 1 Active Vertices 5 10 2 5 4 2 5 4 1 5

slide-7
SLIDE 7

𝑤𝑒𝑡𝑢. 𝑜𝑓𝑦𝑢_𝑤𝑏𝑚 = 𝐰𝐟𝐬𝐮𝐟𝐲_𝐯𝐪𝐞𝐛𝐮𝐟(𝑤𝑒𝑡𝑢. 𝑜𝑓𝑦𝑢_𝑤𝑏𝑚, 𝑓𝑤) 𝑓𝑤 = 𝐟𝐞𝐡𝐟_𝐪𝐬𝐩𝐡𝐬𝐛𝐧 (𝑤𝑡𝑠𝑑, . 𝑤𝑏𝑚, 𝑓. 𝑥𝑓𝑗𝑕ℎ𝑢)

Algorithmic Representation of a Vertex Program Iteration

7

Random read-modify update into vertex data

Sort-reduce algorithm solves this issue

𝐠𝐩𝐬 𝐟𝐛𝐝𝐢 𝑤𝑡𝑠𝑑 𝐣𝐨 𝐵𝑑𝑢𝑗𝑤𝑓𝑀𝑗𝑡𝑢 𝐞𝐩 𝐠𝐩𝐬 𝐟𝐛𝐝𝐢 𝑓(𝑤𝑡𝑠𝑑, 𝑤𝑒𝑡𝑢) 𝐣𝐨 G 𝐞𝐩 𝐟𝐨𝐞 𝐠𝐩𝐬 𝐟𝐨𝐞 𝐠𝐩𝐬

slide-8
SLIDE 8

General Problem of Irregular Array Updates

8

𝑮𝒑𝒔 𝒇𝒃𝒅𝒊 < 𝑗𝑒𝑦, 𝑏𝑠𝑕 > 𝒋𝒐 𝑦𝑡: xs f Random Updates X 𝑦 𝑗𝑒𝑦 = 𝒈(𝑦 𝑗𝑒𝑦 , 𝑏𝑠𝑕) Updating an array x with a stream of update requests xs and update function f

slide-9
SLIDE 9

Solution Part One - Sort

9

Sorted xs Sequential Updates Sort xs according to index

Sort

Much better than naïve random updates Significant sorting overhead Terabyte graphs can generate terabyte logs X xs

slide-10
SLIDE 10

xs

Solution Part Two - Reduce

10

3 7 1 3 1 1 8 3 5 2 8 3 9 1 2 3 3 1 1 8 3 9 1 2 10 1 3 5 2 8 3 7 19 1 3 8 3 7 Associative update function f can be interleaved with sort e.g., (A + B) + C = A + (B + C) Reduced Overhead merge merge

slide-11
SLIDE 11

Big Benefits from Interleaving Reduction

11

Ratio of data copied at each sort phase 0.2 0.4 0.6 0.8 1 1 2 3 Normalized size of update stream Merge Iteration Kron32 0.2 0.4 0.6 0.8 1 1 2 3 Normalized size of update stream Merge Iteration Kron32 WDC 90% 0.2 0.4 0.6 0.8 1 1 2 3 Normalized size of update stream Merge Iteration Kron32 WDC Kron30 Twitter

slide-12
SLIDE 12

Accelerated GraFBoost Architecture

12

Host (Server/PC/Laptop) Vertex Data Edge Data Active Vertices Partially Sort- Reduced Files

FPGA Flash Software

Accelerator-Aware Flash Management Multirate 16-to-1 Merge-Sorter Multirate Aggregator Wire-Speed On-chip Sorter In-storage accelerator reduces data movement and cost

slide-13
SLIDE 13

Accelerated GraFBoost Architecture

13

Host (Server/PC/Laptop) Vertex Data Edge Data Active Vertices Partially Sort- Reduced Files

FPGA Flash Software

Edge Property Vertex Value

Edge Program

Update Log (xs)

Sort-Reduce Accelerator 1GB DRAM In-storage accelerator reduces data movement and cost

slide-14
SLIDE 14

Evaluated Graph Analytic Systems

14

In-memory External Semi-External GraphLab (IN) FlashGraph (SE1) X-Stream (SE2) GraphChi (EX) GraFSoft GraFBoost GraFBoost2 Projected system with 2x memory bandwidth

“Distributed GraphLab: a framework for machine learning and data mining in the cloud,” VLDB 2012 “FlashGraph: Processing billion-node graphs on an array of commodity SSDs,” FAST 2015 “X-Stream: edge-centric graph processing using streaming partitions,“ SOSP 2013 “GraphChi: Large-scale graph computation on just a PC,“ USENIX 2012

slide-15
SLIDE 15

Evaluation Environment

15

32-core Xeon 128 GB RAM 5x 0.5TB PCIe Flash 4-core i5 4 GB RAM Virtex 7 FPGA 1TB custom flash 1GB on-board RAM

+ +

$8000 $400 $??? $1000 All software experiments

slide-16
SLIDE 16

Evaluation Result Overview

16

Large graphs: Medium graphs: GraFBoost has very low resource requirement Memory, CPU, Power In-memory External Semi-External GraFBoost

Fail Fail Slow Fast

Small graphs:

Fast Fail Fast Slow Slow Fast Fast Fast

slide-17
SLIDE 17

1 2 3 PR BFS BC Normalized Performance

SE1 SE2 GraFBoost GraFBoost2

Results with a Large Graph: Synthetic Scale 32 Kronecker Graph

17

0.5 TB in text, 4 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish 1.7x 2.8x 10x GraFSoft

slide-18
SLIDE 18

Results with a Large Graph: Web Data Commons Web Crawl

1 2 3 4 PR BFS BC Normalized Performance

SE1 SE2 GraFBoost GraFBoost2

18

2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish GraFSoft

slide-19
SLIDE 19

Results with a Large Graph: Web Data Commons Web Crawl

1 2 3 4 PR BFS BC Normalized Performance

SE1 SE2 GraFBoost GraFBoost2

19

2 TB in text, 3 Billion vertices GraphLab (IN) out of memory GraphChi (EX) did not finish GraFSoft

Only GraFBoost succeeds in both graphs GraFBoost can run still larger graphs!

slide-20
SLIDE 20

Results with Smaller Graphs: Breadth-First Search

20

1 2 3 4 5 twitter kron28 kron30

Normalized Performance IN SE1 SE2 EX GraFBoost GraFBoost2

Fastest! Slowest Slowest Slowest GraFSoft

0.3 TB 1 Billion 0.09 TB 0.3 Billion 0.03 TB 0.04 Billion

slide-21
SLIDE 21

Results with a Medium Graph: Against an In-Memory Cluster

21

1 2 3 4 5 6 PR BFS

Normalized Performance 5xIN SE1 SE2 EX GraFBoost

0.09 TB in text, 0.3 Billion vertices Synthesized Kronecker Scale 28

GraFSoft

slide-22
SLIDE 22

GraFBoost Reduces Resource Requirements

22

80

20 40 60 80 100

GB

Conventional GraFBoost

32 2

5 10 15 20 25 30 35

Threads

Conventional GraFBoost

410

200 400 600 800

Watts

Conventional GraFBoost

2 40 External analytics Hardware Acceleration External Analytics + Hardware Acceleration

slide-23
SLIDE 23

Future Work

23

Open-source GraFBoost Cleaning up code for users! Acceleration using Amazon F1 Commercial accelerated storage Collaborating with Samsung More applications using sort-reduce Bioinformatics collaboration with Barcelona Supercomputering Center

slide-24
SLIDE 24

Thank You

24

Hardware Acceleration Sort-Reduce Algorithm Flash Storage

slide-25
SLIDE 25

25

slide-26
SLIDE 26

BlueDBM Cluster Architecture

26 Host Server (24-Core) FPGA (VC707) minFlash minFlash Host Server (24-Core) FPGA (VC707) minFlash minFlash Host Server (24-Core) FPGA (VC707) minFlash minFlash

PCIe 4GB/s FMC Ethernet 10Gbps network ×8 1 TB

1GB RAM Uniform latency of 100 µs!

slide-27
SLIDE 27

The BlueDBM Cluster

27

BlueDBM Storage Device