On-The-Fly Parallel Data Shuffling for Graph Processing on - PowerPoint PPT Presentation

On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs Xinyu Chen 1 , Ronak Bajaj 1 , Yao Chen 2 , Jiong He 3 , Bingsheng He 1 , Weng-Fai Wong 1 , Deming Chen 4 1 National University of Singapore, 2 Advanced Digital Sciences Center, 3 Alibaba Group, 4 University of Illinois at Urbana-Champaign 1

Graph processing on FPGAs • Graph processing is widely used in variety of application domains. • Social networks • Cybersecurity • Machine learning • Accelerating graph processing on FPGA has attracted a lot of attention benefiting from: • Fine grained parallelism • Low power consumption • Extreme configurability 2

Graph processing on HLS-based FPGAs • Previous RTL-based FPGAs development. • Time-consuming • Deep understanding of hardware • To ease the use of FPGAs, HLS tools have been proposed. • High-level programming model • Hide hardware details • Both Intel and Xilinx have HLS tools • Graph processing on OpenCL-based FPGAs. 3

GAS model for graph processing • Scatter : for each edge, an update tuple is generated with the format of <destination, value>. • E.g. <2, x>, <7, y> for vertex 1 3 1 • Gather: accumulate the value to destination vertices. Example graph 2 • E.g. Op(P 2 , x), Op(P 7 , y) 7 • Apply: an apply function on all the vertices. read write write read read Property P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 Memory accesses (vertex 1 as the example) A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel, “Chaos: Scale-out graph processing from secondary storage,” in SOSP , 2015 4

GAS model on FPGAs • BRAM caching • avoid random memory accesses to property array. • Multiple PEs • each PE processes a part of cached data and runs independently. 3 Update tuples to process for vertex 1 <2, x>, <7, y> 1 Example graph 2 Data shuffling 7 PE 1 PE 0 update update write write read read read P 4 P 5 P 6 P 7 P 0 P 1 P 2 P 3 Property P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 In BRAM In BRAM Memory accesses (vertex 1 as the example) 5

Data shuffling • Widely used for irregular applications. • The data generated with format of <dst, value> is dispatched to ‘dst’ PEs to process. • Challenges: • Run-time data dependency • Parallelism Stage 0 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 Data tuples D0 D1 D2 D3 D4 D5 D6 D7 Stage 1 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 * Arrows with different colours show a few shuffling examples. 6

OpenCL does not natively support shuffling • Fine-grained control logic is not available for OpenCL. • No vendor-specific extension for shuffling [1]. • OpenCL only does static analysis at compile time, thus cannot extract parallelism in functions with run-time dependency [2]. [1] Kapre, Nachiket, and Hiren Patel. "Applying Models of Computation to OpenCL Pipes for FPGA Computing." Proceedings of the 5th International Workshop on OpenCL . ACM, 2017. [2] Z. Li, L. Liu, Y. Deng, S. Yin, Y. Wang, and S. Wei, “Aggressive pipelining of irregular applications on reconfigurable hardware,” in ISCA , 2017. 7

Potential shuffling solutions with OpenCL • Polling • Each PE checks the tuples serially. • ‘Bubbles’ are introduced. • 8 cycles for dispatching a set of 8 tuples. Stage 0 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 Bubbles!         D2 Data tuples D0 D1 D2 D3 D4 D5 D6 D7 D6 Stage 1 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 8

Potential shuffling solutions with OpenCL • Convergence kernel from [1] • Each PE writes wanted tuples to local BRAM in parallel. • The run-time data dependency is not resolved. • Initiation interval (II) equals to 284 cycles. PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 Stage 0 Conflicts! D0 D1 D2 D3 D4 D5 D6 D7 Data tuples Processing logic Stage 1 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 [1] Wang, Zeke, et al. "Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.6 (2017): 1906-1918. 9

Insights • Polling: introduces ‘bubbles’. • Convergence kernel: the run-time dependency is still there. • What if we know the positions and number of wanted tuples? • PEs can directly access the wanted tuples. • Cycles needed equal to number of wanted tuples. • How to know the positions and number of wanted tuples? • Decoder based solution. • E.g. 2 8 possibilities, for a set of 8 tuples, since each tuple has two statuses only. 10

Proposed shuffling • Calculate the destination PEs. Tuples 7 6 5 4 3 2 1 0 Index MASK MASK E. (01000100) E. (01000100) Destination PEs 0; 00000000 8 0; 00000000 8 1 0 2 2 3 0 9 11 1; 00000000 8 1; 00000000 8 • Compute an 8-bit MASK by 1; 00000001 8 1; 00000001 8 Tuples Number; Number; (2; 00000062 8 ) (2; 00000062 8 ) Positions Positions 7 6 5 4 3 2 1 0 Index comparing destination PEs E. (2; 00000062 8 ) E. (2; 00000062 8 ) 7; 06543210 8 7; 06543210 8 with the id of current PE, 0. hash_val == PE_ID? 1:0; Validate : 8; 76543210 8 8; 76543210 8 MASK 0 1 0 0 0 1 0 0 Decoder Tuples • Decode the positions and Index 7 6 5 4 3 2 1 0 number of wanted tuples. Num=2; Pos=2,6; Filter • Collect the wanted tuples without “bubbles”. An example for a set of 8 tuples on PE 0 11

Proposed shuffling • No ‘bubbles’ - no cycle wasted on unwanted tuples. • Resolve the run-time dependency. • All the modules are pipelined. 12

Proposed graph processing framework with shuffle DDR aPE 0 aPE 1 aPE x-1 sPE 0 sPE 1 sPE N-1 Apply Scatter (<D 0 ,V 0 >, …, <D N ,V N >) Shuffle N-way PE selection (<D 0 ,V 0, H 0 >, …, <D N-1 ,V N-1 ,H N-1 >) Data Duplication Validation 0 Validation 1 Validation 2N-1 Decoder 0 Decoder 1 Decoder 2N-1 2 Func 0 Func 1 Func 2N-2 Func 2N-1 2 Filter 0 Filter 1 Filter 2N-1 C 2N-2 C 2N-1 C 0 C 1 3 1 gPE 0 gPE 1 gPE 2N-1 0 1 2N-2 2N-1 (2N*32-bit) / read 1 3 Gather DDR 13

Experimental configuration • Our experiments are conducted on a Terasic DE5-Net board. • BFS, SSSP, PageRank and SpMV are used as applications. [34] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani, “Kronecker graphs: An approach to modeling networks,” JMLR, 2010. [35] R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in AAAI, 2015. 14

Efficiency of shuffle • Theoretical throughput = memory_bandwidth / tuple_size • The performance is close the theoretical throughput. Measured throughput Theoretical throughput Bandwidth utilization Throughput: million tuples /s 4000 100% 3200 80% 2400 60% 1600 40% 800 20% 0 0% 64B (1) 32B (2) 16B (4) 8B (8) 4B (16) #tuple size (#tuple number per cycle) 15

Efficiency of shuffle • The throughput of our shuffle is much higher than existing shuffling solutions. [1] Polling This paper Throughput: million tuples /s 1600 1200 800 400 0 8B(8) 16B(4) 32B(2) 64B(1) #tuple size (#tuple number per cycle) [1] Wang, Zeke, et al. "Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.6 (2017): 1906-1918. 16

End to end performance • Compare the performance of graph frameworks with different shuffling solutions. • Speedup of PageRank is up to 100× of [1], and 6× of Polling. [1] Polling This paper 120 90 Speedup 60 30 1 1 1 1 1 1 1 1 0 R21 R19 PK LJ MG TW GG WT [1] Wang, Zeke, et al. "Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.6 (2017): 1906-1918. 17

Resource utilization • BRAMs are well utilized for vertex caching. • PR and SpMV consume DSPs. 18

Compare with RTL-based works • Our approach achieves throughput that is comparable or even better than RTL-based graph processing designs. [11] S. Zhou, C. Chelmis, and V. K. Prasanna, “Optimizing memory performance for FPGA implementation of pagerank.” in ReConFig , 2015. [13] S. Zhou, C. Chelmis, and V. K. Prasanna, “High-throughput and energyefficient graph processing on FPGA,” in FCCM , 2016. [14] G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “Foregraph: Exploring large-scale graph processing on multi-FPGA architecture,” in FPGA , 2017. [38] T. Oguntebi and K. Olukotun, “Graphops: A dataflow library for graph analytics acceleration,” in FPGA , 2016. 19

Conclusion • Data shuffling on OpenCL-based FPGAs is challenging due to the run-time data dependency. • We propose an efficient OpenCL-based data shuffling method. • The performance of graph processing framework integrated our shuffling is comparable to state-of-the-art RTL based works. 20

Acknowledgement • This work is supported by a MoE AcRF Tier 1 grant (T1 251RES1824) and Tier 2 grant (MOE2017-T2-1-122) in Singapore. This work is also partly supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme, and the SenseTime Young Scholars Research Fund. • We also thank Intel for hardware accesses and donations. 21

On-The-Fly Parallel Data Shuffling for Graph Processing on - PowerPoint PPT Presentation

On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs Xinyu Chen 1 , Ronak Bajaj 1 , Yao Chen 2 , Jiong He 3 , Bingsheng He 1 , Weng-Fai Wong 1 , Deming Chen 4 1 National University of Singapore, 2 Advanced Digital Sciences

On the Shuffling Algorithm for the Aztec Nordenstam eno@kth.se Diamond Background Shuffling

Fly Fishing Granite P. What is Fly Fishing? - A method of fishing in which an artificial fly is

FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION CONTROL & PREVENTION

FLY HIGH 2019 Learning English is a joyful life experience FLY HIGH ROMANIA FLYHIGHROMANIA FLY

Tiling Shuffling Phenomenon Tri Lai University of Nebraska Lincoln Lincoln, NE 68588 Dimers

Shuffling Cards via One-sided Transpositions Stephen Connor Joint work with Oliver Matheau-Raven

Permutations, Card Shuffling, and Representation Theory Franco Saliola, Universit du Qubec

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Shuffling, Partitioning, and Closures Parallel Programming and Data Analysis Heather Miller What

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Now Everyone Can Fly Now Everyone Can Fly First Quarter 2006 Results First Quarter

Fl Fly Qu Quie iet t Co Comm mmittee ittee Aug ugust st 18, , 2015 15 Agenda

Licensing Enforcement Team FLY POSTING REVIEW 2015 1 Fly Posting There is no formal definition

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter & Full Year Results

Now Everyone Can Fly Now Everyone Can Fly 2005 Second Quarter Results 2005 Second

Now Everyone Can Fly Now Everyone Can Fly Second Quarter 2006 Results Second Quarter

Information Visualization Crash Course (AKA Information Visualization 101) Chad Stolper, Georgia

How Caching Improves Efficiency and Result Completeness for Querying Linked Data Olaf Hartig

A proposal to upgrade the ATLAS RPC system for the High Luminosity LHC Riccardo Vari - INFN Roma

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Studying -Meson Decays with WASA-at-COSY 03.06.2016 Daniel Lersch for the WASA-at-COSY

CLIC super-modules H. Mainaud-Durand, G. Riddone Motivation

WebVR Next w/ More Layers Web VR Performance 10, 11 A lot of steps ; -) Oculus Web VR 1.x Core

Lecture Cast and Immersion Why 360-Video is (not) a

On-The-Fly Parallel Data Shuffling for Graph Processing on - PowerPoint PPT Presentation

On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs Xinyu Chen 1 , Ronak Bajaj 1 , Yao Chen 2 , Jiong He 3 , Bingsheng He 1 , Weng-Fai Wong 1 , Deming Chen 4 1 National University of Singapore, 2 Advanced Digital Sciences

On the Shuffling Algorithm for the Aztec Nordenstam eno@kth.se Diamond Background Shuffling

Fly Fishing Granite P. What is Fly Fishing? - A method of fishing in which an artificial fly is

FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION CONTROL &amp; PREVENTION

FLY HIGH 2019 Learning English is a joyful life experience FLY HIGH ROMANIA FLYHIGHROMANIA FLY

Tiling Shuffling Phenomenon Tri Lai University of Nebraska Lincoln Lincoln, NE 68588 Dimers

Shuffling Cards via One-sided Transpositions Stephen Connor Joint work with Oliver Matheau-Raven

Permutations, Card Shuffling, and Representation Theory Franco Saliola, Universit du Qubec

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Shuffling, Partitioning, and Closures Parallel Programming and Data Analysis Heather Miller What

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Now Everyone Can Fly Now Everyone Can Fly First Quarter 2006 Results First Quarter

Fl Fly Qu Quie iet t Co Comm mmittee ittee Aug ugust st 18, , 2015 15 Agenda

Licensing Enforcement Team FLY POSTING REVIEW 2015 1 Fly Posting There is no formal definition

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter &amp; Full Year Results

Now Everyone Can Fly Now Everyone Can Fly 2005 Second Quarter Results 2005 Second

Now Everyone Can Fly Now Everyone Can Fly Second Quarter 2006 Results Second Quarter

Information Visualization Crash Course (AKA Information Visualization 101) Chad Stolper, Georgia

How Caching Improves Efficiency and Result Completeness for Querying Linked Data Olaf Hartig

A proposal to upgrade the ATLAS RPC system for the High Luminosity LHC Riccardo Vari - INFN Roma

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Studying -Meson Decays with WASA-at-COSY 03.06.2016 Daniel Lersch for the WASA-at-COSY

CLIC super-modules H. Mainaud-Durand, G. Riddone Motivation

WebVR Next w/ More Layers Web VR Performance 10, 11 A lot of steps ; -) Oculus Web VR 1.x Core

Lecture Cast and Immersion Why 360-Video is (not) a

FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION CONTROL & PREVENTION

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter & Full Year Results