On-The-Fly Parallel Data Shuffling for Graph Processing on - - PowerPoint PPT Presentation

on the fly parallel data shuffling for graph processing
SMART_READER_LITE
LIVE PREVIEW

On-The-Fly Parallel Data Shuffling for Graph Processing on - - PowerPoint PPT Presentation

On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs Xinyu Chen 1 , Ronak Bajaj 1 , Yao Chen 2 , Jiong He 3 , Bingsheng He 1 , Weng-Fai Wong 1 , Deming Chen 4 1 National University of Singapore, 2 Advanced Digital Sciences


slide-1
SLIDE 1

On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-based FPGAs

Xinyu Chen1, Ronak Bajaj1, Yao Chen2, Jiong He3,

Bingsheng He1, Weng-Fai Wong1, Deming Chen4

1

1National University of Singapore, 2Advanced Digital Sciences Center, 3Alibaba Group, 4University of Illinois at Urbana-Champaign

slide-2
SLIDE 2

Graph processing on FPGAs

  • Graph processing is widely used in variety of application

domains.

  • Social networks
  • Cybersecurity
  • Machine learning
  • Accelerating graph processing on FPGA has attracted a

lot of attention benefiting from:

  • Fine grained parallelism
  • Low power consumption
  • Extreme configurability

2

slide-3
SLIDE 3
  • Previous RTL-based FPGAs development.
  • Time-consuming
  • Deep understanding of hardware
  • To ease the use of FPGAs, HLS tools have been proposed.
  • High-level programming model
  • Hide hardware details
  • Both Intel and Xilinx have HLS tools
  • Graph processing on OpenCL-based FPGAs.

3

Graph processing on HLS-based FPGAs

slide-4
SLIDE 4

GAS model for graph processing

  • Scatter: for each edge, an update

tuple is generated with the format

  • f <destination, value>.
  • E.g. <2, x>, <7, y> for vertex 1
  • Gather: accumulate the value to

destination vertices.

  • E.g. Op(P2 , x), Op(P7 , y)
  • Apply: an apply function on all

the vertices.

  • A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel, “Chaos: Scale-out graph processing from secondary

storage,” in SOSP, 2015

4

P0 P1 P2 P3 P4 P5 P6 P7 Property

1 7 2 3

Example graph

Memory accesses (vertex 1 as the example)

read write write read read

slide-5
SLIDE 5

GAS model on FPGAs

  • BRAM caching
  • avoid random memory accesses to property array.
  • Multiple PEs
  • each PE processes a part of cached data and runs

independently.

5

1 7 2 3

Example graph P0 P1 P2 P3 P4 P5 P6 P7 Property

Memory accesses (vertex 1 as the example)

read write write read read

PE 0

P0 P1 P2 P3 P4 P5 P6 P7

update update

<2, x>, <7, y>

PE 1 Update tuples to process for vertex 1 In BRAM In BRAM

Data shuffling

slide-6
SLIDE 6

Data shuffling

  • Widely used for irregular applications.
  • The data generated with format of <dst, value> is dispatched

to ‘dst’ PEs to process.

  • Challenges:
  • Run-time data dependency
  • Parallelism

6

Data tuples D0 PE0 D1 PE1 D2 PE2 D3 PE3 Stage 1 D4 PE4 D5 PE5 D6 PE6 D7 PE7 PE0 PE1 PE2 PE3 Stage 0 PE4 PE5 PE6 PE7

* Arrows with different colours show a few shuffling examples.

slide-7
SLIDE 7

7

OpenCL does not natively support shuffling

  • Fine-grained control logic is not available for OpenCL.
  • No vendor-specific extension for shuffling [1].
  • OpenCL only does static analysis at compile time, thus

cannot extract parallelism in functions with run-time dependency [2].

[1] Kapre, Nachiket, and Hiren Patel. "Applying Models of Computation to OpenCL Pipes for FPGA Computing." Proceedings of the 5th International Workshop on OpenCL. ACM, 2017. [2] Z. Li, L. Liu, Y. Deng, S. Yin, Y. Wang, and S. Wei, “Aggressive pipelining of irregular applications on reconfigurable hardware,” in ISCA, 2017.

slide-8
SLIDE 8

D6 D2

Potential shuffling solutions with OpenCL

  • Polling
  • Each PE checks the tuples serially.
  • ‘Bubbles’ are introduced.
  • 8 cycles for dispatching a set of 8 tuples.

8

D0 PE0 D1 PE1 D2 PE2 D3 PE3 Stage 1 D4 PE4 D5 PE5 D6 PE6 D7 PE7 PE0 PE1 PE2 PE3 Stage 0 PE4 PE5 PE6 PE7 Data tuples

       

Bubbles!

slide-9
SLIDE 9

D0 PE0 D1 PE1 D2 PE2 D3 PE3 Stage 1 D4 PE4 D5 PE5 D6 PE6 D7 PE7 PE0 PE1 PE2 PE3 Stage 0 PE4 PE5 PE6 PE7 Data tuples Processing logic

Potential shuffling solutions with OpenCL

  • Convergence kernel from [1]
  • Each PE writes wanted tuples to local BRAM in parallel.
  • The run-time data dependency is not resolved.
  • Initiation interval (II) equals to 284 cycles.

9

[1] Wang, Zeke, et al. "Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.6 (2017): 1906-1918.

Conflicts!

slide-10
SLIDE 10
  • Polling: introduces ‘bubbles’.
  • Convergence kernel: the run-time dependency is still

there.

  • What if we know the positions and number of wanted

tuples?

  • PEs can directly access the wanted tuples.
  • Cycles needed equal to number of wanted tuples.
  • How to know the positions and number of wanted tuples?
  • Decoder based solution.
  • E.g. 28 possibilities, for a set of 8 tuples, since each tuple

has two statuses only.

10

Insights

slide-11
SLIDE 11

Proposed shuffling

  • Calculate the destination PEs.
  • Compute an 8-bit MASK by

comparing destination PEs with the id of current PE, 0.

  • Decode the positions and

number of wanted tuples.

  • Collect the wanted tuples

without “bubbles”.

11

An example for a set of 8 tuples on PE0

7 6 5 4 3 2 1 Index Tuples

Decoder Num=2; Pos=2,6; Filter

Destination PEs 1 2 2 3 9 11 7 6 5 4 3 2 1 Index Tuples

hash_val == PE_ID? 1:0; 1 1 MASK Tuples 7 6 5 4 3 2 1 Index Validate :

MASK

  • E. (01000100)

0; 000000008 1; 000000008 1; 000000018 7; 065432108 8; 765432108 Number; Positions

  • E. (2; 000000628)

(2; 000000628) MASK

  • E. (01000100)

0; 000000008 1; 000000008 1; 000000018 7; 065432108 8; 765432108 Number; Positions

  • E. (2; 000000628)

(2; 000000628)

slide-12
SLIDE 12

Proposed shuffling

  • No ‘bubbles’ - no cycle wasted on unwanted tuples.
  • Resolve the run-time dependency.
  • All the modules are pipelined.

12

slide-13
SLIDE 13

Proposed graph processing framework with shuffle

13

Validation0 Decoder0 Filter0 gPE0 Validation1 Decoder1 Filter1 gPE1 Validation2N-1 Decoder2N-1 Filter2N-1 gPE2N-1 Shuffle Gather N-way PE selection (<D0,V0,H0>, …, <DN-1,VN-1,HN-1>) DDR (<D0,V0>, …, <DN,VN>) Scatter Data Duplication DDR aPE0 aPE1 aPEx-1 sPE0 sPE1 sPEN-1 Func0 Func2N-2 C0 C1 C2N-2 C2N-1 1 2N-2 2N-1 (2N*32-bit) / read 1 3 1 3 2 2 Apply Func1 Func2N-1

slide-14
SLIDE 14

Experimental configuration

14

  • Our experiments are conducted on a Terasic DE5-Net board.
  • BFS, SSSP, PageRank and SpMV are used as applications.

[34] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani, “Kronecker graphs: An approach to modeling networks,” JMLR, 2010. [35] R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in AAAI, 2015.

slide-15
SLIDE 15

Efficiency of shuffle

  • Theoretical throughput = memory_bandwidth / tuple_size
  • The performance is close the theoretical throughput.

15

0% 20% 40% 60% 80% 100% 800 1600 2400 3200 4000 64B (1) 32B (2) 16B (4) 8B (8) 4B (16) Throughput: million tuples /s #tuple size (#tuple number per cycle) Measured throughput Theoretical throughput Bandwidth utilization

slide-16
SLIDE 16

Efficiency of shuffle

  • The throughput of our shuffle is much higher than

existing shuffling solutions.

16

400 800 1200 1600 8B(8) 16B(4) 32B(2) 64B(1) Throughput: million tuples /s #tuple size (#tuple number per cycle) [1] Polling This paper

[1] Wang, Zeke, et al. "Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.6 (2017): 1906-1918.

slide-17
SLIDE 17

End to end performance

  • Compare the performance of graph frameworks with

different shuffling solutions.

  • Speedup of PageRank is up to 100× of [1], and 6× of

Polling.

17

1 1 1 1 1 1 1 1

30 60 90 120 R21 R19 PK LJ MG TW GG WT Speedup [1] Polling This paper

[1] Wang, Zeke, et al. "Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs." IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25.6 (2017): 1906-1918.

slide-18
SLIDE 18

Resource utilization

18

  • BRAMs are well utilized for vertex caching.
  • PR and SpMV consume DSPs.
slide-19
SLIDE 19

Compare with RTL-based works

  • Our approach achieves throughput that is comparable or

even better than RTL-based graph processing designs.

19

[11] S. Zhou, C. Chelmis, and V. K. Prasanna, “Optimizing memory performance for FPGA implementation of pagerank.” in ReConFig, 2015. [13] S. Zhou, C. Chelmis, and V. K. Prasanna, “High-throughput and energyefficient graph processing on FPGA,” in FCCM, 2016. [14] G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “Foregraph: Exploring large-scale graph processing on multi-FPGA architecture,” in FPGA, 2017. [38] T. Oguntebi and K. Olukotun, “Graphops: A dataflow library for graph analytics acceleration,” in FPGA, 2016.

slide-20
SLIDE 20

Conclusion

  • Data shuffling on OpenCL-based FPGAs is challenging due

to the run-time data dependency.

  • We propose an efficient OpenCL-based data shuffling

method.

  • The performance of graph processing framework

integrated our shuffling is comparable to state-of-the-art RTL based works.

20

slide-21
SLIDE 21

Acknowledgement

  • This work is supported by a MoE AcRF Tier 1 grant

(T1 251RES1824) and Tier 2 grant (MOE2017-T2-1-122) in Singapore. This work is also partly supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme, and the SenseTime Young Scholars Research Fund.

  • We also thank Intel for hardware accesses and

donations.

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

Data shuffling on RTL-based FPGAs

  • Fine-grained control logic based NoCs.

23

Data tuples D0 PE0 D1 PE1 D2 PE2 D3 PE3 Stage 1 D4 PE4 D5 PE5 D6 PE6 D7 PE7 PE0 PE1 PE2 PE3 Stage 0 PE4 PE5 PE6 PE7

Routing network

slide-24
SLIDE 24

Outline

  • Introduction to data shuffling
  • Data shuffling on OpenCL-based FPGAs
  • Motivations
  • Design and implementation
  • Graph processing framework with proposed shuffling
  • Evaluation
  • Conclusion
  • Acknowledgement

24