Performance-driven system Performance-driven system generation for - - PowerPoint PPT Presentation

performance driven system performance driven system
SMART_READER_LITE
LIVE PREVIEW

Performance-driven system Performance-driven system generation for - - PowerPoint PPT Presentation

Performance-driven system Performance-driven system generation for distributed generation for distributed vertex-centric vertex-centric graph processing graph processing on multi-FPGA systems on multi-FPGA systems Nina Engelhardt C.-H.


slide-1
SLIDE 1

Performance-driven system Performance-driven system generation for distributed generation for distributed vertex-centric vertex-centric graph processing graph processing

  • n multi-FPGA systems
  • n multi-FPGA systems

Nina Engelhardt C.-H. Dominic Hung Hayden K.-H. So

The University of Hong Kong

28th August 2018

slide-2
SLIDE 2

The GraVF graph processing framework

... ... ... ... ... ... ... PE PE PE PE PE PE PE PE PE user kernel

Vertex-centric graph processing framework on FPGA User provides kernel, inserted in framework architecture PEs exchanging messages over on-chip network

2 / 14

slide-3
SLIDE 3

Now extended to multiple FPGA

Allow PEs to exchange messages with PEs on different FPGAs Extend network by adding external interface and routing messages destined for PEs on other FPGAs over 10GbE

FPGA 1

... ... ... ... ... PE PE PE PE PE PE PE ... ... PE 10 GbE

Ethernet Switch

FPGA n

... ... ... ... ... PE PE PE PE PE PE PE ... ... 10 GbE PE

... 3 / 14

slide-4
SLIDE 4

What’s the performance?

Vertex kernel has well-defined interface Can calculate the necessary resources to process one edge

PE

receive message send message update vertex data read edge

Build roofline-style performance model based on platform resources Use model to automatically pick configuration when generating

4 / 14

slide-5
SLIDE 5

Limiting factors

4 limits considered: Processing element throughput Memory bandwidth Network interface bandwidth Total network bandwidth

5 / 14

slide-6
SLIDE 6

Processing element throughput

Tsys ≤ nFPGA × nPE/FPGA × fclk/CPEPE (LPE) Cycles per edge: analogous to processor CPI, used together with clock frequency to determine individual PE throughput CPE is affected by:

PE architecture data hazards kernel implementation

multiplied by number of PEs in the system.

6 / 14

slide-7
SLIDE 7

Memory bandwidth

Tsys ≤ nFPGA × BWmem medge (Lmem) Edges can be stored off-chip to increase processable graph size Can only be processed as fast as they can be loaded

7 / 14

slide-8
SLIDE 8

Network interface bandwidth

Tsys ≤ nFPGA × BWif 2nFPGA−1

nFPGA mmessage

(Lif ) When using multiple FPGAs, messages need to be transferred over external network interface Assuming equal distribution of vertices, a message has a nFPGA−1

nFPGA

chance to be sent to a different board Each message traverses an interface twice, sending and receiving Really a per-board limit, extra factor nFPGA on both sides for system throughput

8 / 14

slide-9
SLIDE 9

Total network bandwidth

Tsys ≤ BWnetwork

nFPGA−1 nFPGA mmessage

(Lnetwork) Total amount of messages transferrable by the external network Again, a fraction nFPGA−1

nFPGA

  • f messages needs to cross the external

network

9 / 14

slide-10
SLIDE 10

Results

100 200 300 400 500 600 700 800 900 1000 1 2 4 8 16 System Throughput (MTEPS) Number of PEs Computation limit (CPE=1.6) PR - RMAT PR - Uniform 200 400 600 800 1000 1200 1400 1 2 4 8 16 System Throughput (MTEPS) Number of PEs Computation limit (CPE=1.3) BFS - RMAT BFS - Uniform 200 400 600 800 1000 1200 1400 1 2 4 8 16 System Throughput (MTEPS) Number of PEs Computation limit (CPE=1.2) CC - RMAT CC - Uniform

everything on-chip: no memory, no network PE throughput is limiting close to limit for uniform graphs, slowdown due to imbalance for RMAT graphs

10 / 14

slide-11
SLIDE 11

Results

10 20 30 40 50 60 70 80 1 2 4 8 System Throughput (MTEPS) Number of PEs DDR limit

  • Comp. limit (CPE=1.6)

BFS - RMAT PR - RMAT CC - RMAT BFS - Uniform PR - Uniform CC - Uniform

using external memory, but only one FPGA Xilinx MIG DDR3 controller’s BWmem = 2.5 Gbps random access performance is limiting better performance at 1 PE as accesses are more sequential

11 / 14

slide-12
SLIDE 12

Results

100 200 300 400 500 600 1 2 3 4 System Throughput (MTEPS) Number of FPGAs Network limit BFS - RMAT CC - RMAT

using 4 FPGAs, no memory network interface bandwidth BWnetwork = 6.7 Gbps is limiting imbalance has greater impact, further decays performance

12 / 14

slide-13
SLIDE 13

Conclusion

Graph algorithms are very communication-intensive need to optimize interfaces can predict performance reasonably accurately except for imbalance (depends on input properties)

13 / 14

slide-14
SLIDE 14

Thank you for listening! Questions? Visit poster board 8!

14 / 14