Performance-driven system Performance-driven system generation for - PowerPoint PPT Presentation

Performance-driven system Performance-driven system generation for distributed generation for distributed vertex-centric vertex-centric graph processing graph processing on multi-FPGA systems on multi-FPGA systems Nina Engelhardt C.-H. Dominic Hung Hayden K.-H. So The University of Hong Kong 28th August 2018

The GraVF graph processing framework user kernel PE PE PE ... PE PE PE ... ... ... ... ... PE PE PE ... Vertex-centric graph processing framework on FPGA User provides kernel, inserted in framework architecture PEs exchanging messages over on-chip network 2 / 14

Now extended to multiple FPGA Allow PEs to exchange messages with PEs on different FPGAs Extend network by adding external interface and routing messages destined for PEs on other FPGAs over 10GbE PE PE ... PE PE PE ... PE PE PE ... PE PE PE ... PE ... ... ... ... ... ... ... ... ... 10 10 PE PE ... PE ... PE GbE GbE FPGA 1 FPGA n Ethernet Switch 3 / 14

What’s the performance? Vertex kernel has well-defined interface Can calculate the necessary resources to process one edge receive message update vertex PE data read edge send message Build roofline-style performance model based on platform resources Use model to automatically pick configuration when generating 4 / 14

Limiting factors 4 limits considered: Processing element throughput Memory bandwidth Network interface bandwidth Total network bandwidth 5 / 14

Processing element throughput T sys ≤ n FPGA × n PE / FPGA × f clk / CPE PE ( L PE ) Cycles per edge: analogous to processor CPI, used together with clock frequency to determine individual PE throughput CPE is affected by: PE architecture data hazards kernel implementation multiplied by number of PEs in the system. 6 / 14

Memory bandwidth T sys ≤ n FPGA × BW mem ( L mem ) m edge Edges can be stored off-chip to increase processable graph size Can only be processed as fast as they can be loaded 7 / 14

Network interface bandwidth BW if T sys ≤ n FPGA × ( L if ) 2 n FPGA − 1 n FPGA m message When using multiple FPGAs, messages need to be transferred over external network interface Assuming equal distribution of vertices, a message has a n FPGA − 1 n FPGA chance to be sent to a different board Each message traverses an interface twice, sending and receiving Really a per-board limit, extra factor n FPGA on both sides for system throughput 8 / 14

Total network bandwidth BW network T sys ≤ ( L network ) n FPGA − 1 n FPGA m message Total amount of messages transferrable by the external network Again, a fraction n FPGA − 1 of messages needs to cross the external n FPGA network 9 / 14

Results Computation limit (CPE=1.6) Computation limit (CPE=1.3) Computation limit (CPE=1.2) PR - RMAT BFS - RMAT CC - RMAT PR - Uniform BFS - Uniform CC - Uniform 1000 1400 1400 System Throughput (MTEPS) 900 System Throughput (MTEPS) System Throughput (MTEPS) 1200 1200 800 1000 1000 700 600 800 800 500 600 600 400 300 400 400 200 200 200 100 0 0 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 Number of PEs Number of PEs Number of PEs everything on-chip: no memory, no network PE throughput is limiting close to limit for uniform graphs, slowdown due to imbalance for RMAT graphs 10 / 14

Results DDR limit Comp. limit (CPE=1.6) BFS - RMAT PR - RMAT CC - RMAT BFS - Uniform PR - Uniform CC - Uniform 80 System Throughput (MTEPS) 70 60 50 40 30 20 10 0 1 2 4 8 Number of PEs using external memory, but only one FPGA Xilinx MIG DDR3 controller’s BW mem = 2 . 5 Gbps random access performance is limiting better performance at 1 PE as accesses are more sequential 11 / 14

Results Network limit BFS - RMAT CC - RMAT 600 System Throughput (MTEPS) 500 400 300 200 100 0 1 2 3 4 Number of FPGAs using 4 FPGAs, no memory network interface bandwidth BW network = 6 . 7 Gbps is limiting imbalance has greater impact, further decays performance 12 / 14

Conclusion Graph algorithms are very communication-intensive need to optimize interfaces can predict performance reasonably accurately except for imbalance (depends on input properties) 13 / 14

Thank you for listening! Questions? Visit poster board 8! 14 / 14

Performance-driven system Performance-driven system generation for - PowerPoint PPT Presentation

Performance-driven system Performance-driven system generation for distributed generation for distributed vertex-centric vertex-centric graph processing graph processing on multi-FPGA systems on multi-FPGA systems Nina Engelhardt C.-H.

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

False fasting is driven by pride False fasting is driven by pride False fasting is

Gillian Smith September 13, 2012 gillian@ccs.neu.edu Graphics-Driven Game Design

A Performance-Driven Standard-Cell A Performance-Driven Standard-Cell Placer Based on a Modified

Events Event-driven programming Event loop Event dispatch Event handling Event Driven

Events Event-driven programming Event loop Event dispatch Event handling Event Driven

Transit-Driven Complete Streets Transit-Driven Complete Streets Questions: Type questions

Data-Driven Research Program Data-Driven Research Program Linked Longitudinal Retrospective

Large deviations and heterogeneities in driven or non-driven kinetically constrained models

SCE Map Update: Data-Driven Spatial and E Field Maps Michael Mooney, Hannah Rogers Colorado

Domain Driven Domain Driven Design with relational Design with relational Databases and Spring

Large deviations and heterogeneities in driven or non-driven kinetically constrained models

Event Driven Simulation and Test-benches Event Driven Simulation Continuous time and value

Software Engineering I cs361 Test Driven Development What is Test Driven Development (TDD)

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Grammar-driven versus Data-driven: Which Parsing System is More Affected by Domain Shifts?

of Cybersecurity in Smart Grid Deployments Prof. Dave Bakken School of Electrical Engineering

Multiobjective Multiobjective Genetic Algorithms for Genetic Algorithms for Multiscaling

dispel4py: A Python Framework for Data-Intensive eScience PyHPC2015 15 November 2015 Amy Krause

Tour de HPCycles Allan Snavely Wu Feng allans@sdsc.edu feng@lanl.gov San Diego Los Alamos

Policy Management in the Reliable Server Pooling Architecture Thomas Dreibholz Institute for

Double activation and the governance of employment services? NERI Annual Labour Market Conference

Hardware Read-Write Lock Elision Alexander Shady Issa Pascal Felber Matveev Paolo Romano

The role of the international community in addressing shocks to agricultural livelihoods R OB V