a distributed model of computation for reconfigurable
play

A distributed model of computation for reconfigurable devices based - PowerPoint PPT Presentation

A distributed model of computation for reconfigurable devices based on a streaming architecture Paolo Cretaro National Institute for Nuclear Physics FPL 2019 Barcelona, September 2019 The ExaNeSt project: hardware highlights Unit: Xilinx


  1. A distributed model of computation for reconfigurable devices based on a streaming architecture Paolo Cretaro National Institute for Nuclear Physics FPL 2019 Barcelona, September 2019

  2. The ExaNeSt project: hardware highlights Unit: Xilinx Zynq Ultrascale+ FPGA  Four 64bit ARM Cortex A53 @1.5GHz  Programmable logic  16 high speed serial links @16Gbps Node: Quad-FPGA Daughter-Board (QFDB)  All-to-all internal connectivity  10 HSS links to remote QFDB (through network FPGA)  64 GB DDR4 RAM (16GB per FPGA)  512 GB NVMe SSD on storage FPGA Blade/mezzanine  4 QFDB in Track 1  2 HSS links per edge (local direct network)  32 SFP+ connectors for inter-mezzanine hybrid network I worked in the team who made the 3D torus network, based on a custom Virtual Cut-Through protocol Paolo Cretaro - FPL2019 10/09/2019 2

  3. Mixing acceleration and network  With High Level Synthesis tools, FPGAs are becoming a viable way to accelerate tasks  Accelerators must be able to access the network directly to achieve low-latency communication among themselves and other remote hosts  A dataflow programming paradigm could take advantage of this feature to optimize communication patterns and loads System memory mapped bus ACCELERATOR CPU CPU DDR DDR NETW NETWORK INTERFACE ORK Paolo Cretaro - FPL2019 10/09/2019 3

  4. Kahn processing networks advantages Group of sequential processes communicating through FIFO channels  Determinism: for the same input history the network produces exactly the same output  No shared memory: processes can run concurrently and synchronize through blocking read on input channel FIFOs  Distributing tasks on multiple devices is easy A C P B Paolo Cretaro - FPL2019 10/09/2019 4

  5. Accelerator hardware interface  Virtual input/output channels for each source/destination  Direct host memory access for buffering and configuration (a device driver is needed)  Direct coupling with the network NETWORK NETWORK NETWORK NETWORK ADAPTER ADAPTER ADAPTER ADAPTER ACCELERATION ACCELERATION CORE CORE HOST HOST MEMORY MEMORY Paolo Cretaro - FPL2019 10/09/2019 5

  6. Steps description A C 1. Write kernels in HLS E 2. A config file delineates tasks B D and data dependencies 3. A directed graph is built and mapped on the network topology 4. Accelerator blocks are flashed CU CU CU 7 8 6 on targeted nodes 5. Data is fed into entry points and CU CU CU tasks are started 3/C 5 4/D,E 6. Each task consumes its data CU CU CU and send the results to the next 1/B 2 0/A ones Paolo Cretaro - FPL2019 10/09/2019 6

  7. Simplified task graph configuration example Device0 { Type: FPGA Task0 { Impl: source_task.c Input_channels: 0 Output_channels { Ch0: Device1.Task0.Ch1 } } Task1 { Impl: source_task.c Input_channels: 0 Output_channels { Ch0: Device1.Task0.Ch0 } } } Device1 { Type: FPGA Task0 { Impl: example_task.c Input_channels: 2 Output_channels { Ch0: Device1.Task1.Ch0 } } Task1 { Impl: sink_task.c input_channels: 1 } } Paolo Cretaro - FPL2019 10/09/2019 7

  8. Thank you! Paolo Cretaro - FPL2019 10/09/2019 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend