A distributed model of computation for reconfigurable devices based - - PowerPoint PPT Presentation

a distributed model of computation for reconfigurable
SMART_READER_LITE
LIVE PREVIEW

A distributed model of computation for reconfigurable devices based - - PowerPoint PPT Presentation

A distributed model of computation for reconfigurable devices based on a streaming architecture Paolo Cretaro National Institute for Nuclear Physics FPL 2019 Barcelona, September 2019 The ExaNeSt project: hardware highlights Unit: Xilinx


slide-1
SLIDE 1

A distributed model of computation for reconfigurable devices based on a streaming architecture

Paolo Cretaro National Institute for Nuclear Physics FPL 2019 Barcelona, September 2019

slide-2
SLIDE 2

The ExaNeSt project: hardware highlights

Unit: Xilinx Zynq Ultrascale+ FPGA  Four 64bit ARM Cortex A53 @1.5GHz  Programmable logic  16 high speed serial links @16Gbps Node: Quad-FPGA Daughter-Board (QFDB)  All-to-all internal connectivity  10 HSS links to remote QFDB (through network FPGA)  64 GB DDR4 RAM (16GB per FPGA)  512 GB NVMe SSD on storage FPGA Blade/mezzanine  4 QFDB in Track 1  2 HSS links per edge (local direct network)  32 SFP+ connectors for inter-mezzanine hybrid network

10/09/2019 Paolo Cretaro - FPL2019 2

I worked in the team who made the 3D torus network, based on a custom Virtual Cut-Through protocol

slide-3
SLIDE 3

Mixing acceleration and network

 With High Level Synthesis tools, FPGAs are becoming a viable way to accelerate tasks  Accelerators must be able to access the network directly to achieve low-latency communication among themselves and

  • ther remote hosts

 A dataflow programming paradigm could take advantage of this feature to optimize communication patterns and loads

10/09/2019 Paolo Cretaro - FPL2019 3

CPU DDR

ACCELERATOR

NETWORK INTERFACE

CPU DDR

System memory mapped bus

NETW ORK

slide-4
SLIDE 4

Kahn processing networks advantages

Group of sequential processes communicating through FIFO channels  Determinism: for the same input history the network produces exactly the same output  No shared memory: processes can run concurrently and synchronize through blocking read on input channel FIFOs  Distributing tasks on multiple devices is easy

10/09/2019 Paolo Cretaro - FPL2019 4

P

A B C

slide-5
SLIDE 5

Accelerator hardware interface

 Virtual input/output channels for each source/destination  Direct host memory access for buffering and configuration (a device driver is needed)  Direct coupling with the network

10/09/2019 Paolo Cretaro - FPL2019 5

ACCELERATION CORE ACCELERATION CORE

ADAPTER ADAPTER ADAPTER ADAPTER HOST MEMORY HOST MEMORY NETWORK NETWORK NETWORK NETWORK

slide-6
SLIDE 6

Steps description

  • 1. Write kernels in HLS
  • 2. A config file delineates tasks

and data dependencies

  • 3. A directed graph is built and

mapped on the network topology

  • 4. Accelerator blocks are flashed
  • n targeted nodes
  • 5. Data is fed into entry points and

tasks are started

  • 6. Each task consumes its data

and send the results to the next

  • nes

10/09/2019 Paolo Cretaro - FPL2019 6

A B C D E

CU 0/A CU 1/B CU 2 CU 3/C CU

4/D,E

CU 5 CU 6 CU 7 CU 8

slide-7
SLIDE 7

Simplified task graph configuration example

10/09/2019 Paolo Cretaro - FPL2019 7

Device0 { Type: FPGA Task0 { Impl: source_task.c Input_channels: 0 Output_channels { Ch0: Device1.Task0.Ch1 } } Task1 { Impl: source_task.c Input_channels: 0 Output_channels { Ch0: Device1.Task0.Ch0 } } } Device1 { Type: FPGA Task0 { Impl: example_task.c Input_channels: 2 Output_channels { Ch0: Device1.Task1.Ch0 } } Task1 { Impl: sink_task.c input_channels: 1 } }

slide-8
SLIDE 8

Thank you!

10/09/2019 Paolo Cretaro - FPL2019 8