An introduction to FPGA-based acceleration of neural networks - - PDF document

an introduction to fpga based acceleration of neural
SMART_READER_LITE
LIVE PREVIEW

An introduction to FPGA-based acceleration of neural networks - - PDF document

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA? Field-programmable gate array (FPGA) are integrated circuits designed to be con fi gured after manufacturing for implementing arbitrary logic


slide-1
SLIDE 1

An introduction to FPGA-based acceleration of neural networks

Marco Pagani

1

What is an FPGA?

  • Field-programmable gate array (FPGA) are integrated circuits

designed to be configured after manufacturing for implementing arbitrary logic functions in hardware. ○ Logic functions are implemented by means of configurable logic cells; ○ These logic cells are distributed in a 2D grid structure and interconnected using configurable routing resources.

2

Logic Cell Switch Matrix Logic Cell Switch Matrix Logic Cell Switch Matrix Logic Cell Switch Matrix

slide-2
SLIDE 2

What is an FPGA?

  • Logic cells which are typically built around a look-up table (LUT).

○ A n-inputs LUT is a logic circuit that can be configured to implement any possible combinational logic function with n inputs and a single output.

3

MUX 1 00 01 11 10 In0 In1 SRAM config. memory Out LUT

  • Logic cells are built combining one or more look-up tables with

flip-flops (FF) capable of storing a state. ○ In this way, they can also sequential logic;

  • In modern state-of-the-art FPGAs, logic cells includes: several LUTs,

FFs, multiplexers, arithmetic and carry chains, and shift registers.

What is an FPGA?

4

MUX 1 FF 00 01 11 10 In0 In1 SRAM config. memory Out Out Logic cell

slide-3
SLIDE 3

What is an FPGA?

  • Besides Logic cells, FPGA also includes special-purpose cells:

○ Digital signal processing (DSPs) / arithmetic cells; ○ Dense memory cells (BRAMs). ○ Input / output cells.

  • Hence, the fabric of modern FPGAs is an heterogeneous structure:

5

LC MC AC LC LC I/O I/O LC MC AC LC LC I/O I/O LC MC AC LC LC I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O I/O

FPGAs as computing devices

6

  • FPGAs are different from software-programmable processors.

○ They do not rely on a predefined microarchitecture or an ISA to be programmed; ○ “Programming” an FPGA means designing the datapath itself and the control logic that will be used to process the data.

Hardware microarchitecture ISA FPGA fabric Logic design ISA (optional) Software stack Software stack (optional) Hardware Software programmable processor FPGA Software

slide-4
SLIDE 4

FPGAs as computing devices

7

  • Traditionally, “hardware programming” has been carried out using

hardware description languages (HDL) such as VHDL or Verilog. ○ Different paradigm with respect to software programming. ○ HDL allows specifying sets of operations that will be performed (or evaluated) concurrently when the activation conditions are met, eventually triggering other operations. ○ Costly and time consuming design and debugging.

FPGA fabric Logic design ISA (optional) Software stack (optional)

FPGAs as computing devices

8

  • In the last decades, high-level synthesis (HLS) tools emerged.

○ HLS tools allow translating an algorithmic description of a procedure written in system programming languages such as C or C++ into an HDL implementation; ○ Less efficient with respect to native HDL but productivity is higher both in coding and testing.

Ease of development Performance FPGA HDL FPGA HLS

slide-5
SLIDE 5

FPGAs design flow

9

  • A modern FPGA design flow consist of a design and integration

phase followed by synthesis and implementation phase. ○ Logic design units (HDL) are often packaged into IPs to promote portability and reusability. ○ At the end, the process produces one or more bitstreams.

Implementation Synthesis Bitstream generation HLS tool HDL Sources HLS Sources

FPGAs design flow

10

  • A bitstream configuration file contains the actual data for configuring

the SRAM memory cells defining the function of logic cells and the routing resources. ○ Note: unlike a CPU program, an FPGA configuration does not specify a set of operations that will be interpreted and executed by the FPGA bur rather contains the data for configuring the fabric.

0101010 1101010 1010111 0011110 Bitstream 00 Out Out Logic cell U MUX U U FF 01 11 U 10 SRAM config. memory In

slide-6
SLIDE 6

Heterogeneous SoC-FPGA

11

  • The pay-off for FPGA HW-programmability is higher silicon area

consumption and lower clock frequency compared to “hard” CPUs. ○ SoC-FPGA are heterogeneous platforms have emerged taking the best of both worlds. ■ Tightly coupling a SoC platform with FPGA fabric.

FPGA

Inter- connect DRAM Memory Memory controller Last level cache CPU HW-Accelerator HW-Accelerator CPU CPU CPU

Why using FPGA for accelerating neural networks?

12

  • FPGA have been used to accelerate the inference process for

artificial neural networks (NN). ○ From data centers to cyber-physical systems application; ○ Currently, the training phase is still typically performed on GPUs.

  • Amazon AWS FPGA instances.
  • Microsoft Azure FPGA;
  • Many industrial players are moving

towards heterogeneous SoC-FPGA for NN acceleration and parallel I/O processing.

slide-7
SLIDE 7

Why using FPGA for accelerating neural networks?

13

  • The inference process for neural networks can be performed using

different computing devices; ○ These alternatives can be classified on different dimensions: ■ Performances, energy efficiency, ease of programming.

CPU GPU ASIC (TPU) FPGA

  • General-purpose devices

implementing an ISA and having a programmable datapath;

  • GPUs are oriented

towards data-parallel (SIMD/SIMT) processing.

  • Reconfigurable

fabric implementing digital logic;

  • Limited clock

frequency;

  • Specialized digital

logic having the highest efficiency.

  • Can't be

reconfigured when the needs change. Easy to program / less efficient More efficient / less flexible

Why using FPGA for accelerating neural networks?

14

  • The main advantage of FPGA is their flexibility:
  • Potentially, it is possible to implement the inference process using a

hardware accelerator custom-designed for the specific NN. ○ Efficient datapath tailored for the specific inference process. ○ If the network change the accelerator can be redesigned. ■ Same chip fit different types of machine learning models;

  • As a result, FPGA-accelerated inference of NN is typically more

energy-efficient compared to CPUs and GPUs; ○ High throughput without ramping up the clock frequency.

Ease of development Performance per Watt CPU GPU FPGA HDL FPGA HLS

slide-8
SLIDE 8

Why using FPGA for accelerating neural networks?

15

  • Another important advantage of FPGA-based hardware acceleration

is time predictability; ○ This especially important in the context of real-time systems.

  • Currently, only minimal information on the internal architecture

and resource management logic is publicly available for state-the-art GPUs.

  • FPGA-based hardware accelerators are intrinsically more open,
  • ffering the possibility to simulate and even describe their internal

logic at clock-level granularity.

Logic simulation view (GTK Waves)

Accelerating neural networks on FPGA

16

  • The main drawback of FPGA-based NN acceleration is complexity:

○ Custom designing an accelerator in HDL or even with HSL is expensive and time consuming process.

L0 Conv. L1 Conv. L2 Max Pool. L3 Fully conn. L4 Fully conn. Network trained model

load(); start(); wait();

Control code (CPU) HW accelerators (FPGA)

FPGA

Inter- connect LLC CPU CPU CPU CPU

slide-9
SLIDE 9

Accelerating neural networks on FPGA

17

  • Hence, as for GPUs, specific frameworks have been developed to

ease the development of FPGA-based HW acceleration for NN.

L0 Conv. L1 Conv. L2 Max Pool. L3 Fully conn. L4 Fully conn. Network trained model

load(); start(); wait();

Control code (CPU) HW accelerators (FPGA) Framework core Framework library Contains code templates / accelerators

Accelerating neural networks on FPGA

18

  • In order to achieve effective HW acceleration some characteristics
  • f FPGA must be taken into account:

○ Floating points operations are expensive on current generation FPGA fabric (this may change in the future). ■ Integer and binary operation are way more efficient. ○ On-fabric memory storage (BRAMs and LUT-RAMs) have a larger throughput and and a lower latency compared to external memory such as external DDR memory. ■ The parallel processing potential of FPGA hardware may be hampered resulting in a memory-bound design.

LC MC AC LC LC LC MC AC LC LC

slide-10
SLIDE 10

Accelerating neural networks on FPGA

19

  • According to these characteristics, some general design principles

for accelerating NN of FPGA can be derived: ○ Whenever possible, use integer or binary operation; ○ Whenever possible, use on-chip memory to store the parameters;

  • NN parameters can contain a lot of redundant information.

○ This can be exploited using quantization i.e., moving from floating-point to low-precision integer arithmetics.

  • Moving to a quantized representation fits both design principles:

○ Integer operations requires significantly less logic resources allowing increasing compute density. ○ Network parameters may entirely fit into on-chip memory;

Accelerating neural networks on FPGA

20

  • Moving from floating-point arithmetic to low-precision integer

arithmetic has a limited impact on the NN accuracy. ○ Network can be retrained to improve the accuracy.

  • Advanced frameworks for FPGA-based acceleration of NN will

manage the quantization and retraining steps.

Quantization and retraining Hardware generation / code generation Network model transformation Framework library

slide-11
SLIDE 11

Accelerating neural networks on FPGA

21

  • Current NN are typically composed of predefined layers:

○ Convolutional, pooling, fully connected.

  • Frameworks for FPGA-based acceleration provide full

implementations or even design templates for these layers. ■ Templates with respect to the size of the layer (e.g. number of neurons, input size, etc.); ■ Knowing the size of the layer is useful for dimensioning the logic design and “unrolling” the operations.

L0 Conv. L1 Conv. L2 Pool. L3 Fully conn. L4 Fully conn. Conv. HW-accel. template Fully conn. HW-accel. template Pool. HW-accel. template

Tentative taxonomy of FPGA-based NN acceleration

22

  • Due to the flexibility of the FPGA-based computing, the inference

process for a NN can be accelerated using different approaches and architectures. ○ Some approaches are more popular than others.

Entire NN in HW Only some layers in HW NN instance-specific HW-accelerators NN layer-generic HW-accelerators Programmable (ISA) HW-accelerators Fixed-function HW-accelerators

slide-12
SLIDE 12

FPGA

Popular approaches - programmable accelerators

23

  • One possible solution is to use a highly optimized programmable

HW-accelerator usually designed with HDL. ○ Similar to ASIC TPU/NPU, less efficient but more flexible; ○ If the NN model change, the HW-accelerator can be swapped without changing chip or platform;

NN HW-accelerator L0 Conv. L1 Conv. L2 Pool. L3 Fully conn. L4 Fully conn. Network model

load(); start(); wait();

Control code

01010001 11110011 11000111 11110000

Accelerator code CPU

FPGA

Popular approaches - programmable accelerators

24

  • Easy development process of NN through a framework.

○ Only software development may be required.

  • Resource consumption typically does not depend on the NN model.

○ Inference time may depend on the NN model.

  • The code can be easily ported on other platforms implementing the

same NN accelerator.

NN HW-accelerator

load(); start(); wait();

Control code

01010001 11110011 11000111 11110000

Accelerator code CPU

slide-13
SLIDE 13

FPGA

Popular approaches - fixed-function accelerators

25

  • Another possible approach approach is to use multiple HW modules

for accelerating in HW only the most demanding layers.

  • The computational load of the NN inference is not constant across

all the layers of the network. ○ in an CNN, the initial convolutional layers may require more computing resources that the last layers;

L0 Conv. L1 Conv. L2 Pool. L3 Fully conn. L4 Fully conn. Network model Conv. HW-accelerator Fully conn. HW-accelerator HW-assisted L0/L1 L2 HW-assisted L3/L4 t CPU task

FPGA

Popular approaches - fixed-function accelerators

26

  • The accelerator typically have a fixed-function datapath optimized

for the specific operations. ○ Often not customized for the specific NN (size, parameters, etc.);

  • Resource consumption may not depend on the the NN model.

○ Inference time may depend on the NN model;

  • More complex development process, performances may improve.

○ Hardware support design may be required.

Conv. HW-accelerator Fully conn. HW-accelerator HW-assisted L0/L1 L2 HW-assisted L3/L4 t CPU task

slide-14
SLIDE 14

FPGA

Popular approaches - dataflow accelerators

27

  • Accelerating the entire NN inference with a custom dataflow

architecture accelerator. ○ Each layer can be implemented with a dedicated functional unit; ■ Typically using a fixed-function architecture; ■ Typically FU are tailored for the specific NN (size, params., etc.); ■ Weights must be stored on-chip.

L0 Conv. L1 Conv. L2 Pool. L3 Fully conn. L4 Fully conn. L0 L0 L0 L0 FU L1 FU L2 FU L3 FU L1 L1 L1 L2 L2 L2 L3 L3 L3 L4 FU L4 L4 L4 Network model L0 L1 L2 L3 L4

Popular approaches - dataflow accelerators

28

  • High throughput and low latency;
  • Complex implementation process.

○ Hardware design depends on the specific NN-model.

  • Resource consumption depend on the size of the NN.

○ If there are not enough resources, some layers can be folded together, reducing the level of parallelism; ○ It there are more resources available, the level of parallelism within the FUs can be increased.

L0 L1 L1 L0/1 FU L0 L0 L0 L1 L1 L0 L0 FU L1 FU L0 L0 L0 L1 L1 L1 L1

slide-15
SLIDE 15

Dynamic partial reconfiguration

29

  • An interesting feature of FPGA is dynamic partial reconfiguration.

○ DPR allow to dynamically reconfigure a subset of the FPGA fabric while the remaining resources continue to operate without interruption.

  • Can be used to host in time-sharing multiple HW-accelerators.

○ Implementing the inference for multiple NN models; ○ Or even for the same NN.

FPGA

Reconfigurable partition (RP) NN B HW-accelerator NN A HW-accelerator HW A RCFG t RP HW B RCFG HW A RCFG

...

Thank you for your attention

30

marco.pagani@santannapisa.it