Accelerated Data Processing on SoC with FPGA Marek Va sut < - - PowerPoint PPT Presentation

accelerated data processing on soc with fpga
SMART_READER_LITE
LIVE PREVIEW

Accelerated Data Processing on SoC with FPGA Marek Va sut < - - PowerPoint PPT Presentation

Accelerated Data Processing on SoC with FPGA Marek Va sut < marex@denx.de > June 3, 2015 Marek Va sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA Marek Vasut Software engineer at DENX S.E. since 2011


slide-1
SLIDE 1

Accelerated Data Processing on SoC with FPGA

Marek Vaˇ sut <marex@denx.de> June 3, 2015

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-2
SLIDE 2

Marek Vasut

◮ Software engineer at DENX S.E. since 2011

◮ Embedded and Real-Time Systems Services, Linux kernel and

driver development, U-Boot development, consulting, training.

◮ Versatile Linux kernel hacker ◮ Custodian at U-Boot bootloader

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-3
SLIDE 3

Structure of the talk

◮ Motivation ◮ Introduction to FPGAs ◮ Your first FPGA data cruncher ◮ Interfacing with Linux ◮ Speeding things up

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-4
SLIDE 4

Why listen to this talk

◮ Get fresh ideas ◮ Learn something new ◮ Reduce energy envelope of your device ◮ Process data quickly and efficiently

You won’t learn marketing stuff or random benchmark numbers

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-5
SLIDE 5

FPGA

◮ Abbr. for Field Programmable Gate Array ◮ Programmable logic ◮ Usually used for:

◮ Digital Signal Processing (DSP) ◮ Data crunching ◮ Custom hardware interfaces ◮ ASIC prototyping ◮ . . .

◮ Common vendors – Xilinx, Altera, Lattice, Microsemi. . .

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-6
SLIDE 6

Internal structure

W.T.Freeman http://www.vision.caltech.edu/CNS248/Fpga/fpga1a.gif CC BY 2.5: http://creativecommons.org/licenses/by/2.5/

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-7
SLIDE 7

FPGA and the outside

◮ FPGA has plenty of I/O options:

◮ Regular I/O with configurable voltage levels ◮ Differential I/O ◮ High-speed SerDes ◮ . . .

◮ Usual interface with host:

◮ Stand-alone FPGA, usually PCIe, USB, . . . ◮ FPGA on a CPU bus (PowerPCs, ie. ML507) ◮ Built into CPU (SoCFPGA/Zynq), usually AMBA/AXI Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-8
SLIDE 8

Programming the FPGA

◮ Each vendor has his own tools – Altera Quartus, Xilinx Vivado ◮ FPGA tools often closed source :-( ◮ FPGA bitstream format is closed :-( ◮ Basic vendor tools available free of charge ◮ Sufficient amount of functionality to implement data cruncher ◮ Vendor tools needed for place-and-route and assembler ◮ Third-party tools for synthesis are available

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-9
SLIDE 9

Comparison to a GPU – I.

CPU GPU FPGA Toolchain Open Closed Closed HW design Proprietary Proprietary Your own HW units Fixed Fixed As needed I/O Limited None As needed

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-10
SLIDE 10

HDL – Hardware Description Language

◮ FPGA content is written in HDLs ◮ HDL – Hardware Description Language ◮ HDLs are used to model behavior of logic block ◮ Two major HDLs – VHDL and Verilog ◮ Tools often allow seamless mixing of HDLs ◮ Many readily-available cores under acceptable license:

OpenCores http://opencores.org/ OpenCores projects http://opencores.org/projects CERN Open HW Repo http://www.ohwr.org/

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-11
SLIDE 11

Modeling behavior

HW Behavior modeling vs. Writing CPU code:

◮ Vastly different and confusing to software people :-) ◮ CPU: Programmer implements an algorithm ◮ FPGA: Programmer implements hardware to run the algorithm

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-12
SLIDE 12

Implicit parallelism

◮ Everything in a block is executed in parallel ◮ All conditions in a conditional statement are tested in parallel

if, case – differs from C

1 if (foo == 1)

bar <= 1’b0;

2 else

bar <= 1’b1;

◮ Blocks are executed in parallel

1 begin 2

x <= 1’b0;

3

y <= 1’b1;

4 end Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-13
SLIDE 13

Combinatorial vs. Sequential logic

◮ Combo – imm. value of var is the product of the imm. inputs

  • f the function:

assign Z = X ^ Y;

◮ Seq logic is sync to clock (involves a latch)

always @(posedge clk) Z <= DAT;

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-14
SLIDE 14

Verilog example

◮ Looks like C, based on C, but behaves differently ◮ Used a lot in Europe ◮ Example: CRC5, polynomial x5 + x2 + x0 ◮ Example modified from:

http://www.asic-world.com/examples/verilog/ serial_crc.html

1 module crc5 ( 2

/* SYSTEM I/O */

3

input reset,

4

input clk,

5

/* CRC5 I/O */

6

input data,

7

  • utput reg [4:0]

crc

8 ); Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-15
SLIDE 15

Verilog example II

1 always @(posedge clk) begin 2

if (reset) begin

3

crc <= 5’b00000;

4

end else begin

5

crc[0] <= data ^ crc[4];

6

crc[1] <= crc[0];

7

crc[2] <= crc[1] ^ data ^ crc[4];

8

crc[3] <= crc[2];

9

crc[4] <= crc[3];

10

end

11 end 12 endmodule Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-16
SLIDE 16

VHDL example

◮ Distinctive syntax based on Ada ◮ More explicit typing system than Verilog ◮ Used a lot in the USA ◮ Example: CRC5, polynomial x5 + x2 + x0 ◮ Example from http://outputlogic.com/?page_id=321

1 library ieee; 2 use ieee.std_logic_1164.all; 3 4 entity crc is 5

port ( data_in : in std_logic_vector (0 downto 0);

6

rst, clk : in std_logic;

7

crc_out : out std_logic_vector (4 downto 0));

8 end crc; Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-17
SLIDE 17

VHDL example II

1 architecture imp_crc of crc is 2

signal lfsr_q: std_logic_vector (4 downto 0);

3

signal lfsr_c: std_logic_vector (4 downto 0);

4 begin 5

crc_out <= lfsr_q;

6

lfsr_c(0) <= lfsr_q(4) xor data_in(0);

7

lfsr_c(1) <= lfsr_q(0);

8

lfsr_c(2) <= lfsr_q(1) xor lfsr_q(4) xor data_in(0);

9

lfsr_c(3) <= lfsr_q(2);

10

lfsr_c(4) <= lfsr_q(3);

11 12

process (clk,rst) begin

13

if (rst = ’1’) then

14

lfsr_q <= b"11111";

15

elsif (clk’EVENT and clk = ’1’) then

16

lfsr_q <= lfsr_c;

17

end if;

18

end process;

19 end architecture imp_crc; Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-18
SLIDE 18

Comparison to a GPU – II.

CPU GPU FPGA Languages All OpenCL, CUDA OpenCL, HDLs Design paradigm Sequential Seq/Par Parallel Design granularity Instruction Instruction Gate

  • Opt. possibility

Low Low High

  • Opt. difficulty

Low Low High

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-19
SLIDE 19

Development and debugging

◮ Simulation (on developer’s system) ◮ Probing (on-target)

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-20
SLIDE 20

Simulation

◮ Simulation tools:

Icarus Verilog http://iverilog.icarus.com/ ghdl http://home.gna.org/ghdl/ ModelSim http://en.wikipedia.org/wiki/ModelSim/

◮ Write testcase for a module in an augmented HDL ◮ Execute testcase ◮ Observe results

◮ View waveforms ◮ Decode and inspect busses ◮ Trigger on complex conditions ◮ . . . Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-21
SLIDE 21

Probing

◮ Used to observe design on target ◮ Think of this as a bus analyzer in the FPGA ◮ Probing tools (ie. SignalTap) ◮ Design is augmented with a probing IP, FPGA is

reprogrammed

◮ Probing is controlled through a debug probe attached to the

FPGA (JTAG or similar)

◮ Probe internal signals, observe waveforms, trigger on complex

  • conditions. . .

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-22
SLIDE 22

Structuring the design

◮ HDL files – lowest in the hierarchy ◮ IP block – collection of HDL files with an interface ◮ FPGA design – collection of IP blocks ◮ Vendor tools contain tools to assemble IP blocks into FPGA

design – ie. Altera QSys.

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-23
SLIDE 23

Comparison to a GPU – III.

CPU GPU FPGA Simulation QEMU ? Icarus, ModelSim Debugger GDB CUDA-GDB, CodeXL SignalTap

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-24
SLIDE 24

Linux interface

◮ No standard in-kernel FPGA interface due to variance of

designs

◮ Attempts do exist:

◮ Device Tree Overlay(s) stored in FPGA ◮ SDB –

http://www.ohwr.org/projects/fpga-config-space

◮ Usually there are control registers in the FPGA design ◮ Usually the DMA is involved (either on FPGA or CPU side) ◮ Two options for controlling the FPGA:

◮ Custom Linux kernel driver ◮ Userspace utility Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-25
SLIDE 25

Custom kernel driver

◮ Driver written to match the particular FPGA bitstream ◮ Driver can crash the host machine if written badly :-( ◮ Driver usually exports custom userland I/O ◮ splice(2)

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-26
SLIDE 26

Userland approach

◮ Userland accesses the FPGA registers via uio ◮ The uio is like a restricted devmem ◮ In case DMA is involved, kernel module to prepare the data

for the DMA (ie. assure cache coherency) is needed.

◮ CMA might be used to export large slab of custom kernel

memory to user

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-27
SLIDE 27

Performance tuning

◮ FPGA is clocked at 50...200MHz, not much ◮ The fabric is usually rated at much more! ◮ Synthesize PLL, which generates faster clock ◮ Clock your design from the PLL

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-28
SLIDE 28

Design tricks

◮ Use combo logic where possible ◮ Create pipelined designs and make sure the pipeline is

saturated

◮ Synthesize multiple units and compute in parallel

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-29
SLIDE 29

Altera OpenCL

◮ OpenCL EP 1.0 implementation for Altera FPGAs ◮ Easier for SW developers ◮ Only thin shim must be ported to the FPGA ◮ Closed source compiler :-( ◮ Even needs a license :-C

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-30
SLIDE 30

Conclusion

◮ FPGAs are strong in parallel, pipelined workloads ◮ FPGAs give the user almost gate-level performance ◮ FPGAs are manufactured using bleeding-edge process ◮ FPGAs deploy excellent Performance-per-Watt ◮ There is no simple unified Linux interface ◮ Developing an FPGA content can be difficult ◮ The FPGA ecosystem is still rather closed

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA

slide-31
SLIDE 31

The End

Thank you for your attention!

Contact: Marek Vasut <marex@denx.de>

Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA