Accelerated Data Processing on SoC with FPGA
Marek Vaˇ sut <marex@denx.de> June 3, 2015
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
Accelerated Data Processing on SoC with FPGA Marek Va sut < - - PowerPoint PPT Presentation
Accelerated Data Processing on SoC with FPGA Marek Va sut < marex@denx.de > June 3, 2015 Marek Va sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA Marek Vasut Software engineer at DENX S.E. since 2011
Marek Vaˇ sut <marex@denx.de> June 3, 2015
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Software engineer at DENX S.E. since 2011
◮ Embedded and Real-Time Systems Services, Linux kernel and
driver development, U-Boot development, consulting, training.
◮ Versatile Linux kernel hacker ◮ Custodian at U-Boot bootloader
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Motivation ◮ Introduction to FPGAs ◮ Your first FPGA data cruncher ◮ Interfacing with Linux ◮ Speeding things up
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Get fresh ideas ◮ Learn something new ◮ Reduce energy envelope of your device ◮ Process data quickly and efficiently
You won’t learn marketing stuff or random benchmark numbers
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Abbr. for Field Programmable Gate Array ◮ Programmable logic ◮ Usually used for:
◮ Digital Signal Processing (DSP) ◮ Data crunching ◮ Custom hardware interfaces ◮ ASIC prototyping ◮ . . .
◮ Common vendors – Xilinx, Altera, Lattice, Microsemi. . .
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
W.T.Freeman http://www.vision.caltech.edu/CNS248/Fpga/fpga1a.gif CC BY 2.5: http://creativecommons.org/licenses/by/2.5/
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ FPGA has plenty of I/O options:
◮ Regular I/O with configurable voltage levels ◮ Differential I/O ◮ High-speed SerDes ◮ . . .
◮ Usual interface with host:
◮ Stand-alone FPGA, usually PCIe, USB, . . . ◮ FPGA on a CPU bus (PowerPCs, ie. ML507) ◮ Built into CPU (SoCFPGA/Zynq), usually AMBA/AXI Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Each vendor has his own tools – Altera Quartus, Xilinx Vivado ◮ FPGA tools often closed source :-( ◮ FPGA bitstream format is closed :-( ◮ Basic vendor tools available free of charge ◮ Sufficient amount of functionality to implement data cruncher ◮ Vendor tools needed for place-and-route and assembler ◮ Third-party tools for synthesis are available
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
CPU GPU FPGA Toolchain Open Closed Closed HW design Proprietary Proprietary Your own HW units Fixed Fixed As needed I/O Limited None As needed
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ FPGA content is written in HDLs ◮ HDL – Hardware Description Language ◮ HDLs are used to model behavior of logic block ◮ Two major HDLs – VHDL and Verilog ◮ Tools often allow seamless mixing of HDLs ◮ Many readily-available cores under acceptable license:
OpenCores http://opencores.org/ OpenCores projects http://opencores.org/projects CERN Open HW Repo http://www.ohwr.org/
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
HW Behavior modeling vs. Writing CPU code:
◮ Vastly different and confusing to software people :-) ◮ CPU: Programmer implements an algorithm ◮ FPGA: Programmer implements hardware to run the algorithm
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Everything in a block is executed in parallel ◮ All conditions in a conditional statement are tested in parallel
if, case – differs from C
1 if (foo == 1)
bar <= 1’b0;
2 else
bar <= 1’b1;
◮ Blocks are executed in parallel
1 begin 2
x <= 1’b0;
3
y <= 1’b1;
4 end Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Combo – imm. value of var is the product of the imm. inputs
assign Z = X ^ Y;
◮ Seq logic is sync to clock (involves a latch)
always @(posedge clk) Z <= DAT;
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Looks like C, based on C, but behaves differently ◮ Used a lot in Europe ◮ Example: CRC5, polynomial x5 + x2 + x0 ◮ Example modified from:
http://www.asic-world.com/examples/verilog/ serial_crc.html
1 module crc5 ( 2
/* SYSTEM I/O */
3
input reset,
4
input clk,
5
/* CRC5 I/O */
6
input data,
7
crc
8 ); Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
1 always @(posedge clk) begin 2
if (reset) begin
3
crc <= 5’b00000;
4
end else begin
5
crc[0] <= data ^ crc[4];
6
crc[1] <= crc[0];
7
crc[2] <= crc[1] ^ data ^ crc[4];
8
crc[3] <= crc[2];
9
crc[4] <= crc[3];
10
end
11 end 12 endmodule Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Distinctive syntax based on Ada ◮ More explicit typing system than Verilog ◮ Used a lot in the USA ◮ Example: CRC5, polynomial x5 + x2 + x0 ◮ Example from http://outputlogic.com/?page_id=321
1 library ieee; 2 use ieee.std_logic_1164.all; 3 4 entity crc is 5
port ( data_in : in std_logic_vector (0 downto 0);
6
rst, clk : in std_logic;
7
crc_out : out std_logic_vector (4 downto 0));
8 end crc; Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
1 architecture imp_crc of crc is 2
signal lfsr_q: std_logic_vector (4 downto 0);
3
signal lfsr_c: std_logic_vector (4 downto 0);
4 begin 5
crc_out <= lfsr_q;
6
lfsr_c(0) <= lfsr_q(4) xor data_in(0);
7
lfsr_c(1) <= lfsr_q(0);
8
lfsr_c(2) <= lfsr_q(1) xor lfsr_q(4) xor data_in(0);
9
lfsr_c(3) <= lfsr_q(2);
10
lfsr_c(4) <= lfsr_q(3);
11 12
process (clk,rst) begin
13
if (rst = ’1’) then
14
lfsr_q <= b"11111";
15
elsif (clk’EVENT and clk = ’1’) then
16
lfsr_q <= lfsr_c;
17
end if;
18
end process;
19 end architecture imp_crc; Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
CPU GPU FPGA Languages All OpenCL, CUDA OpenCL, HDLs Design paradigm Sequential Seq/Par Parallel Design granularity Instruction Instruction Gate
Low Low High
Low Low High
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Simulation (on developer’s system) ◮ Probing (on-target)
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Simulation tools:
Icarus Verilog http://iverilog.icarus.com/ ghdl http://home.gna.org/ghdl/ ModelSim http://en.wikipedia.org/wiki/ModelSim/
◮ Write testcase for a module in an augmented HDL ◮ Execute testcase ◮ Observe results
◮ View waveforms ◮ Decode and inspect busses ◮ Trigger on complex conditions ◮ . . . Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Used to observe design on target ◮ Think of this as a bus analyzer in the FPGA ◮ Probing tools (ie. SignalTap) ◮ Design is augmented with a probing IP, FPGA is
reprogrammed
◮ Probing is controlled through a debug probe attached to the
FPGA (JTAG or similar)
◮ Probe internal signals, observe waveforms, trigger on complex
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ HDL files – lowest in the hierarchy ◮ IP block – collection of HDL files with an interface ◮ FPGA design – collection of IP blocks ◮ Vendor tools contain tools to assemble IP blocks into FPGA
design – ie. Altera QSys.
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
CPU GPU FPGA Simulation QEMU ? Icarus, ModelSim Debugger GDB CUDA-GDB, CodeXL SignalTap
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ No standard in-kernel FPGA interface due to variance of
designs
◮ Attempts do exist:
◮ Device Tree Overlay(s) stored in FPGA ◮ SDB –
http://www.ohwr.org/projects/fpga-config-space
◮ Usually there are control registers in the FPGA design ◮ Usually the DMA is involved (either on FPGA or CPU side) ◮ Two options for controlling the FPGA:
◮ Custom Linux kernel driver ◮ Userspace utility Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Driver written to match the particular FPGA bitstream ◮ Driver can crash the host machine if written badly :-( ◮ Driver usually exports custom userland I/O ◮ splice(2)
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Userland accesses the FPGA registers via uio ◮ The uio is like a restricted devmem ◮ In case DMA is involved, kernel module to prepare the data
for the DMA (ie. assure cache coherency) is needed.
◮ CMA might be used to export large slab of custom kernel
memory to user
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ FPGA is clocked at 50...200MHz, not much ◮ The fabric is usually rated at much more! ◮ Synthesize PLL, which generates faster clock ◮ Clock your design from the PLL
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ Use combo logic where possible ◮ Create pipelined designs and make sure the pipeline is
saturated
◮ Synthesize multiple units and compute in parallel
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ OpenCL EP 1.0 implementation for Altera FPGAs ◮ Easier for SW developers ◮ Only thin shim must be ported to the FPGA ◮ Closed source compiler :-( ◮ Even needs a license :-C
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
◮ FPGAs are strong in parallel, pipelined workloads ◮ FPGAs give the user almost gate-level performance ◮ FPGAs are manufactured using bleeding-edge process ◮ FPGAs deploy excellent Performance-per-Watt ◮ There is no simple unified Linux interface ◮ Developing an FPGA content can be difficult ◮ The FPGA ecosystem is still rather closed
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA
Contact: Marek Vasut <marex@denx.de>
Marek Vaˇ sut <marex@denx.de> Accelerated Data Processing on SoC with FPGA