accelerated data processing on soc with fpga
play

Accelerated Data Processing on SoC with FPGA Marek Va sut < - PowerPoint PPT Presentation

Accelerated Data Processing on SoC with FPGA Marek Va sut < marex@denx.de > June 3, 2015 Marek Va sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA Marek Vasut Software engineer at DENX S.E. since 2011


  1. Accelerated Data Processing on SoC with FPGA Marek Vaˇ sut < marex@denx.de > June 3, 2015 Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  2. Marek Vasut ◮ Software engineer at DENX S.E. since 2011 ◮ Embedded and Real-Time Systems Services, Linux kernel and driver development, U-Boot development, consulting, training. ◮ Versatile Linux kernel hacker ◮ Custodian at U-Boot bootloader Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  3. Structure of the talk ◮ Motivation ◮ Introduction to FPGAs ◮ Your first FPGA data cruncher ◮ Interfacing with Linux ◮ Speeding things up Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  4. Why listen to this talk ◮ Get fresh ideas ◮ Learn something new ◮ Reduce energy envelope of your device ◮ Process data quickly and efficiently You won’t learn marketing stuff or random benchmark numbers Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  5. FPGA ◮ Abbr. for Field Programmable Gate Array ◮ Programmable logic ◮ Usually used for: ◮ Digital Signal Processing (DSP) ◮ Data crunching ◮ Custom hardware interfaces ◮ ASIC prototyping ◮ . . . ◮ Common vendors – Xilinx, Altera, Lattice, Microsemi. . . Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  6. Internal structure W.T.Freeman http://www.vision.caltech.edu/CNS248/Fpga/fpga1a.gif CC BY 2.5: http://creativecommons.org/licenses/by/2.5/ Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  7. FPGA and the outside ◮ FPGA has plenty of I/O options: ◮ Regular I/O with configurable voltage levels ◮ Differential I/O ◮ High-speed SerDes ◮ . . . ◮ Usual interface with host: ◮ Stand-alone FPGA, usually PCIe, USB, . . . ◮ FPGA on a CPU bus (PowerPCs, ie. ML507) ◮ Built into CPU (SoCFPGA/Zynq), usually AMBA/AXI Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  8. Programming the FPGA ◮ Each vendor has his own tools – Altera Quartus, Xilinx Vivado ◮ FPGA tools often closed source :-( ◮ FPGA bitstream format is closed :-( ◮ Basic vendor tools available free of charge ◮ Sufficient amount of functionality to implement data cruncher ◮ Vendor tools needed for place-and-route and assembler ◮ Third-party tools for synthesis are available Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  9. Comparison to a GPU – I. CPU GPU FPGA Toolchain Open Closed Closed HW design Proprietary Proprietary Your own HW units Fixed Fixed As needed I/O Limited None As needed Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  10. HDL – Hardware Description Language ◮ FPGA content is written in HDLs ◮ HDL – Hardware Description Language ◮ HDLs are used to model behavior of logic block ◮ Two major HDLs – VHDL and Verilog ◮ Tools often allow seamless mixing of HDLs ◮ Many readily-available cores under acceptable license: OpenCores http://opencores.org/ OpenCores projects http://opencores.org/projects CERN Open HW Repo http://www.ohwr.org/ Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  11. Modeling behavior HW Behavior modeling vs. Writing CPU code: ◮ Vastly different and confusing to software people :-) ◮ CPU: Programmer implements an algorithm ◮ FPGA: Programmer implements hardware to run the algorithm Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  12. Implicit parallelism ◮ Everything in a block is executed in parallel ◮ All conditions in a conditional statement are tested in parallel if, case – differs from C 1 if (foo == 1) bar <= 1’b0; 2 else bar <= 1’b1; ◮ Blocks are executed in parallel 1 begin x <= 1’b0; 2 y <= 1’b1; 3 4 end Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  13. Combinatorial vs. Sequential logic ◮ Combo – imm. value of var is the product of the imm. inputs of the function: assign Z = X ^ Y; ◮ Seq logic is sync to clock (involves a latch) always @(posedge clk) Z <= DAT; Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  14. Verilog example ◮ Looks like C, based on C, but behaves differently ◮ Used a lot in Europe ◮ Example: CRC5, polynomial x 5 + x 2 + x 0 ◮ Example modified from: http://www.asic-world.com/examples/verilog/ serial_crc.html 1 module crc5 ( /* SYSTEM I/O */ 2 input reset, 3 input clk, 4 /* CRC5 I/O */ 5 input data, 6 output reg [4:0] crc 7 8 ); Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  15. Verilog example II 1 always @(posedge clk) begin if (reset) begin 2 crc <= 5’b00000; 3 end else begin 4 crc[0] <= data ^ crc[4]; 5 crc[1] <= crc[0]; 6 crc[2] <= crc[1] ^ data ^ crc[4]; 7 crc[3] <= crc[2]; 8 crc[4] <= crc[3]; 9 end 10 11 end 12 endmodule Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  16. VHDL example ◮ Distinctive syntax based on Ada ◮ More explicit typing system than Verilog ◮ Used a lot in the USA ◮ Example: CRC5, polynomial x 5 + x 2 + x 0 ◮ Example from http://outputlogic.com/?page_id=321 1 library ieee; 2 use ieee.std_logic_1164.all; 3 4 entity crc is port ( data_in : in std_logic_vector (0 downto 0); 5 rst, clk : in std_logic; 6 crc_out : out std_logic_vector (4 downto 0)); 7 8 end crc; Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  17. VHDL example II 1 architecture imp_crc of crc is signal lfsr_q: std_logic_vector (4 downto 0); 2 signal lfsr_c: std_logic_vector (4 downto 0); 3 4 begin crc_out <= lfsr_q; 5 lfsr_c(0) <= lfsr_q(4) xor data_in(0); 6 lfsr_c(1) <= lfsr_q(0); 7 lfsr_c(2) <= lfsr_q(1) xor lfsr_q(4) xor data_in(0); 8 lfsr_c(3) <= lfsr_q(2); 9 lfsr_c(4) <= lfsr_q(3); 10 11 process (clk,rst) begin 12 if (rst = ’1’) then 13 lfsr_q <= b"11111"; 14 elsif (clk’EVENT and clk = ’1’) then 15 lfsr_q <= lfsr_c; 16 end if; 17 end process; 18 19 end architecture imp_crc; Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  18. Comparison to a GPU – II. CPU GPU FPGA Languages All OpenCL, CUDA OpenCL, HDLs Design paradigm Sequential Seq/Par Parallel Design granularity Instruction Instruction Gate Opt. possibility Low Low High Opt. difficulty Low Low High Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  19. Development and debugging ◮ Simulation (on developer’s system) ◮ Probing (on-target) Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  20. Simulation ◮ Simulation tools: Icarus Verilog http://iverilog.icarus.com/ ghdl http://home.gna.org/ghdl/ ModelSim http://en.wikipedia.org/wiki/ModelSim/ ◮ Write testcase for a module in an augmented HDL ◮ Execute testcase ◮ Observe results ◮ View waveforms ◮ Decode and inspect busses ◮ Trigger on complex conditions ◮ . . . Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  21. Probing ◮ Used to observe design on target ◮ Think of this as a bus analyzer in the FPGA ◮ Probing tools (ie. SignalTap) ◮ Design is augmented with a probing IP, FPGA is reprogrammed ◮ Probing is controlled through a debug probe attached to the FPGA (JTAG or similar) ◮ Probe internal signals, observe waveforms, trigger on complex conditions. . . Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  22. Structuring the design ◮ HDL files – lowest in the hierarchy ◮ IP block – collection of HDL files with an interface ◮ FPGA design – collection of IP blocks ◮ Vendor tools contain tools to assemble IP blocks into FPGA design – ie. Altera QSys. Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  23. Comparison to a GPU – III. CPU GPU FPGA Simulation QEMU ? Icarus, ModelSim Debugger GDB CUDA-GDB, CodeXL SignalTap Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  24. Linux interface ◮ No standard in-kernel FPGA interface due to variance of designs ◮ Attempts do exist: ◮ Device Tree Overlay(s) stored in FPGA ◮ SDB – http://www.ohwr.org/projects/fpga-config-space ◮ Usually there are control registers in the FPGA design ◮ Usually the DMA is involved (either on FPGA or CPU side) ◮ Two options for controlling the FPGA: ◮ Custom Linux kernel driver ◮ Userspace utility Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  25. Custom kernel driver ◮ Driver written to match the particular FPGA bitstream ◮ Driver can crash the host machine if written badly :-( ◮ Driver usually exports custom userland I/O ◮ splice(2) Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

  26. Userland approach ◮ Userland accesses the FPGA registers via uio ◮ The uio is like a restricted devmem ◮ In case DMA is involved, kernel module to prepare the data for the DMA (ie. assure cache coherency) is needed. ◮ CMA might be used to export large slab of custom kernel memory to user Marek Vaˇ sut < marex@denx.de > Accelerated Data Processing on SoC with FPGA

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend