SLIDE 1
Parallel Programming and Heterogeneous Computing FPGA Accelerators - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing FPGA Accelerators - - PowerPoint PPT Presentation
Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group FPGA Hardware Characteristics Application Specific Integrated
SLIDE 2
SLIDE 3
■
Hardware primitives include:
□
Logic Blocks (CLB) with Flipflops, Lookup Tables, Multiplexers, …
□
Memory Blocks (BRAM) to act as single port, dual port or FIFO memories
□
Arithmetic Blocks (DSP) with hardware multipliers, adders, shifters, …
□
Clock Management Blocks (MMCM) to derive clock signals with specific frequency and phase relations
□
IO Banks with logic for various signaling standards
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 3
FPGA Hardware Characteristics
From: Xilinx UG 474, Figure 5-1
SLIDE 4
Floorplan of a Xilinx Kintex Ultra Scale XCKU060 FPGA
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 4
FPGA Hardware Characteristics
SLIDE 5
■
ASICs are rated by maximum operating clock frequency
■
FPGAs have no uniform clock frequency rating
Ø
Maximum clock frequency is design specific and constrained by the longest combinatorial path delay
■
Individual logic delays range from 0.1ns to 0.5ns
Ø
Small and tightly coupled design sections may run at 1GHz
Ø
Common frequency is 250MHz
■
Specific blocks like BRAMs may have maximum clock frequency ratings
□
BRAMs on current Xilinx FPGAs can run at 800MHz
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 5
FPGA Performance Characteristics
0.2 0.2 0.1 0.1 0.4 0.4 0.4 0.1 0.2 0.2 0.3 0.2
0.4 ns 0.6 ns
0.1
0.8 ns 1.1 ns 1.4 ns 1.7 ns ~ 550 MHz
SLIDE 6
FPGA designs operate at up to an order of magnitude lower clock frequencies than ASIC accelerators!
How do FPGAs achieve speedups over fixed function ASIC implementations?
Ø
Avoid overheads of general-purpose hardware:
□
CPUs invest a large amount of logic and cycles into fetching and decoding a general-purpose instruction stream
□
CPUs must accommodate a wide variety of applications by providing a compromise set of execution facilities (function units, forwarding paths, …)
Ø
FPGAs permit application specific microarchitectures, leveraging:
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 6
FPGA Performance Characteristics
Task
Clock
Task
Clock
Task Task Parallelization Task
Clock
T
Clock
a s k Pipelining
SLIDE 7
Hardware development toolchains and steps are significantly different from software development, as final artifacts are not executable binaries but hardware configurations.
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 7
FPGA Design Process
SLIDE 8
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 8
High Level Synthesis and Data Streams
void hls_operator(stream & in, stream & out, stream_data offset) { #pragma HLS interface ap_stable port=offset stream_element in_element, out_element; do { #pragma HLS pipeline in_element = in.read();
- ut_element.tdata = in_element.tdata + offset;
- ut_element.tlast = in_element.tlast;
- ut.write(out_element);
} while (!in_element.tlast); }
data last valid ready
- ut
hls_operator
data last valid ready
in
- ffset
SLIDE 9
CAPI + MetalFS on Nallatech N250S / POWER8 https://github.com/osmhpi/metal_fs/ https://github.com/open-power/snap/ Zynq SOC + PYNQ on Ultra96 Boards https://ultra96-pynq.readthedocs.io/
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 10
Links
SLIDE 10
Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 11
Roadmap
- 15. Okt
Introduction
- 17. Dec
Work/Consultation
- 22. Okt
Terminology, OpenMP
- 24. Dec
<No lecture>
- 29. Okt
<No lecture>
- 31. Dec
<No lecture>
- 05. Nov
SIMD, Profiling I
- 07. Jan
<maybe no lecture>
- 12. Nov
Heatmap Discussion
- 14. Jan
- 19. Nov
FPGA Accelerators
- 21. Jan
- 26. Nov
Bring Five Simple [Heatmap|MatMul|FFT] Implementations
- 28. Jan
- 03. Dec
Study GPU Literature
- 03. Feb
Final Presentation
- 10. Dec