Parallel Programming and Heterogeneous Computing FPGA Accelerators - - PowerPoint PPT Presentation

parallel programming and heterogeneous computing
SMART_READER_LITE
LIVE PREVIEW

Parallel Programming and Heterogeneous Computing FPGA Accelerators - - PowerPoint PPT Presentation

Parallel Programming and Heterogeneous Computing FPGA Accelerators Max Plauth, Sven Khler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group FPGA Hardware Characteristics Application Specific Integrated


slide-1
SLIDE 1

Parallel Programming and Heterogeneous Computing

FPGA Accelerators

Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group

slide-2
SLIDE 2

Application Specific Integrated Circuits (ASIC) implement a single fixed and usually highly optimized hardware architecture (e.g. CPUs, GPUs, …)

Field Programmable Gate Arrays (FPGA) are configured to implement a variety of hardware designs

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 2

FPGA Hardware Characteristics

Programmable Interconnect Logic Blocks IO Blocks RAM/ALU/... Blocks

Ø

FPGA fabric consists of a regular structure of hardware primitives, signal lines and routers

slide-3
SLIDE 3

Hardware primitives include:

Logic Blocks (CLB) with Flipflops, Lookup Tables, Multiplexers, …

Memory Blocks (BRAM) to act as single port, dual port or FIFO memories

Arithmetic Blocks (DSP) with hardware multipliers, adders, shifters, …

Clock Management Blocks (MMCM) to derive clock signals with specific frequency and phase relations

IO Banks with logic for various signaling standards

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 3

FPGA Hardware Characteristics

From: Xilinx UG 474, Figure 5-1

slide-4
SLIDE 4

Floorplan of a Xilinx Kintex Ultra Scale XCKU060 FPGA

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 4

FPGA Hardware Characteristics

slide-5
SLIDE 5

ASICs are rated by maximum operating clock frequency

FPGAs have no uniform clock frequency rating

Ø

Maximum clock frequency is design specific and constrained by the longest combinatorial path delay

Individual logic delays range from 0.1ns to 0.5ns

Ø

Small and tightly coupled design sections may run at 1GHz

Ø

Common frequency is 250MHz

Specific blocks like BRAMs may have maximum clock frequency ratings

BRAMs on current Xilinx FPGAs can run at 800MHz

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 5

FPGA Performance Characteristics

0.2 0.2 0.1 0.1 0.4 0.4 0.4 0.1 0.2 0.2 0.3 0.2

0.4 ns 0.6 ns

0.1

0.8 ns 1.1 ns 1.4 ns 1.7 ns ~ 550 MHz

slide-6
SLIDE 6

FPGA designs operate at up to an order of magnitude lower clock frequencies than ASIC accelerators!

How do FPGAs achieve speedups over fixed function ASIC implementations?

Ø

Avoid overheads of general-purpose hardware:

CPUs invest a large amount of logic and cycles into fetching and decoding a general-purpose instruction stream

CPUs must accommodate a wide variety of applications by providing a compromise set of execution facilities (function units, forwarding paths, …)

Ø

FPGAs permit application specific microarchitectures, leveraging:

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 6

FPGA Performance Characteristics

Task

Clock

Task

Clock

Task Task Parallelization Task

Clock

T

Clock

a s k Pipelining

slide-7
SLIDE 7

Hardware development toolchains and steps are significantly different from software development, as final artifacts are not executable binaries but hardware configurations.

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 7

FPGA Design Process

slide-8
SLIDE 8

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 8

High Level Synthesis and Data Streams

void hls_operator(stream & in, stream & out, stream_data offset) { #pragma HLS interface ap_stable port=offset stream_element in_element, out_element; do { #pragma HLS pipeline in_element = in.read();

  • ut_element.tdata = in_element.tdata + offset;
  • ut_element.tlast = in_element.tlast;
  • ut.write(out_element);

} while (!in_element.tlast); }

data last valid ready

  • ut

hls_operator

data last valid ready

in

  • ffset
slide-9
SLIDE 9

CAPI + MetalFS on Nallatech N250S / POWER8 https://github.com/osmhpi/metal_fs/ https://github.com/open-power/snap/ Zynq SOC + PYNQ on Ultra96 Boards https://ultra96-pynq.readthedocs.io/

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 10

Links

slide-10
SLIDE 10

Lukas Wenzel EnSem 2019 FPGA Accelerators Chart 11

Roadmap

  • 15. Okt

Introduction

  • 17. Dec

Work/Consultation

  • 22. Okt

Terminology, OpenMP

  • 24. Dec

<No lecture>

  • 29. Okt

<No lecture>

  • 31. Dec

<No lecture>

  • 05. Nov

SIMD, Profiling I

  • 07. Jan

<maybe no lecture>

  • 12. Nov

Heatmap Discussion

  • 14. Jan
  • 19. Nov

FPGA Accelerators

  • 21. Jan
  • 26. Nov

Bring Five Simple [Heatmap|MatMul|FFT] Implementations

  • 28. Jan
  • 03. Dec

Study GPU Literature

  • 03. Feb

Final Presentation

  • 10. Dec

Roadmap Presentation