OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens - - PowerPoint PPT Presentation

openmp device offloading to fpga accelerators
SMART_READER_LITE
LIVE PREVIEW

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens - - PowerPoint PPT Presentation

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth,


slide-1
SLIDE 1

OpenMP Device Offloading to FPGA Accelerators

Lukas Sommer, Jens Korinth, Andreas Koch

slide-2
SLIDE 2

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 2

Motivation

  • Increasing use of heterogeneous systems to overcome

CPU power limitations

2

slide-3
SLIDE 3

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 3

Motivation

  • Increasing use of heterogeneous systems to overcome

CPU power limitations

  • FPGAs increasingly used for implementation of

accelerators in HPC systems (e.g. Microsoft Azure)

2

slide-4
SLIDE 4

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 4

Motivation

  • Increasing use of heterogeneous systems to overcome

CPU power limitations

  • FPGAs increasingly used for implementation of

accelerators in HPC systems (e.g. Microsoft Azure)

  • Programming of heterogeneous systems is non-trivial

2

slide-5
SLIDE 5

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 5

Motivation

  • Increasing use of heterogeneous systems to overcome

CPU power limitations

  • FPGAs increasingly used for implementation of

accelerators in HPC systems (e.g. Microsoft Azure)

  • Programming of heterogeneous systems is non-trivial
  • Desirable: Programming with a single, portable code

base

2

slide-6
SLIDE 6

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 6

OpenMP Device Offloading

  • Denote target regions to

execute on device

#pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...]

for(i=0; i<SIZE; i++){

y[i] = a*x[i]+y[i]; } }

Target Region

3

slide-7
SLIDE 7

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 7

OpenMP Device Offloading

  • Denote target regions to

execute on device

  • Specify which and how

data is transferred to device memory

#pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...]

for(i=0; i<SIZE; i++){

y[i] = a*x[i]+y[i]; } }

Target Region

3

slide-8
SLIDE 8

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 8

OpenMP Device Offloading

  • Denote target regions to

execute on device

  • Specify which and how

data is transferred to device memory

  • Use additional parallel

constructs inside target region (also target- specific, e.g. teams, distribute,...)

#pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...]

for(i=0; i<SIZE; i++){

y[i] = a*x[i]+y[i]; } }

3

slide-9
SLIDE 9

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 9

Goal

  • Implement mapping of target regions to FPGA

accelerators in LLVM Clang

– Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input

program to FPGA bitstream + software executable

4

slide-10
SLIDE 10

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 10

Goal

  • Implement mapping of target regions to FPGA

accelerators in LLVM Clang

– Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input

program to FPGA bitstream + software executable

  • Extend LLVM OpenMP Runtime

– Manage data-transfers between host and FPGA – Control device execution on the FPGA accelerator

4

slide-11
SLIDE 11

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 11

ThreadPoolComposer

  • Toolchain to fast-track

implementation of FPGA-based accelerators in heterogeneous systems

  • Synthesize accelerator from

kernel code

TPC is available as open source: https://goo.gl/qTsU3B

5

slide-12
SLIDE 12

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 12

ThreadPoolComposer

  • Toolchain to fast-track

implementation of FPGA-based accelerators in heterogeneous systems

  • Assemble (multiple) instances
  • f different kernels in top-level

design, combined with standardized host- and memory connection

TPC is available as open source: https://goo.gl/qTsU3B

5

slide-13
SLIDE 13

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 13

ThreadPoolComposer

  • Toolchain to fast-track

implementation of FPGA-based accelerators in heterogeneous systems

  • Control execution and data-

transfer using two-layered API

– Higher-level TPC API is

device/platform-agnostic, allows for portable implementation (write once, run everywhere)

TPC is available as open source: https://goo.gl/qTsU3B

5

slide-14
SLIDE 14

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 14

Compilation Flow

  • Start from a single,

portable source file

6

slide-15
SLIDE 15

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 15

Compilation Flow

  • Start from a single,

portable source file

  • Standard host-

compilation, including fallback if offloading fails

6

slide-16
SLIDE 16

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 16

Compilation Flow

  • Start from a single,

portable source file

  • Standard host-

compilation, including fallback if offloading fails

  • One device-specific

compilation flow per device type

– Limited to extracted

target regions

6

slide-17
SLIDE 17

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 17

Compilation Flow

  • Custom Clang toolchain

for TPC-based offloading to FPGA accelerators

– Identified with new

LLVM target triple

– Preserves FPGA-

specific pragmas, e.g. Vivado HLS pragmas

– Yields three artifacts

6

slide-18
SLIDE 18

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 18

Compilation Flow

  • TPC-specific software

binary

– Entry point for FPGA

device execution

– Transfers kernel

arguments and launches hardware execution using TPC API

6

slide-19
SLIDE 19

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 19

Compilation Flow

  • TPC-specific software

binary

– Entry point for FPGA

device execution

– Transfers kernel

arguments and launches hardware execution using TPC API

– Included in the

combined binary

6

slide-20
SLIDE 20

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 20

Compilation Flow

  • Hardware kernel code

extracted from target region

6

slide-21
SLIDE 21

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 21

Compilation Flow

  • Hardware kernel code

extracted from target region

  • Description of input

argument types

6

slide-22
SLIDE 22

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 22

Compilation Flow

  • Hardware kernel code

extracted from target region

  • Description of input

argument types

  • TPC automates synthesis

from kernel code and description to full FPGA design

No additional user input required!

6

slide-23
SLIDE 23

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 23

Compilation Flow

  • Hardware kernel code

extracted from target region

  • Description of input

argument types

  • TPC automates synthesis

from kernel code and description to full FPGA design

  • Resulting bitstream

features standardized host- and memory connection

6

slide-24
SLIDE 24

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 24

Host

Runtime Flow

Components:

7

slide-25
SLIDE 25

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 25

Host

Runtime Flow

Components:

  • LLVM OpenMP Runtime

Infrastructure

7

slide-26
SLIDE 26

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 26

Host

Runtime Flow

Components:

  • LLVM OpenMP Runtime

Infrastructure

  • TPC-based plugin for

LLVM OpenMP Runtime

7

slide-27
SLIDE 27

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 27

Host

Runtime Flow

Components:

  • LLVM OpenMP Runtime

Infrastructure

  • TPC-based plugin for

LLVM OpenMP Runtime

  • TPC-specific software

binary resulting from compilation

7

slide-28
SLIDE 28

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 28

Host

Runtime Flow

Components:

  • LLVM OpenMP Runtime

Infrastructure

  • TPC-based plugin for

LLVM OpenMP Runtime

  • TPC-specific software

binary resulting from compilation

  • FPGA abstraction as

provided by TPC

7

slide-29
SLIDE 29

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 29

Host

Runtime Flow

  • Host-centric: Execution starts
  • n the host

If target region is encountered:

7

slide-30
SLIDE 30

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 30

Host

Runtime Flow

  • Host-centric: Execution starts
  • n the host

If target region is encountered:

  • Transfer data to FPGA

memory

7

slide-31
SLIDE 31

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 31

Host

Runtime Flow

  • Host-centric: Execution starts
  • n the host

If target region is encountered:

  • Transfer data to FPGA

memory

  • Invoke binary
  • Sets kernel arguments
  • Launches hardware

execution

7

slide-32
SLIDE 32

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 32

Host

Runtime Flow

  • Host-centric: Execution starts
  • n the host

If target region is encountered:

  • Transfer data to FPGA

memory

  • Invoke binary
  • Sets kernel arguments
  • Launches hardware

execution

  • Transfer data back to host

7

slide-33
SLIDE 33

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 33

Evaluation

  • Proof-of-concept implementation based on development

version of LLVM/Clang/LLVM OpenMP Runtime

  • Evaluation using integer BLAS kernels from Adept

benchmark suite

– AXPY – Vector scaling – Vector dot product – Dense matrix vector multiplication – Euclidean norm for vector

8

slide-34
SLIDE 34

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 34

Evaluation

  • Using Xilinx Vivado HLS 2016.4

– Kernels annotated with Vivado HLS pipeline pragma – 250 MHz kernel operation frequency

9

#pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...]

for(i=0; i<SIZE; i++){

#pragma HLS PIPELINE II=1 y[i] = a*x[i]+y[i]; } }

slide-35
SLIDE 35

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 35

Evaluation

  • FPGA-board: VC709 (Virtex 7), 4 GiB on-board memory
  • Single instance of each kernel (“single-core“)
  • Comparison against x86-CPU (i7-6700K, 16 GiB DDR4-

RAM) running 4 OpenMP threads

10

slide-36
SLIDE 36

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 36

Insights

  • Fully functional implementation of OpenMP offloading to

FPGAs

– Custom compilation flow mapping target region to

hardware kernel without additional user input

– Extension of runtime library based on

ThreadPoolComposer API

11

slide-37
SLIDE 37

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 37

Insights

  • Fully functional implementation of OpenMP offloading to

FPGAs

– Custom compilation flow mapping target region to

hardware kernel without additional user input

– Extension of runtime library based on

ThreadPoolComposer API

  • Program a FPGA-based heterogeneous system with a

single, portable code-base

11

slide-38
SLIDE 38

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 38

Insights

  • Fully functional implementation of OpenMP offloading to

FPGAs

– Custom compilation flow mapping target region to

hardware kernel without additional user input

– Extension of runtime library based on

ThreadPoolComposer API

  • Program a FPGA-based heterogeneous system with a

single, portable code-base

  • Offloading overhead primarily dependent on size of data

transfered to/from device memory

11

slide-39
SLIDE 39

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 39

Insights

  • Fully functional implementation of OpenMP offloading to

FPGAs

– Custom compilation flow mapping target region to

hardware kernel without additional user input

– Extension of runtime library based on

ThreadPoolComposer API

  • Program a FPGA-based heterogeneous system with a

single, portable code-base

  • Offloading overhead primarily dependent on size of data

transfered to/from device memory

  • Pipelining results in 2x speedup over non-pipelined

kernel

11

slide-40
SLIDE 40

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 40

Insights

  • Single PE execution slower

than quad-core X86 (6.7x/3.4x)

12

slide-41
SLIDE 41

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 41

Insights

  • Single PE execution slower

than quad-core X86 (6.7x/3.4x)

  • Kernels very area-efficient

(1-5% logic utilization)

12

slide-42
SLIDE 42

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 42

Insights

  • Single PE execution slower

than quad-core X86 (6.7x/3.4x)

  • Kernels very area-efficient

(1-5% logic utilization)

Future work:

– Make use of coarse-grain

parallelism (e.g., teams distribute)

– Distribute computation

across multiple PEs

12

slide-43
SLIDE 43

2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 43

Questions?

  • Stop by at our poster
  • Get in touch: sommer@esa.tu-darmstadt.de

Download open-source release of ThreadPoolComposer: https://goo.gl/qTsU3B

13