OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens - PowerPoint PPT Presentation

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 2 2

Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 3 2

Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) ● Programming of heterogeneous systems is non-trivial 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 4 2

Motivation ● Increasing use of heterogeneous systems to overcome CPU power limitations ● FPGAs increasingly used for implementation of accelerators in HPC systems (e.g. Microsoft Azure) ● Programming of heterogeneous systems is non-trivial ● Desirable: Programming with a single, portable code base 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 5 2

OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ map(tofrom:y[0:SIZE]) { #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ Target Region y[i] = a*x[i]+y[i]; } } 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 6 3

OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ ● Specify which and how map(tofrom:y[0:SIZE]) data is transferred to { device memory #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ Target Region y[i] = a*x[i]+y[i]; } } 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 7 3

OpenMP Device Offloading ● Denote target regions to execute on device #pragma omp target \ map(to:x[0:SIZE]) \ ● Specify which and how map(tofrom:y[0:SIZE]) data is transferred to { device memory #pragma omp parallel for[...] for(i=0; i<SIZE; i++){ ● Use additional parallel y[i] = a*x[i]+y[i]; } constructs inside target } region (also target- specific, e.g. teams, distribute ,...) 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 8 3

Goal ● Implement mapping of target regions to FPGA accelerators in LLVM Clang – Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 9 4

Goal ● Implement mapping of target regions to FPGA accelerators in LLVM Clang – Preserve FPGA-specific pragmas (e.g. Vivado HLS) – Automated flow from OpenMP-annotated input program to FPGA bitstream + software executable ● Extend LLVM OpenMP Runtime – Manage data-transfers between host and FPGA – Control device execution on the FPGA accelerator 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 10 4

ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Synthesize accelerator from kernel code TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 11 5

ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Assemble (multiple) instances of different kernels in top-level design, combined with standardized host- and memory connection TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 12 5

ThreadPoolComposer ● Toolchain to fast-track implementation of FPGA-based accelerators in heterogeneous systems ● Control execution and data- transfer using two-layered API – Higher-level TPC API is device/platform-agnostic, allows for portable implementation (write once, run everywhere) TPC is available as open source: https://goo.gl/qTsU3B 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 13 5

Compilation Flow ● Start from a single, portable source file 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 14 6

Compilation Flow ● Start from a single, portable source file ● Standard host- compilation, including fallback if offloading fails 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 15 6

Compilation Flow ● Start from a single, portable source file ● Standard host- compilation, including fallback if offloading fails ● One device-specific compilation flow per device type – Limited to extracted target regions 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 16 6

Compilation Flow ● Custom Clang toolchain for TPC-based offloading to FPGA accelerators – Identified with new LLVM target triple – Preserves FPGA- specific pragmas, e.g. Vivado HLS pragmas – Yields three artifacts 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 17 6

Compilation Flow ● TPC-specific software binary – Entry point for FPGA device execution – Transfers kernel arguments and launches hardware execution using TPC API 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 18 6

Compilation Flow ● TPC-specific software binary – Entry point for FPGA device execution – Transfers kernel arguments and launches hardware execution using TPC API – Included in the combined binary 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 19 6

Compilation Flow ● Hardware kernel code extracted from target region 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 20 6

Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 21 6

Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types ● TPC automates synthesis from kernel code and description to full FPGA design No additional user input required! 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 22 6

Compilation Flow ● Hardware kernel code extracted from target region ● Description of input argument types ● TPC automates synthesis from kernel code and description to full FPGA design ● Resulting bitstream features standardized host- and memory connection 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 23 6

Runtime Flow Components: Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 24 7

Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 25 7

Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 26 7

Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime ● TPC-specific software binary resulting from compilation 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 27 7

Runtime Flow Components: ● LLVM OpenMP Runtime Infrastructure ● TPC-based plugin for Host LLVM OpenMP Runtime ● TPC-specific software binary resulting from compilation ● FPGA abstraction as provided by TPC 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 28 7

Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: Host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 29 7

Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 30 7

Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory ● Invoke binary ● Sets kernel arguments ● Launches hardware execution 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 31 7

Runtime Flow ● Host-centric: Execution starts on the host If target region is encountered: ● Transfer data to FPGA Host memory ● Invoke binary ● Sets kernel arguments ● Launches hardware execution ● Transfer data back to host 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth, A. Koch | TU Darmstadt 32 7

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens - PowerPoint PPT Presentation

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 | OpenMP FPGA Device Offloading | L. Sommer, J. Korinth,

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Nquire ask anything Anis Abboud, Chris Snyder, Mario Finelli Device 1 Device 2 Device 1

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Application Accelerators: Application Accelerators: Application Accelerators: Application

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Confidential Accelerators Stavros Volos Microsoft Research Accelerators Play Pivotal Role in

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

State of Kotlin in Android Florina Muntenescu FMuntenescu October, 2020 Why Android

Intro to TensorFlow 2.0 MBL, August 2019 Josh Gordon (@random_forests) 1 Agenda 1 of 2

Numerical Computing with Spark Hossein Falaki Challenges of numerical computation over big data

Algebraic models of dependent type theory Clive Newstead HoTT/UF Workshop 2018 Oxford, UK (in

Introduction to Matlab CSC420 Spring 2017 Introduction to Image Understanding Instructor: Sanja

Can We Containerize Internet Measurements? Chris Misa (University of Oregon) Sudarsun Kannan

16623 - Advanced Computer Vision Apps Final Project (50 % of total grade) Project Proposal Due -

Prioritizing Enterprise Customer Needs with Constructed, Augmented MaxDiff EARL London These