Parameterized OpenCL Adaptation of Selected Benchmarks of the - PowerPoint PPT Presentation

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite Marius Meyer Tobias Kenter, Christian Plessl Paderborn University, Germany Paderborn Center for Parallel Computing H2RC’20, everywhere, 13. November 2020

HPC Challenge for FPGA An FPGA-adapted implementation of HPCC • OpenCL kernels and C++ host code – Measure hardware and tools + • Support for Intel and Xilinx FPGAs • Configuration Options to adapt to resources and architecture • It’s open source and already available on GitHub! 2

The HPC Challenge Suite Synthetic Benchmarks Benchmark Applications • GEMM • STREAM • PTRANS • RandomAccess • FFT • b_eff • HPL Base runs: Use unmodified provided benchmark implementations Optimized runs: Modifications allowed with respect to the benchmark rules Idea: Memory access patterns of other application will always be a combination of the patterns implemented by these benchmarks 3

HPCC FPGA Base Implementations We focus on base implementations for now… Two main concepts to increase resource utilization and performance FPGA FPGA FPGA FPGA CU 1 CU 1 CU 1 CU 1 CU 2 Scaling Replication • • Match data width of fixed interfaces Utilize all available interfaces • • Increase parallelism to make use of Increase resource usage • Option: NUM_REPLICATIONS more resources • Individual options for every benchmark 4

Experimental Setup Nallatech 520N Intel PAC D5005 Xilinx Alveo U280 • • • Intel Stratix 10 GX 2800 Intel Stratix 10 SX 2800 XCU280 • • • Direkt access to host 32x 256 MB HBM2 on 4x 8 GB DDR4 SDRAM memory using SVM FPGA • x8 PCIe 3.0 • • x16 PCIe 3.0 2x 16 GB DDR4 SDRAM • x8 PCIe 4.0 5

Benchmark Implementation

STREAM Implementation Operations Measured by STREAM for FPGA Operation Name Kernel Logic PCIe Write Write arrays to device 𝐷 𝑗 = 𝐵[𝑗] Copy 𝐶 𝑗 = 𝑘 ⋅ 𝐷[𝑗] Scale 𝐷 𝑗 = 𝐵 𝑗 + 𝐶[𝑗] Add Triad 𝐵 𝑗 = 𝑘 ⋅ 𝐷 𝑗 + 𝐶[𝑗] PCIe read Read arrays from device Configuration Options: • DATA_TYPE Define the data type • VECTOR_COUNT • GLOBAL_MEM_UNROLL: Unroll the loops • DEVICE_BUFFER_SIZE: Size of the local memory buffer • NUM_REPLICATIONS: One kernel per memory bank 7

STREAM Synthesis Observations • Kernel needs to support two different kernel designs to work best with all global memory types • STREAM achieves a high memory efficiency independent of operation for half-duplex memory interfaces 8

RandomAccess Implementation 𝑆 𝑗 Description: Update values in a … Random Numbers 𝑆 large data array in pseudo random order. Update errors allowed! Configuration Option: Index of next value in data array 𝑙 • DEVICE_BUFFER_SIZE: Size of the local memory buffer Update value 𝐸 𝑙 • NUM_REPLICATIONS: One kernel per memory bank Every kernel: Local memory buffer • Calculates the same pseudo random number sequence • Update only, if address is in Data 𝐸 2 𝑜 memory bank • Two pipelines used to remove dependencies between reads Bank 1 Bank 2 and writes 9

RandomAccess Results Option 520N U280 U280 PAC DDR DDR HBM2 SVM NUM_REPLICATIONS 4 2 32 1 DEVICE_BUFFER_SIZE 1 1,024 Board MUOP/s Error Observations: • Compiler support for ignoring data 520N DDR 245.0 0.0099% dependencies has a huge impact on U280 DDR 40.3 0.0106% performance • Number of kernel replications has U280 HBM2 128.1 0.0106% negative impact on performance PAC SVM 0.5 0.0106% 10

FFT Implementation FFT FFT FFT … Description: Batched calculation of 1d FFTs Stage Stage Configuration Options: Shift register • LOG_FFT_SIZE: Log 2 of the 1d FFT size • NUM_REPLICATIONS: One kernel for two Pipe Pipe memory banks • Implementation is fully pipelined Fetch • Fetch: BRAM Store • FFT: BRAM/Logic, DSPs Buffer Performance Model Memory Bank 2 Memory Bank 1 𝑞 𝐺𝐺𝑈 = 5 ⋅ 𝑀𝑃𝐻_𝐺𝐺𝑈_𝑇𝐽𝑎𝐹 ⋅ 𝑔 𝑛𝑓𝑛 ⋅ 𝑂𝑉𝑁_𝑆𝐹𝑄𝑀𝐽𝐷𝐵𝑈𝐽𝑃𝑂𝑇 ⋅ 8 11

FFT Results Global memory Bandwidth Efficiency of FFT [%] 100 Option 520N U280 U280 PAC DDR DDR HBM2 SVM 90 80 NUM_REPLICATIONS 2 1 15 1 70 LOG_FFT_SIZE 17 9 5 17 60 50 Observations 40 • Design allows high utilization of the global memory for 30 a broad range of FFT sizes 20 • Performance can be achieved equally over both configuration options 10 0 520N DDR U280 DDR U280 HBM2 PAC SVM 12

GEMM Implementation Description: Multiply square matrices C ′ = 𝛽 ⋅ 𝐵 ⋅ 𝐶 + 𝛾 ⋅ 𝐷 where 𝐵, 𝐶, 𝐷, 𝐷 ′ ∈ ℝ 𝑜×𝑜 and 𝛽, 𝛾 ∈ ℝ Configuration Parameters: • DATA_TYPE: Used data type • GLOBAL_MEM_UNROLL: Number of values that are loaded into local memory per clock cycle ( 𝑣 ) • BLOCK_SIZE: Size of the local memory block ( 𝑐 ) • GEMM_SIZE: Size of the register block ( 𝑕 ) • NUM_REPLICATIONS: Used to fill 𝑐 2 𝑐 3 𝑐 2 𝑢 𝑓𝑦𝑓 = + + 𝑕 3 ⋅ 𝑔 𝑣 ⋅ 𝑜 FPGA resources 𝑣 ⋅ 𝑔 𝑐 ⋅ 𝑔 𝑛𝑓𝑛 𝑙 𝑛𝑓𝑛 13

GEMM Results Observations Option 520N U280 U280 PAC • Large in-register DDR DDR HBM2 SVM multiplication leads to low DATA_TYPE float kernel frequencies GLOBAL_MEM_UNROLL 16 • HBM2 can also improve the GEMM_SIZE performance of mainly 8 compute bound applications BLOCK_SIZE 512 256 256 512 NUM_REPLICATIONS 5 3 3 5 Normalized Performance to 100MHz and a single Kernel Replication Kernel Frequency [MHz] 100 80 250 GFLOP/s 200 60 150 40 100 20 50 0 0 520N DDR U280 DDR U280 HBM2 PAC SVM 520N DDR U280 DDR U280 HBM2 PAC SVM 14

Conclusion • It is a challenging task to create unbiased base implementations • The implementations show a similar performance efficiency on the tested devices • The implementations allow to adjust the utilization of relevant resources for a broad range of FPGAs Next Steps: • Implement remaining base implementations • Offer support for multi-FPGA execution of the benchmarks • Utilize inter-FPGA networks 15

Parameterized OpenCL Adaptation of Selected Benchmarks of the - PowerPoint PPT Presentation

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite Marius Meyer Tobias Kenter, Christian Plessl Paderborn University, Germany Paderborn Center for Parallel

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Parameterized Power Vertex Cover Eric Angel, Evripidis Bampis, Bruno Escoffier, Michael Lampis

Final Selected Abstracts Final Selected Abstracts Final Selected Abstracts Final Selected

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Exact Crossing Number Exact Crossing Number Parameterized by Vertex Cover Parameterized by

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Parameterized Complexity of Integer Linear Programming (ILP) Sebastian Ordyniak Parameterized

Parameterized graph separation problems D aniel Marx Budapest University of Technology and

Geometrically Parameterized Interconnect Geometrically Parameterized Interconnect Performance

Computable Real Functions Parameterized Uniform Parameterized Uniform From NP -hard to polytime

The Ethernet Evolution The 180 Degree Turn (C) Herbert Haas 2010/02/15 Use common sense in

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

Outstanding unsolved problems demand new methods for their solution, while powerful new methods

Status Report EXTRA COSTS AND ADDITIONAL RESOURCES CERN, 29 January 2013 Yacine Kadi Phase 1

How asteroids grow Anders Johansen (Lund University) Star and Planet Formation For All,

Another coin bites the dust Sergi Delgado Segura Cristina Prez-Sol, Sergi Delgado-Segura ,

!"#$"%&%'()&$+,"%(-"+&.*)')(

Midlatitude Storms and Atmospheric Jets in the CESM1.3: Resolution Dependence, Coupling

Parameterized OpenCL Adaptation of Selected Benchmarks of the - PowerPoint PPT Presentation

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite Marius Meyer Tobias Kenter, Christian Plessl Paderborn University, Germany Paderborn Center for Parallel

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Parameterized Power Vertex Cover Eric Angel, Evripidis Bampis, Bruno Escoffier, Michael Lampis

Final Selected Abstracts Final Selected Abstracts Final Selected Abstracts Final Selected

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Exact Crossing Number Exact Crossing Number Parameterized by Vertex Cover Parameterized by

Parameterized Streaming Algorithms Graham Cormode Rajesh Chitnis Parameterized Streaming

Parameterized Complexity of Integer Linear Programming (ILP) Sebastian Ordyniak Parameterized

Parameterized graph separation problems D aniel Marx Budapest University of Technology and

Geometrically Parameterized Interconnect Geometrically Parameterized Interconnect Performance

Computable Real Functions Parameterized Uniform Parameterized Uniform From NP -hard to polytime

The Ethernet Evolution The 180 Degree Turn (C) Herbert Haas 2010/02/15 Use common sense in

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

Outstanding unsolved problems demand new methods for their solution, while powerful new methods

Status Report EXTRA COSTS AND ADDITIONAL RESOURCES CERN, 29 January 2013 Yacine Kadi Phase 1

How asteroids grow Anders Johansen (Lund University) Star and Planet Formation For All,

Another coin bites the dust Sergi Delgado Segura Cristina Prez-Sol, Sergi Delgado-Segura ,

!&quot;#$&quot;%&amp;%'()&amp;$*+*,&quot;%(-&quot;+&amp;.*)')(

Midlatitude Storms and Atmospheric Jets in the CESM1.3: Resolution Dependence, Coupling

!"#$"%&%'()&$+,"%(-"+&.*)')(