to Efficient FPGA Acceleration Jiajie Li 1,2 , Yuze Chi 2 , Jason - PowerPoint PPT Presentation

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration Jiajie Li 1,2 , Yuze Chi 2 , Jason Cong 2 Tsinghua University 1 , University of California, Los Angeles 2 li-jj16@mails.tsinghua.edu.cn 1 ,{chiyuze,cong}@cs.ucla.edu 2 1 *Work mainly done at UCLA during Jiajie’s research internship in Summer 2019.

Background ◆ Halide[SIGGRAPH’12]: a popular image processing DSL ◆ Decoupled algorithm & schedule CPU ▪ Same algorithm, schedule everywhere (?) x64/ARM/PPC/… GPU CUDA/OpenCL/… FPGA? 2 Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines, Jonathan Ragan-Kelley et al., SIGGRAPH ’12

Motivation ◆ Existing effort synthesizing Halide to FPGA: Halide- HLS[TACO’17] ▪ Vendor-specific • When vendor tool behavior changes/switching vendor… • Portability ▪ Microarchitecture-specific • When better microarchitectures are found… • Maintainability Halide Line-buffered Algorithm Xilinx HLS • Performance μarchitecture Schedule Halide-HLS 3 Programming Heterogeneous Systems from an Image Processing DSL, Jing Pu et al., TACO’17

HeteroHalide: Our Approach ◆ Leverage HeteroCL as an intermediate representation ▪ Vendor-neutral Portability ▪ Microarchitecture-neutral Maintainability ▪ Semantics-preserving Performance General Backend Halide HeteroCL Xilinx HLS Algorithm Algorithm Stencil (SODA) Schedule Schedule μarchitecture Intel OpenCL Systolic array HeteroHalide (PolySA) HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing, Yi- Hsiang Lai et al., FPGA’19 SODA: Stencil with Optimized Dataflow Architecture, Yuze Chi et al., ICCAD’18 4 PolySA: Polyhedral-Based Systolic Array Auto-Compilation, Jason Cong and Jie Wang, ICCAD’18

Algorithm Transformation ◆ C++-based Halide syntax → def top(input_hcl): with heterocl.Stage("blur_x"): Python-based HeteroCL syntax with heterocl.for_(y_min, y_max) as y: with heterocl.for_(x_min, x_max) as x: tensor_blur_x[x, y] = ( input_hcl[x, y] + Func blur_x("blur_x"); input_hcl[x + 1, y] + blur_x(x, y) = (input(x, y) + input(x + 1, y) + input_hcl[x + 2, y]) / 3 input(x + 2, y)) / 3; with heterocl.Stage("blur_y"): Func blur_y("blur_y"); with heterocl.for_(y_min, y_max) as y: blur_y(x, y) = (blur_x(x, y) + blur_x(x, y + 1) + with heterocl.for_(x_min, x_max) as x: blur_x(x, y + 2)) / 3; tensor_blur_y[x, y] = ( tensor_blur_x[x, y] + tensor_blur_x[x, y + 1] + tensor_blur_x[x, y + 2]) / 3 return tensor_blur_y 5

Schedule Transformation Lazy transformation Immediate transformation blur_x(x, y) = (input(x, y) + input (x + 1, y) blur_x(x, y) = (input(x, y) + input (x + 1, y) + + input(x + 2, y)) / 3 input(x + 2, y)) / 3 Halide blur_x. unroll (x, 4) blur_x. lazy_unroll (x, 4) for y [min = ...; extent = ...; stride = 1]: for y [min = ...; extent = ...; stride = 1]: for x [min = ...; extent = ...; stride = 4 ]: for x [min = ...; extent = ...; stride = 1 ; blur_x(y, x) = ... Halide IR unrolled ; factor = 4 ]: blur_x(y, x + 1) = ... blur_x(y, x) = ... blur_x(y, x + 2) = ... blur_x(y, x + 3) = ... for (int y = ...; y < ...; y++) for (int y = ...; y < ...; y++) for (int x = ...; x < ...; x += 4 ) #pragma ACCEL parallel factor = 4 flatten blur_x[y][x] = ... for (int x = ...; x < ...; x++ ) Merlin C blur_x[y][x+1] = ... blur_x[y][x] = ... blur_x[y][x+2] = ... blur_x[y][x+3] = ... 6

Evaluation: Productivity ◆ xfOpenCV ▪ An HLS library for image processing Lines of Code (algorithm + schedule) Application HeteroHalide xfOpenCV Harris 26 + 14 117 (2.9 ×) ◆ For new applications Gaussian 8 + 3 104 (9.5 × ) Dilation 2 + 1 80 (26.7 × ) ▪ HeteroHalide is more compact 79 (26.3 × ) Erosion 2 + 1 ◆ For existing Halide programs 81 (27.0 × ) Median Blur 2 + 1 Sobel 3 + 2 208 (41.6 × ) ▪ HeteroHalide requires minimal changes Geo. Mean — (16.7 × ) 7 Xilinx xfOpenCV Library: https://github.com/Xilinx/xfopencv

Evaluation: Comparison with Prior Work Throughput (pixel/cycle) Application Data Size & Type Speedup Halide-HLS HeteroHalide Harris 640 × 640, uint8 2 4 2 Gaussian 640 × 640, uint8 2 8 4 640 × 640 × 3, uint8 Unsharp 1 4 4 Geo. Mean — — — 3.2 ◆ FPGA: Zynq 7020 ◆ HeteroHalide scales better by leveraging state-of-the-art microarchitecture 8

Evaluation: Comparison w/ Original Halide on CPU ◆ Different platforms × different backends ◆ Energy efficient & performant on both platforms and all backends VU9P (AWS F1) Stratix 10 MX Benchmark Data Size & Type Pattern (Backend) Energy Eff. Speedup Energy Eff. Speedup 2448 × 3264, Uint8 Harris 29.11 10.31 12.36 9.89 Stencil (SODA) Blur 648 × 482, UInt16 10.98 3.89 9.34 7.47 Stencil (SODA) Linear Blur 768 × 1280 × 3, Float32 12.65 4.48 10.75 8.60 Stencil (SODA) 1536 × 2560, UInt16 Stencil Chain 4.29 1.52 3.64 2.91 Stencil (SODA) Dilation 6480 × 4820, UInt16 4.69 1.66 1.99 1.59 Stencil (SODA) 6480 × 4820, UInt16 Median Blur 12.51 4.43 5.30 4.24 Stencil (SODA) GEMM 1024 ³ , Int16 9.97 3.53 — — Systolic Array (PolySA) K-Means 320 × 32, k=15, Int32 29.00 10.27 — — General (Merlin Compiler) Geo. Mean — 11.44 4.05 6.02 4.82 — CPU: dual Xeon 2680v4, 14nm, 2.4GHz, 240W; VU9P on AWS F1, 16nm, 250MHz, 85W; Stratix 10 MX, 14nm, 480MHz, 192W 9 Not to serve as a fair comparison between the two FPGAs

Conclusion ◆ HeteroHalide ▪ Enables end-to-end compilation from Halide to FPGA • Simplified flow from Halide to accelerators • Minimal modifications on existing Halide programs ▪ Extends the existing Halide schedules • Generate efficient code for the backend tools ▪ Produces efficient accelerators by leveraging HeteroCL • 4.82 × average speedup over 28 CPU cores • 2-4 × speedup over existing work 10

References ◆ Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines, Jonathan Ragan- Kelley et al., SIGGRAPH’12 ◆ Programming Heterogeneous Systems from an Image Processing DSL, Jing Pu et al., TACO’17 ◆ SODA: Stencil with Optimized Dataflow Architecture, Yuze Chi et al., ICCAD’18 ◆ PolySA: Polyhedral-Based Systolic Array Auto-Compilation, Jason Cong and Jie Wang, ICCAD’18 ◆ HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing, Yi- Hsiang Lai et al., FPGA’19 11

Thank you See you in the poster session! Acknowledgments This work is supported by the Intel and NSF joint research programs for Computer Assisted Programming for Heterogeneous Architectures (CAPA), Tsinghua Academic Fund for Undergraduate Overseas Studies, and Beijing National Research Center for Information Science and Technology (BNRist). We thank Prof. Zhiru Zhang (Cornell) and his research group for their help on HeteroCL and Prof. Mark Horowitz (Stanford) and his research group for their help on Halide-HLS. We also thank Amazon for providing AWS F1 credits. 12

to Efficient FPGA Acceleration Jiajie Li 1,2 , Yuze Chi 2 , Jason - PowerPoint PPT Presentation

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration Jiajie Li 1,2 , Yuze Chi 2 , Jason Cong 2 Tsinghua University 1 , University of California, Los Angeles 2 li-jj16@mails.tsinghua.edu.cn 1 ,{chiyuze,cong}@cs.ucla.edu 2 1 *Work

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

RTLinux in an FPGA Alejandro Lucero alucero@os3sl.com www.os3sl.com RTLinux in a FPGA 1.

Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA) Andrea Suardi

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Mathematical Models of Markovian Dephasing Franco Fagnola Politecnico di Milano (joint work with

By Adam Z. Margulies A = set of all candidates |A| = m V = set of all voters in the electorate

Dilation theory and applications Marius Junge Joint with Eric Ricard and Dima Shlyakhtenko

Image Processing using Graphs (lecture 2 - connected filters) Alexandre Xavier Falc ao Visual

Yayi A generic framework for morphological image processing IPOL Raffi Enficiaud June 2012

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Generalized Ehrhart Polynomials Nan Li (MIT) with Sheng Chen (HIT) and Steven Sam (MIT) Aug 5,

Factorization and dilation problems for completely positive maps on von Neumann algebras