Throughput Optimization for High-Level Synthesis Using Resource - PowerPoint PPT Presentation

Throughput Optimization for High-Level Synthesis Using Resource Constraints Peng Li 1 , 2 Louis-Noël Pouchet 3 , 2 Jason Cong 3 , 1 , 2 1 Center for Energy-efficient Computing and Application, Peking University 2 PKU/UCLA Joint Research Institution for Science and Engineering 3 University of California, Los Angeles January 20, 2014 Fourth International Workshop on Polyhedral Compilation Techniques Vienna, Austria

Overview: IMPACT’14 (Very) High Level Picture FPGAs: Field-Programmable Gate Arrays 1 HLS: High-Level Synthesis (from C to RTL) 2 Synthesis: “from RTL to FPGA” 3 => A toolchain from C to hardware! (ex: Xilinx Vivado ISE) 4 ◮ Our job: C to FPGA, using source-to-source C transfo. ◮ We focus on affine C programs :-) PKU / UCLA 2

Overview: IMPACT’14 A Previous Work: PolyOpt/HLS The current situation: ◮ Tremendous improvements on FPGA capacity/speed/energy ◮ But off-chip communications remains very costly, on-chip memory is scarce ◮ HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) ◮ But still extensive manual effort needed for best performance ◮ Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations ◮ But (strong) limitations in applicability / transformations supported / performance achieved PKU / UCLA 3

Overview: IMPACT’14 A Previous Work: PolyOpt/HLS The current situation: ◮ Tremendous improvements on FPGA capacity/speed/energy ◮ But off-chip communications remains very costly, on-chip memory is scarce ⇒ Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) ◮ HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) ◮ But still extensive manual effort needed for best performance ◮ Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations ◮ But (strong) limitations in applicability / transformations supported / performance achieved PKU / UCLA 3

Overview: IMPACT’14 A Previous Work: PolyOpt/HLS The current situation: ◮ Tremendous improvements on FPGA capacity/speed/energy ◮ But off-chip communications remains very costly, on-chip memory is scarce ⇒ Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) ◮ HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) ◮ But still extensive manual effort needed for best performance ⇒ Our solution: complete HLS-focused source-to-source compiler ◮ Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations ◮ But (strong) limitations in applicability / transformations supported / performance achieved PKU / UCLA 3

Overview: IMPACT’14 A Previous Work: PolyOpt/HLS The current situation: ◮ Tremendous improvements on FPGA capacity/speed/energy ◮ But off-chip communications remains very costly, on-chip memory is scarce ⇒ Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) ◮ HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) ◮ But still extensive manual effort needed for best performance ⇒ Our solution: complete HLS-focused source-to-source compiler ◮ Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations ◮ But (strong) limitations in applicability / transformations supported / performance achieved ⇒ Our solution: unleash the power of the polyhedral framework (loop transfo., comm. scheduling, etc.) PKU / UCLA 3

Overview: IMPACT’14 Performance Results Denoise: Pareto-optimal Segmentation: Pareto-optimal DGEMM: Pareto-optimal 140 Total BRAMs (in 16kB blocks) 900 Total BRAMs (in 16kB blocks) 600 Total BRAMs (in 16kB blocks) 800 120 500 700 100 600 400 500 80 300 400 60 300 200 40 200 100 100 20 0 0 1e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 1e+09 1.5e+09 2e+09 2.5e+09 3e+09 3.5e+09 4e+09 4.5e+09 0 1.8e+07 1.9e+07 2e+07 2.1e+07 2.2e+07 2.3e+07 2.4e+07 2.5e+07 2.6e+07 2.7e+07 2.8e+07 Total execution time (in cycles) Total execution time (in cycles) Total execution time (in cycles) Benchmark Description basic off-chip PolyOpt hand-tuned [17] denoise 3D Jacobi+Seidel-like 7-point stencils 0.02 GF/s 4.58 GF/s 52.0 GF/s segmentation 3D Jacobi-like 7-point stencils 0.05 GF/s 24.91 GF/s 23.39 GF/s DGEMM matrix-multiplication 0.04 GF/s 22.72 GF/s N/A GEMVER sequence of matrix-vector 0.10 GF/s 1.07 GF/s N/A ◮ Convey HC-1 (4 Xilinx Virtex-6 FPGAs), total bandwidth up to 80GB/s ◮ AutoESL version 2011.1, use memory/control interfaces provided by Convey ◮ Core design frequency: 150MHz, off-chip memory frequency: 300HMz PKU / UCLA 4

Overview: IMPACT’14 Context of This Work How to get good throughput? Good management of off-chip communications, and on-chip data reuse 1 Effective on-chip computation module 2 ◮ Previous work focused on tiling, comm. optimization, localization, and “coarse-grain” parallelism exposure ◮ This work: focus on improving computation module (assume data is on-chip) ◮ Question: are previous techniques enough? ◮ Question: can we design techniques to improve pipelining efficiency? PKU / UCLA 5

Loop Pipelining: IMPACT’14 Loop Pipelining [1/3] ◮ Depth: number of cycles needed to complete one iteration ◮ Initiation Interval (II): number of cycles to wait before the next iteration can start Depth=8 II=3 ◮ Total cycles: (LoopTripCount - 1) * II + Depth ◮ Reasons for II > 1 ◮ Data dependence (typically loop-carried dependence) ◮ Resource constraints (typically the resource needed is still in use) PKU / UCLA 6

Loop Pipelining: IMPACT’14 Loop Pipelining [2/3] Example (dgemm) for (i = 0; i < ni; i++) for (j = 0; j < nj; j++) #pragma AP pipeline II=1 for (k = 0; k < nk; ++k) C[i][j] += alpha * A[i][k] * B[k][j]; This code has: ◮ inner loop marked for pipelining, target is 1 ◮ but a loop-carried dependence ◮ Vivado finally uses II=6 PKU / UCLA 7

Loop Pipelining: IMPACT’14 Loop Pipelining [2/3] Example (dgemm) for (i = 0; i < ni; i++) for (k = 0; k < nk; k++) #pragma AP pipeline II=1 for (j = 0; j < nj; ++j) C[i][j] += alpha * A[i][k] * B[k][j]; This code has: ◮ inner loop marked for pipelining, target is 1 ◮ no loop-carried dependence ◮ Vivado finally uses II=1, a 6x speedup PKU / UCLA 7

Loop Pipelining: IMPACT’14 Loop Pipelining [3/3] Loop pipelining in our work: ◮ Critical performance impact on loop-dominated codes ◮ We focus on pipelining inner loops only ◮ Each inner loop is marked for pipelining ◮ Our goal: reach II=1 through loop transformations ◮ Parallelization (affine scheduling and ISS) ◮ Split loops with resource conflicts into multiple loops PKU / UCLA 8

Affine Scheduling: IMPACT’14 Reminder: Tiling + Parallelization First scheme: “Pluto” plus vectorization-like transfo. Schedule/transform the code for maximal locality + tilability 1 Move one of the parallel dimension inner-most 2 ◮ integrated in pluto ◮ complemented by a post-pass to perform loop permutation Implemented in PolyOpt/HLS [FPGA’13] 3 What’s special for FPGAs? ◮ inner loop parallelization is NOT vectorization (simpler problem) ◮ trade-off latency vs. resource ◮ Tile size drives the (scarce!) on-chip BRAM usage ◮ Resource sharing happens when statements are fused ◮ Conservative scheduling: a single slow iteration slows the whole loop PKU / UCLA 9

Throughput Optimization for High-Level Synthesis Using Resource - PowerPoint PPT Presentation

Throughput Optimization for High-Level Synthesis Using Resource Constraints Peng Li 1 , 2 Louis-Nol Pouchet 3 , 2 Jason Cong 3 , 1 , 2 1 Center for Energy-efficient Computing and Application, Peking University 2 PKU/UCLA Joint Research

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University

Post-Synthesis Simulation VITAL Models, SDF Files, Timing Simulation Post-synthesis simulation

Applicability of Free Energy Applicability of Free Energy Calculations using High-Throughput

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Synthesis of Carbon Synthesis of Carbon Nanotubes Nanotubes Polina Shifrina Supervisors: Dr.

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

Synthesis of Ranking Functions and Synthesis of Inductive Invariants and Synthesis of

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

Healthcare System in Rural China - Challenges and Opportunities Hongman Wang, MD. Ph.D.

Polarization calibration using pulsar K.J.Lee kjlee@pku.edu.cn with leap team members:LEAP

Technology Enabled Clinical Endpoint Innovation Foundational Concepts and Pathway for

Photoelectrochemical Chemical Oxygen Demand Analysis in Drinking Water Amina Stoddart

Reduction in Social Inequalities in Health and the Burden of Disease Supported by the

A Year Abroad in China Christian P. Rivera October 25, 2018 1 Returning After 6 Years Peking

The Public Higher Learning in Imperial China Lili Yang (CGHE) Higher Education in China

COMPANY PROFILE 1H18 PT Toba Bara Sejahtra Tbk Jakarta, September 2018 Strictly Private &

Throughput Optimization for High-Level Synthesis Using Resource - PowerPoint PPT Presentation

Throughput Optimization for High-Level Synthesis Using Resource Constraints Peng Li 1 , 2 Louis-Nol Pouchet 3 , 2 Jason Cong 3 , 1 , 2 1 Center for Energy-efficient Computing and Application, Peking University 2 PKU/UCLA Joint Research

SYNTHESIS OF SUPER SYNTHESIS OF SUPER NANOPOROUS SYNTHESIS OF SUPER SYNTHESIS OF

High throughput High throughput kafka for science kafka for science Testing Kafkas limits

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Total Synthesis of the Polycyclic Total Synthesis of the Polycyclic Total Synthesis of the

Chemical Synthesis Techniques Chemical Synthesis Techniques Chemical Synthesis Techniques

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci &amp; Eng

Towards Layout-Friendly High-Level Synthesis Jason Cong UCLA Bin Liu UCLA Peking University

Post-Synthesis Simulation VITAL Models, SDF Files, Timing Simulation Post-synthesis simulation

Applicability of Free Energy Applicability of Free Energy Calculations using High-Throughput

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Synthesis of Carbon Synthesis of Carbon Nanotubes Nanotubes Polina Shifrina Supervisors: Dr.

Solid Texture Synthesis Solid Texture Synthesis Solid Texture Synthesis from 2D Exemplars from

Synthesis of Ranking Functions and Synthesis of Inductive Invariants and Synthesis of

Text-to-Image Generation Yu Cheng Text-to-Image Synthesis Text-to-Image Synthesis

Healthcare System in Rural China - Challenges and Opportunities Hongman Wang, MD. Ph.D.

Polarization calibration using pulsar K.J.Lee kjlee@pku.edu.cn with leap team members:LEAP

Technology Enabled Clinical Endpoint Innovation Foundational Concepts and Pathway for

Photoelectrochemical Chemical Oxygen Demand Analysis in Drinking Water Amina Stoddart

Reduction in Social Inequalities in Health and the Burden of Disease Supported by the

A Year Abroad in China Christian P. Rivera October 25, 2018 1 Returning After 6 Years Peking

The Public Higher Learning in Imperial China Lili Yang (CGHE) Higher Education in China

COMPANY PROFILE 1H18 PT Toba Bara Sejahtra Tbk Jakarta, September 2018 Strictly Private &amp;

High-Level Synthesis Creating Custom Circuits from High-Level Code Hao Zheng Comp Sci & Eng

COMPANY PROFILE 1H18 PT Toba Bara Sejahtra Tbk Jakarta, September 2018 Strictly Private &