SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason - PowerPoint PPT Presentation

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1

What is stencil computation? 2

What is Stencil Computation? ◆ A sliding window applied on an array ▪ Compute output according to some fixed pattern using the stencil window ◆ Extensively used in many areas ▪ Image processing, solving PDEs, cellular automata, etc. ◆ Example: a 5-point blur filter with uniform weights void blur(float input [N][M], float output[N][M]) { for(int j = 1; j < N-1; ++j) { for(int i = 1; i < M-1; ++i) { output[j][i] = ( blur input[j-1][i ] + input[j ][i-1] + input[j ][i ] + input[j ][i+1] + input[j+1][i ] ) * 0.2f; } } } 3

How do people do stencil computation? 4

Stencil Optimization #1: Data Reuse ◆ Non-uniform partitioning –based line buffer (DAC’14) ▪ Full data reuse, 1 PE ▪ Optimal size of reuse buffer ▪ Optimal number of memory banks ◆ But how to parallelize? 5 DAC’14: An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non -Uniform Partitioning of Data Reuse Buffers

Stencil Optimization #2: Temporal Parallelization ◆ Multiple iterations / stages chained together (ICCAD’16) ▪ More iterations ⇒ better throughput ▪ Communication-bounded ⇒ Computation-bounded Input Iteration 1 Iteration 2 Output On Chip ▪ Parallelization within each iteration? ICCAD’16: A Polyhedral Model -Based Framework for Dataflow Implementation on FPGA Devices of Iterative Stencil Loops 6

Stencil Optimization #3: Spatial Parallelization Element- Level Parallelization (FPGA’18) Tile- Level Parallelization (DAC’17) ▪ Fine-grained parallelism ▪ Coarse-grained parallelism ▪ Private reuse buffers w/ duplication ▪ Private reuse buffers DAC’17: A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs using OpenCL Model FPGA’18: Combined Spatial and Temporal Blocking for High -Performance Stencil Computation on FPGAs Using OpenCL 7

Stencil Optimization: Parallelization ◆ Previous works use private reuse buffers ▪ 𝑙 PEs require 𝑇 𝑠 × 𝑙 storage • 𝑇 𝑠 : reuse distance, the distance from the first data element to the last data element ▪ Sub-optimal buffer size ▪ Not scalable when k increases 8

Can we do better? 9

SODA as a Microarchitecture: Data Reuse ◆ For 𝑙 = 3 PEs ▪ 𝑙 PEs only require 𝑇 𝑠 + 𝑙 − 1 storage ▪ Full data reuse ▪ Optimal buffer size ▪ Scalable when k increases 10

SODA as a Microarchitecture: Spatial Parallelization Reuse Buffer 11

SODA as a Microarchitecture: Temporal Parallelization 12

How do you program such a messy fancy architecture? 13

Stencil Optimization #4: Domain-Specific Language Support ◆ Complex hardware architecture ◆ How to program? ▪ Template-based • DAC’14, ICCAD’16, FPGA’18 ▪ Domain-specific language (DSL) • Darkroom, Halide, Hipacc … ◆ SODA uses a DSL ▪ Flexible ▪ Programmable 14

SODA as an Automation Framework Design-Space User-Defined User-Defined Exploration C++ Host Application SODA DSL Kernel (SODA) How to User-Defined Input explore? sodac (SODA) Xilinx Dataflow OpenCL API HLS Kernel Intermediate Code g++ (GCC) xocc (SDAccel) #PEs Large Design Space (up to 10 2 ) (up to 10 10 ) Host FPGA Program #Iteration Bitstream (up to 10 2 ) Executable Results Tile size (up to 10 6 ) 15

How do you explore such a huge design space? 16

SODA as an Exploration Engine: Resource Modeling SODA DSL input sodac • HLS code of each module Module model • Number of each module database for each module Has resource Run HLS for No model for module module? Yes Complete resource Modularized Design Enabling Accurate model Architecture-Specific Modeling Resource Modeling Flow 17

SODA as an Exploration Engine: Performance Modeling Throughput Throughput limited by external bandwidth Throughput achieved 0 #PEs / stage Performance Roofline Model 18

SODA as an Exploration Engine: Design-Space Pruning ◆ Unroll factor 𝑙 ▪ Only powers of 2 make sense due to the memory port ◆ Iteration factor 𝑟 ▪ Bounded by available resources, 𝑙𝑟 ≤ 10 2 ◆ Tile size 𝑈 0 , 𝑈 1 , … ▪ Bounded by available on-chip storage ▪ Searched via branch-and-bound ◆ Can finish exploration in up to 3 minutes 19

What does your result look like? 20

Experimental Results: Model Accuracy ◆ Model prediction targets ▪ Resource modeling target: post-synthesis resource utilization ▪ Performance modeling target: on-board execution throughput Prediction Item BRAM DSP LUT FF Throughput Average Error 1.84% 0% 6.23% 7.58% 4.22% 21

Experimental Results: Performance Comparison Non-Iterative Stencil Iterative Stencil 1.2 3.5 3 1 Normalized Performance Normalized Performance 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 SOBEL 2D DENOISE 2D DENOISE 3D JACOBI 2D JACOBI 3D SEIDEL 2D HEAT 3D 24t-CPU DAC'14 SODA 24t-CPU ICCAD'16 FPGA'18 SODA 22 Synthesis Tool: SDAccel / Vivado HLS 2017.2 FPGA: ADM-PCIE-KU3 w/ XCKU060 CPU: Intel Xeon E5-2620 v3 x2

What are the takeaways? 23

SODA: Stencil with Optimized Dataflow Architecture ◆ SODA is a Microarchitecture ▪ Flexible & scalable reuse buffers for multiple PEs ◆ SODA is an Automation Framework ▪ From DSL to hardware, end-to-end automation ◆ SODA is an Exploration Engine ▪ Optimal parameters via model-driven exploration 24

References ▪ DAC’14: An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers, Cong et al. ▪ ICCAD’16: A Polyhedral Model -Based Framework for Dataflow Implementation on FPGA Devices of Iterative Stencil Loops, Natale et al. ▪ DAC’17: A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs using OpenCL Model, Wang and Liang ▪ FPGA’18: Combined Spatial and Temporal Blocking for High -Performance Stencil Computation on FPGAs Using OpenCL, Zohouri et al. 25

Thank you! Q&A Acknowledgments This work is partially supported by the Intel and NSF joint research program for Computer Assisted Programming for Heterogeneous Architectures (CAPA), and the contributions from Fujitsu Labs, Huawei, and Samsung under the CDSC industrial partnership program. We thank Amazon for providing AWS F1 credits. 26

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason - PowerPoint PPT Presentation

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding window applied on an array

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

SODA Safety Officer Drone Awareness Human Factor Training for High Risk Industries SODA Safety

Soda Tax in West Virginia Tara Holmes Summer Research Associate West Virginia Center on Budget

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Low Power DSP Architectures Trevor Mudge, Bredt Professor of Engineering, The University of

SODA: FEC4 6TiSCH Open Data Action Bruges, 09/10/2018 WWW.FED4FIRE.EU SODA 6TISCH OPEN DATA

Exploring string axiverse in GW cosmology Yuko Urakawa (Nagoya university, IAR) J.Soda &

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of

Update on light a.enua0on studies in Qscan Alessandra,

Take away messages: what is XPS e-spectrometer: how it works HAXPES: probing depth cross

Upstream Assigned Label Collision Solution draft-zhang-idr-upstream-label-collision-solution-00

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

EFFECTS OF PAYMENTS FOR ECOSYSTEM SERVICES ON WILDLIFE IN FANJINGSHAN NATIONAL NATURE RESERVE,

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang,

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason - PowerPoint PPT Presentation

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding window applied on an array

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

Naiad (Timely Dataflow) &amp; Streaming Systems CS 848: Models and Applications of Distributed

SODA Safety Officer Drone Awareness Human Factor Training for High Risk Industries SODA Safety

Soda Tax in West Virginia Tara Holmes Summer Research Associate West Virginia Center on Budget

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Google Cloud Dataflow Cosmin Arad , Senior Software Engineer carad@google.com August 7, 2015

Quantifying Dataflow Analysis with Gradients in LLVM Gabriel Ryan 1 , Abhishek Shah 1 , Dongdong

Low Power DSP Architectures Trevor Mudge, Bredt Professor of Engineering, The University of

SODA: FEC4 6TiSCH Open Data Action Bruges, 09/10/2018 WWW.FED4FIRE.EU SODA 6TISCH OPEN DATA

Exploring string axiverse in GW cosmology Yuko Urakawa (Nagoya university, IAR) J.Soda &amp;

Chapter 8 Dataflow Descriptions in VHDL 1 benyamin@mehr.sharif.edu Dataflow Description

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

Dataflow Testing Chapter 10 Dataflow Testing Testing All-Nodes and All-Edges in a control

WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism + traditional

Mark Pagel pags@cray.com New features in XT MPT 3.1 and MPT 3.2 Features as a result of

Update on light a.enua0on studies in Qscan Alessandra,

Take away messages: what is XPS e-spectrometer: how it works HAXPES: probing depth cross

Upstream Assigned Label Collision Solution draft-zhang-idr-upstream-label-collision-solution-00

Softwar tware-Fir First st FPGA GA Ac Accele elerato rator r De Desi sign gn Make it

EFFECTS OF PAYMENTS FOR ECOSYSTEM SERVICES ON WILDLIFE IN FANJINGSHAN NATIONAL NATURE RESERVE,

T ANGRAM : Optimized Coarse-Grained Dataflow for Scalable NN Accelerators Mingyu Gao, Xuan Yang,

More Power to the Future Uduak Akpanedet IEEE PES Day 2020 Ambassador MSc Electrical Power

Naiad (Timely Dataflow) & Streaming Systems CS 848: Models and Applications of Distributed

Exploring string axiverse in GW cosmology Yuko Urakawa (Nagoya university, IAR) J.Soda &