GPRM Towards Automated Design Space Exploration and Code Generation - PowerPoint PPT Presentation

First International Workshop on Heterogeneous High-performance Reconfigurable Computing (H 2 RC'15) Sunday, November 15, 2015 Austin, TX GPRM Towards Automated Design Space Exploration and Code Generation using Type Transformations www.tytra.org.uk S Waqar Nabi & Wim Vanderbauwhede

Using Safe Transformations and a Cost-Model for HPC on FPGAs  The TyTra project context Our approach, blue-sky target, down-to-earth target, where o we are now, how we are different  Key contributions (1) Type transformations to create design-variants, (2) a new o Intermediate Language, and (3) an FPGA Cost model  The cost model Performance and resource-usage estimates, some results o Using safe transformations and an associated light-weight cost-model opens the route to a fully automated design-space exploration flow

THE CONTEXT Our approach, blue-sky target, down-to-earth target, where we are now, how we are different

Blue Sky Target

Blue Sky Target Heterogeneous HPC Target Description Legacy Scientific Code Cost Model Optimized HPC solution! The goal that keeps us motivated! (The pragmatic target is somewhat more modest…)

The Short-Term Target Our focus is on FPGA targets, and we currently require design entry in a Functional Language using High-Level Functions (maps, folds) [a kind of DSL]

7 The cunning plan… Use the functional programming paradigm to (auto) generate 1. program-variants which translate to design-variants on the FPGA. Create an Intermediate Language that: 2. • Is able to capture points entire design-space • Allows a light-weight cost-model to be built around it • Is a convenient target for front-end compiler Create a light-weight cost-model that can estimate the 3. performance and resource-utilization for each variant . A performance portable code-base that builds on a purely software programming paradigm.

8 And you may very well ask… The jury is still out…

How our work is different  Our observations on limitations of current tools and flows: Design-entry in a custom high-level language which nevertheless has 1. hardware-specific semantics Architecture of the FPGA-solution specified by programmer; compilers 2. cannot optimize it. Solutions create soft-processors on the FPGA; not optimized for HPC 3. (orientation towards embedded applications) Design-space exploration requires prohibitively long time 4. Compiler is application specific (e.g. DSP applications) 5. We are not there yet, but in principle, our approach entirely eliminates the first four, and mitigates the fifth.

KEY CONTRIBUTIONS (1) Type transformations for generating program variants, (2) a new Intermediate Language, and (3) a light-weight Cost Model

1. Type Transformations to Generate Program Variants  Functional Programming  Types More general than types in C o Our focus is on types of functions that perform array o operations reshape, maps and folds o  Type transformations Can be derived automatically o Provably correct o Essentially reshape the arrays o A functional paradigm with high-level functions allows creation of design-variants that are correct-by-construction.

Illustration of Variant Generation through Type-Transformation • typeA :Vect (im*jm*km) dataType --1D data • Single execution thread • typeB :Vect km (Vect im*jm dataType) --transformed 2D data • (km concurrent execution threads) • output = map pipe kernel_func input --original program • inputTr = reshapeTo km input --reshaping data • output = map par (map pipe kernel_func) inputTr --new program Simple and provably correct transformations in a high-level functional language translates to design-variants on the FPGA.

2. A New Intermediate Language  Strongly and statically typed  All computations expressed as SSA (Single-Static Assignments)  Largely (and deliberately) based on the LLVM-IR • Manage-IR • Compute-IR • Deals with • Streaming model • memory objects (arrays) • streams (loops over arrays) • SSA instructions define • offset streams the datapath • loops over work-unit • block-memory transfers

2. A New Intermediate Language

Design ign Space The Cost Model 3. Cost Model Estimation Space

THE FPGA COST-MODEL Performance Estimate, Resource-utilization estiamte, Experimental Results

17 The Cost-Model Use-Case A set of standardized experiments feed target-specific empirical data to the cost model, and the rest comes from the IR descripition.

18 Two Types of Estimates  Resource-Utilization Estimates ALUTs, REGs, DSPs o  Performance Estimates Estimating memory-access o bandwidth for specific data patterns Estimating FPGA operating o frequency Both estimates needed to allow compiler to choose the best design variant.

19 1. Resource Estimates  Observation Regularity of FPGA fabric allows some very simple first or second order o expressions to be built up for most instructions based on a few experiments.  Key Determinants Primitive (SSA) instructions used in IR of the kernel functions o Data-types o Structure of various functions (par, comb, par, seq) o Control logic over-head o A set of one-time simple synthesis experiments on the target device helps us create a very accurate resource-utilization cost model

20 Resource Estimates - Example Integer Division Integer Multiplication Light-weight cost expressions associated with every legal SSA instruction in the TyTra-IR

21 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and design- o variant • Data-access pattern Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

22 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and design- o variant • Data-access pattern Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

Performance Estimate Dependence on Memory Execution Model Three Types of memory executions A given design-variant can be categorized based on: - Architectural description Activity - IR description Kernel Pipeline Execution Device-Buffers  Offset-Buffers Device-DRAM  Device-Buffers Host  Device-DRAM Time

Performance Estimate Dependence on Memory Execution Model Work-Unit Iterations Type A All iterations Activity Kernel Pipeline Execution Device-Buffers  Offset-Buffers Device-DRAM  Device-Buffers Host  Device-DRAM Time

Performance Estimate Dependence on Memory Execution Model Work-Unit Iterations Type B Activity Kernel Pipeline All other Execution iterations Device-Buffers  Offset-Buffers Device-DRAM  Last Iteration Device-Buffers only Host  Device-DRAM First Iteration only Time

Performance Estimate Dependence on Memory Execution Model Work-Unit Iterations Type C Activity All other iterations Kernel Pipeline Execution Device-Buffers  Offset-Buffers Last Iteration only Device-DRAM  Device-Buffers Host First Iteration  only Device-DRAM Time Once a design-variant is categorized, performance can be estimated accordingly

28 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and o design-variant • Data-access pattern Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

29 Performance Estimate Dependence on Data Access Pattern  We have defined a rho ( ρ ) factor defined as a scaling factor of the peak memory bandwidth  Varies from 0-1  Based on data-access pattern  Derived empirically through one-time standardized experiments on target node

30 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and design- o variant • Data-access pattern Design configuration of the FPGA o Determined from the IR Operating frequency of the FPGA o description of design-variant Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

GPRM Towards Automated Design Space Exploration and Code Generation - PowerPoint PPT Presentation

First International Workshop on Heterogeneous High-performance Reconfigurable Computing (H 2 RC'15) Sunday, November 15, 2015 Austin, TX GPRM Towards Automated Design Space Exploration and Code Generation using Type Transformations

William Putsis Professor of Marketing, Economics and Business Strategy Advice to Business Owners:

Invariance and Equivariance: Benefits, Costs, and Methods Robert Serfling 1 University of Texas

pan right P P Similarly, a pan up moves P down. 1 So, a pan of p in the x-direction and q in the

Special relativity Squashing of the E-field line associated to a moving charge is suggestive

Noncommutative Uncertainty Principle Zhengwei Liu (joint with Chunlan Jiang and Jinsong Wu)

Awe transformation from php5 to php7 ! Hello! I am Tejomay Saha I am here because I love to

Supersymmetric Quantum Mechanics for Coupled-Channel Systems Jean-Marc Sparenberg PNTPM,

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

DIFFUSION PROCESS IN NETWORKS THE CASE OF GMO SOYBEAN IN ARGENTINA THE CASE OF GMO SOYBEAN IN

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &

Head Finalization: Translation from SVO to SOV Hideki Isozaki Okayama

Compilers & Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005

The Efficacy of Human Post-Editing for Language Translation Spence Green Jeffrey Heer

Cross-ISA Machine Instrumentation Cross-ISA Machine Instrumentation using Fast and Scalable

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind

Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan

Translation from SQL into the relational algebra Consider the following relational schema:

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar

Empirical Methods in Natural Language Processing Lecture 14 Machine translation (I): Introduction

Translation , 1 2,1 2 3,1 3 0,1 1,1

Neural Machine Translation: Breaking the Performance Plateau Rico Sennrich Institute for

Surprise Language Evaluation: Rapid-Response Cross-Language IR Maryland: Douglas W. Oard, Marine

A first step towards interactivity and language tools convergence Arnaud Vi 1 Luis Villarejo

GPRM Towards Automated Design Space Exploration and Code Generation - PowerPoint PPT Presentation

First International Workshop on Heterogeneous High-performance Reconfigurable Computing (H 2 RC'15) Sunday, November 15, 2015 Austin, TX GPRM Towards Automated Design Space Exploration and Code Generation using Type Transformations

William Putsis Professor of Marketing, Economics and Business Strategy Advice to Business Owners:

Invariance and Equivariance: Benefits, Costs, and Methods Robert Serfling 1 University of Texas

pan right P P Similarly, a pan up moves P down. 1 So, a pan of p in the x-direction and q in the

Special relativity Squashing of the E-field line associated to a moving charge is suggestive

Noncommutative Uncertainty Principle Zhengwei Liu (joint with Chunlan Jiang and Jinsong Wu)

Awe transformation from php5 to php7 ! Hello! I am Tejomay Saha I am here because I love to

Supersymmetric Quantum Mechanics for Coupled-Channel Systems Jean-Marc Sparenberg PNTPM,

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

DIFFUSION PROCESS IN NETWORKS THE CASE OF GMO SOYBEAN IN ARGENTINA THE CASE OF GMO SOYBEAN IN

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &amp;

Head Finalization: Translation from SVO to SOV Hideki Isozaki Okayama

Compilers &amp; Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005

The Efficacy of Human Post-Editing for Language Translation Spence Green Jeffrey Heer

Cross-ISA Machine Instrumentation Cross-ISA Machine Instrumentation using Fast and Scalable

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind

Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Chris Manning, Dan

Translation from SQL into the relational algebra Consider the following relational schema:

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Machine Translation 2: Statistical MT: Phrase-Based and Neural Ond rej Bojar

Empirical Methods in Natural Language Processing Lecture 14 Machine translation (I): Introduction

Translation , 1 2,1 2 3,1 3 0,1 1,1

Neural Machine Translation: Breaking the Performance Plateau Rico Sennrich Institute for

Surprise Language Evaluation: Rapid-Response Cross-Language IR Maryland: Douglas W. Oard, Marine

A first step towards interactivity and language tools convergence Arnaud Vi 1 Luis Villarejo

Introduction to Machine Translation CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides &

Compilers & Translator Writing Systems Prof. R. Eigenmann ECE573, Fall 2005