gprm
play

GPRM Towards Automated Design Space Exploration and Code Generation - PowerPoint PPT Presentation

First International Workshop on Heterogeneous High-performance Reconfigurable Computing (H 2 RC'15) Sunday, November 15, 2015 Austin, TX GPRM Towards Automated Design Space Exploration and Code Generation using Type Transformations


  1. First International Workshop on Heterogeneous High-performance Reconfigurable Computing (H 2 RC'15) Sunday, November 15, 2015 Austin, TX GPRM Towards Automated Design Space Exploration and Code Generation using Type Transformations www.tytra.org.uk S Waqar Nabi & Wim Vanderbauwhede

  2. Using Safe Transformations and a Cost-Model for HPC on FPGAs  The TyTra project context Our approach, blue-sky target, down-to-earth target, where o we are now, how we are different  Key contributions (1) Type transformations to create design-variants, (2) a new o Intermediate Language, and (3) an FPGA Cost model  The cost model Performance and resource-usage estimates, some results o Using safe transformations and an associated light-weight cost-model opens the route to a fully automated design-space exploration flow

  3. THE CONTEXT Our approach, blue-sky target, down-to-earth target, where we are now, how we are different

  4. Blue Sky Target

  5. Blue Sky Target Heterogeneous HPC Target Description Legacy Scientific Code Cost Model Optimized HPC solution! The goal that keeps us motivated! (The pragmatic target is somewhat more modest…)

  6. The Short-Term Target Our focus is on FPGA targets, and we currently require design entry in a Functional Language using High-Level Functions (maps, folds) [a kind of DSL]

  7. 7 The cunning plan… Use the functional programming paradigm to (auto) generate 1. program-variants which translate to design-variants on the FPGA. Create an Intermediate Language that: 2. • Is able to capture points entire design-space • Allows a light-weight cost-model to be built around it • Is a convenient target for front-end compiler Create a light-weight cost-model that can estimate the 3. performance and resource-utilization for each variant . A performance portable code-base that builds on a purely software programming paradigm.

  8. 8 And you may very well ask… The jury is still out…

  9. How our work is different  Our observations on limitations of current tools and flows: Design-entry in a custom high-level language which nevertheless has 1. hardware-specific semantics Architecture of the FPGA-solution specified by programmer; compilers 2. cannot optimize it. Solutions create soft-processors on the FPGA; not optimized for HPC 3. (orientation towards embedded applications) Design-space exploration requires prohibitively long time 4. Compiler is application specific (e.g. DSP applications) 5. We are not there yet, but in principle, our approach entirely eliminates the first four, and mitigates the fifth.

  10. KEY CONTRIBUTIONS (1) Type transformations for generating program variants, (2) a new Intermediate Language, and (3) a light-weight Cost Model

  11. 1. Type Transformations to Generate Program Variants  Functional Programming  Types More general than types in C o Our focus is on types of functions that perform array o operations reshape, maps and folds o  Type transformations Can be derived automatically o Provably correct o Essentially reshape the arrays o A functional paradigm with high-level functions allows creation of design-variants that are correct-by-construction.

  12. Illustration of Variant Generation through Type-Transformation • typeA :Vect (im*jm*km) dataType --1D data • Single execution thread • typeB :Vect km (Vect im*jm dataType) --transformed 2D data • (km concurrent execution threads) • output = map pipe kernel_func input --original program • inputTr = reshapeTo km input --reshaping data • output = map par (map pipe kernel_func) inputTr --new program Simple and provably correct transformations in a high-level functional language translates to design-variants on the FPGA.

  13. 2. A New Intermediate Language  Strongly and statically typed  All computations expressed as SSA (Single-Static Assignments)  Largely (and deliberately) based on the LLVM-IR • Manage-IR • Compute-IR • Deals with • Streaming model • memory objects (arrays) • streams (loops over arrays) • SSA instructions define • offset streams the datapath • loops over work-unit • block-memory transfers

  14. 2. A New Intermediate Language

  15. Design ign Space The Cost Model 3. Cost Model Estimation Space

  16. THE FPGA COST-MODEL Performance Estimate, Resource-utilization estiamte, Experimental Results

  17. 17 The Cost-Model Use-Case A set of standardized experiments feed target-specific empirical data to the cost model, and the rest comes from the IR descripition.

  18. 18 Two Types of Estimates  Resource-Utilization Estimates ALUTs, REGs, DSPs o  Performance Estimates Estimating memory-access o bandwidth for specific data patterns Estimating FPGA operating o frequency Both estimates needed to allow compiler to choose the best design variant.

  19. 19 1. Resource Estimates  Observation Regularity of FPGA fabric allows some very simple first or second order o expressions to be built up for most instructions based on a few experiments.  Key Determinants Primitive (SSA) instructions used in IR of the kernel functions o Data-types o Structure of various functions (par, comb, par, seq) o Control logic over-head o A set of one-time simple synthesis experiments on the target device helps us create a very accurate resource-utilization cost model

  20. 20 Resource Estimates - Example Integer Division Integer Multiplication Light-weight cost expressions associated with every legal SSA instruction in the TyTra-IR

  21. 21 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and design- o variant • Data-access pattern Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

  22. 22 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and design- o variant • Data-access pattern Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

  23. Performance Estimate Dependence on Memory Execution Model Three Types of memory executions A given design-variant can be categorized based on: - Architectural description Activity - IR description Kernel Pipeline Execution Device-Buffers  Offset-Buffers Device-DRAM  Device-Buffers Host  Device-DRAM Time

  24. Performance Estimate Dependence on Memory Execution Model Three Types of memory executions A given design-variant can be categorized based on: - Architectural description Activity - IR description Kernel Pipeline Execution Device-Buffers  Offset-Buffers Device-DRAM  Device-Buffers Host  Device-DRAM Time

  25. Performance Estimate Dependence on Memory Execution Model Work-Unit Iterations Type A All iterations Activity Kernel Pipeline Execution Device-Buffers  Offset-Buffers Device-DRAM  Device-Buffers Host  Device-DRAM Time

  26. Performance Estimate Dependence on Memory Execution Model Work-Unit Iterations Type B Activity Kernel Pipeline All other Execution iterations Device-Buffers  Offset-Buffers Device-DRAM  Last Iteration Device-Buffers only Host  Device-DRAM First Iteration only Time

  27. Performance Estimate Dependence on Memory Execution Model Work-Unit Iterations Type C Activity All other iterations Kernel Pipeline Execution Device-Buffers  Offset-Buffers Last Iteration only Device-DRAM  Device-Buffers Host First Iteration  only Device-DRAM Time Once a design-variant is categorized, performance can be estimated accordingly

  28. 28 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and o design-variant • Data-access pattern Design configuration of the FPGA o Operating frequency of the FPGA o Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

  29. 29 Performance Estimate Dependence on Data Access Pattern  We have defined a rho ( ρ ) factor defined as a scaling factor of the peak memory bandwidth  Varies from 0-1  Based on data-access pattern  Derived empirically through one-time standardized experiments on target node

  30. 30 2. Performance Estimate  Effective Work-Unit Throughput (EWUT) Work-Unit = Executing the kernel over the entire index-space o  Key Determinants Memory execution model o Sustained memory bandwidth for the target architecture and design- o variant • Data-access pattern Design configuration of the FPGA o Determined from the IR Operating frequency of the FPGA o description of design-variant Compute-bound or IO-bound? o Performance model is trickier, especially calculating estimates of sustained memory bandwidth and FPGA operating frequency.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend