Programming Soft Processors in High Performance Reconfigurable - - PowerPoint PPT Presentation
Programming Soft Processors in High Performance Reconfigurable - - PowerPoint PPT Presentation
Programming Soft Processors in High Performance Reconfigurable Computing Andrew W. H. House & Paul Chow University of Toronto Workshop on Soft Processor Systems 26 October 2008 Outline Introduction High Performance Reconfigurable
Outline
- Introduction
- High Performance Reconfigurable Computing
- Programming Models for HPRC
- Our Proposed Programming Model
- Implications for Soft Processors
- Conclusion
Introduction
- Traditional microprocessors are starting to
reach a performance plateau.
– In HPC there has been much interest in using
accelerators (like GPUs) for computation.
- Reconfigurable hardware has much to offer.
– Application speedups, lower power
consumption, flexibility all possible.
- Soft processors can have an important role in
high performance reconfigurable computing systems.
High Performance Reconfigurable Computing
High Performance Reconfigurable Computing
- HPRC is a branch of high performance
computing (HPC) that uses reconfigurable hardware (typically FPGAs) to accelerate computation.
- HPRC shares a number of characteristics with
- ther accelerator-based HPC solutions.
- Let's start with some definitions....
High Performance Computing
- Massively parallel
multiprocessor systems
- Shared or distributed
memory
- High end systems
have specialized interconnection network
HPRC Classifications
- We can classify HPRC systems into three
categories:
– Accelerated Reconfigurable Multiprocessors – Application-Specific Reconfigurable
Multiprocessors
– Heterogeneous Peer Reconfigurable
Multiprocessors
- These mirror our more general definitions for
accelerator-based systems.
Accelerated Reconfigurable Multiprocessor
- Reconfigurable
hardware is co- processor
- Subordinate to CPU
- May have its own
memory, but CPU controls access to main memory and interconnect
Application-Specific Reconfigurable Multiprocessor
- Uses only FPGAs or
- ther
reconfigurable hardware as computing elements
- FPGAs have direct
connections to main memory and interconnect
Heterogeneous Peer Reconfigurable Multiprocessor
- FPGAs are first-
class computing elements, having equal access to system resources as CPUs and other accelerators
- This is most likely
scenario for future
- f reconfigurable
computing(?)
Soft Processors in HPRC
- Can soft processors compete with hard ones?
Maybe.
– Many soft processors in parallel can exploit
massive FPGA on-chip memory bandwidth.
– Application-specific soft processors can offer
significant performance (Mitrion, Tensilica...).
- Soft processors can also be used for:
– Controlling interaction between hardware
kernels.
– Interfacing hardware with host CPUs.
Programming Models for HPRC
Programming Models for HPRC
- Consider existing programming models:
– Parallel programming for HPC – Programming for reconfigurable computing – Hardware-Software Co-design
- Effectively integrating soft processors into
HPRC provides challenges to all of these paradigms.
Challenges in Programming HPRC
- Handling Multiple FPGAs
– Multiple processors in each?
- Heterogeneity
– Both hard and soft processors, plus other
processing elements.
- Application Partitioning
– Between processing elements, or processors
and hardware kernels.
- Synthesis
– Generation of ASP/kernels from high level.
Parallel Programming for HPC
- Good at handling massive data parallelism.
- Shared memory programming and message
passing are dominant.
– Single-program multiple-data (SPMD) most
scalable paradigm for message passing.
– Partitioned global address space (PGAS)
- ffers shared memory abstraction for
distributed memory machines.
- SPMD/PGAS approaches not as effective in
heterogeneous environments.
Application to Soft Processors
- Program them in the same way!
– TMD-MPI is an MPI implementation for Xilinx
Microblaze and PPC processors.
– Couples processors and hardware kernels. – Regular MPI applications ported easily.
- But standard models don't scale in a
heterogeneous environment.
– Runtime partitioning systems like RapidMind
- r Intel Ct not currently relevant for FPGA-
based systems.
Programming Model Requirements
- For these emerging HPRC systems, the
programming model must:
– Support heterogeneity, – Be scalable, – Be synthesizable, – Assume limited system services, – Support varied types of computation, – Expose coarse- and fine-grain parallelism, – Separate algorithm and implementation, – Be architecture-independent, – Provide execution model transparency.
Model Language Data Parallel HPF C* Lustre SISAL SAC Stream Computing CSP Occam MPI PVM Active Messages Handel-C Shared Memory PRAM SHMEM Linda PGAS Split-C UPC Titanium Co-Array Fortran ZPL Fortress X10 Chapel Parallel Objects CAL Dataparallel C mpC RapidMind Dataflow and Functional Simulink LabVIEW Prograph Multilisp Cilk CellSs Mitrion-C StreamIT HDLs Verilog/VHDL OpenMP Orca H e t e r
- g
e n e
- u
s P r
- c
e s s i n g S c a l a b l e S y n t h e s i z a b l e L i m i t e d S y s t e m R e s
- u
r c e s M a n y T y p e s
- f
C
- m
p u t a t i
- n
E x p
- s
e p a r a l l e l i s m S e p a r a t e a l g
- r
i t h m a n d i m p l e m e n t a t i
- n
A r c h i t e c t u r e i n d e p e n d e n t E x e c u t i
- n
m
- d
e l t r a n s p a r e n c y
So what now?
- None of the existing models meet our nine
general criteria.
- Making the best use of soft processors will
also require a tool that can automatically:
– Partition applications. – Identify parts of application for soft processors. – Generate soft ASPs. – Generate software for whole system. – Generate hardware kernels where needed. – Manage communication.
- We need a new programming model.
Our Proposed Programming Model
A New Programming Model
- New language (or adaptation of existing
language) designed to meet the nine requirements.
- Include features such as:
– Data-parallel operations. – Region-based array management. – Global view of memory. – High level, implicit communication and
synchronization.
– Emphasizes use of libraries and functions.
- Compatible with soft processor design flow.
The Armada Language
- High level languages for writing simulations.
- Data-parallel, PGAS-style language.
- Provides high-level operators and built-in
functions (eg. Matrix multiply).
- Functions are free from side effects.
- Program interpreted as dataflow.
- No pointers/direct memory manipulation.
- Region-based array management.
Sample Code
for(i := 1 to NUMSTEPS) { // Calculate forces foreach(i,j in i := 1 TO NP : j := i+1 TO NP) { var real forceij := calculateForce(pMass[i], pMass[j], pPos[i], pPos[j]); allForces[i,j] := forceij; allForces[j,i] := -forceij; } // sum columns of allForces array SummedForces[] := sum(allForces[1 TO NP, 1 ACROSS NP]); // update velocities and positions var array[1 TO NP] of triple a := summedForces[]/pMass[]; pPos[] := pPos[] + (pVel[]*TIMESTEP) + (0.5 * (TIMESTEP^2) * a[]); pVel[] := pVel[] + (a[]*TIMESTEP); }
Comments on Armada
- Meant to write algorithms and simulations, not
systems programming.
- Designed to expose parallelism as much as
possible, while reducing or eliminating troublesome features (eg. pointer indirection)
- But it is still just a language.
– Effectiveness will be heavily dependent on the
back-end tools and design flow.
Planned Armada Design Flow
Front-end Compiler Back-end Compiler – design partitioning, code generation, interfacing with other tools C++ code C++ code using MPI HDL code for single FPGA HDL code & scripts for multi-FPGA Armada program file(s) Platform Description File(s) Application Description File
Armada Back-End System
- Intended to compile a single set of source
code to multiple platforms.
- Given a description of the target platform, it:
– Partitions the available parallelism onto system
resources and processing elements (PEs).
– Generates source code appropriate to each
PE, including necessary communication with
- ther PEs.
– Generates necessary scripts to invoke
downstream compilation tools.
Modeling HPRC Systems
- Armada back-end requires description of
target platforms.
- Need information about:
– Number and types of PEs. – Memory and communication bandwidth. – Interconnection of PEs and memories.
- Use heavily-annotated weighted directed
graph.
– Each PE annotated with heuristic
“computational capacity”.
Comments on System Model
- Not intended to be an architectural simulator,
but rather a guide for partitioning.
– Can include PEs that don't actually exist in
system (eg. switch node to simulate shared memory access).
– Specialized mapping techniques would be
used to optimize for each PE.
- Armada source not directly mapped to PEs –
naturally, use the intermediate representation.
Armada Intermediate Representation (AIR)
- Load balancing of HPRC applications most
easily implemented at compile time.
- Need to map algorithm onto architecture.
- To separate front-end language from back-
end tools, we use an intermediate representation.
– Three-address code embeds assumptions
about architecture
– Use a dataflow graph.
AIR Features
- Memory operations are removed from AIR
– Memory hierarchy of HPRC systems is
complex, need flexibility of mapping
– Data movements and variable names stored
- n edges of graph.
– Back-end tools map variables to actual
memories.
- AIR dataflow graph thus consists of:
– Nodes, which perform operations, and – Edges, which represent movement of data.
AIR DFG Example
c[] := a[] + b[] / 2; foreach(i,j in i := 1 TO 10 : j := i TO 10) { d[i,j] := d[i,j] + c[i] * c[j]; }
- Nodes/edges annotated with
type and size information.
- “Computational density” of
- perations can be
determined to guide mapping.
Advantages of AIR
- Keeps algorithm high level.
– Operational nodes include big operations like
matrix multiply.
- Allows mapping flexibility.
– For example, first time a variable is used, it
might be read from main memory; thereafter, it could be kept in FPGA local memory for faster access, until computation is done.
– Parts with high computational density can be
mapped to kernels; lower density to software
Implications for Soft Processors
Implications for Soft Processors
- Armada provides an implementation-neutral
front-end programming language.
- Back-end does a high-level partitioning of AIR
to map portions of application to resources.
- “Code” is generated for each PE.
– But for FPGAs, is “code” just HDL, soft
processor + program, or some mixture thereof?
– We want to automate this decision-making.
Options for using FPGAs
- Only hardware kernels.
– Might offer best performance? – But is difficult, and success depends on type of
application.
- Only soft processors.
– “Standard” soft processors wouldn't offer
enough performance in general.
– Auto-generated application-specific soft
processors may be a viable approach.
A Hybrid Solution?
- Might offer best mix of high performance and
ease of implementation.
– But introduces a non-trivial partitioning
problem – what goes where?
- Routes to explore:
– Adapt techniques from hardware software co-
design.
– Implementing critical path in hardware, with
- ne or more soft processors for support.
– Just for control and interfacing?
Conclusions
Conclusions
- Soft processors have a role in HPRC:
– Application acceleration. – Control and interfacing.
- As with many parts of the HPRC eco-system,
the tools are a problem.
– Need to provide application experts with a high
level of abstraction.
– Back-end tools need to automatically provide
- verall performance improvement.
- Armada platform is being designed to
demonstrate validity of this approach.
Acknowledgements
- This work has been supported by:
– NSERC – Xilinx – Walter C. Sumner Memorial Fellowship