Programming Soft Processors in High Performance Reconfigurable - - PowerPoint PPT Presentation

programming soft processors in high performance
SMART_READER_LITE
LIVE PREVIEW

Programming Soft Processors in High Performance Reconfigurable - - PowerPoint PPT Presentation

Programming Soft Processors in High Performance Reconfigurable Computing Andrew W. H. House & Paul Chow University of Toronto Workshop on Soft Processor Systems 26 October 2008 Outline Introduction High Performance Reconfigurable


slide-1
SLIDE 1

Programming Soft Processors in High Performance Reconfigurable Computing

Andrew W. H. House & Paul Chow University of Toronto Workshop on Soft Processor Systems 26 October 2008

slide-2
SLIDE 2

Outline

  • Introduction
  • High Performance Reconfigurable Computing
  • Programming Models for HPRC
  • Our Proposed Programming Model
  • Implications for Soft Processors
  • Conclusion
slide-3
SLIDE 3

Introduction

  • Traditional microprocessors are starting to

reach a performance plateau.

– In HPC there has been much interest in using

accelerators (like GPUs) for computation.

  • Reconfigurable hardware has much to offer.

– Application speedups, lower power

consumption, flexibility all possible.

  • Soft processors can have an important role in

high performance reconfigurable computing systems.

slide-4
SLIDE 4

High Performance Reconfigurable Computing

slide-5
SLIDE 5

High Performance Reconfigurable Computing

  • HPRC is a branch of high performance

computing (HPC) that uses reconfigurable hardware (typically FPGAs) to accelerate computation.

  • HPRC shares a number of characteristics with
  • ther accelerator-based HPC solutions.
  • Let's start with some definitions....
slide-6
SLIDE 6

High Performance Computing

  • Massively parallel

multiprocessor systems

  • Shared or distributed

memory

  • High end systems

have specialized interconnection network

slide-7
SLIDE 7

HPRC Classifications

  • We can classify HPRC systems into three

categories:

– Accelerated Reconfigurable Multiprocessors – Application-Specific Reconfigurable

Multiprocessors

– Heterogeneous Peer Reconfigurable

Multiprocessors

  • These mirror our more general definitions for

accelerator-based systems.

slide-8
SLIDE 8

Accelerated Reconfigurable Multiprocessor

  • Reconfigurable

hardware is co- processor

  • Subordinate to CPU
  • May have its own

memory, but CPU controls access to main memory and interconnect

slide-9
SLIDE 9

Application-Specific Reconfigurable Multiprocessor

  • Uses only FPGAs or
  • ther

reconfigurable hardware as computing elements

  • FPGAs have direct

connections to main memory and interconnect

slide-10
SLIDE 10

Heterogeneous Peer Reconfigurable Multiprocessor

  • FPGAs are first-

class computing elements, having equal access to system resources as CPUs and other accelerators

  • This is most likely

scenario for future

  • f reconfigurable

computing(?)

slide-11
SLIDE 11

Soft Processors in HPRC

  • Can soft processors compete with hard ones?

Maybe.

– Many soft processors in parallel can exploit

massive FPGA on-chip memory bandwidth.

– Application-specific soft processors can offer

significant performance (Mitrion, Tensilica...).

  • Soft processors can also be used for:

– Controlling interaction between hardware

kernels.

– Interfacing hardware with host CPUs.

slide-12
SLIDE 12

Programming Models for HPRC

slide-13
SLIDE 13

Programming Models for HPRC

  • Consider existing programming models:

– Parallel programming for HPC – Programming for reconfigurable computing – Hardware-Software Co-design

  • Effectively integrating soft processors into

HPRC provides challenges to all of these paradigms.

slide-14
SLIDE 14

Challenges in Programming HPRC

  • Handling Multiple FPGAs

– Multiple processors in each?

  • Heterogeneity

– Both hard and soft processors, plus other

processing elements.

  • Application Partitioning

– Between processing elements, or processors

and hardware kernels.

  • Synthesis

– Generation of ASP/kernels from high level.

slide-15
SLIDE 15

Parallel Programming for HPC

  • Good at handling massive data parallelism.
  • Shared memory programming and message

passing are dominant.

– Single-program multiple-data (SPMD) most

scalable paradigm for message passing.

– Partitioned global address space (PGAS)

  • ffers shared memory abstraction for

distributed memory machines.

  • SPMD/PGAS approaches not as effective in

heterogeneous environments.

slide-16
SLIDE 16

Application to Soft Processors

  • Program them in the same way!

– TMD-MPI is an MPI implementation for Xilinx

Microblaze and PPC processors.

– Couples processors and hardware kernels. – Regular MPI applications ported easily.

  • But standard models don't scale in a

heterogeneous environment.

– Runtime partitioning systems like RapidMind

  • r Intel Ct not currently relevant for FPGA-

based systems.

slide-17
SLIDE 17

Programming Model Requirements

  • For these emerging HPRC systems, the

programming model must:

– Support heterogeneity, – Be scalable, – Be synthesizable, – Assume limited system services, – Support varied types of computation, – Expose coarse- and fine-grain parallelism, – Separate algorithm and implementation, – Be architecture-independent, – Provide execution model transparency.

slide-18
SLIDE 18

Model Language Data Parallel HPF         C*                                                   Lustre              SISAL       SAC                         Stream Computing       CSP Occam       MPI        PVM        Active Messages                 Handel-C          Shared Memory PRAM      SHMEM               Linda                 PGAS Split-C        UPC        Titanium        Co-Array Fortran      ZPL      Fortress       X10        Chapel       Parallel Objects CAL        Dataparallel C mpC RapidMind Dataflow and Functional Simulink LabVIEW Prograph Multilisp Cilk CellSs Mitrion-C StreamIT HDLs Verilog/VHDL OpenMP Orca H e t e r

  • g

e n e

  • u

s P r

  • c

e s s i n g S c a l a b l e S y n t h e s i z a b l e L i m i t e d S y s t e m R e s

  • u

r c e s M a n y T y p e s

  • f

C

  • m

p u t a t i

  • n

E x p

  • s

e p a r a l l e l i s m S e p a r a t e a l g

  • r

i t h m a n d i m p l e m e n t a t i

  • n

A r c h i t e c t u r e i n d e p e n d e n t E x e c u t i

  • n

m

  • d

e l t r a n s p a r e n c y

slide-19
SLIDE 19

So what now?

  • None of the existing models meet our nine

general criteria.

  • Making the best use of soft processors will

also require a tool that can automatically:

– Partition applications. – Identify parts of application for soft processors. – Generate soft ASPs. – Generate software for whole system. – Generate hardware kernels where needed. – Manage communication.

  • We need a new programming model.
slide-20
SLIDE 20

Our Proposed Programming Model

slide-21
SLIDE 21

A New Programming Model

  • New language (or adaptation of existing

language) designed to meet the nine requirements.

  • Include features such as:

– Data-parallel operations. – Region-based array management. – Global view of memory. – High level, implicit communication and

synchronization.

– Emphasizes use of libraries and functions.

  • Compatible with soft processor design flow.
slide-22
SLIDE 22

The Armada Language

  • High level languages for writing simulations.
  • Data-parallel, PGAS-style language.
  • Provides high-level operators and built-in

functions (eg. Matrix multiply).

  • Functions are free from side effects.
  • Program interpreted as dataflow.
  • No pointers/direct memory manipulation.
  • Region-based array management.
slide-23
SLIDE 23

Sample Code

for(i := 1 to NUMSTEPS) { // Calculate forces foreach(i,j in i := 1 TO NP : j := i+1 TO NP) { var real forceij := calculateForce(pMass[i], pMass[j], pPos[i], pPos[j]); allForces[i,j] := forceij; allForces[j,i] := -forceij; } // sum columns of allForces array SummedForces[] := sum(allForces[1 TO NP, 1 ACROSS NP]); // update velocities and positions var array[1 TO NP] of triple a := summedForces[]/pMass[]; pPos[] := pPos[] + (pVel[]*TIMESTEP) + (0.5 * (TIMESTEP^2) * a[]); pVel[] := pVel[] + (a[]*TIMESTEP); }

slide-24
SLIDE 24

Comments on Armada

  • Meant to write algorithms and simulations, not

systems programming.

  • Designed to expose parallelism as much as

possible, while reducing or eliminating troublesome features (eg. pointer indirection)

  • But it is still just a language.

– Effectiveness will be heavily dependent on the

back-end tools and design flow.

slide-25
SLIDE 25

Planned Armada Design Flow

Front-end Compiler Back-end Compiler – design partitioning, code generation, interfacing with other tools C++ code C++ code using MPI HDL code for single FPGA HDL code & scripts for multi-FPGA Armada program file(s) Platform Description File(s) Application Description File

slide-26
SLIDE 26

Armada Back-End System

  • Intended to compile a single set of source

code to multiple platforms.

  • Given a description of the target platform, it:

– Partitions the available parallelism onto system

resources and processing elements (PEs).

– Generates source code appropriate to each

PE, including necessary communication with

  • ther PEs.

– Generates necessary scripts to invoke

downstream compilation tools.

slide-27
SLIDE 27

Modeling HPRC Systems

  • Armada back-end requires description of

target platforms.

  • Need information about:

– Number and types of PEs. – Memory and communication bandwidth. – Interconnection of PEs and memories.

  • Use heavily-annotated weighted directed

graph.

– Each PE annotated with heuristic

“computational capacity”.

slide-28
SLIDE 28
slide-29
SLIDE 29

Comments on System Model

  • Not intended to be an architectural simulator,

but rather a guide for partitioning.

– Can include PEs that don't actually exist in

system (eg. switch node to simulate shared memory access).

– Specialized mapping techniques would be

used to optimize for each PE.

  • Armada source not directly mapped to PEs –

naturally, use the intermediate representation.

slide-30
SLIDE 30

Armada Intermediate Representation (AIR)

  • Load balancing of HPRC applications most

easily implemented at compile time.

  • Need to map algorithm onto architecture.
  • To separate front-end language from back-

end tools, we use an intermediate representation.

– Three-address code embeds assumptions

about architecture

– Use a dataflow graph.

slide-31
SLIDE 31

AIR Features

  • Memory operations are removed from AIR

– Memory hierarchy of HPRC systems is

complex, need flexibility of mapping

– Data movements and variable names stored

  • n edges of graph.

– Back-end tools map variables to actual

memories.

  • AIR dataflow graph thus consists of:

– Nodes, which perform operations, and – Edges, which represent movement of data.

slide-32
SLIDE 32

AIR DFG Example

c[] := a[] + b[] / 2; foreach(i,j in i := 1 TO 10 : j := i TO 10) { d[i,j] := d[i,j] + c[i] * c[j]; }

  • Nodes/edges annotated with

type and size information.

  • “Computational density” of
  • perations can be

determined to guide mapping.

slide-33
SLIDE 33

Advantages of AIR

  • Keeps algorithm high level.

– Operational nodes include big operations like

matrix multiply.

  • Allows mapping flexibility.

– For example, first time a variable is used, it

might be read from main memory; thereafter, it could be kept in FPGA local memory for faster access, until computation is done.

– Parts with high computational density can be

mapped to kernels; lower density to software

slide-34
SLIDE 34

Implications for Soft Processors

slide-35
SLIDE 35

Implications for Soft Processors

  • Armada provides an implementation-neutral

front-end programming language.

  • Back-end does a high-level partitioning of AIR

to map portions of application to resources.

  • “Code” is generated for each PE.

– But for FPGAs, is “code” just HDL, soft

processor + program, or some mixture thereof?

– We want to automate this decision-making.

slide-36
SLIDE 36

Options for using FPGAs

  • Only hardware kernels.

– Might offer best performance? – But is difficult, and success depends on type of

application.

  • Only soft processors.

– “Standard” soft processors wouldn't offer

enough performance in general.

– Auto-generated application-specific soft

processors may be a viable approach.

slide-37
SLIDE 37

A Hybrid Solution?

  • Might offer best mix of high performance and

ease of implementation.

– But introduces a non-trivial partitioning

problem – what goes where?

  • Routes to explore:

– Adapt techniques from hardware software co-

design.

– Implementing critical path in hardware, with

  • ne or more soft processors for support.

– Just for control and interfacing?

slide-38
SLIDE 38

Conclusions

slide-39
SLIDE 39

Conclusions

  • Soft processors have a role in HPRC:

– Application acceleration. – Control and interfacing.

  • As with many parts of the HPRC eco-system,

the tools are a problem.

– Need to provide application experts with a high

level of abstraction.

– Back-end tools need to automatically provide

  • verall performance improvement.
  • Armada platform is being designed to

demonstrate validity of this approach.

slide-40
SLIDE 40

Acknowledgements

  • This work has been supported by:

– NSERC – Xilinx – Walter C. Sumner Memorial Fellowship