an approach to programming configurable computers for
play

An Approach to Programming Configurable Computers for Numeric - PowerPoint PPT Presentation

An Approach to Programming Configurable Computers for Numeric Applications* A simple Language for Algorithms on the Reals F. Mayer-Lindenberg, TUHH * computer components with applications of their own: 1. compute units (ALUs) 2. memory


  1. An Approach to Programming Configurable Computers for Numeric Applications* A simple Language for Algorithms on the Reals F. Mayer-Lindenberg, TUHH * computer components with applications of their own: 1. compute units (ALUs) 2. memory 3. control automata

  2. von Neumann architecture: memory holding ALUs for data instruction list arithmetics on CA memory (M-Programm) number codes arbitrary contents → (limited) universality FPGA architecture / digital circuits / embedded computing: chips integrating sets of micro modules ALU primitives and memory + + + + configurable interconnection network for the micro modules + + + + (configurability → universality) permits configurations of ALU circuits + cfgb., e.g. full adder for various number codes and CAs 'Programmierung': formal definition of computations composed of many* arithmetic operations ('algorithms'), to be automatically transformed to control the processing on a maschine including the data flow * a finite number, maybe unbounded if depending on the data

  3. vN computer Levels of programming FPGA: (not programmable) machine ressources: can implement many types of a few ALUs for a compute units of ALUs for diverse number codes maybe different types (mandatory for eff. use of resources) few number codes large memories memory, I/O (input/output) control automata/soft processors (hidden hierarchy) int./ext. Interfaces, network functions automatic caching configuration could change dynamically using libraries, OS application programming: using multiple number codes may fixed number types algorithmen need simulation (number codes, I/O) HW oriented types of mathematical origin and error handling SIMD/MIMD with execution control: available ALU circuits work in parallel control by the OS control of parallelism/timing at high rates excuting threads PL/PS control for distributing operations to single FPGA needs processor and threads, comm. ALUs and data to memory memory allocation, communications

  4. Approach to the application programming of networks of standard and FPGA based procs: ( → requirements on a programming language for numeric applications .. emb'd. and HPC) 1) The implementation of new number codes and of ALU circuits/soft processors, inter- faces and IO automata based on FPGA shall not be supported. Number codes are simply selected, ALUs, IO automata, and networking functions are separately developed and configured system components. FPGA applications build on libraries of predefined configurations. 2) Compiling an application requires the specification of the available programmable processors and automata. Target systems are heterogeneous networks of processors. 3) The multi-threading and the distribtion of data and operations to the ALUs and memory are specified allowing for automatic optimizations. Timing conditions must be explicit. 4) Apart from the specification of the target networks and the usage of their resources 'only' the formal definition of numerical algorithms is needed and preferred to be given abstracly and in some notation close to the mathematical one. For embedded appli- cations any PL will be compared to PLs like 'C' regarding simplicity and compilation. | 4a) The diverse number codes are not treated as individual types with operations | and conversions of their own. Instead, a single abstract type of real number is | used with the error-free arithmetic operations. The diverse number codes are | only represented by the corresponding rounding operations on the reals. There | is no need for pointers, bit fields or Boolean data. | 4b) As algorithms encompassing many numeric operations have to be supported, the tuple sets IR n are available with extra tuple operations and optional roundings. | | Tuple ops are useful to eliminate loops, and can be implemented efficiently.

  5. A large group of projects – efficiently usable computer systems .. on CS | efficiency in the usage of the HW .. eng/sci | efficient application programming .. e:mult.sol. | .. std/acc. .. also for the usage of FPGA components | | |-----------→ modular processor based on FPGA HW architecture | | standardized control automaton | ALU modules for various number codes | composite operations, vector data | |-----------→ programming language  –Nets (small/simple) F&PL&Compiler | | for numeric applications on processor networks | parallelism supported by processes, realtime functions | implementation of a compiler and a prog/sim environment | |-----------→ FPGA based heterogene. processor networks S-Archit., OS | | network/system architecture, infrastructure includes ser.ctl. | exper. platforms for parallel and distributed computing | and for evaluating the system architecture and the PL

  6. Soft controllers / modular soft processors Required to: support non-standard ALUs, wide data codes provide maximum ALU efficiency, parallel CF and mem.acc. .. support multiple threads be a low complexity circuit to allow for large MIMD sub nets .. have a simple memory architecture DMA IMEM DMEM port (on-chip BRAM) (on-chip BRAM) soft data registers & address registers I/O controller interface path control circuit ALU host arithmetic I/O pipelines - controller performs instruction sequencing, memory control and I/O, 4 threads - ALU data word size independent from controller address/index word sizes - VLIW type instructions for ALU ops. executed in parallel with controller ops. - no memory bus, no cache/MMU, DMA supported I/O to ext.mem.(SW caching) - controller design adapted to FPGA resources

  7. Example: Floating point ALU / data path attaching to soft controller (Spartan-6) 45-bit number codes: 34-bit mantissa+sign, 9-bit exponent, no non-normals, round → 0 supporting parallel chained +/* operations and dual memory accesses (effic. dot product) registers and data RAM are 45-bit wide, data RAM with one 'rw', one 'r' port, 4 threads D0 D1 '+' pipeline D2 S D3 E D4 S L D5 D6 E (r) D7 '*' pipeline L 45-bit DMEM E D8 (r) C DMEM(w) D9 T S D10 E D11 flag L D12 D13 data cvt cvt D14 (controller) D15

  8. V144-ALU 144-bit data size (4-vectors), fixed/BFP, 16 regs/ctx, separate exponents Arithmetic operations: 18-bit instruction codes for data path (SP-6) 110 010 0rrr tttt ssss dr=(dt,ds)2 .. 2-f.SIMD dbl. .. no par.transfer, n.f. 110 011 0rrr tttt ssss dr+=(dt,ds)2 .. 2-f-SIMD dbl. .. no par.transfer, n.f. 110 000 0rrr tttt ssss dr=½(dt,ds) .. dbl. .. no par.transfer, n.f. *** 000 *rrr tttt 1*01 dr=bfly(dt) .. uses extra pars from ctrl word *** 000 *rrr tttt 1*10 drl=½(dtl+dth) *** 000 *rrr tttt 1*11 drl=dsum(dt) .. double prec. add *** 001 *rrr tttt ssss dr=ds*dt .. SIMD *** 010 *rrr tttt ssss dr=ds+dt .. SIMD *** 011 *rrr tttt ssss dr=dt–ds .. SIMD *** 100 *rrr tttt ssss drl=dtl*dsl .. cmpl. mpy .. not fused on SP-6 *** 101 *rrr tttt ssss drh=dtl*dsh .. cmpl. mpy .. not fused on SP-6 *** 110 *rrr tttt ssss drh+/2=dth*dsl .. quat. mpy .. not fused on SP-6 *** 111 *rrr tttt ssss drl+/2=dth*dsh .. quat. mpy .. not fused on SP-6 Parallel transfer operations, using reg codes from controller instruction: 001 *** 1 *** **** **** shift .. normalize 010 *** 1 *** **** **** copy (dr=dt) 011 *** 1 *** **** **** conj (dr=conj(dt)) .. shift direction from controller instr 100 *** 1 *** **** **** rsh 110 *** 1 *** **** **** wshc (r/w sh cnt) .. access associated 9-bit count regs .. ALU w/o BF uses 12 multipliers, 5500 LUTs incl.controller, fits into XC6SLX9

  9.  –Nets supports the various number codes by a unique type of real number Coding data and operations on codes to be evaluated by processors: Numbers need to be encoded by bit strings before they can be digitally computed with. digit enc: IR → B* encoding function partially defined dec: B* → IR decoding function partially defined such that r nd = dec°enc: IR → IR the 'rounding' function fulfils enc(r) = enc(r nd(r)) hence r nd°r nd = r nd Operations op: IR → IR are substituted by op'=enc°op°dec: B* → B* on the machine r = r nd(r)  dec(op'( enc(r)) ) = r nd(op(r)) (substitution inserts rounding) Operations op: IR  IR → IR etc. are handled similarly. Algorithms (compositions of operations) are executed on the machine by substituting every operation op on the reals by the corresponding op' on number codes. Tuple codes can be different from tuples of codes. Selected op's can add extra approximation errors. Certain composite operations can be implemented as 'fused' operation w/o intermediate roundings.  -Nets supports several standard and non-standard encodings including I32, X16, X35, V144, F32, F64, G45 .. to be expanded (rnd'd from int'l num) by their (unique) rounding operations only and as attributes to the abtract computations performed by its processes telling the compiler how to substitute operations.

  10. Target architecture (example): – red blocks: application specific ALUs – fixed networking infrastructure/RTS in the FPGA – disjoint memory partitions (includes CAs, memorycontrol, data commun./protocol impl., HIF) DRAM DRAM MC MC C+PR C+PR C+PR C+PR A1+DR A2+DR A1+DR A2+DR data data FPGA FPGA NoC ARM NoC ARM NW NW (SoC) (SoC) A3+DR A4+DR A3+DR A4+DR C+PR C+PR C+PR C+PR LAN MC MC DRAM DRAM PC

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend