SLIDE 1 An Approach to Programming Configurable Computers for Numeric Applications* A simple Language for Algorithms on the Reals
- F. Mayer-Lindenberg, TUHH
* computer components with applications of their own:
- 1. compute units (ALUs) 2. memory 3. control automata
SLIDE 2
von Neumann architecture: FPGA architecture / digital circuits / embedded computing: 'Programmierung':
formal definition of computations composed of many* arithmetic operations ('algorithms'), to be automatically transformed to control the processing on a maschine including the data flow
* a finite number, maybe unbounded if depending on the data
memory holding instruction list (M-Programm) CA ALUs for arithmetics on number codes data memory + + + + + + + + chips integrating sets of micro modules ALU primitives and memory configurable interconnection network for the micro modules (configurability → universality) permits configurations of ALU circuits for various number codes and CAs arbitrary contents → (limited) universality +
cfgb., e.g. full adder
SLIDE 3 Levels of programming machine ressources: compute units of maybe different types memory, I/O (input/output) application programming: algorithmen
execution control: control of parallelism/timing distributing operations to ALUs and data to memory FPGA: can implement many types of ALUs for diverse number codes (mandatory for eff. use of resources) control automata/soft processors int./ext. Interfaces, network functions configuration could change dynamically using multiple number codes may need simulation (number codes, I/O) and error handling available ALU circuits work in parallel at high rates excuting threads single FPGA needs processor and memory allocation, communications vN computer (not programmable) a few ALUs for a few number codes large memories (hidden hierarchy) automatic caching using libraries, OS fixed number types HW oriented types SIMD/MIMD with control by the OS PL/PS control for threads, comm.
SLIDE 4 Approach to the application programming of networks of standard and FPGA based procs: ( → requirements on a programming language for numeric applications .. emb'd. and HPC) 1) The implementation of new number codes and of ALU circuits/soft processors, inter- faces and IO automata based on FPGA shall not be supported. Number codes are simply selected, ALUs, IO automata, and networking functions are separately developed and configured system components. FPGA applications build on libraries of predefined configurations. 2) Compiling an application requires the specification of the available programmable processors and automata. Target systems are heterogeneous networks of processors. 3) The multi-threading and the distribtion of data and operations to the ALUs and memory are specified allowing for automatic optimizations. Timing conditions must be explicit. 4) Apart from the specification of the target networks and the usage of their resources 'only' the formal definition of numerical algorithms is needed and preferred to be given abstracly and in some notation close to the mathematical one. For embedded appli- cations any PL will be compared to PLs like 'C' regarding simplicity and compilation. | 4a) The diverse number codes are not treated as individual types with operations | and conversions of their own. Instead, a single abstract type of real number is | used with the error-free arithmetic operations. The diverse number codes are |
- nly represented by the corresponding rounding operations on the reals. There
| is no need for pointers, bit fields or Boolean data. | 4b) As algorithms encompassing many numeric operations have to be supported, | the tuple sets IRn are available with extra tuple operations and optional roundings. | Tuple ops are useful to eliminate loops, and can be implemented efficiently.
SLIDE 5 A large group of projects – efficiently usable computer systems .. on CS | efficiency in the usage of the HW .. eng/sci | efficient application programming .. e:mult.sol. |
.. also for the usage of FPGA components
.. std/acc. | | |-----------→ modular processor based on FPGA HW architecture | | standardized control automaton | ALU modules for various number codes | composite operations, vector data | |-----------→ programming language –Nets (small/simple) F&PL&Compiler | | for numeric applications on processor networks | parallelism supported by processes, realtime functions | implementation of a compiler and a prog/sim environment | |-----------→ FPGA based heterogene. processor networks S-Archit., OS | | network/system architecture, infrastructure includes ser.ctl. |
- exper. platforms for parallel and distributed computing
| and for evaluating the system architecture and the PL
SLIDE 6 Soft controllers / modular soft processors Required to: support non-standard ALUs, wide data codes provide maximum ALU efficiency, parallel CF and mem.acc. .. support multiple threads be a low complexity circuit to allow for large MIMD sub nets .. have a simple memory architecture
address registers
control circuit IMEM (on-chip BRAM)
registers & interface arithmetic pipelines
DMEM (on-chip BRAM) ALU I/O data path soft controller
- controller performs instruction sequencing, memory control and I/O, 4 threads
- ALU data word size independent from controller address/index word sizes
- VLIW type instructions for ALU ops. executed in parallel with controller ops.
- no memory bus, no cache/MMU, DMA supported I/O to ext.mem.(SW caching)
- controller design adapted to FPGA resources
host I/O DMA port
SLIDE 7
Example: Floating point ALU / data path attaching to soft controller (Spartan-6) 45-bit number codes: 34-bit mantissa+sign, 9-bit exponent, no non-normals, round → 0 supporting parallel chained +/* operations and dual memory accesses (effic. dot product) registers and data RAM are 45-bit wide, data RAM with one 'rw', one 'r' port, 4 threads
D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15
45-bit '+' pipeline '*' pipeline (r) DMEM (r) S E L E C T DMEM(w) flag data (controller) cvt
S E L S E L
cvt
SLIDE 8
V144-ALU 144-bit data size (4-vectors), fixed/BFP, 16 regs/ctx, separate exponents Arithmetic operations: 18-bit instruction codes for data path (SP-6) 110 010 0rrr tttt ssss dr=(dt,ds)2 .. 2-f.SIMD dbl. .. no par.transfer, n.f. 110 011 0rrr tttt ssss dr+=(dt,ds)2 .. 2-f-SIMD dbl. .. no par.transfer, n.f. 110 000 0rrr tttt ssss dr=½(dt,ds) .. dbl. .. no par.transfer, n.f. *** 000 *rrr tttt 1*01 dr=bfly(dt) .. uses extra pars from ctrl word *** 000 *rrr tttt 1*10 drl=½(dtl+dth) *** 000 *rrr tttt 1*11 drl=dsum(dt) .. double prec. add *** 001 *rrr tttt ssss dr=ds*dt .. SIMD *** 010 *rrr tttt ssss dr=ds+dt .. SIMD *** 011 *rrr tttt ssss dr=dt–ds .. SIMD *** 100 *rrr tttt ssss drl=dtl*dsl .. cmpl. mpy .. not fused on SP-6 *** 101 *rrr tttt ssss drh=dtl*dsh .. cmpl. mpy .. not fused on SP-6 *** 110 *rrr tttt ssss drh+/2=dth*dsl .. quat. mpy .. not fused on SP-6 *** 111 *rrr tttt ssss drl+/2=dth*dsh .. quat. mpy .. not fused on SP-6 Parallel transfer operations, using reg codes from controller instruction: 001 *** 1 *** **** **** shift .. normalize 010 *** 1 *** **** **** copy (dr=dt) 011 *** 1 *** **** **** conj (dr=conj(dt)) 100 *** 1 *** **** **** rsh .. shift direction from controller instr 110 *** 1 *** **** **** wshc (r/w sh cnt) .. access associated 9-bit count regs .. ALU w/o BF uses 12 multipliers, 5500 LUTs incl.controller, fits into XC6SLX9
SLIDE 9
–Nets supports the various number codes by a unique type of real number Coding data and operations on codes to be evaluated by processors: Numbers need to be encoded by bit strings before they can be digitally computed with. digit enc: IR → B* encoding function partially defined dec: B* → IR decoding function partially defined such that r nd = dec°enc: IR → IR the 'rounding' function fulfils enc(r) = enc(r nd(r)) hence r nd°r nd = r nd Operations op: IR → IR are substituted by op'=enc°op°dec: B* → B* on the machine r = r nd(r) dec(op'( enc(r)) ) = r nd(op(r)) (substitution inserts rounding) Operations op: IRIR → IR etc. are handled similarly. Algorithms (compositions of operations) are executed on the machine by substituting every operation op on the reals by the corresponding op' on number codes.
Tuple codes can be different from tuples of codes. Selected op's can add extra approximation errors. Certain composite operations can be implemented as 'fused' operation w/o intermediate roundings.
-Nets supports several standard and non-standard encodings including I32, X16, X35, V144, F32, F64, G45 .. to be expanded (rnd'd from int'l num) by their (unique) rounding operations only and as attributes to the abtract computations performed by its processes telling the compiler how to substitute operations.
SLIDE 10
Target architecture (example): – red blocks: application specific ALUs – fixed networking infrastructure/RTS in the FPGA – disjoint memory partitions
(includes CAs, memorycontrol, data commun./protocol impl., HIF)
FPGA (SoC) DRAM MC MC C+PR C+PR C+PR C+PR A1+DR A2+DR A3+DR A4+DR NoC ARM DRAM FPGA (SoC) DRAM MC MC C+PR C+PR C+PR C+PR A1+DR A2+DR A3+DR A4+DR NoC ARM DRAM PC LAN data NW data NW
SLIDE 11
Modeling an FPGA as a composite type of component (ntype/node statements of -Nets)
hierarchic definition of the target network by the equivalent of a netlist, FPGA program can be derived
.. defining the FPGA SoC with the standard infrastructure and its ARM nodes ntype CSC (E) 8: E, .. 8 E nodes (external links) (B) DB, NoC ↔ E DB .. NoC connects all E nodes (M) M0↔NoC, M1↔NoC .. 2 memory controllers (A) VGA ↔ DB .. fixed interface automaton (CA9) A0 ↔ M0 M1 DB, A1 ↔ A0 .. ARM type processor nodes .. defining a single FPGA SoC target without soft processors, only its ARM nodes: node (CSC) Q ↔ 0 LAN .. predefined LAN bus links to .. Q A0 and predefined 'host' .. extending the FPGA type by soft procs identified by the number codes they implement
.. focus on ALU .. serial ctrl is standard
ntype (CSC) CSC8 (g45) 4:P ↔ np P DB M0 , .. 4 soft procs implement g45 (v144) 4:R ↔ nr R DB M1 npr P .. 4 soft processors for v144 .. np,nr,npr are link lists (tuples) .. defining the dual FPGA system: .. can optionally use the 'host' node (CSC8) Q0 ↔ 0 LAN, Q1 ↔ 0 0 Q0 .. Q1 linked to Q0 by E nodes
SLIDE 12 The -Nets Language for numeric processing on the reals: Paradigms
close to everyday's mathematical syntax and semantics if applicable
- -Nets distinguishes between 'data' and 'algorithms'
data: finite function tables (tuples of numbers), sizes statically defined algorithms: compositions of elementary operations on numbers/tuples, statically defined constant function tables can be defined through algorithms – both function tables and functions defined by algorithms can be partially defined tuple entries are numbers or 'invalid', numeric ops deliver 'inv' on 'inv' input
functions/algorithms always return a value
– -Nets distinguishes between computational and non computational operations computational: elementary operations on table entries (numbers) and tuples non-computational: evaluation and composition of function tables (index operations)
interpolation from a function table is computational
– 'programs' define networks of automata to evaluate computational operations variables are special sub automata for storing tuples, restricted access – no attempt to automate verification or to transform programs beyond const.folding
SLIDE 13 The built-in -Nets data type(s) .. including the choice of built-in number/tuple operations
.. bias on modeling, robotics and DSP applications
Data set: IR – the real numbers Literals: -1 = -1e0 = -1.00 .. plus symbolic/comp.cst. Operations: + – * / % // arithmetic w/o rounding errors/overflows > comparisons (no result, just select branches) sqrt ld exp sin cos atan special functions, no approximation errors i32 x16 g45 etc. rounding operations .. and a few more
.. scalar ops xtend to tuples !
Data sets: IRn – n-tuples of numbers Literals: (1,2,4,7,9) = (1.0, 2.00e0, 4, 7, 9) Note: Tuples are not devices storing numbers but are abstract data Operations: x.i x:m.i x.y x,y component access, concatenation (nc) .. x.y:m.i
tuples x 'are' fcts on {0,..,n-1} applied by the x.i operation
+ – *, sum prod min max = vector operations and comparisons, roundings x y A y dot product, matrix by vector, tensor product x /\ y exterior product by a vector, 3D vector prod. x.P y apply tuple as polynomial function, etc. x.fc interpolate from tuple (c) .. convert c ↔ nc
some set operations (sets enum. as tuples)
- Comp. functions can be defined to add composite operations (pure fcts of multiple tuples)
Data type definitions can redefine '+', '-', '*' which carries over to the vector/polynomial ops. .. substitution of operations again! .. to support e.g. complex arithmetics, finite fields, bit vectors
SLIDE 14
Mathematical notation (operation/expressions), indexing f x function call f g x nested call ( = f (g x) ) f(x,y,z) same for several args x.i apply tuple as fct. (= xi) x.y.i compose tuple fcts f x.y = f(x.y) x:m.i apply x as tuple valued fct f(x:m,y:m).i = f(x.i,y.i) .. distinction of c.functions and nc.data is by an attribute to the name only: x y apply tuple as lin.fct f x y = f(x y) A x matrix*vector x A AT * vector f A x = f(A x) c x multiply tuple by scalar x+y, x*y sum/product by components x=y is neither an assignment nor a store operation but a comparison. a<b<c equivalent to a<b, b<c chained comparison 2x equivalent to 2 x , 2*x , 2x for numbers and vectors x²+y² equivalent to x^2 + y^2 |x| equivalent to abs x x equivalent to sqrt x x equivalent to sum x for vector valued tuples x f*g.G convolution for a group product table G Expressions bound by präfix/infix operators, terminated by ')', '}', '→', '>>', eol, ','. +,* are associative, + (optionally *) commutative (need to be if redefined), [,] bilinear
SLIDE 15
Mathematical notation (operation/expressions), indexing f x function call f g x nested call ( = f (g x) ) f(x,y,z) same for several args x.i apply tuple as fct. (= xi) x.y.i compose tuple fcts f x.y = f(x.y) x:m.i apply x as tuple valued fct f(x:m,y:m).i = f(x.i,y.i) .. distinction of c.functions and nc.data is by an attribute to the name only: x y apply tuple as lin.fct f x y = f(x y) A x matrix*vector x A AT * vector f A x = f(A x) c x multiply tuple by scalar x+y, x*y sum/product by components x=y is neither an assignment nor a store operation but a comparison. a<b<c equivalent to a<b, b<c chained comparison 2x equivalent to 2 x , 2*x , 2x for numbers and vectors x²+y² equivalent to x^2 + y^2 |x| equivalent to abs x x equivalent to sqrt x x equivalent to sum x for vector valued tuples x f*g.G convolution for a group product table G Expressions bound by präfix/infix operators, terminated by ')', '}', '→', '>>', eol, ','. +,* are associative, + (optionally *) commutative (need to be if redefined), [,] bilinear
SLIDE 16 Options for further predefined tuple operations (while keeping the PL small): 0) A hierarchy of composite operations / functions extends the basic operations. The existing tuple operations suffice to define data/op types s.a. 'complex', 'quaternion'. The available tuple operations are well suited to implement e.g. the püerations of a discrete exterior calculus (DEC) for the numeric treatment of field equations etc. Differential forms are not sampled on the same discrete set of points as functions are, but by their integrals over simplexes defined by points of that set. 1) Automatic differentiation of tuples as functions sampled on a discrete set {u0,u1,…,un-1} A tuple f.u: i → f(ui) is replaced by F.u: i → (f(ui),f '(ui)) (discrete differential form) → Algorithms with arguments u,f(u) automatically extend to u,F.u
(by applying the rules of differentiation for +,*,° and special functions)
- no extra syntax required, transparent integration into a tuple language
- the f '(ui) are sometimes available as signal samples, are estimated otherwise
- extended samples can be used to support interpolation
2) automatic application of standard methods to deduce a complex computation from a simpler input specification probably using 1), e.g. the derivation of the Hamiltonian vector field from a Hamilton function before solving the differential equation. 3) geometric structures with multiple alternative (indexed) systems of coordinates (vector spaces, affine spaces, manifolds, bundles etc.) can be treated independently
- n a special choice of coordinates by extending coordinate tuples by a coordinate index
and using change of coordinate functions and some function selecting a new index.
SLIDE 17 Dealing with failures 1) Conditions like 'a=b' are tested to either hold or fail, or are asserted.
… IF a=b THEN … ELSE …
… $$ a=b …
2) Operations/functions like 'x/y' or 'I32 x' deliver a result tuple or fail.
… IF I32 x, x/y → h THEN … ELSE …
… $$ x/y → h …
- break execution in case of failure if de-asserted not to fail
- asserted not to fail by default, producing an invalid result otherwise
.. functions always return a result
Invalid data: - added to the data set IR and allowed as tuple components
- component access 'x.i' can deliver invalid result
- perations and functions produce invalid output from invalid input
- invalid data can be sent to another process
- strings are literals for invalid data and can be output to a terminal
- numbers (valid data) output to a terminal process are displayed
- can be used to define multi-valued functions and set operations
SLIDE 18 Algorithms: notation for expressions, references, branches, block structure
- data can be named and then be referenced:
expression → name
.. single Greek characters allowed as names .. subseq.expressions reference them .. no forward refs .. cannot redefine valid names .. data names do not represent storage .. functional style
no extra termination character to end a statement, no 'return' statement can define/use computed symbolic constants (evaluated at compile time)
- algorithms list the data expressions to be evaluated in a block {→ args ..expr.. res }
can use an open if .. then .. else for branching and for checking for errors blocks can have arguments, be indexed, and be named and called as functions functions can be recursive up to a limited depth (recursions always stop, verify!)
- variables are only supported within automata type definitions and for processes
the write operation to a variable x is '>>': expression >> x
- applications presented as sets of processes all active from start
processes use a control block calling to variables/automata/IO , may be cyclic processes can have private sub automata/processes, send/receive/sample processes can be subdivided into threads sharing control threads can be given a prescribed timing through delay statements each variable of a process is written by a unique thread
SLIDE 19 Some details Defining a function: fct name { → args … … ress } .. options Computed constants: const twid 256:{→ k (cos(k/512), sin(k/512))} .. special fct Data type definition:
fct cpx * { → a,b,c,d a·c – b·d, a·d + b·c } .. etc. Automata type definition: atype atn, 2 x, 10: y .. variable y is idxd, opt. init.vals.0 fct atn f { → args … can r/w vars. … ress } .. ress opt. .. oo style access rules for variables, inheritance Process definition: apc pnm, r, s, (atn) q { … control block … } .. vars, sub aut. .. access rules for var's .. can be given for simulation only Communicating threads: { … f q.x → y … # … g y >> p … }.. read q.x .. send to apc p Threads in a group share the top level control flow: apc pn {#T1 … if cond then … #T2 … else #T1 … … #T2 … … } .. th. Labels apc pn {#T1 … {… IF cond THEN … #T2 … ELSE … #T3 … }° … #T4 …} ..°T1 cont. .. 100:{#T … … } .. .. parallel SIMD 'loop' defining indexed threads T.0,…,T.99 Threads using diff.codes: { # f64 … → y ... # i32 … y … } .. aut.conversion Threads on different P's: apc pnm #1 on P1, #2 on P2 {#1 … … #2 … … }
SLIDE 20 Timing within a thread and resulting performance requirements for realtime applications { ..S0.. $$+d1 .. S1 .. $$+d2 .. S2 .. {if-then .. $$+d21 .. else .. $$+d22 .. } .. $$+d3 .. S3 .. } | | | | | | | | |t0 |t1=t0+d1 |t2=t1+d2 |t2 |t2+d21 |t2 |t2+d22 |t3=t2+d3
- utputs from every section Si occur at a specific time ti from the starting time t0
(assuming no waiting to occur apart from the $$+ operators)
- branches within a thread may exhibit different time patterns for their outputs
.. must make sure that d3 > max(d21,d22)
- utput data from Si can be prepared in Si-1 . Then di is the time that is available to do
- this. From the number of operations in Si-1 one can estimate the required performance
in terms of operations/time for the thread.
- if two threads in a group are unified into one, the $$+ operators must be modified
(mixed up) such that the original output times are reproduced to have the same timing. The combined thread will require a higher performance than the original ones. The timing of a thread must take into account data dependencies from other threads.
- if a block is shared by several threads, they synchronize on entry and every re-entry
- $$+ also serves to define timeout condition, '$$' for more general time conditions
SLIDE 21 Summary of techniques applied to keep -Nets simple (in spite of the complex target HW):
- no basic types to be distinguished, no Boolean type or pointer type
tuple types are predefined, no data set constructions in type definitions small set of predefined operations, to be extended by functions type definitions through sets of operations only, sufficient to support overloading automatic overloading by mapping functions/operations to tuples
- same syntax for process control blocks, function bodies and sub blocks
common branching structure for algorithms and for treating run time errors replacement of program loops by tuple operations, autoindexed operations memory operations packed into definitions of automata types and processes
- processes optionally break up into threads and time sections
delay statements for IO also shared for constraining execution times I/O through communications with predefined or external processes assignment of process/thread → coding/ALU/processor (unique coding per thread) two types of statements only for the structural definition of the target networks
- complete simulation or partial execution of applications on a PC
static allocation of memory, processes, processors interactive environment, textual libs, selective compilation, SW caching/reconfiguration
SLIDE 22
const dt 1/256 fct 2 2 chg {→ gh,k if gh>3 then (gh-,1-k) else if gh< -3 then (gh+,1-k) else (gh,k) } fct 2 2 pos {→ s,k if k=0 then (cos s, sin s) else – (cos s, sin s) } fct 2 1 ang {→ g,h {→ d if d> then d-2 ← else if d<- then d+2 ← else d }(g-h) → e e/2 } apc part on host, t 0, g 0, g' 0.5, gk 0, h 0, h' 1.5, hk 1 , (gc) gd .. gc: pd. type of display automata { $$+ dt .. real time ! .. unit is 'sec' pos(g,gk) → gx,gy pos(h,hk) → hx,hy .. transform angles into IR² coordinates ang((g+*gk),(h+*hk)) → a cos(a)/((hx-gx)²+(hy-gy)²) → f .. repelling force in IR² { if a<0 then -f else f } → gh'' g + g' dt + gh'' dt²/2 >> g .. differential equation g' + gh'' dt >> g' h + h' dt – gh'' dt²/2 >> h h' – gh'' dt >> h' .. same for both coordinates chg(g,gk) >> g,gk chg(h,hk) >> h,hk .. change of coordinates 0,0,0 >> gd.r .. re-display g,h: 100(gx,gy)+(110,110) >> gd.p 0,0,0 >> gd.b .. set paint position, paint g as block 100(hx,hy)+(110,110) >> gd.p 0,0,0 >> gd.b .. same for h ←}
Two charged particles moving on the unit circle