- J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004
http://ces.univ-karlsruhe.de
Design and Architectures for Embedded Systems
- Prof. Dr. J.
- Prof. Dr. J. Henkel
Henkel CES CES -
- Chair for Embedded Systems
Design and Architectures for Embedded Systems Prof. Dr. J. Henkel - - PowerPoint PPT Presentation
Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES - - Chair for Embedded Systems Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany , Germany University of Today: Code
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Optimization for:
Embedded Processor Design
Integration Hardware Design
Middleware, RTOS
System specification Design space exploration
System partitioning
Estimation&Simulation
Tape out Prototyping
embedded IP:
IC technology
Optimization
refine
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Source Program P Compiler for Processor Architecture A Machine Code for executing P on Architecture A Processor Model for Architecture A_0 Retargetable Compiler Source Program P Machine Code for executing P on Architecture A_i Processor Model for Architecture A_i Processor Model for Architecture A_n-1
Each processor family may have many derivates (architectural distinctions) tinctions)
ASIPs have three classes of parameters: have three classes of parameters:
Extensible instructions (user defined and completely customized)
Parameterizations (cache size/policy etc.)
In/exclusion of predefined blocks (e.g. spec purpose registers, test etc.) test etc.)
=> it is cumbersome to write a new compiler for each (and have the compiler make use he compiler make use
retargetable code generation techniques code generation techniques
Note: retargetable retargetable code generation for embedded processor is different to general code generation for embedded processor is different to general-
purpose retargetable retargetable code generation (ex: GNU compiler) ! code generation (ex: GNU compiler) !
http://ces.univ-karlsruhe.de
Mainly User supported Machine model
Standard compiler
compiler Mainly Automated process Machine model
Retargetable compiler
compiler
Parts of the compiler tool suite can be re-
used for new architecture but significant implementation work needs to be done but significant implementation work needs to be done
The user of the compiler tool suite can configure the compiler for
retargeting a new architecture; vendor of the tool suite does no retargeting a new architecture; vendor of the tool suite does not t need to make any changes need to make any changes
http://ces.univ-karlsruhe.de
Designed for arithmetic intensive application; reoccurring algorithms and transformation like FFT etc. algorithms and transformation like FFT etc.
Dedicated hardware multipliers; AGUs AGUs (address generation units) (address generation units) since DSP perform memory intensive operations since DSP perform memory intensive operations
Special purpose registers: bound to certain instructions/address modi modi
Some ILP (Instruction-
Level Parallelism) and special reoccurring instruction patterns/sequences e.g. MAC (Multiply instruction patterns/sequences e.g. MAC (Multiply-
and-
Accumulate)
Architectural features make code generation quite complex
Major vendors: TI, Motorola, Analog Devices, NEC
http://ces.univ-karlsruhe.de
Typically: VLIW (very-
long-
instruction-
word) architecture
Instruction word controls the level of parallelism that can be achieved chieved
Multiple FUs FUs operating in parallel => high peak performance
FUs not needed in a certain cycle have to be set idle (NOP) not needed in a certain cycle have to be set idle (NOP)
Optimization goal: keep FUs FUs busy busy
=> disadvantage of VLIWs VLIWs: large code size; addressed by: : large code size; addressed by:
Code compression (like Philips’ ’ Trimedia Trimedia that has on that has on-
chip HW for CC)
Variable-
length VLIW: variable lengths to suppress NOPs NOPs
Differential encoding: only encode diff to next VLIW instruction
Multiple instruction formats: (like Infineon Infineon Carmel): dynamically Carmel): dynamically switches between 24/48/144 switches between 24/48/144-
bit instructions (depends on amount of ILP possible; done per code segment) ILP possible; done per code segment)
Special features:
conditional instructions (supports fast execution of if-
then-
else)
SIMD (single-
instruction-
multiple-
data) instruction
Trimedia TM 3260: 31 TM 3260: 31 FUs FUs (functional units); 128 general (functional units); 128 general-
purpose registers
Others: TI TMS320C6201 => not as many irregularities as a => not as many irregularities as a ‘ ‘typical typical’ ’ DSP DSP
But: many features and hence a large ‘ ‘design space design space’ ’
http://ces.univ-karlsruhe.de
ASIP is designed for one specific application; DSP is designed for
a class of applications (e.g. video processing, audio processing a class of applications (e.g. video processing, audio processing, , … …) )
ASIP has many more degrees of freedom for configuring => large and difficult to handle design space and difficult to handle design space
Retargetability is crucial for marketing modern ASIP tool suites is crucial for marketing modern ASIP tool suites
Still nowadays DSP are often programmed in assembly language (efficiency) => this is unacceptable with the rising importance (efficiency) => this is unacceptable with the rising importance of
embedded software (e.g. millions of lines of code; SW often more embedded software (e.g. millions of lines of code; SW often more costly than hardware design) costly than hardware design)
Code generation is one thing; efficient efficient code generation another code generation another thing thing
http://ces.univ-karlsruhe.de
Compilers compromise performance (unacceptable for high-
performance applications
Has to be traded-
manual coding) manual coding)
Performance can be up to 7x worse than manual coding!
Implies larger memory. Note: at least on-
chip memory is very expensive expensive
Interesting: some optimization techniques increase perf perf. . and and reduce code size whereas others only have one criteria benefit a reduce code size whereas others only have one criteria benefit and nd worsen the other worsen the other
Dense code:
ARM Thumb architecture
Code compression: like in [LeHe02]
http://ces.univ-karlsruhe.de
Has been investigated but there are hardly compilers with low power options power options
Important for general-
purpose systems
But: in embedded systems only one application will run on that system system
if benefits can be achieved through longer code generation phases, it will be exploited phases, it will be exploited
=> leads to unusual steps in code generation for embedded systems: time intensive heuristics like SA (Simulated systems: time intensive heuristics like SA (Simulated Annealing), GA (Genetic Algorithms) have been used for Annealing), GA (Genetic Algorithms) have been used for
http://ces.univ-karlsruhe.de
IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps
Note: even though the steps are serialized, an optimum solution would even though the steps are serialized, an optimum solution would take into take into consideration all steps at once. This is not feasible because of consideration all steps at once. This is not feasible because of the complexity of the the complexity of the problem that is therefore split into serialized optimization tas problem that is therefore split into serialized optimization tasks ks
Many problems shown in the following: refer to [Leu00]
http://ces.univ-karlsruhe.de
IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps
http://ces.univ-karlsruhe.de
* * + MEM MEM MEM MEM
Shown: a sample DFG for which a code assignment has to found such that the allover execution time is minimum
from: [Marw03]
http://ces.univ-karlsruhe.de
* * + MEM MEM MEM MEM reg1:add:13 reg1:mac:12
mov2
A possible code assignment includes routing of values between various registers
from: [Marw03]
http://ces.univ-karlsruhe.de
IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps
http://ces.univ-karlsruhe.de
Dependencies between instructions (data, control) govern the sequence sequence
Degree of possible ILP (Instruction Level Parallelism) as due to, for , for example limited processor resources like registers, example limited processor resources like registers, ALUs ALUs
Micro-
architectural issues. For example: pipelining effects
In most cases: minimize the total execution time (here this means s #s of cycles; this is different to scheduling in synthesis) of a #s of cycles; this is different to scheduling in synthesis) of a given given
http://ces.univ-karlsruhe.de
Example:
Instruction scheduling for clustered (non-
VLIW => constraint: registers are local to certain VLIW => constraint: registers are local to certain FUs FUs
Further constraints: FUs FUs are typically specialized (non are typically specialized (non-
interchangeable for execution of a certain operation) execution of a certain operation)
FU FU FU FU global register file FU FU FU FU
Orthogonal instruction slots
cluster 1 FU1 local register file 1 FU2 cluster 2 FU3 local register file 2 FU4 cluster 3 FU5 local register file 3 FU6 cluster 4 FU7 local register file 4 FU8 interconnection network
Non-orthogonal instruction slots i.e. clustered VLIW data path
[Leu00]
http://ces.univ-karlsruhe.de
[Leu00] L1 S1 M1 D1 X2 A register file D2 M2 S2 L2 X1 B register file cluster A cluster B addr bus data bus
Features:
Two symmetric clusters A and B each with 16x16 reg reg file and ALU named file and ALU named L, S, M, L, S, M, D: D: -
Each FU only capable of executing a subset of instruction. – – delay for delay for execution may be 1 unit or more (most instruction have 1 unit de execution may be 1 unit or more (most instruction have 1 unit delay lay delay slots) delay slots). . – – FUs FUs work mainly on their local work mainly on their local reg reg file; file; -
in one instr
. Cycle at most one FU may read from other read from other reg reg file 9either copy or directly consumed) file 9either copy or directly consumed)
Problem definition … …
http://ces.univ-karlsruhe.de
algorithm input:
array begin while do for to do if
then else end if end for end while return end algorithm ARTITION DFG with nodes; : [1..N]of {A, B}; // partitioning temp = 10; := ANDOM ARTITIONING() mincost := IST CHEDULE( , ); temp > 0.01 i = 1 50 r := ANDOM(1, ); [r] := ( [r] = ) ? : ; cost := IST CHEDULE( , ); delta := cost - mincost; delta < 0 ANDOM(0,1) < exp(-delta/temp) mincost := cost; [r] := ( [r] = ) ? : ; temp = 0.9 * temp; ;
P R P L S R L S R
G N P P G P n P P A B A G P P P A B A P
[Leu00]
http://ces.univ-karlsruhe.de
Representative results for TI C6201 code compared to standard compiler mpiler approach approach
Shown is the normalized execution time decrease for various applications ications
10 20 30 d c t i d c t j p e g l a t t i c e h 2 6 3 f f t 1 f i r v i t e r b i 40 50 60 70 80 90 100 t e s t f f t 2 i i r [Leu00]
http://ces.univ-karlsruhe.de
IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps
http://ces.univ-karlsruhe.de
Assign variables to registers; thereby taking into consideration the the architectural peculiarities (see later) of an embedded processor architectural peculiarities (see later) of an embedded processor
Some definitions
DFT: :
= Data Flow Tree
The special case of a DFG that is
a) connected
b) all nodes have a fanout fanout <= 1 <= 1
IR – – Intermediate Representation Intermediate Representation
Typically a three-
address code using auxiliary variables
Also, high-
level constructs like if if-
the-
else etc. are replaced by
jumps/return jumps/return
Advantage of IR: independent of source language
Basic Block: a sequence of IR instructions; no jumps in/out except for : a sequence of IR instructions; no jumps in/out except for beginning or end of the sequence beginning or end of the sequence
http://ces.univ-karlsruhe.de
int f (int a, int b, int c) { int x,y; int t1, t2, t3, t4, t5, t6, t7, t8, t9, t10; /* basic block B1 */ t1 = a + b; t2 = 3 * c; t3 = t1 - t2; x = t3; t4 = x > 10; if (t4) goto L1; /* basic block B2 */ t8 = 10 * c; t9 = b - t8; t10 = x >> t9; y = t10; goto L2; /* basic block B3 */ L1: t5 = a + b ; t6 = t5 - c; t7 = x >> t6; y = t7; /* basic block B4 */ L2: return y; } int f (int a, int b, int c) { int x,y; x = a + b - 3 * c; if (x > 10) { y = x >> a + b - c; } else { y = x >> b - 10 * c; } return y; }
source IR
LOAD LOAD b a
+
LOAD 3 c
*
>
goto L1
A basic block represented as a DFG; variables A, b, c are loaded from memory before
place [Leu00]
http://ces.univ-karlsruhe.de
Problem
DSP typically have special-
purpose registers (can only be utilized for certain
Special-
purpose register enable some kind of pipelining of the data path and and thereby reducing combinational delay thereby reducing combinational delay -
> higher clock frequency
For processor with ILP: multiple read/write can be avoided
Instruction word can be reduced since registers are addressed implicitly plicitly
Data flow graphs (DFGs DFGs) are often used to represents expressions ) are often used to represents expressions
A major problem: register allocation for common sub : register allocation for common sub-
expression elimination
ALU ACCU ALU ALU ALU ALU multiplier PR MEM TR data bus [Leu00]
Shown:
TI C25 datapath datapath
Special-
purpose register TR (hold left multiplier TR (hold left multiplier inputs) and PR (stores inputs) and PR (stores result of multiplication result of multiplication
http://ces.univ-karlsruhe.de
Right fig. shows a DFG split into DFTs DFTs (triangles) (triangles)
Typically, each DFT would receive a separate register allocation and would store the CSE in register allocation and would store the CSE in memory I.e. definitions and uses have to be memory I.e. definitions and uses have to be mapped to certain locations mapped to certain locations
This mapping is most likely different when viewing all viewing all DFTs DFTs at once instead mapping for one at once instead mapping for one certain DFT at a time only certain DFT at a time only
CSE should probably also not reside in special-
purpose registers because they might be
But storing in memory is too expensive: execution time, power, code size etc. time, power, code size etc.
Example: next slide …
R1 T1 R R2 R3 route 1 route 3 route 2 T2 u2 T3 u3
…
http://ces.univ-karlsruhe.de
“lac”: “addk”: “add”: “pac”: “apac”: “mpy”: “lt”: “sacl”: “spl”:
Possible instructions according to the data path of TI C25 load a load b
+
*
+
store b store a load c 42 Example DFG
lt a mpy b spl temp pac add c sacl a lac temp addk 42 sacl b lt a mpy b pac add c sacl a pac addk 42 sacl b
That PR is due to the special structure of this DFG! DFG!
The decision where to store a CSE cannot be made locally made locally -
> depends on whole DFG
=> DFT-
by-
DFT code generation is little useful for embedded processors like for embedded processors like DSPs DSPs
Two implementations
Left needs one more operation
http://ces.univ-karlsruhe.de
Ex: a DFG separated into three DFTs DFTs, , T1, T2, T3 T1, T2, T3
R1 stores a CSE defined defined in T1 in T1
R2, R3 store values of the CSEs CSEs used in used in variables u2, u3 variables u2, u3 used used in T2 and T3, in T2 and T3, respectively respectively
The total cost of the complete DFG need to be minimized to be minimized
…
That may lead to a different allocation than the one given on the right side!
R1 T1 R R2 R3 route 1 route 3 route 2 T2 u2 T3 u3
than the one given on the right side!
http://ces.univ-karlsruhe.de
algorithm input:
begin while do for to do for all do end for if then if
then else end if end for end while for all do end for end algorithm EGISTER LLOCATION with s; sequential assembly code for ; = ECOMPOSE ; A[1..k] = + 1; // assign all s to memory current_cost = NITIAL OST ; temp = 50; temp > 0.1 count = 1 10 O ODIFICATION ); schedule = OPOLOGICAL ORT ; new_cost = 0; trees in schedule new_cost += OVER OST ; new_cost += DDR OST(schedule); EGISTER ONFLICT(schedule) new_cost = ; delta = new_cost - current_cost; delta < 0 ANDOM(0,1) < exp(-delta/temp) current_cost = new_cost; NDO ODIFICATION ; temp = 0.9 * temp; trees in schedule MIT SSEMBLY ODE ;
CSE_R A DFG CSE D ( ) CSE I C (G’) D M ( ’, A T S ( ) C C A C R C R U M , A) E A C G G G’ G G G’ T (T) (G’ T (T)
k m
http://ces.univ-karlsruhe.de
IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps
http://ces.univ-karlsruhe.de
Special address generation unit AGU for parallelizing computation AGU for parallelizing computation and address generation and address generation
Supports:
Immediate load
Immediate modify
Auto-
increment
Auto-
modify
Differing for diverse DSP
# ARs ARs (address registers) (address registers)
# MRs MRs (modify registers) (modify registers)
r – – auto auto-
increment range
=> can be used as parameters can be used as parameters in a in a retargetable retargetable compiler compiler
address register file effective address +/– d modify register file MR pointer q AR pointer p immediate constant c
AGU
http://ces.univ-karlsruhe.de
Problem: restricted # of addressing modes in DSPs DSPs
Typically less problem at assembly level
But: high-
level languages have concept of function calls and local variables which only exit es which only exit during execution of a particular function during execution of a particular function
=> need to reside on a stack because DSP has very limited amount of registers
Left example:
Stack layout during execution: function parameters (pushed by calling function); return address; lling function); return address; local variables; spill space local variables; spill space
Right example:
Floating frame pointer FP, relatively addressed spill space local variables return addr variable y variable x direction
growth spill space return addr SP
a) b)
n+c n FP SP FP+=c
http://ces.univ-karlsruhe.de
All data needs to be addressed relative to SP (Stack Pointer)
Helpful; an addressing mode “ “SP + offset SP + offset” ”
But: not always available on : not always available on DSPs DSPs ! !
Solution: an additional frame pointer FP (should move through the stack e stack frame) frame)
How does it work?
1. 1. Initialize FP with effective address of first local variable Initialize FP with effective address of first local variable 2. 2. For each access increment or decrement FP For each access increment or decrement FP 3. 3. Therefore: keep FP in an address register AR (if available) Therefore: keep FP in an address register AR (if available)
What can be done?
Cannot: positions of function parameters and return address are fixed fixed
Can: position of local addresses can be switched by compiler : position of local addresses can be switched by compiler
=> can be used for optimization in such a way that FP updates can be conducted n be conducted with AGU auto increment/decrement (this is fast since there is d with AGU auto increment/decrement (this is fast since there is dedicated hardware edicated hardware for it) for it)
Note: FP updates that need to add/subtract a number >1 are expensive 1 sive 1
http://ces.univ-karlsruhe.de
AR: address register
Goal: assign variables to registers such that address generation can assign variables to registers such that address generation can mostly be conducted by using auto increment/decrement mostly be conducted by using auto increment/decrement
Left: only 4 out of 13 operations are auto-
increment/decrement since assignment was simply done in alphabetical order assignment was simply done in alphabetical order
Right: optimized
Note: the optimization comes for free! Does not need any additional nal reg. reg.
AR = 1 AR += 2 AR -= 3 AR += 2 AR ++ AR -= 3 AR += 2 AR -- AR -- AR += 3 AR -= 3 AR += 2 AR ++ b d a c d a c b a d a c d a b c d 1 2 3 M1 C(M1) = 9 AR = 3 AR -- AR -- AR -- AR += 2 AR -- AR -- AR += 3 AR -= 2 AR ++ AR -- AR -- AR += 2 b d a c d a c b a d a c d c a d b 1 2 3 M2 C(M2) = 5
[Leu00]
registers variables
http://ces.univ-karlsruhe.de
a b c d 1 1 1 2 3 4 a b c d 1 3 4 c a d b 1 2 3 access graph
http://ces.univ-karlsruhe.de
3 7 9 8 4 5 g g g g
1 2 n-1
g g g
1 2
gn-1+k-1 chromosome chromosome v3 v7 v9 v8 v4 v5 1 2 n-1 layout 1 2 layout n-1 AR1 ARk-1 ARk
http://ces.univ-karlsruhe.de
4 5 6 1 8 9 2 7 3 8 7 3 4 2 6 5 1 9 7 6 1 4 9 8 2 3 5 A1 A2 A3 B1 B2 B3 A B step 1 8 7 3 4 2 6 5 1 9 7 6 1 4 9 8 2 3 5 step 2 4 5 6 8 7 3 1 2 9 3 8 7 0 step 3 step 4 A’ B’ 1 2 9 4 3 6 8 5 7 0
http://ces.univ-karlsruhe.de
algorithm input:
begin for do if then break end if end for return end algorithm FFSET SSIGNMENT variable set , access sequence , AGU parameters , ,
ENERATE NITIAL OPULATION(); VALUATE ITNESS(); generations ELECT ARENTS(); ENERATE FFSPRING(); UTATE FFSPRING(); VALUATE ITNESS(); EPLACE OPULATION(); no max fitness improvement in the last generations // exit loop best individual;
O A G I P E F S P G O M O E F R P
V S k m r G G
1 2
http://ces.univ-karlsruhe.de
IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps
http://ces.univ-karlsruhe.de
Loops:
Loop permutations
Loop fusion
Loop fission
Loop unrolling
Loop tiling
…
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
[Leu00] Leupers, R.; Code Optimization Techniques for Embedded Processors, Kluwer, 2000. [Marw03] P. Marwedel; Embedded Systems Design, Kluwer Academic Publishers, 2003. [LeHe02] Lekatsas, H.; Henkel, J.; Jakkula, V. 1-cycle code decompression circuitry for performance increase of Xtensa-1040-based embedded systems, Custom Integrated Circuits Conference, 2002. Proceedings of the IEEE 2002 , Pages:9 – 12, 12-15 May 2002.
Set Processors (ASIPs) Using a Machine description Language”, IEEE Tr on CAD, Vol. 20, No. 11, Nov 2001.