Design and Architectures for Embedded Systems Prof. Dr. J. Henkel - - PowerPoint PPT Presentation

design and architectures for embedded systems
SMART_READER_LITE
LIVE PREVIEW

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel - - PowerPoint PPT Presentation

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES - - Chair for Embedded Systems Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany , Germany University of Today: Code


slide-1
SLIDE 1
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Design and Architectures for Embedded Systems

  • Prof. Dr. J.
  • Prof. Dr. J. Henkel

Henkel CES CES -

  • Chair for Embedded Systems

Chair for Embedded Systems University of University of Karlsruhe Karlsruhe, Germany , Germany

Today: Code Generation Issues for Today: Code Generation Issues for Embedded Processors Embedded Processors

slide-2
SLIDE 2
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Where are we ?

  • Emb. Software

Optimization for:

  • low power
  • Performance
  • Area, …

Embedded Processor Design

  • extens. Instruction
  • Parameterization

Integration Hardware Design

  • synthesis

Middleware, RTOS

  • Scheduling

System specification Design space exploration

  • low power
  • Performance
  • Area

System partitioning

  • models of computation
  • Spec languages

Estimation&Simulation

  • low power
  • performance
  • Area, …

Tape out Prototyping

embedded IP:

  • PEs
  • Memories
  • Communication
  • Peripherals

IC technology

Optimization

  • low power
  • performance
  • Area, …

refine

slide-3
SLIDE 3
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Outline

  • Intro to code generation for embedded processors

Intro to code generation for embedded processors

  • Code generations approaches (

Code generations approaches (parameterizable parameterizable) for ) for machine dependent steps: machine dependent steps:

  • Code selection

Code selection

  • Instruction scheduling

Instruction scheduling

  • Register allocation

Register allocation

  • Address code generation

Address code generation

slide-4
SLIDE 4
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Retargetable Code Generation

Source Program P Compiler for Processor Architecture A Machine Code for executing P on Architecture A Processor Model for Architecture A_0 Retargetable Compiler Source Program P Machine Code for executing P on Architecture A_i Processor Model for Architecture A_i Processor Model for Architecture A_n-1

  • Each processor family may have many derivates (architectural dis

Each processor family may have many derivates (architectural distinctions) tinctions)

  • ASIPs

ASIPs have three classes of parameters: have three classes of parameters:

  • Extensible instructions (user defined and completely customized)

Extensible instructions (user defined and completely customized)

  • Parameterizations (cache size/policy etc.)

Parameterizations (cache size/policy etc.)

  • In/exclusion of predefined blocks (e.g. spec purpose registers,

In/exclusion of predefined blocks (e.g. spec purpose registers, test etc.) test etc.)

  • => it is cumbersome to write a new compiler for each (and have t

=> it is cumbersome to write a new compiler for each (and have the compiler make use he compiler make use

  • f the distinctions) =>
  • f the distinctions) => retargetable

retargetable code generation techniques code generation techniques

  • Note:

Note: retargetable retargetable code generation for embedded processor is different to general code generation for embedded processor is different to general-

  • purpose

purpose retargetable retargetable code generation (ex: GNU compiler) ! code generation (ex: GNU compiler) !

slide-5
SLIDE 5
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Levels of Retargetability

Mainly User supported Machine model

Standard compiler

compiler Mainly Automated process Machine model

Retargetable compiler

compiler

“Developer Developer Retargetability Retargetability

  • Parts of the compiler tool suite can be re

Parts of the compiler tool suite can be re-

  • used for new architecture

used for new architecture but significant implementation work needs to be done but significant implementation work needs to be done

“User User Retargetability Retargetability” ”

  • The user of the compiler tool suite can configure the compiler f

The user of the compiler tool suite can configure the compiler for

  • r

retargeting a new architecture; vendor of the tool suite does no retargeting a new architecture; vendor of the tool suite does not t need to make any changes need to make any changes

slide-6
SLIDE 6
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Peculiarities of Embedded Processors w/r to Compilation

  • DSPs

DSPs: :

  • Designed for arithmetic intensive application; reoccurring

Designed for arithmetic intensive application; reoccurring algorithms and transformation like FFT etc. algorithms and transformation like FFT etc.

  • Dedicated hardware multipliers;

Dedicated hardware multipliers; AGUs AGUs (address generation units) (address generation units) since DSP perform memory intensive operations since DSP perform memory intensive operations

  • Special purpose registers: bound to certain instructions/address

Special purpose registers: bound to certain instructions/address modi modi

  • Some ILP (Instruction

Some ILP (Instruction-

  • Level Parallelism) and special reoccurring

Level Parallelism) and special reoccurring instruction patterns/sequences e.g. MAC (Multiply instruction patterns/sequences e.g. MAC (Multiply-

  • and

and-

  • Accumulate)

Accumulate)

  • Architectural features make code generation quite complex

Architectural features make code generation quite complex

  • Major vendors: TI, Motorola, Analog Devices, NEC

Major vendors: TI, Motorola, Analog Devices, NEC

slide-7
SLIDE 7
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Multi-Media Processor

  • Typically: VLIW (very

Typically: VLIW (very-

  • long

long-

  • instruction

instruction-

  • word) architecture

word) architecture

  • Instruction word controls the level of parallelism that can be a

Instruction word controls the level of parallelism that can be achieved chieved

  • Multiple

Multiple FUs FUs operating in parallel => high peak performance

  • perating in parallel => high peak performance
  • FUs

FUs not needed in a certain cycle have to be set idle (NOP) not needed in a certain cycle have to be set idle (NOP)

  • Optimization goal: keep

Optimization goal: keep FUs FUs busy busy

  • => disadvantage of

=> disadvantage of VLIWs VLIWs: large code size; addressed by: : large code size; addressed by:

  • Code compression (like Philips

Code compression (like Philips’ ’ Trimedia Trimedia that has on that has on-

  • chip HW for CC)

chip HW for CC)

  • Variable

Variable-

  • length VLIW: variable lengths to suppress

length VLIW: variable lengths to suppress NOPs NOPs

  • Differential encoding: only encode diff to next VLIW instruction

Differential encoding: only encode diff to next VLIW instruction

  • Multiple instruction formats: (like

Multiple instruction formats: (like Infineon Infineon Carmel): dynamically Carmel): dynamically switches between 24/48/144 switches between 24/48/144-

  • bit instructions (depends on amount of

bit instructions (depends on amount of ILP possible; done per code segment) ILP possible; done per code segment)

  • Special features:

Special features:

  • conditional instructions (supports fast execution of if

conditional instructions (supports fast execution of if-

  • then

then-

  • else)

else)

  • SIMD (single

SIMD (single-

  • instruction

instruction-

  • multiple

multiple-

  • data) instruction

data) instruction

  • Trimedia

Trimedia TM 3260: 31 TM 3260: 31 FUs FUs (functional units); 128 general (functional units); 128 general-

  • purpose registers

purpose registers

  • Others: TI TMS320C6201

Others: TI TMS320C6201 => not as many irregularities as a => not as many irregularities as a ‘ ‘typical typical’ ’ DSP DSP

  • But: many features and hence a large

But: many features and hence a large ‘ ‘design space design space’ ’

slide-8
SLIDE 8
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Comparison

  • ASIP <

ASIP <-

  • > DSP/VLIW

> DSP/VLIW

  • ASIP is designed for one specific application; DSP is designed f

ASIP is designed for one specific application; DSP is designed for

  • r

a class of applications (e.g. video processing, audio processing a class of applications (e.g. video processing, audio processing, , … …) )

  • ASIP has many more degrees of freedom for configuring => large

ASIP has many more degrees of freedom for configuring => large and difficult to handle design space and difficult to handle design space

  • Retargetability

Retargetability is crucial for marketing modern ASIP tool suites is crucial for marketing modern ASIP tool suites

  • Note:

Note:

  • Still nowadays DSP are often programmed in assembly language

Still nowadays DSP are often programmed in assembly language (efficiency) => this is unacceptable with the rising importance (efficiency) => this is unacceptable with the rising importance of

  • f

embedded software (e.g. millions of lines of code; SW often more embedded software (e.g. millions of lines of code; SW often more costly than hardware design) costly than hardware design)

  • Code generation is one thing;

Code generation is one thing; efficient efficient code generation another code generation another thing thing

slide-9
SLIDE 9
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

What does matter in embedded code generation ?

  • Performance (#1 goal)

Performance (#1 goal)

  • Compilers compromise performance (unacceptable for high

Compilers compromise performance (unacceptable for high-

  • performance applications

performance applications

  • Has to be traded

Has to be traded-

  • off against smaller design time (compared to
  • ff against smaller design time (compared to

manual coding) manual coding)

  • Performance can be up to 7x worse than manual coding!

Performance can be up to 7x worse than manual coding!

  • Code size

Code size

  • Implies larger memory. Note: at least on

Implies larger memory. Note: at least on-

  • chip memory is very

chip memory is very expensive expensive

  • Interesting: some optimization techniques increase

Interesting: some optimization techniques increase perf perf. . and and reduce code size whereas others only have one criteria benefit a reduce code size whereas others only have one criteria benefit and nd worsen the other worsen the other

  • Dense code:

Dense code:

  • ARM Thumb architecture

ARM Thumb architecture

  • Code compression: like in [LeHe02]

Code compression: like in [LeHe02]

slide-10
SLIDE 10
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

What does matter in embedded code generation? (cont’d)

  • Power

Power

  • Has been investigated but there are hardly compilers with low

Has been investigated but there are hardly compilers with low power options power options

  • Compilation speed

Compilation speed

  • Important for general

Important for general-

  • purpose systems

purpose systems

  • But: in embedded systems only one application will run on that

But: in embedded systems only one application will run on that system system

  • if benefits can be achieved through longer code generation

if benefits can be achieved through longer code generation phases, it will be exploited phases, it will be exploited

  • => leads to unusual steps in code generation for embedded

=> leads to unusual steps in code generation for embedded systems: time intensive heuristics like SA (Simulated systems: time intensive heuristics like SA (Simulated Annealing), GA (Genetic Algorithms) have been used for Annealing), GA (Genetic Algorithms) have been used for

  • ptimization in code generation
  • ptimization in code generation
slide-11
SLIDE 11
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Steps in code generation for Embedded Processors

IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps

  • Note:

Note: even though the steps are serialized, an optimum solution would even though the steps are serialized, an optimum solution would take into take into consideration all steps at once. This is not feasible because of consideration all steps at once. This is not feasible because of the complexity of the the complexity of the problem that is therefore split into serialized optimization tas problem that is therefore split into serialized optimization tasks ks

  • Many problems shown in the following: refer to [Leu00]

Many problems shown in the following: refer to [Leu00]

slide-12
SLIDE 12
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Code Selection

IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps

slide-13
SLIDE 13
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Code Selection

  • short overview -

* * + MEM MEM MEM MEM

Shown: a sample DFG for which a code assignment has to found such that the allover execution time is minimum

from: [Marw03]

slide-14
SLIDE 14
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Code Selection

  • short overview (cont’d)-

* * + MEM MEM MEM MEM reg1:add:13 reg1:mac:12

mac load load load load mul mov2 mov3

mov2

A possible code assignment includes routing of values between various registers

from: [Marw03]

slide-15
SLIDE 15
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Instruction Scheduling

IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps

slide-16
SLIDE 16
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Instruction Scheduling

  • Decides which instruction is executed at which time (time in

Decides which instruction is executed at which time (time in this context is the cycle#) this context is the cycle#)

  • Some factors that influence instruction scheduling:

Some factors that influence instruction scheduling:

  • Dependencies between instructions (data, control) govern the

Dependencies between instructions (data, control) govern the sequence sequence

  • Degree of possible ILP (Instruction Level Parallelism) as due to

Degree of possible ILP (Instruction Level Parallelism) as due to, for , for example limited processor resources like registers, example limited processor resources like registers, ALUs ALUs

  • Micro

Micro-

  • architectural issues. For example: pipelining effects

architectural issues. For example: pipelining effects

  • Goal

Goal

  • In most cases: minimize the total execution time (here this mean

In most cases: minimize the total execution time (here this means s #s of cycles; this is different to scheduling in synthesis) of a #s of cycles; this is different to scheduling in synthesis) of a given given

  • program. Sometimes: minimize power
  • program. Sometimes: minimize power
slide-17
SLIDE 17
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Example for Instruction Scheduling

  • Example:

Example:

  • Instruction scheduling for clustered (non

Instruction scheduling for clustered (non-

  • orthogonal instruction slots)
  • rthogonal instruction slots)

VLIW => constraint: registers are local to certain VLIW => constraint: registers are local to certain FUs FUs

  • Further constraints:

Further constraints: FUs FUs are typically specialized (non are typically specialized (non-

  • interchangeable for

interchangeable for execution of a certain operation) execution of a certain operation)

FU FU FU FU global register file FU FU FU FU

Orthogonal instruction slots

cluster 1 FU1 local register file 1 FU2 cluster 2 FU3 local register file 2 FU4 cluster 3 FU5 local register file 3 FU6 cluster 4 FU7 local register file 4 FU8 interconnection network

Non-orthogonal instruction slots i.e. clustered VLIW data path

[Leu00]

slide-18
SLIDE 18
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

  • Ex. for Instruction Scheduling
  • Example for clustered VLIW: data path of TI C6201

Example for clustered VLIW: data path of TI C6201

[Leu00] L1 S1 M1 D1 X2 A register file D2 M2 S2 L2 X1 B register file cluster A cluster B addr bus data bus

  • Features:

Features:

  • Two symmetric clusters A and B each with 16x16

Two symmetric clusters A and B each with 16x16 reg reg file and ALU named file and ALU named L, S, M, L, S, M, D: D: -

  • Each FU only capable of executing a subset of instruction.

Each FU only capable of executing a subset of instruction. – – delay for delay for execution may be 1 unit or more (most instruction have 1 unit de execution may be 1 unit or more (most instruction have 1 unit delay lay delay slots) delay slots). . – – FUs FUs work mainly on their local work mainly on their local reg reg file; file; -

  • in one

in one instr

  • instr. Cycle at most one FU may

. Cycle at most one FU may read from other read from other reg reg file 9either copy or directly consumed) file 9either copy or directly consumed)

  • Problem definition

Problem definition … …

slide-19
SLIDE 19
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Combined Partitioning and Scheduling with SA

  • Simulated Annealing

Simulated Annealing for partitioning and for partitioning and List List-

  • Scheduler for

Scheduler for scheduling [Leu00] scheduling [Leu00]

  • Other approaches:

Other approaches: fix partitioning first fix partitioning first and schedule and schedule afterwards, afterwards, …

algorithm input:

  • utput:

array begin while do for to do if

  • r

then else end if end for end while return end algorithm ARTITION DFG with nodes; : [1..N]of {A, B}; // partitioning temp = 10; := ANDOM ARTITIONING() mincost := IST CHEDULE( , ); temp > 0.01 i = 1 50 r := ANDOM(1, ); [r] := ( [r] = ) ? : ; cost := IST CHEDULE( , ); delta := cost - mincost; delta < 0 ANDOM(0,1) < exp(-delta/temp) mincost := cost; [r] := ( [r] = ) ? : ; temp = 0.9 * temp; ;

P R P L S R L S R

G N P P G P n P P A B A G P P P A B A P

[Leu00]

slide-20
SLIDE 20
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Results due to Scheduling/Partitioning

  • Representative results for TI C6201 code compared to standard co

Representative results for TI C6201 code compared to standard compiler mpiler approach approach

  • Shown is the normalized execution time decrease for various appl

Shown is the normalized execution time decrease for various applications ications

10 20 30 d c t i d c t j p e g l a t t i c e h 2 6 3 f f t 1 f i r v i t e r b i 40 50 60 70 80 90 100 t e s t f f t 2 i i r [Leu00]

slide-21
SLIDE 21
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Register Allocation

IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps

slide-22
SLIDE 22
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

What is Register Allocation?

  • Assign variables to registers; thereby taking into consideration

Assign variables to registers; thereby taking into consideration the the architectural peculiarities (see later) of an embedded processor architectural peculiarities (see later) of an embedded processor

  • Some definitions

Some definitions

  • DFT

DFT: :

  • = Data Flow Tree

= Data Flow Tree

  • The special case of a DFG that is

The special case of a DFG that is

  • a) connected

a) connected

  • b) all nodes have a

b) all nodes have a fanout fanout <= 1 <= 1

  • IR

IR – – Intermediate Representation Intermediate Representation

  • Typically a three

Typically a three-

  • address code using auxiliary variables

address code using auxiliary variables

  • Also, high

Also, high-

  • level constructs like

level constructs like if if-

  • the

the-

  • else

else etc. are replaced by

  • etc. are replaced by

jumps/return jumps/return

  • Advantage of IR: independent of source language

Advantage of IR: independent of source language

  • Basic Block

Basic Block: a sequence of IR instructions; no jumps in/out except for : a sequence of IR instructions; no jumps in/out except for beginning or end of the sequence beginning or end of the sequence

slide-23
SLIDE 23
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Register Allocation

  • DFG representation -

int f (int a, int b, int c) { int x,y; int t1, t2, t3, t4, t5, t6, t7, t8, t9, t10; /* basic block B1 */ t1 = a + b; t2 = 3 * c; t3 = t1 - t2; x = t3; t4 = x > 10; if (t4) goto L1; /* basic block B2 */ t8 = 10 * c; t9 = b - t8; t10 = x >> t9; y = t10; goto L2; /* basic block B3 */ L1: t5 = a + b ; t6 = t5 - c; t7 = x >> t6; y = t7; /* basic block B4 */ L2: return y; } int f (int a, int b, int c) { int x,y; x = a + b - 3 * c; if (x > 10) { y = x >> a + b - c; } else { y = x >> b - 10 * c; } return y; }

source IR

LOAD LOAD b a

+

LOAD 3 c

*

  • 10

>

goto L1

A basic block represented as a DFG; variables A, b, c are loaded from memory before

  • perations can take

place [Leu00]

slide-24
SLIDE 24
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Register Allocation

  • Problem

Problem

  • DSP typically have special

DSP typically have special-

  • purpose registers (can only be utilized for certain

purpose registers (can only be utilized for certain

  • perations): Advantages:
  • perations): Advantages:
  • Special

Special-

  • purpose register enable some kind of pipelining of the data path

purpose register enable some kind of pipelining of the data path and and thereby reducing combinational delay thereby reducing combinational delay -

  • > higher clock frequency

> higher clock frequency

  • For processor with ILP: multiple read/write can be avoided

For processor with ILP: multiple read/write can be avoided

  • Instruction word can be reduced since registers are addressed im

Instruction word can be reduced since registers are addressed implicitly plicitly

  • Data flow graphs (

Data flow graphs (DFGs DFGs) are often used to represents expressions ) are often used to represents expressions

  • A major problem

A major problem: register allocation for common sub : register allocation for common sub-

  • expression elimination

expression elimination

ALU ACCU ALU ALU ALU ALU multiplier PR MEM TR data bus [Leu00]

  • Shown:

Shown:

  • TI C25

TI C25 datapath datapath

  • Special

Special-

  • purpose register

purpose register TR (hold left multiplier TR (hold left multiplier inputs) and PR (stores inputs) and PR (stores result of multiplication result of multiplication

slide-25
SLIDE 25
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Register Allocation for CSEs

  • Right fig. shows a DFG split into

Right fig. shows a DFG split into DFTs DFTs (triangles) (triangles)

  • Typically, each DFT would receive a separate

Typically, each DFT would receive a separate register allocation and would store the CSE in register allocation and would store the CSE in memory I.e. definitions and uses have to be memory I.e. definitions and uses have to be mapped to certain locations mapped to certain locations

  • This mapping is most likely different when

This mapping is most likely different when viewing all viewing all DFTs DFTs at once instead mapping for one at once instead mapping for one certain DFT at a time only certain DFT at a time only

  • CSE should probably also not reside in special

CSE should probably also not reside in special-

  • purpose registers because they might be

purpose registers because they might be

  • verwritten
  • verwritten
  • But storing in memory is too expensive: execution

But storing in memory is too expensive: execution time, power, code size etc. time, power, code size etc.

  • Example: next slide

Example: next slide …

R1 T1 R R2 R3 route 1 route 3 route 2 T2 u2 T3 u3

slide-26
SLIDE 26
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Register allocation for CSEs: example

“lac”: “addk”: “add”: “pac”: “apac”: “mpy”: “lt”: “sacl”: “spl”:

Possible instructions according to the data path of TI C25 load a load b

+

*

+

store b store a load c 42 Example DFG

lt a mpy b spl temp pac add c sacl a lac temp addk 42 sacl b lt a mpy b pac add c sacl a pac addk 42 sacl b

  • That PR is due to the special structure of this

That PR is due to the special structure of this DFG! DFG!

  • The decision where to store a CSE cannot be

The decision where to store a CSE cannot be made locally made locally -

  • > depends on whole DFG

> depends on whole DFG

  • => DFT

=> DFT-

  • by

by-

  • DFT code generation is little useful

DFT code generation is little useful for embedded processors like for embedded processors like DSPs DSPs

Two implementations

  • f example DFG;

Left needs one more operation

slide-27
SLIDE 27
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Problem Def. for CSE register allocation

  • Ex: a DFG separated into three

Ex: a DFG separated into three DFTs DFTs, , T1, T2, T3 T1, T2, T3

  • R1 stores a CSE

R1 stores a CSE defined defined in T1 in T1

  • R2, R3 store values of the

R2, R3 store values of the CSEs CSEs used in used in variables u2, u3 variables u2, u3 used used in T2 and T3, in T2 and T3, respectively respectively

  • The total cost of the complete DFG need

The total cost of the complete DFG need to be minimized to be minimized

  • That may lead to a different allocation

That may lead to a different allocation than the one given on the right side!

R1 T1 R R2 R3 route 1 route 3 route 2 T2 u2 T3 u3

than the one given on the right side!

slide-28
SLIDE 28
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

SA Algorithm for common subexpressions

algorithm input:

  • utput:

begin while do for to do for all do end for if then if

  • r

then else end if end for end while for all do end for end algorithm EGISTER LLOCATION with s; sequential assembly code for ; = ECOMPOSE ; A[1..k] = + 1; // assign all s to memory current_cost = NITIAL OST ; temp = 50; temp > 0.1 count = 1 10 O ODIFICATION ); schedule = OPOLOGICAL ORT ; new_cost = 0; trees in schedule new_cost += OVER OST ; new_cost += DDR OST(schedule); EGISTER ONFLICT(schedule) new_cost = ; delta = new_cost - current_cost; delta < 0 ANDOM(0,1) < exp(-delta/temp) current_cost = new_cost; NDO ODIFICATION ; temp = 0.9 * temp; trees in schedule MIT SSEMBLY ODE ;

CSE_R A DFG CSE D ( ) CSE I C (G’) D M ( ’, A T S ( ) C C A C R C R U M , A) E A C G G G’ G G G’ T (T) (G’ T (T)

k m ฀

  • Exhaustive search is

Exhaustive search is infeasible infeasible

  • Therefore: use SA

Therefore: use SA

… see [Leu00] see [Leu00]

slide-29
SLIDE 29
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Memory address computation in DSPs

IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps

slide-30
SLIDE 30
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Address Generation for DSPs

  • Special address generation unit

Special address generation unit AGU for parallelizing computation AGU for parallelizing computation and address generation and address generation

  • Supports:

Supports:

  • Immediate load

Immediate load

  • Immediate modify

Immediate modify

  • Auto

Auto-

  • increment

increment

  • Auto

Auto-

  • modify

modify

  • Differing for diverse DSP

Differing for diverse DSP

  • #

# ARs ARs (address registers) (address registers)

  • #

# MRs MRs (modify registers) (modify registers)

  • r

r – – auto auto-

  • increment range

increment range

  • =>

=> can be used as parameters can be used as parameters in a in a retargetable retargetable compiler compiler

address register file effective address +/– d modify register file MR pointer q AR pointer p immediate constant c

AGU

slide-31
SLIDE 31
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset Assignment - problem

  • Problem: restricted # of addressing modes in

Problem: restricted # of addressing modes in DSPs DSPs

  • Typically less problem at assembly level

Typically less problem at assembly level

  • But: high

But: high-

  • level languages have concept of function calls and local variabl

level languages have concept of function calls and local variables which only exit es which only exit during execution of a particular function during execution of a particular function

  • => need to reside on a stack because DSP has very limited amount

=> need to reside on a stack because DSP has very limited amount of registers

  • f registers
  • Left example:

Left example:

  • Stack layout during execution: function parameters (pushed by ca

Stack layout during execution: function parameters (pushed by calling function); return address; lling function); return address; local variables; spill space local variables; spill space

  • Right example:

Right example:

  • Floating frame pointer FP, relatively addressed

Floating frame pointer FP, relatively addressed spill space local variables return addr variable y variable x direction

  • f stack

growth spill space return addr SP

a) b)

n+c n FP SP FP+=c

slide-32
SLIDE 32
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset assignment – problem (cont’d)

  • All data needs to be addressed relative to SP (Stack Pointer)

All data needs to be addressed relative to SP (Stack Pointer)

  • Helpful; an addressing mode

Helpful; an addressing mode “ “SP + offset SP + offset” ”

  • But

But: not always available on : not always available on DSPs DSPs ! !

  • Solution: an additional frame pointer FP (should move through th

Solution: an additional frame pointer FP (should move through the stack e stack frame) frame)

  • How does it work?

How does it work?

1. 1. Initialize FP with effective address of first local variable Initialize FP with effective address of first local variable 2. 2. For each access increment or decrement FP For each access increment or decrement FP 3. 3. Therefore: keep FP in an address register AR (if available) Therefore: keep FP in an address register AR (if available)

  • What can be done?

What can be done?

  • Cannot: positions of function parameters and return address are

Cannot: positions of function parameters and return address are fixed fixed

  • Can

Can: position of local addresses can be switched by compiler : position of local addresses can be switched by compiler

  • => can be used for optimization in such a way that FP updates ca

=> can be used for optimization in such a way that FP updates can be conducted n be conducted with AGU auto increment/decrement (this is fast since there is d with AGU auto increment/decrement (this is fast since there is dedicated hardware edicated hardware for it) for it)

  • Note: FP updates that need to add/subtract a number >1 are expen

Note: FP updates that need to add/subtract a number >1 are expensive 1 sive 1

slide-33
SLIDE 33
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset assignment example

  • AR: address register

AR: address register

  • Goal:

Goal: assign variables to registers such that address generation can assign variables to registers such that address generation can mostly be conducted by using auto increment/decrement mostly be conducted by using auto increment/decrement

  • Left: only 4 out of 13 operations are auto

Left: only 4 out of 13 operations are auto-

  • increment/decrement since

increment/decrement since assignment was simply done in alphabetical order assignment was simply done in alphabetical order

  • Right: optimized

Right: optimized

  • Note: the optimization comes for free! Does not need any additio

Note: the optimization comes for free! Does not need any additional nal reg. reg.

AR = 1 AR += 2 AR -= 3 AR += 2 AR ++ AR -= 3 AR += 2 AR -- AR -- AR += 3 AR -= 3 AR += 2 AR ++ b d a c d a c b a d a c d a b c d 1 2 3 M1 C(M1) = 9 AR = 3 AR -- AR -- AR -- AR += 2 AR -- AR -- AR += 3 AR -= 2 AR ++ AR -- AR -- AR += 2 b d a c d a c b a d a c d c a d b 1 2 3 M2 C(M2) = 5

[Leu00]

registers variables

slide-34
SLIDE 34
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset Assignment

  • access graph model -

a b c d 1 1 1 2 3 4 a b c d 1 3 4 c a d b 1 2 3 access graph

  • ffset assignment
slide-35
SLIDE 35
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset Assignment

  • solving using GA -

3 7 9 8 4 5 g g g g

1 2 n-1

g g g

1 2

gn-1+k-1 chromosome chromosome v3 v7 v9 v8 v4 v5 1 2 n-1 layout 1 2 layout n-1 AR1 ARk-1 ARk

slide-36
SLIDE 36
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset Assigment

  • solving using GA -

4 5 6 1 8 9 2 7 3 8 7 3 4 2 6 5 1 9 7 6 1 4 9 8 2 3 5 A1 A2 A3 B1 B2 B3 A B step 1 8 7 3 4 2 6 5 1 9 7 6 1 4 9 8 2 3 5 step 2 4 5 6 8 7 3 1 2 9 3 8 7 0 step 3 step 4 A’ B’ 1 2 9 4 3 6 8 5 7 0

slide-37
SLIDE 37
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Offset Assigment

  • solving using GA -

algorithm input:

  • utput:

begin for do if then break end if end for return end algorithm FFSET SSIGNMENT variable set , access sequence , AGU parameters , ,

  • ffset assignment

ENERATE NITIAL OPULATION(); VALUATE ITNESS(); generations ELECT ARENTS(); ENERATE FFSPRING(); UTATE FFSPRING(); VALUATE ITNESS(); EPLACE OPULATION(); no max fitness improvement in the last generations // exit loop best individual;

O A G I P E F S P G O M O E F R P

V S k m r G G

1 2

slide-38
SLIDE 38
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

IR Optimizations

IR Optimizations Source-level transformations Source code Address code generation Register allocation Instruction scheduling Code selection Assembly code Architecture dependent steps

slide-39
SLIDE 39
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

IR Optimizations

  • Independent from the architecture

Independent from the architecture

  • Some examples:

Some examples:

  • Loops:

Loops:

  • Loop permutations

Loop permutations

  • Loop fusion

Loop fusion

  • Loop fission

Loop fission

  • Loop unrolling

Loop unrolling

  • Loop tiling

Loop tiling

slide-40
SLIDE 40
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

Summary

  • Architecture

Architecture-

  • dependent code generation consists of of four

dependent code generation consists of of four major steps that might be interwoven: major steps that might be interwoven: -

  • code selection,

code selection, -

  • instruction scheduling,

instruction scheduling, -

  • register allocation,

register allocation, -

  • address code

address code generation generation

  • Many of these tasks are computationally unfeasible

Many of these tasks are computationally unfeasible -

  • >

> needs heuristics to solve/optimize needs heuristics to solve/optimize

  • Embedded processor have specific architectural features

Embedded processor have specific architectural features that need to be exploited by a compiler that need to be exploited by a compiler

  • Retargetable

Retargetable compilers are useful for families of compilers are useful for families of processors in order to produce high processors in order to produce high-

  • quality code for

quality code for embedded processors embedded processors

  • With new embedded processors emerging continuously,

With new embedded processors emerging continuously, this area is subject to heavy research activities this area is subject to heavy research activities

slide-41
SLIDE 41
  • J. Henkel, Univ. of Karlsruhe, WS 04/05, 2004

http://ces.univ-karlsruhe.de

References and Sources

[Leu00] Leupers, R.; Code Optimization Techniques for Embedded Processors, Kluwer, 2000. [Marw03] P. Marwedel; Embedded Systems Design, Kluwer Academic Publishers, 2003. [LeHe02] Lekatsas, H.; Henkel, J.; Jakkula, V. 1-cycle code decompression circuitry for performance increase of Xtensa-1040-based embedded systems, Custom Integrated Circuits Conference, 2002. Proceedings of the IEEE 2002 , Pages:9 – 12, 12-15 May 2002.

  • A. Hoffman et al., “A Novel Methodology for the Design of Application-Specific Instruction

Set Processors (ASIPs) Using a Machine description Language”, IEEE Tr on CAD, Vol. 20, No. 11, Nov 2001.