Impact of VLSI Scaling and Synthesis on Multimedia Processor Cores T - - PDF document

▶

Aug 28, 2023 130 likes •225 views

Impact of VLSI Scaling and Synthesis on Multimedia Processor Cores T OM AS B AUTISTA AND A NTONIO N U NEZ CAD Division, IUMA Applied Microelectronics Research Institute. University of Las Palmas de Gran Canaria. E-35017 Las Palmas

SLIDE 1

Impact of VLSI Scaling and Synthesis

n Multimedia Processor Cores

TOM ´

AS BAUTISTA AND ANTONIO N ´ U ˜ NEZ

CAD Division, IUMA – Applied Microelectronics Research Institute. University of Las Palmas de Gran Canaria. E-35017 Las Palmas de Gran Canaria, Canary Islands, Spain. E-mail: bautista@cma.ulpgc.es

Abstract— In this paper we present experimental results obtained dur- ing the modelling, design and implementation of a full set of versions of SPARC v8 Integer Unit core aimed for embedded applications in digital media products. VHDL has been the description language, Synopsis tools those for the logical synthesis, and Duet Technologies’ Epoch has been used for the physical layout of the final circuits. They have been mapped to 0.50 and 0.35

m, three metal layers processes, in order to study the impact

f VLSI scaling on SPARC microarchitectural features. The quantitative

results given characterize suitable points in the design space. They show how much microarchitecture, design, datapath granularity and module decisions affect performance and cost functions. Design space exploration down to physical layouts is made possible by modelling techniques based

n configurable VHDL descriptions.
I. INTRODUCTION

As feature size of tecnological processes approach deep submicron technnologies, and as metal layers are not a bottleneck anymore, the integration density available on chip is becoming extremely high. The natural trend is to take this density for free. However, deeper submicron technologies also bring along new problems especially related to wire delay and power consumption. A new design paradigm has emerged: the synthesis of large cores (in-house propietary or outsourced intellectual property) ultimately building very large and complete systems on a chip. This paradigm also calls for a synthesis approach relying on architectural, logic and layout synthesis tools. This is in contrast to mainstream design approaches relying on full-custom cores. Complexity issues in processor architectures related to feature size evolution has been studied among others by Palacharla et al [fro]. They haev studied the tradeoff between hardware and clock speed from an architectural point of view by using key pieces of full-custom layouts good to estimate the clock cycle in superscalar processors, and for geometries ranging from 0.8

✁ m

to 0.18

✁ m. The layouts for the 0.35 and 0.18 ✁ m process were

btained by appropriately shrinking the layouts for the 0.80

✁ m

process. We set the goal to conduct a similar study but under a “synthesis-based approach” for the design rather than a “full- custom-based approach”, and to analyze also the effect of the different synthesis steps on the various levels of description and design options of an architecture. In our case we developed completely processor layouts (over one hundred implementations) for 0.50 and 0.35

✁ m technologies. A study including

the 0.18

✁ m process is also underway.

One of the industrial fields demanding this dense design paradigm is digital media processing, especially for medium and low bit-rate video decoding. In the digital media domain, processor workload is dominated by video processing tasks [Pir98], [Ack94]. In order to cope with this load, in particular for high bit-rate video coding, high-end architectures are being con- ceived and developed using superscalar, vector, and parallel processors [RS98], [Ses98], [Pur98], [RK96].

A. Limits of specialization

The vector-microprocessor paradigm has been explored in depth in [Asa98] as a result of assessing the vectorizability of SPECint programs. A quantitative analysis of extending the short-vector microprocessor approach to long-vector microprocessors has been given recently in [LS98] demonstrating a clear performance advantage for multimedia applications over simple scalar and superscalar processors, up to a three-fold improve- ment factor. It also shows layout-area costs which can become up to one order of magnitude higher compared to simple scalar processors with multimedia extensions. Another related architectural trend is represented by VLIW approaches aimed to find and automatically generate efficient architectures through processor specialization. Relying on the power of the highly optimizing HP Labs Cambridge C Com- piler, quantitative results reported in [FFD96] show performance gains for high cost tightly targeted VLIW architectures, but also show dramatic performance losses in low and medium cost VLIW architectures if too narrow-scope custom-fit processors are defined from the application. After running 5730 experiments with 191 VLIW architectures tailored to fit, in a wide range, 10 multimedia benchmarks, authors conclude: “If and when the cost of individual chip design becomes very much lower than it is today, it will make a lot of sense to build chips for the narrowest of embedded applications. Today, that seems like a dangerous route to attempt”. In recent years the advantages of standard, mainstream, pro- grammable solutions have also been highlighted. These solutions rely on standard processors available as cores for embedded systems. This approach helps in software development since they are based on well established processor architectures and efficient optimising compilers. Process technology advances are also bringing these processors to speed marks that make software solutions ever attractive.

SLIDE 2

B. Mainstream processors with architecture enhancements

As a combination of these trends general-purpose processor architectures are evolving introducing family derivations including better support for domain-specific problems. One of these domains is signal processing and in particular the digital media domain including audio, video, graphics and signal process-

ing. Variations and derivations from established architectures

include additional scalar units, additional extended precision ac- cumulators, multiply and accumulate (MAC) units, additional floating point (FP) units, different data type support and formats in registers and register files —for instance to support limited SIMD ‘short-vector’ operations—, and architectural extensions to the instruction set supported by those units. In the digital media domain, examples of this are provided by MIPS, SPARC, HP PA-RISC, or HITACHI and ARM processor families among

thers. A comparison of the state of the art features offered

by these processors are given in <http://www.mips.com/ Documentation/isa5_tech_brf.pdf>, including also the Intel Pentium family MMX extension. See [BJER98] as well.

C. Processor core optimization

Pentium and HP PA-RISC are proprietary technologies. MIPS and ARM are open technologies in the sense that the architecture is aimed to be licensed to different manufacturers. SPARC is an open architecture based on architecture compliance and design certification. MIPS has proven its success in the digital media domain, for low and medium end applica-

tions. SPARC architectures have proven an equivalent potential

to MIPS architectures in their respective evolution. Optimizing microarchitecture and implementation parameters of these architectures will remain key aspects for success, for given technologies. Embedded systems and many low-end general purpose systems are very cost sensitive. It is critical to find ways to reduce cost while increasing performance. Per- formance must not only be measured in cycles/task, often op- timised at pipeline and architectural levels, but must also include clock frequency and power consumption achieved. Phys- ical aspect ratio is also essential for optimising embedded cores with those application-specific memory, coprocessor or periph- eral device modules that have to fit each other on-chip.

D. This research

The purpose of our research has been to study, for the digital media domain, the impact on real state, clock speed, power consumption and other physical implementation functions, of the following variables and parameters:

✂

ISA extensions including MAC, FP, register types, and special

perators (e.gr. vector processing elements as accelerators)

✂

Microarchitectural and design decisions for a given ISA

✂

Control of physical design by tool management with emphasis in modules and datapath granularity decisions at synthesis time and at placement and routing time.

✂

Technological scaling (in particular for deeper submicron technologies). With this aim the open SPARC architecture has been chosen. Stages conducted have been:

✂

Development of VHDL versions of SPARC v8 ISA with IU, FPU and VIS extensions

✂

Development of a VHDL based synthesis and compilation technique with easily configurable VHDL options, for flexible synthesis of experimental versions of each design and its pa- rameterized variations. This paper describes relevant microarchitectural modelling and design decisions adopted in the implementation of a full set of versions of SPARC v. 8 Integer Unit (IU) [SPA91] as processor cores for multimedia embedded systems. Complete results including FPU, VIS extensions and other blocks are given in [Bau99]. Section II comments on pertinent SPARC features. Section III addresses the strategy and technique used in modelling these processor cores. Section IV explains some of the most relevant decisions made in the microarchitecture design. In section V, under Experiments, different microarchitectural alternatives which can be more precisely evaluated are presented. This paper ends with section VI, where some results are given and discussed, and finally the last section summarizes some con- clusions.

II. SPARC ARCHITECTURE, COMPILERS AND KERNELS

SPARC is a RISC style instruction-set architecture that defines the instructions, register structure and data types for the integer unit (IU) and an IEEE 754 standard floating point unit. It allocates opcodes and defines a standard interface for a coprocessor unit. It assumes a linear 32-bit virtual address space for user application programs [SPA91], [New91]. The reasons for choosing SPARC as a core stems from

ur purpose to use it in low-cost multimedia applications

[BMCN96]. With these applications in mind, there were several reasons for making this decision:

1. Architecture-related reasons:

✂

SPARC is an open and scalable architecture, which leaves the designer with a great degree of freedom. There are a lot

f possible design alternatives depending on the performance

desired.

✂

SPARC is very well-suited for real-time and embedded ap-

plications. There are even specific extensions upon the basic

SPARC specifications especially developed for using SPARC cores in embedded systems.

✂

Another interesting feature for using SPARC in embedded systems is that the architecture presumes the presence of a windowed register file. This is attractive in applications where processes of different nature coexist and where fast traps handling is required. Support for this is achieved with the windowed register file since it allows fast context switching.

2. Compiler and kernel related reasons:

✂

For the processor cores in embedded systems a software must be resident in order to handle proper start-up mechanisms, device intercommunication, interrupts management, and main process execution. For producing machine code for the processor the GNU C compiler [Sta91], [Dar96] has been used.

✂

There are software development tools already created for SPARC that can help in the design, efficient compilation and debugging of the software to be run on these cores.

✂

The introduction of Windows CE opens the area to new multimedia applications for hand-held devices based on non-Intel

SLIDE 3

processors. There is extensive support given by other operating

systems including real-time ones for the SPARC platform.

III. MODELLING STRATEGY FOR THE PROCESSOR CORE

For a processor core, the first partition to be made corresponds to the traditional view of a processor as split into a Data Path and a Control Unit. In order to better support this structural partition and view

f the model, extensive use of a schematics capturing graphi-

cal environment is made. The graphical environment chosen in this occassion has been Synopsys’ SGE, but any other graphical environment could be equally useful. As shown in [BMCN97], a centralized alternative is better suited than the one based on distributed FSMs because it allows a better control of the instruction set, in a flexible design work-

bench. This strategy offers additional significant benefits which

result in a well suited environment for making needed modifications on the instruction set in what respects to the decoding logic. Modifying the instruction set can imply the modification, re- moval or appearance of specific processor resources in the architecture. This operation is made easier using the facilities provided by the graphical environment and the features of the hardware language. This also applies to implementation and architectural decisions upon localized elements of the architec-

ture. Other parameterization options are also possible due to the

capabilities of the description language, in which, by means of constants, specific behaviour options can be enabled or disabled in the synthesis steps. The whole processor is defined by a RTL-like block diagram. The core is split into smaller units, and these units are subdi- vided again if neccesary. This top-down approach reaches its limit when encountering simple elements or elements which function is very design-specific and well-defined. At bottom level these simplest building units are easy to maintain, debug and modify if desired. This way precise and detailed control on every element can be better achieved without losing the visibil- ity of interrelations with the rest of the elements in the core. With instances of these elements a graphical model of every hierarchical level of the core can be built. With this model, any desired microarchitectural modifications can be easily achieved by either reorganizing the structure of the design or modifying the descriptions of specific elements, or both. This way a complete RTL description of the new modified core can be readily

btained. The debugging process can then be started from this

description and any microarchitectural decisions are easily re- evaluated. Since the building blocks have been developed following the standard IEEE 1076.3 VHDL subset for synthesis the resulting model is completly synthetisable. The model of the Data Path Unit is based on the RTL view give in fig. 1. The pipeline registers are always edge triggered. The Control Unit is split into two subunits. The first one is for managing the decoding and dispatching execution of instructions, as well as for sequencing management. The second one is the main heart of the machine, with the central Finite State Machine (FSM) which rules all the actions performed and the elements for producing proper control signals for the different

MUX-B

ALU

MUX-A

Shifter OP 1 OP 2

Special Registers

Aligner

Ext. Data

Result Decode Execution Write-back StReg Data I/O

Fig. 1. IU Data Path architecture

elements of the whole architecture. These signals depend on the FSM, Instruction Register contents, traps management, external signals and other elements.

✂

The organization for the first subunit in what concerns sequencing management can be critical under some circunstances, and its design and final mapping is certainly critical. A structure which uses only one adder for the computation of next PC produces slow versions.

+ +

PCFetch 1 Offset PCExec PCDecode IncPC@t0 NextPC@t2 IncPC@t0 Clk Saved Alt NPC

Fig. 2. Removing the NextPC calculation from the critical path

Local optimizations for speeding up this operation could be achieved either working with synthesis constraints or carrying

ut a description close to the desired result. With the latter ap-

proach, and in the case of this IU v8 core, the structure of fig. 2 has been chosen. It is built with an adder and an incrementer. The key of the structure remains in the simultaneous computation of the two possible values to be assigned next to the PC. Then, depending on the sequencing decisions made at a certain step, the chosen value is given to the PC and the unchosen value is saved in a register. If the previous value given to the PC was wrong then the previously unchosen value is recovered, one cycle later.

✂

The second subunit, with which the whole architecture is con- trolled, has the critical elements which produce the control sig-

nals. The very critical signals, in which most of the effort needs

SLIDE 4

to be put, are those used in the same stage in which they are gen-

erated. By

✄

making this effort we can avoid that the critical path be found in this decode stage, or at least achieve that the delay that results in this stage become in balance with the delays found in other stages of the architecture.

IV. MICROARCHITECTURAL DECISIONS
A. Branches

Branches may cause problems to an overlapped instruction execution scheme since they can change the control flow. These instructions can be usually found executing in the processor an important amount of time (see fig. 3).

Conditional branches ratio

ver all instructions executed

Conditional branches Unconditional branches

2,5 5 7,5 10 12,5 15 17,5 GCC LaTeX MPEG xv % General purpose Application specific

Fig. 3. Ratio of branch instructions executed in programs

Therefore, the execution process of these instructions can have an important impact upon the overall performance of a processor core. We give therefore more detailes about decisions made for branch processing. Taking as a reference fig. 4, if in cycle ‘b’ a branch instruction is fetched, usually in cycle ‘c’ it will be decoded. This is true when the previous instruction does not need further tasks to complete it. In this cycle the next instruction in the delay slot is fetched and the address for the target address should be calculated to fetch the proper instruction in cycle ‘d’. The only problem is that if instruction 1 fetched in cycle ‘a’, which could be modifying the condition codes close to the end of cycle ‘c’, could delay the decision on whether actually taking the branch

r not. This delay has resulted to be about 2 ns in the 0.35

✁ m

tecnology. The branch decision will be located inside the critical

path, because the final condition codes depend on the operation carried out on the ALU. If prediction is inserted considering all conditional branches as if they were going to be taken, then an original cycle time of 20 ns is reduced to 18 ns. In this case misprediction will occur for 5.60% for xv, with 3.66% more cycles and 5.90% for mpeg, with 4.40% of cycles increment. In the worst case, this gives a 7% increase in executed cycles, but a 1% decrease in execution

time. That means that the version with prediction will run a little

faster. These figures turn out to be more significant as

✂

versions with prediction with cycle time down to 11 ns have been achieved, Instruction 1

WB. 1
Ex. 1
Dec. 1
F. 1

Instruction 2 (Bicc)

WB. 2
Ex. 2
Dec. 2
F. 2

a b c d e f g h

Post-prediction Instruction

F. D+1 Dec. D+1 Ex. D+1 WB. D+1

Predicted Instruction

WB. D
Ex. D
F. D
Dec. D

Instruction 3 (delay slot)

WB. 3
Ex. 3
Dec. 3
F. 3
Fig. 4. Timing of conditional branches

✂

a lightly better prediction scheme can be built by studying

fig. 5, where forwards conditional branches are predicted as un-

taken and backwards conditional branches as taken,

✂

if only domain-specific applications are considered.

backwards taken gcc LaTeX xv MPEG

60% 50% 40% 30% 20% 10% 0%

backwards untaken forwards taken forwards untaken

☎ ✆ ✝ ✞ ✟ ✠ ✟ ✆ ✝ ✡ ☛ ☞ ✌ ✡ ✝ ✍ ✎ ✟ ✝ ✏ ✠ ✌ ✑ ✍ ✠ ✟ ✆ ✝ ✏

Fig. 5. Ratio of classes of conditional branches execution

Under these circumstances conditional branches that cause misprediction and need recoveries are 3.94% for xv and 4.45% for mpeg. In the worst case, this gives a 2.27% increase in executed cycles, but a 14% decrease in execution time, that rep- resents, following the Amdahl’s Law expression, a speedup of 1.16.

B. Bypass

A bypass mechanism is needed for avoiding read-after-write hazards, i.e., an instruction operating with wrong operands because a previous instruction cannot write the new value for one

f them in time. This bypass mechanism can be set:
1. from the end of the execution stage of the previous instruc-

tion to the decode stage (which is shown in fig. 1), or

2. from the beginning of the write-back stage of the previous

stage to the execution stage. Both alternatives seem at first sight equivalent. However, in this architecture the operands are also needed for checking possible trap conditions. The traps management circuitry in this architecture is rather time consuming, so operands should be available as soon as possible. Therefore the first alternative presented has been chosen.

SLIDE 5

C. TADDccTV and TSUBccTV instructions

SPARC defines two instructions that cause a trap when the operation made upon two operands produces overflow. The overflow signal comes from the ALU, it has to be checked and the trap it might produce has to be contrasted with any other traps that could be requested in the same cycle. This is needed for complying with architectural traps priorities. This means that the delay associated with the flow of data through the ALU is increased with traps checking, and with sequencing management. In order to avoid such a long path these instructions have been made to occupy an extra cycle, so that in the first execution step the operation needed in the ALU is carried out and the overflow signal is stored in a special register. In the second execution step this previously stored overflow flag is checked for possible trap

management. This instruction is finally allowed to complete or
not. This strategy reduces the cycle time.

This decision is made for SPARC compliance, and is application-domain specific. Many Digital Media algorithms do not use tagged arithmetics.

D. Auxiliary instructions

Several implementations of SPARC use internal auxiliary instructions [Cat91]. These are used in case certain instructions need additional execution steps for completing them. When one

f these instructions is decoded, then a new internal instruction

is generated and fed into the processor for executing it next. This poses two problems:

✂

An additional bit has to be generated and later checked, as internal instructions need a different management. For instance, if an internal instruction has an opcode which does not match with any external instruction, and an instruction from outside comes into the processor with the same opcode, then this wrong situation should be detected. The only way for achieving this is with an additional signal.

✂

Additional circuitry for generating the new internal instructions is needed. One possible solution for this is to have additional bits for identifying how many additional execution steps are needed. Additional registers for storing this information have to be in-

cluded. With this alternative, when an instruction needs addi-

tional execution steps this instruction is frozen in the instruction register, and control signals needed from its decoding are generated depending on the execution step to be achieved. At each of these decoding steps the information from steps count- ing registers is used, and a new value for these counters is fed for knowing which step should be performed next. With this alternative, which has been adopted, circuitry for building new internal instructions has been removed and just an additional bit has been needed. This is so because the maximum number of steps to be carried out in sequence is three. This decission has also the advantage that the models gain in clarity with this policy.

E. CPI figures of the microarchitecture

For MPEG multimedia applications it is observed an instruction mix ratio of about 16.2% loads, 8.1% stores, 3.1% branches with wrong prediction and 72.6% other one-execution- step instructions (which includes branches with right prediction). Based on table I, the contribution of loads, stores, branches, annulled instructions, load with interlocks (assum- ing 50% of the loads cause interlock with the next instruction), and jumps, to the CPI, excluding cache misses, results approximately in 1.436 for our microarchitecture. In order to give another convenient reference, although outside the scope of our application domain, for a general-purpose environment the microarchitectural decisions adopted produce a CPI of 1.4545 for the execution of the GCC compiler.

TABLE I INSTRUCTION EXECUTION CYCLES Instruction type

No. of cycles
No. of additional

instruction steps Load (word/halfword/byte)1 2 1 Load (double)1 3 2 Store (word/halfword/byte) 3 2 Store (double) 4 3 Atomic load-store 3 2 Jump and Rett 2 1 Branch (with wrong prediction) 2 Branch (with right prediction) 1 Tagged instruction with trap check 2 1 All other instructions 1

1One additional cycle and one additional instruction step are included when in-

terlock is needed.

V. EXPERIMENTS

With the followed methodology, based on a defined Instruc- tion Set Architecture, a large amount of processors can be built. The impact of an specific feature can be easily evaluated by comparing two versions, one with the feature and the other with-

ut it, as far as the rest of characteristics is not altered.

The different design alternatives are introduced into the models by working on special-purpose elements of the architecture

r by reorganising some of them and their interconnections.

Taking this into account, a set of versions has been developed with the following microarchitectural parameters:

1. Number of windows in the register file.
2. Branch prediction:

(a) always predict taken, (b) predict taken only if backwards.

3. Checking of possible modification of the condition codes by

the previous instruction on branches. As seen in fig. 4, at the decode stage of the branch instruction in cycle c

✒

, when fetch- ing the instruction in the delay slot, the address for the next instruction to fetch has to be calculated. The problem is that In- struction 1 could change the condition also in c

✒

, so the address calculated in this cycle is predicted1. A parameter to evaluate for making a prediction or not is to check if the condition codes could be modified by Instruction 1 or not.

4. Calculation of the address for the next fetch

(a) with just an add operation in a cycle (b) with two additions. This way, the two alternatives are calculated at a time, the proper one is chosen depending on the sequencing decisions and the other one is saved in a register. If

✓

If we wait for the condition codes to be set by Exec. 1 for deciding the address to set in ‘d’, the cycle time increases.

SLIDE 6

at any time the decision was wrong (for instance, in mispredic- tions), then

✔

the saved previously is retrieved.

5. Bypass mechanism set

(a) from the end of the execution stage of the previous instruction to the decode stage, or (b) from the beginning of the write-back stage of the previous stage to the execution stage. The layout synthesis of these design options have been pro- cessed with another set of logic and physical synthesis parameters:

✂

Logic granularity (ratio of datapath modules vs. standard cells).

✂

Physical granularity (options for guiding the physical synthesis): – Allowing for physical planning in megacells. – Allowing for groups of standard cells definition for placement.

✂

Plasticity in shape and connectivity of the cores.

✂

Technology scaling from 0.5

✁ m to 0.35 ✁ m, with same metal

layers and VDD = 3.3V. The impact of the synthesis process on every microarchitectural option has been measured in terms of area, speed, power and transistor count as performance figures as well as by a merit figure for area, shape and plasticity for core integration with

ther blocks.
VI. RESULTS

For the design and microarchitectural options established, and two technologies, more than one hundred implementations have been carried out. As mentioned, the technologies chosen have been 3-metal 1-poly 0.5 and 0.35

✁ m CMOS processes, with

VDD = 3.3V.

TABLE II ‘LOWEST’ AMOUNT OF MODULES Ref. Options Area Power Freq. Transistors (sq. mm) (mW/MHz) (MHz) a702 2.a, 4.a, 5.a 7.9841 9.30268 59.97 183109 a712 2.a, 3, 4.a, 5.a 18.5920 10.29729 58.09 187031 a722 2.a, 4.b, 5.a 11.5999 10.00482 64.16 185580 a732 2.a, 3, 4.b, 5.a 15.3036 10.13658 63.78 186515 a742 2.a, 4.b, 5.b 12.4070 9.48736 66.87 183181 a752 2.b, 4.b, 5.a 11.9607 9.97601 69.97 185273 a762 2.b, 3, 4.b, 5.a 13.3563 10.03715 72.64 185416 TABLE III ‘MEAN’ AMOUNT OF MODULES Ref. Options Area Power Freq. Transistors (sq. mm) (mW/MHz) (MHz) a703 2.a, 4.a, 5.a 9.0754 9.59213 60.00 171217 a713 2.a, 3, 4.a, 5.a 9.5424 8.84838 55.60 169475 a723 2.a, 4.b, 5.a 13.4839 9.64224 80.95 172032 a733 2.a, 3, 4.b, 5.a 9.1589 9.56397 89.90 173087 a743 2.a, 4.b, 5.b 10.6412 9.34569 74.36 169630 a753 2.b, 4.b, 5.a 9.6249 9.20074 93.60 172463 a763 2.b, 3, 4.b, 5.a 13.1945 9.92848 84.75 173774

Tables II, III and IV summarize quantitative data from the experiments for cores with 7 windows with the 0.35

✁ m pro-

TABLE IV ‘HIGHEST’ AMOUNT OF MODULES Ref. Options Area Power Freq. Transistors (sq. mm) (mW/MHz) (MHz) a704 2.a, 4.a, 5.a 14.7899 10.15464 54.60 172235 a714 2.a, 3, 4.a, 5.a 9.7460 9.81435 53.87 171983 a724 2.a, 4.b, 5.a 13.2781 10.50044 78.19 176022 a734 2.a, 3, 4.b, 5.a 13.0209 9.98942 87.37 174922 a744 2.a, 4.b, 5.b 16.0621 10.39429 79.57 174973 a754 2.b, 4.b, 5.a 14.9878 10.48167 79.40 176069 a764 2.b, 3, 4.b, 5.a 12.7523 10.03864 80.06 174623 (a) a223 (b) a252 (c) a424 (d) a733 (e) a734 (f) a753

Fig. 6. Six different versions of the SPARC v8 IU
cess. Table V summarizes similar data for cores with 2 and 4
windows. The full set of implementations, their physical lay-
uts, and other measurements and ratios obtained can be seen in

[Bau99]. Fig. 6 gives six sample bare layouts, not including pe- ripheral blocks. Fig. 7 gives as an example a processor core including the VIS extension. This particular core runs at 80 MHz, takes 15 mm

✕ , and the VIS extension amounts to 7.34% of total

transistor count. The names of the tables summarize combinations of several

SLIDE 7

TABLE V WITH VARIABLE WINDOWS NUMBER Windows Prediction Modules Ref. Area Power Freq. Transistors Scheme Ratio (sq. mm) (mW/MHz) (MHz) 2 Simple — with

ptions 2.a, 4.b, 5.a

Low a222 7.7270 4.87396 61.46 122780 Mean a223 7.1537 4.66823 86.56 109650 High a224 11.2015 5.66391 85.16 112879 Enhanced — with

ptions 2.b, 4.b, 5.a

Low a252 8.0454 4.85731 61.17 122699 Mean a253 7.4045 4.85172 88.91 109763 High a254 11.3917 5.62138 78.70 112611 4 Simple Low a422 8.8574 6.48212 59.63 148278 Mean a423 11.4606 6.89787 88.90 136056 High a424 9.5769 6.80150 85.73 137870 Enhanced Low a452 11.0749 6.89198 61.14 148787 Mean a453 10.2297 6.76384 79.11 136042 High a454 9.7595 6.91100 82.79 137776

Fig. 7. Sample complete layout of a SPARC core with VIS extension

design parameters. Option ‘2.a, 4.a, 5.a’ in table II means an implementation with ‘predict taken’, ‘one add per cycle’ and ‘by-pass from end of execution stage’ and refer to section V. Many qualitative considerations can be analyzed from these quantitative data, since they show significant performance variations in area/shape, power and speed. The first noticeable fact has been that a large spread of performance is produced by the synthesis and physical design tools. A reasonable ‘mean’ combination of modules of megacells inside the design improves all the performance functions. If adders and ALUs are not built with megacells a larger amount of transistors is used, and if registers and multiplexers are included as megacells the tools find problems in achieving placement and routing, incrementing the global interconnection length, and therefore increasing power consumption, transistor count, and cycle time. If groups of standard cells are defined for the placement in the physical synthesis according to logic design criteria, the power increases related to that obtained when full automatic synthesis is made. This is due to the fact that with full automatic synthesis the tool tries to minimize the global interconnection length; however, forcing a particular situation for placement increases this interconnection length, and therefore the power consumed. The effect of another option at physical synthesis time, namely allowing or not allowing physical planning within megacells, can be observed in the example given in fig. 8. Fig. 8.a, for ‘lowest’ amount of modules, shows how the power consumption gets increased without physical planning. To the contrary,

fig. 8.c, for ‘highest’ amount of modules, shows how power con-

sumption gets increased with physical planning.

✖ ✖✘✗ ✙ ✚✜✛ ✚✢✛✣✗✤✙ ✚✥✚ ✚✦✚✥✗✤✙ ✛★✧ ✚✢✧ ✧✥✧ ✩✦✧ ✪★✧ ✙✥✧ ✫✦✧ ✬ ✭ ✮ ✯ ✰ ✱✳✲ ✴ ✵ ✶ ✷✹✸ ✺ ✻✽✼✜✾❀✿❂❁❄❃✦❅✘✿

With planning

❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆

Without planning

❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇

(a) ‘Lowest’ amount of modules

✖ ✖✘✗ ✙ ✚✜✛ ✚✢✛✣✗✤✙ ✚✥✚ ✚✦✚✥✗✤✙ ✛✦✩ ✚✜✩ ✧❈✩ ✩✥✩ ✪✦✩ ✙❈✩ ✫✥✩ ✬ ✭ ✮ ✯ ✰ ✱✳✲ ✴ ✵ ✶ ✷ ✸ ✺ ✻✽✼✜✾❀✿❂❁❄❃✦❅✘✿

With planning

❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆

Without planning

❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇

(b) ‘Mean’ amount of modules

✖ ✖✘✗ ✙ ✚✜✛ ✚✢✛✣✗✤✙ ✚✥✚ ✚✦✚✥✗✤✙ ✛✥✪ ✚❉✪ ✧❊✪ ✩❈✪ ✪✥✪ ✙❊✪ ✫❈✪ ✬ ✭ ✮ ✯ ✰ ✱ ✲ ✴ ✵ ✶ ✷✹✸ ✺ ✻✽✼✜✾❀✿❂❁❄❃✦❅✘✿

With planning

❆ ❆ ❆ ❆ ❆ ❆ ❆ ❆

Without planning

❇ ❇ ❇ ❇ ❇ ❇ ❇ ❇

Fig. 8. Effect on power of the planning in modules

SLIDE 8 ✫✦✛ ❋ ✛

✦✛

✖✦✛ ✛✘✗ ✙✥✛ ✁■❍ ✛✣✗ ✩✦✙ ✁■❍ ❏▲❑ ✯ ✯ ▼ ✱ ✶ ✷✹✸ ✺ ◆❖✼✢P❘◗✣❅✣❃✥❙❚❃✥❯✦❱

xa723

❆ ❆ ❆

xa753

❇ ❇ ❇

Fig. 9. Effect of the technology on the speed

The effect of the sequencing strategy on performance is observed in fig 10. Microarchitectural option 2.b (predicting that a branch will be taken when backwards) performs better than

ption 2.a (predicting that a branch will be always taken). How-

ever, if we try to improve further the prediction scheme by introducing a design option which complicates the decision logic (checking whether the previous instruction is able to modify the condition flags or not) then the behaviour is the opposite: option 2.a tends to behave better than option 2.b for cycle time. In particular, since options 2.a and 2.b differ not only in cycle time shown, but also in execution cycles, fig. 10.b shows clearly how an implementation decision depends on dynamic instruction counts of the applications workload. This effect also scales differently with different technologies. When the number of windows is varied, it can be observed that the speed stays approximately the same. This is because the register file is not inside the critical path in this case. Nonethe- less the power and the transistor count increases linearly with the number of windows. However as feature size decreases and the size of the register file increases in word length and in ac- cess lanes or ports, as demanded by emerging processor architectures, this experiment is worth to be explored again. The introduction of SIMD structures in the datapath to support instruction set extensions such as VIS, has been measured to slow down the clock speed by at least 15% if the SIMD structure is placed in the datapath. This means that introducing support for SIMD structures is heavily dependent on the frequency

f execution of these special instructions. That is, it fully works
ut only for rather narrow scope applications.

In the case that the core coexists with other external elements, in general it is beneficial to previously fix physically the core. If full-automatic physical synthesis is performed the tools can alter significantly the properties of the core. Also the effects of some design options do not scale equally well with deeper submicron technologies; that is, the influence

f mapping and synthesis on different technologies is very im-

portant for microarchitectural selection, as the relative weight

f the improvements of a specific set of options over another
ne may not be kept when scaling to another technology. This

has been observed in several specific versions, while comparing couples of versions matched in all design decisions except in the technology used. For instance, the options of next instruction address calculation with adder and incrementer has been dominant in speed in all cases. However, the prediction of considering branches as if they were going to be taken if backwards, has resulted beneficial always in CPIs, but not always in cycle

time. It slows down the cycle for 0.5

✁ m technology, and speeds ✙✥✛ ❋ ✙ ✚✜✛✦✛ ✧ ✩ ❏▲❑ ✯ ✯ ▼ ✱ ✶ ✷ ✸ ✺ ❲★❳ ✾❘❨ ❳ ✼❉❯✥❱❬❩❭❃✦✾❫❪✣✾❘❨❈❅✘P❘◗✣✼✢✿

without checking option

❆ ❆ ❆

with chekcing option

❇ ❇ ❇

(a) ‘Lowest’ amount of modules

✙✥✛ ❋ ✙ ✚✜✛✦✛ ✧ ✩ ❏▲❑ ✯ ✯ ▼ ✱ ✶ ✷ ✸ ✺ ❲★❳ ✾❘❨ ❳ ✼❉❯✥❱❬❩❭❃✦✾❫❪✣✾❘❨❈❅✘P❘◗✣✼✢✿

without checking option

❆ ❆ ❆

with checking option

❇ ❇ ❇

(b) ‘Mean’ amount of modules

✙✥✛ ❋ ✙ ✚✜✛✦✛ ✧ ✩ ❏▲❑ ✯ ✯ ▼ ✱ ✶ ✷ ✸ ✺ ❲★❳ ✾❘❨ ❳ ✼❉❯✥❱❬❩❭❃✦✾❫❪✣✾❘❨❈❅✘P❘◗✣✼✢✿

without checking option

❆ ❆ ❆

with checking option

❇ ❇ ❇

Fig. 10. Effect on the speed of the sequencing strategy

it up for 0.35

✁ m technology (see fig 9). This is related to in-

creased wire delay compared to gate delay in a deeper submicron process. It can be also observed that for exactly the same set of options in two different technologies, more transistors are needed in the 0.50

✁ m process than in the 0.35 ✁ m one. This

is due to the larger area occupied, that imposes longer interconnection lengths and bigger loads. Area is an item that shows larger statistical variance, when modules and standard cells coexist on the same design. The detailed results show when and how much designer guidance is adviceable if an area/shape-optimized design is desired. The following general tradeoff holds: the more standard cells used the less optimal the design results in area, however, more shape plasticity for logic blocks and modules is obtained.

VII. CONCLUSIONS

In this paper we have shown some experimental results from the modelling, design and implementation of a complete set of

SLIDE 9

versions of SPARC v8 Integer Unit, its VIS extensions, and a standard

❴

FPU. As it has been demonstrated, in addition to the

plenty of architectural decisions that can be incorporated into large embedded processor designs at different levels, the impact

f microarchitecture and design features as well as the impact of

the use of custom modules, can be decisive in the performance

btained from a design. Furthermore: the relative weight of de-

sign options varies significantly with the feature size of the tech-

nology. These facts call for a greater integration of high-level

architectural tools and lower level design tools, especially for deeper submicron technologies, and for synthesis-based system-

n-chip design.
VIII. ACKNOWLEDGEMENTS

The authors thank the support obtained from project DSIPS, CICYT no. TIC970953, the Department of Electronics Engi- neering, ULPGC, and the support of O. Quintero with the VIS extensions and V. de Armas with the Floating Point Units. REFERENCES

[Ack94] B.D. ACKLAND. The role of VLSI in multimedia. IEEE Journal

f Solid State Circuits 29(4) (1994).

[Asa98]

K. ASANOVI ´
C. “Vector Microprocessors”. PhD thesis, University
f California at Berkeley, Berkeley, CA (May 1998).

[Bau99]

T. BAUTISTA.

“Flexible Generation and Modelling of SPARC Cores for Specific-Application Domains”. PhD thesis, School of Telecommunication Engineers, University of Las Palmas de G.C. (1999). [BJER98] RAVI BHARGAVA, LIZY K. JOHN, BRIAN L. EVANS, AND RAMESH RADHAKRISHNAN. Evaluating MMX technology using DSP and multimedia applications. In “MICRO-31, ACM/IEEE In- ternational Symposium on Microarchitecture”, pages 37–46. IEEE Computer Society (November 1998). [BMCN96] T. BAUTISTA, G. MARRERO, P.P. CARBALLO, AND A. N ´

U ˜ NEZ.

Towards a low-cost processor architecture for multimedia. In

J. FIGUERAS, editor, “Proc. of the XI Conference of Design
f Integrated Circuits and Systems”, pages 445–450. Universi-

tat Polit` ecnica de Catalunya, CPDA, Diagonal, 647, E-08028 Barcelona, Spain (November 1996). [BMCN97] T. BAUTISTA, G. MARRERO, P.P. CARBALLO, AND A. N ´

U ˜ NEZ.

Rapid-prototyping of high-performance RISC cores with VHDL. In CAPT. GREG PETERSON AND DR. PHILIP WILSEY, editors, “Rapid Systems Prototyping with VHDL”, pages 43–52, Arlington, VA (October 1997). VHDL International, IEEE Computer Society. [Cat91] B.J. CATANZARO, editor. “The SPARC Technical Papers”. Springer Verlag (1991). [Dar96]

G. DART.

An investigation into processor descriptions within

GCC. Technical Report, CMA, University of Las Palmas de G.C.

(1996). [FFD96] JOSEPH A. FISHER, PAOLO FARABOSCHI, AND GIUSEPPE DES-

OLI. Custom-fit processors: Letting applications define architec-
tures. In “MICRO-29, ACM/IEEE International Symposium on

Microarchitecture”, pages 324–335. IEEE Computer Society (De- cember 1996). [fro] “Frontier Design”. <http://www.frontierd.com/ artlibrary.htm>. [LS98] CORINNA G. LEE AND MARK G. STOODLEY. Simple vector microprocessors for multimedia applications. In “MICRO- 31, ACM/IEEE International Symposium on Microarchitecture”, pages 25–36. IEEE Computer Society (November 1998). [New91]

M. NEWMAN. “SPARC Strategy and Technology”. Sun Microsys-

tems, Inc. (1991). [Pir98] PETER PIRSCH, editor. “VLSI Implementations for Digital Signal Processing”. John Wiley (1998). [Pur98]

S. PURCELL. The impact of Mpact 2. IEEE Signal Processing

Magazine 15(2), 102–107 (March 1998). [RK96]

K. R ¨

ONNER AND J. KNEIP. Architecture and applications of the

HiPAR video signal processor. IEEE Trans. on Circuits and Sys- tems for Video Technology 6(1) (February 1996). [RS98]

S. RATHNAM AND G. SLAVENBURG. Processing the new world
f interactive media: The Trimedia VLIW CPU architecture. IEEE

Signal Processing Magazine 15(2), 108–117 (March 1998). [Ses98]

N. SESHAN. High VelociTI processing. IEEE Signal Processing

Magazine 15(2), 86–101, 117 (March 1998). [SPA91] SPARC International, Inc. “The SPARC Architecture Manual” (1991). [Sta91] R.M. STALLMAN. “Using and Porting GNU CC” (June 1991).