- J. Henkel, Univ. of Karlsruhe, WS04/05, 2004
http://ces.univ-karlsruhe.de
Design and Architectures for Embedded Systems
- Prof. Dr. J.
- Prof. Dr. J. Henkel
Henkel CES CES -
- Chair for Embedded Systems
Design and Architectures for Embedded Systems Prof. Dr. J. Henkel - - PowerPoint PPT Presentation
Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES - - Chair for Embedded Systems Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany , Germany University of Today: Embedded
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Optimization for:
Embedded Processor Design
Integration Hardware Design
Middleware, RTOS
System specification Design space exploration
System partitioning
Estimation&Simulation
Tape out Prototyping
embedded IP:
IC technology
Optimization
refine
http://ces.univ-karlsruhe.de
LisaTek ( ( CoWare CoWare) )
Tensilica’ ’s s Xtensa Xtensa
Improv
ARC
HP’ ’s s PiCo PiCo
… others
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Architectural Exploration Implementing the Architecture Designing SW
Integration and Verification
Tasks are interdependent
Improvement through iteration
Each task is customized for one specific implementation of an embedded bedded processor processor
Many steps are manual since it is a one-
time effort
But product life times are short: can these tasks be combined and automated d automated ? ?
http://ces.univ-karlsruhe.de
Architectural Exploration Implementing the Architecture Designing SW
Integration and Verification Embedded Processor Tool-suite
Iterative Improvement
There is only one generic tool-
suite that generates all other parts: -
> a) min. manual support b) higher flexibility c) re support b) higher flexibility c) re-
use for next-
gen embedded processor embedded processor
Iterative improvement is done without manually re-
designing the tools
http://ces.univ-karlsruhe.de
Instruction set:
Fully customized instructions (no predefined); but the instruction set might be
domain domain-
specific (e.g. DSP-
type)
Core instruction set is fixed; the instruction set can be enhanced: ed:
The “ “bottlenecks bottlenecks” ” of an application are hard
wired as application-
specific instructions (might be re instructions (might be re-
used, e.g. FFT, but might be specific to one application only); tool application only); tool-
suite provides a language to do define these instructions instructions
Processor components:
The basic (general) core can be enhanced by pre-
defined, fixed, specialized cores: e.g. a DSP core e.g. a DSP core
System components (to be added/omitted and parameterized):
A) on-
chip cache: size, policy, … …
B) MMU
C) … …
On-
Chip communication infrastructure:
Busses and hierarchy of buses (processor core, inter-
core, peripheral) -
> typically typically fixed fixed
http://ces.univ-karlsruhe.de
Paradigm
The LISA language
Design Flow and Tools
Simulation
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Behavior : C/C+ + description Resources : register, pipelines etc. Timing information Pipeline-model
Instruction-set description
LISA operations
abstraction of time (instruction/ cycle accurate) abstraction of architecture
http://ces.univ-karlsruhe.de
Memory model
registers, memories
bit widths, ranges
Resource model
hardware resources
resource requirements
Behavioral Model
abstracted hardware activities (various activities (various levels) levels)
changing the system state state
composed of valid HW
assembly syntax instruction word coding instruction semantics
activation sequence of
hardware operations
pipeline
RTL accurate hardware
behavior
hardware operation
grouping
(source: LISATek)
http://ces.univ-karlsruhe.de
Basic idea: closing gap between structural oriented languages (HDL, closing gap between structural oriented languages (HDL, Verilog Verilog) and ) and instruction set languages instruction set languages
Memory model:
registers, memories with width ranges etc.
Resource model:
specifies available hardware (like FUs FUs, , … …) )
Instruction set model:
instruction word coding, spec. of valid operands and addressing modes; written modes; written in assembly syntax in assembly syntax
Behavioral model:
abstraction of hardware structures; notion of state (for simulation; abstraction ion; abstraction level can vary level can vary
Timing model:
specifying the sequence of hardware operations and units
Micro-
architectural model:
grouping of hardware operations to FUs FUs; describes the details of micro ; describes the details of micro-
architectural implementation of RTL components
http://ces.univ-karlsruhe.de
Target Architecture LISA Description Language Compiler LISA C Compiler LISA Assembler LISA Linker LISA Simulator
Output/Results: Profiling Data Performance Data
VHDL Description Synthesis Gate-level Model
Output/Results: Area Power
Consumption
Clock Speed
Exploration Implementation
Simulation Library
http://ces.univ-karlsruhe.de
Mixed behavioral/structural model: based on C/C+ + VLIW data-types strong memory modeling capabilities (incl. caches) include external IP (libraries) Enriched by timing information: clocked register behavior
extensive pipeline model with predefined functions
stall, flush
(source: LISATek)
http://ces.univ-karlsruhe.de
Instruction word coding
variable widths multiple words distributed coding
Assembly syntax
mnemonic based syntax algebraic (C-like) syntax
Instruction semantics
compiler semantics
Configurable instruction set information (power, etc.)
(source: LISATek)
http://ces.univ-karlsruhe.de
Resource Model Memory Model Instruction Set Behavioral Model Timing Model Micro Architecture LISA Memory Configuration Structure Functional Units Decoder Pipeline Controller HDL
Model does not consist of
predefined components -> must be generated from description:
Memory: directly
derived
Structure (e.g. pipeline
stages): derived from resource, behavioral and micro-architectural model
FUs: derived from
architectural model (fully fuctional or empty entities)
Decoder: derived from
info in instruction set model
http://ces.univ-karlsruhe.de
OPERATION Decode IN pipe.DC { ENUM InsnType=Type1, Type2, Type3; SWITCH (InsnType) CASE Type1: CODING {Decode==Decode_16} CASE Type2: CODING {(Decode==Decode_32) && (Fetch==Operand)} CASE Type3: CODING {(Decode==Decode_48) && (Fetch==Operand1) && (Prefetch==Operand2)} }
Example for multiple
instruction words and its implementation in LISA
OPERATION add { DCL ARE {REFE, RENCE mode; } if (mode==short) { BEHAVIOR {dest_lo=src1_lo+src2_lo; } } ELSE BEHAVIOR { dest_lo=src1_lo+src2_low; carry=dest_lo >> 16; dest_low &= 0xFFFF; dest_hi=src1_hi+src2_hi+carry; } } }
instruct cond mode dest-reg src_reg1 src_reg2 instruction word
Instruction: add, sub, mul, ld, sto mode: short long
Non-orthogonal coding elements
http://ces.univ-karlsruhe.de
Features:
dynamic address
mapping
user-defined memory
modules
different levels of
abstraction
C+ + and SystemC
simulation models
bus redirect allows
external memory access
access statistics
SYSTEM BUS On-Chip RAM On-Chip RAM D$ D$ I$ I$ Write Buffer Write Buffer Off-Chip RAM Off-Chip RAM L2 Cache L2 Cache
Memory Architecture Spec Lisa Spec. Memory Template Lib
(source: LISATek)
http://ces.univ-karlsruhe.de
Combining both paradigms in Lisatek:
Just-In-Time Cache-Compiled Simulation (JIT-CCS)
Memory
Execute
Instruction Decode
Run-Time Run-Time
Application
Interpretative Simulations
Compiled Simulation Simulation Compiler Application
Execute
Instruction Behavior
Compiled Simulation
(source: LISATek)
Compile-Time Compile-Time
Run-Time Run-Time
http://ces.univ-karlsruhe.de
Texas Instruments TMS320C6201
Analog Devices ADSP21xx
(neither knowledge on DSP nor on LISA)
Advanced Risc Machines ARM 7 Core
(source: LISATek)
http://ces.univ-karlsruhe.de
Paradigm
Micro-
Architectural Parameterization
Using TIE (Tensilica Tensilica Instruction Extension) Instruction Extension)
Tools
http://ces.univ-karlsruhe.de
IP (cores) parameterizable TIE Instruction Set Extensions Customized generated Software tool flow
Combines core-
based design paradigm on the one side with ASIP features (application specific instruction set processor) on the features (application specific instruction set processor) on the other side
User can adapt core parameters and define own instructions (if necessary necessary
two levels of customization
Status: commercial product
http://ces.univ-karlsruhe.de
2 @ 0.18 micron
2 @ 0.25micron
http://ces.univ-karlsruhe.de
Base ISA
Optional function extensible
CoPro Reg file CoPro Exec Unit TIE Instructions
Window reg. file ALU & address generation MAC 16 Align & decode Processor Controls Instruction memory or Cache & tag Branch logic & Instruction fetch Data memory or Cache & tag Memory protection Write buffer
Special function registers Timers (0 to n) Data & instruction Address watch (0 to n) Exception support Interrupt control
http://ces.univ-karlsruhe.de
Mixed level of configurability:
Fixed options that can be added or omitted (y/n) added or omitted (y/n)
Configuration of device parameters: sizes of caches, parameters: sizes of caches, … …
Fully customized extensions to the instruction set: TIE to the instruction set: TIE
Target Options geometry/process frequency [MHz} power saving y/n register file impl. … Instruction options 16-bit MAC y/n 16-bit multiplier y/n … Types and # of interrupts # of timers Byte ordering b/l endian Registers for call windows # Processor interface (r/w) width Instruction Cache associativity e.g. direct cache organization e.g. 4096x32 tag RAM addr x data width e.g. 512x19 Debugging full scan y/n instruction ads break reg. # TIE Xtension yes/no TIE source e.g. ./sample.tie Board support …
http://ces.univ-karlsruhe.de
Application in C/C+ + Profiling Identify potential new instructions Implement TIE Translate TIE to C/C+ + Profile and analyze OK ? Re-compile source with new Instruction instead of function calls Run ISS (cycle-accurate) Build processor Run on evaluation board OK ?
xtensa native
http://ces.univ-karlsruhe.de
state ANS2 32 user_register ans2 0 {ANS2}
iclass frexp {FREXP} {out arr, in ars} {out ANS2} iclass ldexp {LDEXP} {out arr, in art, in ars} reference FREXP { wire [31:0] temparr; wire [31:0] tempans2; assign temparr = (ars[30:23] == 0 && ars[22:0] == 0) ? 32'b0 : {ars[31], 8'b01111110, ars[22:0]}; assign tempans2 = (ars[30:23] == 0 && ars[22:0] == 0) ? 32'b0 : {24'b0, ars[30:23] - 127 + 1} ; assign arr = (tempans2[0] == 1) ? {temparr[31], temparr[30:23] + 1'b1,temparr[22:0]} : temparr ; assign ANS2 = (tempans2[0] == 1) ? (tempans2 - 1) >> 1 : tempans2 >> 1; } reference LDEXP { assign arr = {art[31], art[30:23] + ars, art[22:0]}; }
sqrt.tie
http://ces.univ-karlsruhe.de
After coding TIE, the compiler generates:
C-
functions equivalent to TIE -
> functional verification through usage in C with native software development environment with native software development environment
C-
function declarations -
> allow new instructions to be coded as functions in application code in application code
Dynamic shared libs libs to be used by other to be used by other Xtensa Xtensa SW SW
HDL description (Verilog Verilog) ) -
> hardware needed to support TIE instructions (gives also measure on HW costs and performance) (gives also measure on HW costs and performance)
Synthesis scripts (for DC): allows to automatically synthesize the hardware he hardware from the HDL description from the HDL description
http://ces.univ-karlsruhe.de
Single-
cycle access to memory
5 stage pipeline: “ “memory memory” ” often the critical path when it comes to high clock rates
User can chose to avoid placing logic after memory result is read to avoid creating a d to avoid creating a critical path critical path -
> delay result assignment by one cycle using multi-
fetch decode execute memory write-back memory
critical path
cycle instructions
http://ces.univ-karlsruhe.de
Schedule sections:
Specifies implementation at the micro-
architectural level (all others are ISA related) related)
Technique to define instruction that use more than one cycle (important for portant for relaxing cycle time) relaxing cycle time)
Example: one or more op code with same I/o spec can be grouped into one into one schedule schedule
http://ces.univ-karlsruhe.de
Instruction Set Simulator (ISS), Bus Functional Model Program (assembly)
Processor Interface Model (HDL) Processor Interface (PIF) lib Bus Interface Verilog/VHDL SRAM Mem. HDL External Co-verification Console
http://ces.univ-karlsruhe.de
External SRAM (compressed Code) Cache CPU Core: Xtensa PIF Tag
Code decompression Core
Tensilica NEC add-on
Compress code and store in main memory Decompress on-the-fly in just 1 cycle
[LeHe02]
http://ces.univ-karlsruhe.de
Gather statistics Compilation
Source Code Object Code
Compressed Software SRAM
Compress & build table patch branch offsets
Compression stages Table Interface Tree logic
Cache
[LeHe02]
http://ces.univ-karlsruhe.de
Summary of main Features
Platform
DSP core
Misc
http://ces.univ-karlsruhe.de
IP (cores) parameterizable add new instructions to core
(with standard ASIC design flow)
Modify/extend Instruction Set architecture
Targets DSP oriented applications
http://ces.univ-karlsruhe.de
The components of the platforms platforms
Composer:
Facilitates instruction extension and adding new extension and adding new instructions; interrupts; instructions; interrupts; memory memory … …
Generator:
Interprets configuration from composer composer
Generates RTL Verilog Verilog description as input for a description as input for a standard ASIC design flow standard ASIC design flow
The RTL instances generated are verified and generated are verified and read from a data base and read from a data base and not automatically generated not automatically generated
(source: Improv Systems)
software software hardware hardware
http://ces.univ-karlsruhe.de
Configurable VLIW architecture
user chooses data path (16-
bit or 32-
bit wide)
User can extend the ISA: either through as an option (like inclusion or non (like inclusion or non-
inclusion of multiplier, MAC multiplier, MAC etc.) or by defining etc.) or by defining custom custom instructions instructions and and custom custom functional units; functional units; dual dual-
bit accumulator; … …
Memory: instruction and data memories can be configured
Furthermore: interrupts (number and priority levels), system memory
addresses etc. addresses etc.
Features:
Power: < 0.1mW/MHz @ 0.13 micron
Chip size: <0.25mm2 @ 0.13 micron
Performance: > 1000 MOPS @ 100 MHz
Misc architectural features: architectural features:
Distributed register files to avoid I/O bottleneck from and to FUs FUs
2-
stage instruction pipeline
Single-
cycle execution units
http://ces.univ-karlsruhe.de
(source: Improv Systems)
extensible
http://ces.univ-karlsruhe.de
Bus wrapper
Memory Interface Control Interface Host Bus Interface
data and control signal
Jazz system
interfacing to standard bus systems like AHB, PCI, …
Qbus within the Jazz subsystem
host bus IF local IF
channels
division-multiplexed PCM highways
type/speed/width of each highway
http://ces.univ-karlsruhe.de
ARCTangent A4 processor core A4 processor core
SW extension
http://ces.univ-karlsruhe.de
Configuration of the ARC RISC core add new instructions to core
Extend ISS
Modify/extend Instruction Set architecture
Configure/extend the core
http://ces.univ-karlsruhe.de
extension = > 98
extension = > 64
extension = > 2^ 22
extension = > 32
Host interface Host interface Interrupt controller Interrupt controller A4 processor core Load Store Load Store fetch fetch Core Reg. Core Reg. Ext. Reg. Ext. Reg. Aux. reg Aux. reg Extensible Instruct. Extensible Instruct. Cond code Cond code Extension registers Auxiliary Registers Extension Instructions Ext. cond code User extensions
http://ces.univ-karlsruhe.de
(source: ARC)
to the decode stage
connects to the two input operands
connections are required:
instruction decode entry
http://ces.univ-karlsruhe.de
#include <stdio.h> #include <stdlib.h> typedef int Integer; typedef unsigned long ulong; extern int lookup(int, int); pragma Intrinsic(lookup, opcode=>0x1F, flags => "zncv"); volatile lookup_value; ulong result; void main() { lookup_value = 0xFFFFFFFF; result = lookup(4, 0xFFFFFFFF); }
;-------------| main |-----------------
main: ..LN2: mov %r0, -1 st.di %r0, [%gp, lookup_value@sda] ..LN3: .extInstruction lookup,0x1f,0x00,SUFFIX_COND|SUFFIX_FLAG,SYNT AX_3OP lookup %r0, 4,-1
Call from C program Call from C assembly code
(source: ARC)
code
can simulate a program that contains the new op-codes
http://ces.univ-karlsruhe.de
Design paradigm
Front-
end optimization
Architecture
Synthesis and design flow
http://ces.univ-karlsruhe.de
Software Program in C Non-programmable Architecture (NPA)
PICO: : P Program rogram I In, n, C Chip hip O Out ut
Nested loops are identified and the hot spots are synthesized as hardwired, non hardwired, non-
programmable hardware
Output is a co-
processor that can be used in conjunction with standard processor standard processor
Aim is a low cost-
design, low-
cost production and high performance performance
Status: research project
http://ces.univ-karlsruhe.de
Workload & Requirement spec Design space exploration Parameter ranges Architecture framework HW and SW simulators Evaluation Design Design Specification (parameters) Component lib Parameterized design space Pareto-optimal solutions
Exec time area
x x x x xx x
http://ces.univ-karlsruhe.de
For (j1=0;j1<8192;j1++) { j[j1]=0; For (j2=0;j2<16;j2++) y[j1]=y[j1]+w[j2]*x[j1+j2]; } … For (j1=0;j1<8192;j1++) { For (j2=0;j2<16;j2++) y[j1]=y[j1]+w[j2]*x[j1+j2]; } …
body of any but the innermost loop
http://ces.univ-karlsruhe.de
Generic architecture: VLIW, Cache, Bus system are fixed Processor array Array controller Local memories Interface to global
memory
Control & data interface
to host are synthesized
System generates
structural, synthesizable
Cache
Memory Contr. Systolic Array LocMem 1 2 3 4 5 LocMem Interface
VLIW Proc.
RTL
http://ces.univ-karlsruhe.de
Analysis Phase
Search for array access and dependencies
Mapping and Scheduling
Nested loops are mapped to processors and scheduled
Loop Transformations
Transform to an outer sequential loop and inner parallel loops
Optimization at operation-
level
Word-
width minimization and classical optimizations
Processor Synthesis
Allocation of FUs FUs and scheduling of operations relative to loop start time and scheduling of operations relative to loop start time
System Synthesis
Allocation of processors and their interconnect; controller and data interfaces data interfaces
Output:
HDL description and cost estimation
http://ces.univ-karlsruhe.de
C-code VLIW code Computation intensive code VLIW compiler VLIW synthesis VLIW des-space exploration NPA synthesis Cache exploration Cache synthesis NPA des-space exploration NPA compiler Cache hierarchy VLIW SW NPA Interface NPA param
Cache param Arch spec
http://ces.univ-karlsruhe.de
Power: 0.1mW/MHz
Area: 0.3mm^2
~300MHz
0.13 micron process
http://ces.univ-karlsruhe.de
Automating the process of selecting an appropriate instruction set et given an embedded application given an embedded application
Tasks: a) Autom
. selecting appropriate code segments b) Autom Autom. . matching code segments to application matching code segments to application-
specific instructions c) Autom
. adapting parameters of embedded processors to embedded applications c) embedded applications c) … …
Sun/Raghunathan/Jha Raghunathan/Jha => automated design space exploration => automated design space exploration with custom with custom-
designed application specific instructions [Fei03]
Cheung/Parameswaran/Henkel Parameswaran/Henkel => Library => Library-
based approach to automatically selecting application automatically selecting application-
specific instructions given an embedded applications [INS03] embedded applications [INS03]
… and many others (see following conferences: DATE, DAC, and many others (see following conferences: DATE, DAC, ICCAD as well as WASP workshop) ICCAD as well as WASP workshop)
http://ces.univ-karlsruhe.de
Configurable processor cores: parameters Extensible instruction set Fully customized instruction set
Integration, optimization, estimation Some platforms offer customized high-level tools that allow
immediate evaluation of new parameters/instructions
masks as opposed to FPGA-based platforms but are not limited in silicon size
multiprocessors on a single chip
http://ces.univ-karlsruhe.de
increase of Xtensa-1040-based embedded systems, Custom Integrated Circuits Conference, 2002. Proceedings of the IEEE 2002 , Pages:9 – 12, 12-15 May 2002.
Processors (ASIPs) Using a Machine description Language”, IEEE Tr on CAD, Vol. 20, No. 11, Nov 2001.
Machine Description Language LISA”, Proc of 15th Int. Conference on VLSI Design, 2002.
DATE, 2003.
design exploration for extensible processors, Computer Aided Design, 2003. ICCAD-2003. International Conference on , 9-13 Nov. 2003, Pages:291 – 297.
Hardware Accelerators”, HPL-2001-249, Oct., 2001.