Design and Architectures for Embedded Systems Prof. Dr. J. Henkel - - PowerPoint PPT Presentation

design and architectures for embedded systems
SMART_READER_LITE
LIVE PREVIEW

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel - - PowerPoint PPT Presentation

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES - - Chair for Embedded Systems Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany , Germany University of Today: Embedded


slide-1
SLIDE 1
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Design and Architectures for Embedded Systems

  • Prof. Dr. J.
  • Prof. Dr. J. Henkel

Henkel CES CES -

  • Chair for Embedded Systems

Chair for Embedded Systems University of University of Karlsruhe Karlsruhe, Germany , Germany

Today: Embedded Processor Platforms Today: Embedded Processor Platforms

slide-2
SLIDE 2
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Where are we ?

  • Emb. Software

Optimization for:

  • low power
  • Performance
  • Area, …

Embedded Processor Design

  • extens. Instruction
  • Parameterization

Integration Hardware Design

  • synthesis

Middleware, RTOS

  • Scheduling

System specification Design space exploration

  • low power
  • Performance
  • Area

System partitioning

  • models of computation
  • Spec languages

Estimation&Simulation

  • low power
  • performance
  • Area, …

Tape out Prototyping

embedded IP:

  • PEs
  • Memories
  • Communication
  • Peripherals

IC technology

Optimization

  • low power
  • performance
  • Area, …

refine

slide-3
SLIDE 3
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Outline

  • Intro

Intro

  • Platforms

Platforms

  • LisaTek

LisaTek ( ( CoWare CoWare) )

  • Tensilica

Tensilica’ ’s s Xtensa Xtensa

  • Improv

Improv

  • ARC

ARC

  • HP

HP’ ’s s PiCo PiCo

… others

  • thers
slide-4
SLIDE 4
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Platforms

slide-5
SLIDE 5
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Designing an embedded Processor: tasks

Architectural Exploration Implementing the Architecture Designing SW

  • Develop. Tools

Integration and Verification

  • Tasks are interdependent

Tasks are interdependent

  • Improvement through iteration

Improvement through iteration

  • Each task is customized for one specific implementation of an em

Each task is customized for one specific implementation of an embedded bedded processor processor

  • Many steps are manual since it is a one

Many steps are manual since it is a one-

  • time effort

time effort

  • But product life times are short: can these tasks be combined an

But product life times are short: can these tasks be combined and automated d automated ? ?

slide-6
SLIDE 6
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Designing an embedded Processors: the alternative way

Architectural Exploration Implementing the Architecture Designing SW

  • Develop. Tools

Integration and Verification Embedded Processor Tool-suite

Iterative Improvement

  • There is only one generic tool

There is only one generic tool-

  • suite that generates all other parts:

suite that generates all other parts: -

  • > a) min. manual

> a) min. manual support b) higher flexibility c) re support b) higher flexibility c) re-

  • use for next

use for next-

  • gen

gen embedded processor embedded processor

  • Iterative improvement is done without manually re

Iterative improvement is done without manually re-

  • designing the tools

designing the tools

slide-7
SLIDE 7
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Designing a customized embedded processor: approaches

  • Instruction set:

Instruction set:

  • Fully customized instructions (no predefined); but the instructi

Fully customized instructions (no predefined); but the instruction set might be

  • n set might be

domain domain-

  • specific (e.g. DSP

specific (e.g. DSP-

  • type)

type)

  • Core instruction set is fixed; the instruction set can be enhanc

Core instruction set is fixed; the instruction set can be enhanced: ed:

  • The

The “ “bottlenecks bottlenecks” ” of an application are hard

  • f an application are hard-
  • wired as application

wired as application-

  • specific

specific instructions (might be re instructions (might be re-

  • used, e.g. FFT, but might be specific to one

used, e.g. FFT, but might be specific to one application only); tool application only); tool-

  • suite provides a language to do define these

suite provides a language to do define these instructions instructions

  • Processor components:

Processor components:

  • The basic (general) core can be enhanced by pre

The basic (general) core can be enhanced by pre-

  • defined, fixed, specialized cores:

defined, fixed, specialized cores: e.g. a DSP core e.g. a DSP core

  • System components (to be added/omitted and parameterized):

System components (to be added/omitted and parameterized):

  • A) on

A) on-

  • chip cache: size, policy,

chip cache: size, policy, … …

  • B) MMU

B) MMU

  • C)

C) … …

  • On

On-

  • Chip communication infrastructure:

Chip communication infrastructure:

  • Busses and hierarchy of buses (processor core, inter

Busses and hierarchy of buses (processor core, inter-

  • core, peripheral)

core, peripheral) -

  • >

> typically typically fixed fixed

slide-8
SLIDE 8
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

The LisaTek Platform\ (LisaTek: TH Aachen)

  • Overview

Overview

  • Paradigm

Paradigm

  • The LISA language

The LISA language

  • Design Flow and Tools

Design Flow and Tools

  • Simulation

Simulation

slide-9
SLIDE 9
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Paradigm and Features

  • Combining architectural exploration and

Combining architectural exploration and implementation in one tool suite implementation in one tool suite

  • Software development tools are derived (generated)

Software development tools are derived (generated) from the description from the description

  • Not using a standard core; instead, the whole

Not using a standard core; instead, the whole Instruction Set Architecture (ISA) is customized Instruction Set Architecture (ISA) is customized

  • Status: commercial product

Status: commercial product

slide-10
SLIDE 10
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Features at a glance

  • Textual description of target architecture
  • Hardware Model

Behavior : C/C+ + description Resources : register, pipelines etc. Timing information Pipeline-model

  • Software Model

Instruction-set description

  • Hierarchical description style

LISA operations

  • Different levels of abstraction

abstraction of time (instruction/ cycle accurate) abstraction of architecture

slide-11
SLIDE 11
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Tools and Models

  • Memory model

Memory model

  • registers, memories

registers, memories

  • bit widths, ranges

bit widths, ranges

  • Resource model

Resource model

  • hardware resources

hardware resources

  • resource requirements

resource requirements

  • f
  • f operations
  • perations
  • Behavioral Model

Behavioral Model

  • abstracted hardware

abstracted hardware activities (various activities (various levels) levels)

  • changing the system

changing the system state state

  • I nstruction-set model

composed of valid HW

  • perations

assembly syntax instruction word coding instruction semantics

  • Timing model

activation sequence of

hardware operations

pipeline

  • Micro-architecture model

RTL accurate hardware

behavior

hardware operation

grouping

(source: LISATek)

slide-12
SLIDE 12
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

LISA language: features

  • Basic idea:

Basic idea: closing gap between structural oriented languages (HDL, closing gap between structural oriented languages (HDL, Verilog Verilog) and ) and instruction set languages instruction set languages

  • Memory model:

Memory model:

  • registers, memories with width ranges etc.

registers, memories with width ranges etc.

  • Resource model:

Resource model:

  • specifies available hardware (like

specifies available hardware (like FUs FUs, , … …) )

  • Instruction set model:

Instruction set model:

  • instruction word coding, spec. of valid operands and addressing

instruction word coding, spec. of valid operands and addressing modes; written modes; written in assembly syntax in assembly syntax

  • Behavioral model:

Behavioral model:

  • abstraction of hardware structures; notion of state (for simulat

abstraction of hardware structures; notion of state (for simulation; abstraction ion; abstraction level can vary level can vary

  • Timing model:

Timing model:

  • specifying the sequence of hardware operations and units

specifying the sequence of hardware operations and units

  • Micro

Micro-

  • architectural model:

architectural model:

  • grouping of hardware operations to

grouping of hardware operations to FUs FUs; describes the details of micro ; describes the details of micro-

  • architectural implementation of RTL components

architectural implementation of RTL components

slide-13
SLIDE 13
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Design Flow and Tools

Target Architecture LISA Description Language Compiler LISA C Compiler LISA Assembler LISA Linker LISA Simulator

Output/Results: Profiling Data Performance Data

VHDL Description Synthesis Gate-level Model

Output/Results: Area Power

Consumption

Clock Speed

Exploration Implementation

Simulation Library

slide-14
SLIDE 14
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Processor Architecture

Mixed behavioral/structural model: based on C/C+ + VLIW data-types strong memory modeling capabilities (incl. caches) include external IP (libraries) Enriched by timing information: clocked register behavior

  • peration scheduling

extensive pipeline model with predefined functions

stall, flush

(source: LISATek)

slide-15
SLIDE 15
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Instruction Set Description

Instruction word coding

variable widths multiple words distributed coding

Assembly syntax

mnemonic based syntax algebraic (C-like) syntax

Instruction semantics

compiler semantics

Configurable instruction set information (power, etc.)

(source: LISATek)

slide-16
SLIDE 16
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Generating HDL from LISA

Resource Model Memory Model Instruction Set Behavioral Model Timing Model Micro Architecture LISA Memory Configuration Structure Functional Units Decoder Pipeline Controller HDL

Model does not consist of

predefined components -> must be generated from description:

Memory: directly

derived

Structure (e.g. pipeline

stages): derived from resource, behavioral and micro-architectural model

FUs: derived from

architectural model (fully fuctional or empty entities)

Decoder: derived from

info in instruction set model

slide-17
SLIDE 17
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Architectural Features

OPERATION Decode IN pipe.DC { ENUM InsnType=Type1, Type2, Type3; SWITCH (InsnType) CASE Type1: CODING {Decode==Decode_16} CASE Type2: CODING {(Decode==Decode_32) && (Fetch==Operand)} CASE Type3: CODING {(Decode==Decode_48) && (Fetch==Operand1) && (Prefetch==Operand2)} }

Example for multiple

instruction words and its implementation in LISA

OPERATION add { DCL ARE {REFE, RENCE mode; } if (mode==short) { BEHAVIOR {dest_lo=src1_lo+src2_lo; } } ELSE BEHAVIOR { dest_lo=src1_lo+src2_low; carry=dest_lo >> 16; dest_low &= 0xFFFF; dest_hi=src1_hi+src2_hi+carry; } } }

instruct cond mode dest-reg src_reg1 src_reg2 instruction word

Instruction: add, sub, mul, ld, sto mode: short long

Non-orthogonal coding elements

slide-18
SLIDE 18
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Modeling Memory

Features:

dynamic address

mapping

user-defined memory

modules

different levels of

abstraction

C+ + and SystemC

simulation models

bus redirect allows

external memory access

access statistics

SYSTEM BUS On-Chip RAM On-Chip RAM D$ D$ I$ I$ Write Buffer Write Buffer Off-Chip RAM Off-Chip RAM L2 Cache L2 Cache

Memory Architecture Spec Lisa Spec. Memory Template Lib

(source: LISATek)

slide-19
SLIDE 19
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Simulation Approaches

Combining both paradigms in Lisatek:

Just-In-Time Cache-Compiled Simulation (JIT-CCS)

Memory

Execute

Instruction Decode

Run-Time Run-Time

Application

Interpretative Simulations

Compiled Simulation Simulation Compiler Application

Execute

Instruction Behavior

Compiled Simulation

(source: LISATek)

Compile-Time Compile-Time

Run-Time Run-Time

slide-20
SLIDE 20
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Case Studies (by LISATek)

Texas Instruments TMS320C6201

  • cycle-accurate model
  • 9978 lines of LISA 2.0 and C code
  • design effort: 6 weeks

Analog Devices ADSP21xx

  • cycle-accurate model
  • 11000 lines of LISA 2.0 and C code
  • inexperienced, undergraduate student

(neither knowledge on DSP nor on LISA)

  • design effort: 8 weeks

Advanced Risc Machines ARM 7 Core

  • instruction-accurate model
  • 4000 lines of LISA 2.0 and C code
  • design effort: 2 weeks

(source: LISATek)

slide-21
SLIDE 21
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

The Tensilica Platform

  • Overview

Overview

  • Paradigm

Paradigm

  • Micro

Micro-

  • Architectural Parameterization

Architectural Parameterization

  • Using TIE (

Using TIE (Tensilica Tensilica Instruction Extension) Instruction Extension)

  • Tools

Tools

slide-22
SLIDE 22
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Paradigm and Main Features

IP (cores) parameterizable TIE Instruction Set Extensions Customized generated Software tool flow

  • Integration
  • Simulation
  • Synthesis-ready
  • Combines core

Combines core-

  • based design paradigm on the one side with ASIP

based design paradigm on the one side with ASIP features (application specific instruction set processor) on the features (application specific instruction set processor) on the other side

  • ther side
  • User can adapt core parameters and define own instructions (if

User can adapt core parameters and define own instructions (if necessary necessary

  • two levels of customization

two levels of customization

  • Status: commercial product

Status: commercial product

slide-23
SLIDE 23
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Processor Core Hardware Features (based on T1040)

  • 16/24

16/24-

  • bit Instruction Set Architecture ISA: non

bit Instruction Set Architecture ISA: non-

  • fixed length

fixed length provides dense code provides dense code

  • 200MHz

200MHz— —320MHz performance @ .18micron 320MHz performance @ .18micron

  • Core: 0.7mm

Core: 0.7mm2

2 @ 0.18 micron

@ 0.18 micron

  • Power consumption: 0.4W/MHz @ 0.18micron, 1.8V

Power consumption: 0.4W/MHz @ 0.18micron, 1.8V

  • Performance @ 0.25micron: 150MHz

Performance @ 0.25micron: 150MHz… …250MHz 250MHz

  • Core: 1.0mm

Core: 1.0mm2

2 @ 0.25micron

@ 0.25micron

  • Power consumption: 0.8mW/MHz @ 0.25micron

Power consumption: 0.8mW/MHz @ 0.25micron

slide-24
SLIDE 24
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Xtensa Architecture: base ISA, configurable, optional, extensible

Base ISA

  • Config. function
  • trace port
  • JTAG tap ctrl
  • On-chip debug

Optional function extensible

CoPro Reg file CoPro Exec Unit TIE Instructions

Window reg. file ALU & address generation MAC 16 Align & decode Processor Controls Instruction memory or Cache & tag Branch logic & Instruction fetch Data memory or Cache & tag Memory protection Write buffer

  • proc. interface

Special function registers Timers (0 to n) Data & instruction Address watch (0 to n) Exception support Interrupt control

slide-25
SLIDE 25
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Example: configurating Xtensa

  • Mixed level of configurability:

Mixed level of configurability:

  • Fixed options that can be

Fixed options that can be added or omitted (y/n) added or omitted (y/n)

  • Configuration of device

Configuration of device parameters: sizes of caches, parameters: sizes of caches, … …

  • Fully customized extensions

Fully customized extensions to the instruction set: TIE to the instruction set: TIE

Target Options geometry/process frequency [MHz} power saving y/n register file impl. … Instruction options 16-bit MAC y/n 16-bit multiplier y/n … Types and # of interrupts # of timers Byte ordering b/l endian Registers for call windows # Processor interface (r/w) width Instruction Cache associativity e.g. direct cache organization e.g. 4096x32 tag RAM addr x data width e.g. 512x19 Debugging full scan y/n instruction ads break reg. # TIE Xtension yes/no TIE source e.g. ./sample.tie Board support …

slide-26
SLIDE 26
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Design Flow for building TIE instructions

Application in C/C+ + Profiling Identify potential new instructions Implement TIE Translate TIE to C/C+ + Profile and analyze OK ? Re-compile source with new Instruction instead of function calls Run ISS (cycle-accurate) Build processor Run on evaluation board OK ?

xtensa native

slide-27
SLIDE 27
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Example for TIE coding

state ANS2 32 user_register ans2 0 {ANS2}

  • pcode FREXP op2 = 4'b0001 CUST0
  • pcode LDEXP op2 = 4'b0010 CUST0

iclass frexp {FREXP} {out arr, in ars} {out ANS2} iclass ldexp {LDEXP} {out arr, in art, in ars} reference FREXP { wire [31:0] temparr; wire [31:0] tempans2; assign temparr = (ars[30:23] == 0 && ars[22:0] == 0) ? 32'b0 : {ars[31], 8'b01111110, ars[22:0]}; assign tempans2 = (ars[30:23] == 0 && ars[22:0] == 0) ? 32'b0 : {24'b0, ars[30:23] - 127 + 1} ; assign arr = (tempans2[0] == 1) ? {temparr[31], temparr[30:23] + 1'b1,temparr[22:0]} : temparr ; assign ANS2 = (tempans2[0] == 1) ? (tempans2 - 1) >> 1 : tempans2 >> 1; } reference LDEXP { assign arr = {art[31], art[30:23] + ars, art[22:0]}; }

sqrt.tie

slide-28
SLIDE 28
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Tie Compiler

  • After coding TIE, the compiler generates:

After coding TIE, the compiler generates:

  • C

C-

  • functions equivalent to TIE

functions equivalent to TIE -

  • > functional verification through usage in C

> functional verification through usage in C with native software development environment with native software development environment

  • C

C-

  • function declarations

function declarations -

  • > allow new instructions to be coded as functions

> allow new instructions to be coded as functions in application code in application code

  • Dynamic shared

Dynamic shared libs libs to be used by other to be used by other Xtensa Xtensa SW SW

  • HDL description (

HDL description (Verilog Verilog) ) -

  • > hardware needed to support TIE instructions

> hardware needed to support TIE instructions (gives also measure on HW costs and performance) (gives also measure on HW costs and performance)

  • Synthesis scripts (for DC): allows to automatically synthesize t

Synthesis scripts (for DC): allows to automatically synthesize the hardware he hardware from the HDL description from the HDL description

slide-29
SLIDE 29
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Pipeline and TIE instructions

  • Single

Single-

  • cycle access to memory

cycle access to memory

  • 5 stage pipeline:

5 stage pipeline: “ “memory memory” ” often the critical path when it comes to high clock rates

  • ften the critical path when it comes to high clock rates
  • User can chose to avoid placing logic after memory result is rea

User can chose to avoid placing logic after memory result is read to avoid creating a d to avoid creating a critical path critical path -

  • > delay result assignment by one cycle using multi

> delay result assignment by one cycle using multi-

  • cycle instructions

fetch decode execute memory write-back memory

critical path

cycle instructions

slide-30
SLIDE 30
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Some TIE features

  • Schedule sections:

Schedule sections:

  • Specifies implementation at the micro

Specifies implementation at the micro-

  • architectural level (all others are ISA

architectural level (all others are ISA related) related)

  • Technique to define instruction that use more than one cycle (im

Technique to define instruction that use more than one cycle (important for portant for relaxing cycle time) relaxing cycle time)

  • Example: one or more op code with same I/o spec can be grouped

Example: one or more op code with same I/o spec can be grouped into one into one schedule schedule

slide-31
SLIDE 31
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Co-verification capabilities

  • An External co-verification tool links SW and HW simulation models:
  • ISS: cycle-accurate, models pipeline, handles interrupts/excptions
  • PIM: models processor interface signals; needs stand. HDL simulator
  • Third-part co-verification tool: e.g. Synopsys Eagle, …

Instruction Set Simulator (ISS), Bus Functional Model Program (assembly)

Processor Interface Model (HDL) Processor Interface (PIF) lib Bus Interface Verilog/VHDL SRAM Mem. HDL External Co-verification Console

slide-32
SLIDE 32
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

External SRAM (compressed Code) Cache CPU Core: Xtensa PIF Tag

Example: Code Compression Project Henkel/Lekatsas using Tensilica’s Xtensa

Code decompression Core

Tensilica NEC add-on

  • NEC project: code compression

Compress code and store in main memory Decompress on-the-fly in just 1 cycle

  • Use Tensilica’s framework: IP cores, simulation and synthesis capabilities

[LeHe02]

slide-33
SLIDE 33
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Example: Code Compression Project Henkel/Lekatsas using Tensilica’s Xtensa

Gather statistics Compilation

Source Code Object Code

Compressed Software SRAM

Xtensa Runtime Offline

Compress & build table patch branch offsets

Compression stages Table Interface Tree logic

Cache

[LeHe02]

slide-34
SLIDE 34
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

The Improv Platform

  • Contents

Contents

  • Summary of main Features

Summary of main Features

  • Platform

Platform

  • DSP core

DSP core

  • Misc

Misc

slide-35
SLIDE 35
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Paradigm of the Improv Platform

IP (cores) parameterizable add new instructions to core

  • Integration
  • Synthesis-ready

(with standard ASIC design flow)

  • Modify/extend Instruction Set architecture

Modify/extend Instruction Set architecture

  • Targets DSP oriented applications

Targets DSP oriented applications

slide-36
SLIDE 36
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Improv Platform

  • The components of the

The components of the platforms platforms

  • Composer:

Composer:

  • Facilitates instruction

Facilitates instruction extension and adding new extension and adding new instructions; interrupts; instructions; interrupts; memory memory … …

  • Generator:

Generator:

  • Interprets configuration from

Interprets configuration from composer composer

  • Generates RTL

Generates RTL Verilog Verilog description as input for a description as input for a standard ASIC design flow standard ASIC design flow

  • The RTL instances

The RTL instances generated are verified and generated are verified and read from a data base and read from a data base and not automatically generated not automatically generated

(source: Improv Systems)

software software hardware hardware

slide-37
SLIDE 37
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

DSP Processor (“Jazz DSP”)

  • Configurable VLIW architecture

Configurable VLIW architecture

  • user chooses data path (16

user chooses data path (16-

  • bit or 32

bit or 32-

  • bit wide)

bit wide)

  • User can extend the ISA: either through as an option

User can extend the ISA: either through as an option (like inclusion or non (like inclusion or non-

  • inclusion of

inclusion of multiplier, MAC multiplier, MAC etc.) or by defining etc.) or by defining custom custom instructions instructions and and custom custom functional units; functional units; dual dual-

  • operand load instructions; 40/64
  • perand load instructions; 40/64-
  • bit accumulator;

bit accumulator; … …

  • Memory: instruction and data memories can be configured

Memory: instruction and data memories can be configured

  • Furthermore: interrupts (number and priority levels), system mem

Furthermore: interrupts (number and priority levels), system memory

  • ry

addresses etc. addresses etc.

  • Features:

Features:

  • Power: < 0.1mW/MHz @ 0.13 micron

Power: < 0.1mW/MHz @ 0.13 micron

  • Chip size: <0.25mm2 @ 0.13 micron

Chip size: <0.25mm2 @ 0.13 micron

  • Performance: > 1000 MOPS @ 100 MHz

Performance: > 1000 MOPS @ 100 MHz

  • Misc

Misc architectural features: architectural features:

  • Distributed register files to avoid I/O bottleneck from and to

Distributed register files to avoid I/O bottleneck from and to FUs FUs

  • 2

2-

  • stage instruction pipeline

stage instruction pipeline

  • Single

Single-

  • cycle execution units

cycle execution units

slide-38
SLIDE 38
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Jazz DSP architectural block diagram

(source: Improv Systems)

extensible

slide-39
SLIDE 39
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Further IP of the Improv platform

Bus wrapper

Memory Interface Control Interface Host Bus Interface

  • Host Bus interface
  • memory mapped interface with address,

data and control signal

  • Interfaces between host processor and

Jazz system

  • Wrapper (bus specific) facilitates

interfacing to standard bus systems like AHB, PCI, …

  • Control IF: manages task queuing via

Qbus within the Jazz subsystem

  • Data Channel Interface
  • Provides data flow management between

host bus IF local IF

  • Contains configurable 1 to N full-duplex

channels

  • User-defined data filters
  • Interfaces to stb-ndard buses like AMBA
  • Time slot interchange block
  • Managing voice data; interfaces to time-

division-multiplexed PCM highways

  • Configurable: # channels,

type/speed/width of each highway

slide-40
SLIDE 40
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

The ARC Platform

  • Contents

Contents

  • ARCTangent

ARCTangent A4 processor core A4 processor core

  • SW extension

SW extension

slide-41
SLIDE 41
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Paradigm of the ARC Platform

Configuration of the ARC RISC core add new instructions to core

  • Integration
  • Synthesis
  • simulation with ISS

Extend ISS

  • Modify/extend Instruction Set architecture

Modify/extend Instruction Set architecture

  • Configure/extend the core

Configure/extend the core

slide-42
SLIDE 42
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

ARC core Architecture n Extensions

  • Instructions:
  • 29 base, 69

extension = > 98

  • Core registers:
  • 36 base, 28

extension = > 64

  • Auxiliary registers:
  • 6 base, 2^ 32

extension = > 2^ 22

  • Condition codes:
  • 16 base, 16

extension = > 32

Host interface Host interface Interrupt controller Interrupt controller A4 processor core Load Store Load Store fetch fetch Core Reg. Core Reg. Ext. Reg. Ext. Reg. Aux. reg Aux. reg Extensible Instruct. Extensible Instruct. Cond code Cond code Extension registers Auxiliary Registers Extension Instructions Ext. cond code User extensions

slide-43
SLIDE 43
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Adding instructions to the ARC RISC core

(source: ARC)

  • New opcode is added

to the decode stage

  • New instruction

connects to the two input operands

  • Alltogether, four

connections are required:

  • Two input ops, the
  • utput result, an

instruction decode entry

slide-44
SLIDE 44
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Extension from SW side

#include <stdio.h> #include <stdlib.h> typedef int Integer; typedef unsigned long ulong; extern int lookup(int, int); pragma Intrinsic(lookup, opcode=>0x1F, flags => "zncv"); volatile lookup_value; ulong result; void main() { lookup_value = 0xFFFFFFFF; result = lookup(4, 0xFFFFFFFF); }

;-------------| main |-----------------

  • .global main

main: ..LN2: mov %r0, -1 st.di %r0, [%gp, lookup_value@sda] ..LN3: .extInstruction lookup,0x1f,0x00,SUFFIX_COND|SUFFIX_FLAG,SYNT AX_3OP lookup %r0, 4,-1

Call from C program Call from C assembly code

(source: ARC)

  • Metaware C/C+ + compiler and assembler recognize extension instructions
  • From a user point of view it is integrated like an instruction call
  • During compilation compiler replaces the intrinsic function with new op-

code

  • Compiler can optimize through sections that contain custom op-codes
  • The dynamic link library extends the ISS (instruction set simulator) = > ISS

can simulate a program that contains the new op-codes

  • C/C+ + compiler allows to access the auxiliary and core registers
slide-45
SLIDE 45
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

HP’s Pico Platform

  • Overview

Overview

  • Design paradigm

Design paradigm

  • Front

Front-

  • end optimization

end optimization

  • Architecture

Architecture

  • Synthesis and design flow

Synthesis and design flow

slide-46
SLIDE 46
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Paradigm of Pico (NPA)

Software Program in C Non-programmable Architecture (NPA)

  • identifying nested loops
  • Optimizations
  • Synthesis
  • PICO

PICO: : P Program rogram I In, n, C Chip hip O Out ut

  • Nested loops are identified and the hot spots are synthesized as

Nested loops are identified and the hot spots are synthesized as hardwired, non hardwired, non-

  • programmable hardware

programmable hardware

  • Output is a co

Output is a co-

  • processor that can be used in conjunction with

processor that can be used in conjunction with standard processor standard processor

  • Aim is a low cost

Aim is a low cost-

  • design, low

design, low-

  • cost production and high

cost production and high performance performance

  • Status: research project

Status: research project

slide-47
SLIDE 47
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Design Paradigm

Workload & Requirement spec Design space exploration Parameter ranges Architecture framework HW and SW simulators Evaluation Design Design Specification (parameters) Component lib Parameterized design space Pareto-optimal solutions

Exec time area

x x x x xx x

slide-48
SLIDE 48
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Front-end transformations & optim.

For (j1=0;j1<8192;j1++) { j[j1]=0; For (j2=0;j2<16;j2++) y[j1]=y[j1]+w[j2]*x[j1+j2]; } … For (j1=0;j1<8192;j1++) { For (j2=0;j2<16;j2++) y[j1]=y[j1]+w[j2]*x[j1+j2]; } …

  • Create loops such that there is no code other than a single embedded “for” in the

body of any but the innermost loop

  • Abstract architecture specification
  • Pipelined implementation of loop nest; several iterations may be active
  • Tiling and mapping
  • Dependence analysis
  • Iteration mapping
  • Iteration schedules
  • Load/store elimination for uniform dependence arrays
slide-49
SLIDE 49
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Generic Architecture of Pico

Generic architecture: VLIW, Cache, Bus system are fixed Processor array Array controller Local memories Interface to global

memory

Control & data interface

to host are synthesized

System generates

structural, synthesizable

Cache

Memory Contr. Systolic Array LocMem 1 2 3 4 5 LocMem Interface

VLIW Proc.

RTL

slide-50
SLIDE 50
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Synthesis Phases: from SW loop to HDL

  • Analysis Phase

Analysis Phase

  • Search for array access and dependencies

Search for array access and dependencies

  • Mapping and Scheduling

Mapping and Scheduling

  • Nested loops are mapped to processors and scheduled

Nested loops are mapped to processors and scheduled

  • Loop Transformations

Loop Transformations

  • Transform to an outer sequential loop and inner parallel loops

Transform to an outer sequential loop and inner parallel loops

  • Optimization at operation

Optimization at operation-

  • level

level

  • Word

Word-

  • width minimization and classical optimizations

width minimization and classical optimizations

  • Processor Synthesis

Processor Synthesis

  • Allocation of

Allocation of FUs FUs and scheduling of operations relative to loop start time and scheduling of operations relative to loop start time

  • System Synthesis

System Synthesis

  • Allocation of processors and their interconnect; controller and

Allocation of processors and their interconnect; controller and data interfaces data interfaces

  • Output:

Output:

  • HDL description and cost estimation

HDL description and cost estimation

slide-51
SLIDE 51
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

The whole Pico design flow

C-code VLIW code Computation intensive code VLIW compiler VLIW synthesis VLIW des-space exploration NPA synthesis Cache exploration Cache synthesis NPA des-space exploration NPA compiler Cache hierarchy VLIW SW NPA Interface NPA param

Cache param Arch spec

slide-52
SLIDE 52
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Others: MIPS32 M4K

  • Recently announced 32

Recently announced 32-

  • bit synthesizable RISC core

bit synthesizable RISC core

  • Aimed at future

Aimed at future SOCs SOCs that integrate multiple cores on that integrate multiple cores on

  • ne chip to satisfy higher bandwidth in
  • ne chip to satisfy higher bandwidth in

networking/broadband networking/broadband

  • Features instruction set extension (first in terms of an

Features instruction set extension (first in terms of an extension for an already existing industry standard) extension for an already existing industry standard)

  • Based on the enhanced MIPS32 architecture (faster and

Based on the enhanced MIPS32 architecture (faster and more flexible packet processing, more flexible packet processing, … …) )

  • More features:

More features:

  • Power: 0.1mW/MHz

Power: 0.1mW/MHz

  • Area: 0.3mm^2

Area: 0.3mm^2

  • ~300MHz

~300MHz

  • 0.13 micron process

0.13 micron process

slide-53
SLIDE 53
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Research Activities Extensible Processor

  • Focal points

Focal points

  • Automating the process of selecting an appropriate instruction s

Automating the process of selecting an appropriate instruction set et given an embedded application given an embedded application

  • Tasks: a)

Tasks: a) Autom

  • Autom. selecting appropriate code segments b)

. selecting appropriate code segments b) Autom Autom. . matching code segments to application matching code segments to application-

  • specific instructions c)

specific instructions c) Autom

  • Autom. adapting parameters of embedded processors to

. adapting parameters of embedded processors to embedded applications c) embedded applications c) … …

  • Some research approaches

Some research approaches

  • Sun/

Sun/Raghunathan/Jha Raghunathan/Jha => automated design space exploration => automated design space exploration with custom with custom-

  • designed application specific instructions [Fei03]

designed application specific instructions [Fei03]

  • Cheung/

Cheung/Parameswaran/Henkel Parameswaran/Henkel => Library => Library-

  • based approach to

based approach to automatically selecting application automatically selecting application-

  • specific instructions given an

specific instructions given an embedded applications [INS03] embedded applications [INS03]

… and many others (see following conferences: DATE, DAC, and many others (see following conferences: DATE, DAC, ICCAD as well as WASP workshop) ICCAD as well as WASP workshop)

slide-54
SLIDE 54
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

Summary: Embedded Processor Platforms

  • Silicon complexity allows for complex, whole SOCs
  • Customizable Processor HW platforms come in different flavors:

Configurable processor cores: parameters Extensible instruction set Fully customized instruction set

  • Customizable Processor SW platforms:

Integration, optimization, estimation Some platforms offer customized high-level tools that allow

immediate evaluation of new parameters/instructions

  • Customizable Processor platforms typically require new silicon

masks as opposed to FPGA-based platforms but are not limited in silicon size

  • Future: more complex platforms allowing heterogeneous

multiprocessors on a single chip

slide-55
SLIDE 55
  • J. Henkel, Univ. of Karlsruhe, WS04/05, 2004

http://ces.univ-karlsruhe.de

References and Sources

  • [Leu00] Leupers, R.; Code Optimization Techniques for Embedded Processors, Kluwer, 2000.
  • [LeHe02] Lekatsas, H.; Henkel, J.; Jakkula, V. 1-cycle code decompression circuitry for performance

increase of Xtensa-1040-based embedded systems, Custom Integrated Circuits Conference, 2002. Proceedings of the IEEE 2002 , Pages:9 – 12, 12-15 May 2002.

  • A. Hoffman et al., “A Novel Methodology for the Design of Application-Specific Instruction Set

Processors (ASIPs) Using a Machine description Language”, IEEE Tr on CAD, Vol. 20, No. 11, Nov 2001.

  • O. Schliebusch, A. Hoffman, A. Nohl, G. Braun, H. Meyr, “Architecture Implementation Using the

Machine Description Language LISA”, Proc of 15th Int. Conference on VLSI Design, 2002.

  • [Fei03] Y. Fei, S. Ravi, A. Raghunathan, and N. Jha, \Energy estimation for extensible processors," in

DATE, 2003.

  • [INS03] Cheung, N.; Parameswaran, S.; Henkel, J INSIDE: INstruction selection/identification &

design exploration for extensible processors, Computer Aided Design, 2003. ICCAD-2003. International Conference on , 9-13 Nov. 2003, Pages:291 – 297.

  • R. Schreiber, A. Aditya, S. Mahlke et al., “Pico-NPA: High-Level Synthesis of Nonprogrammable

Hardware Accelerators”, HPL-2001-249, Oct., 2001.

  • Improv Systems, http://www.improvsys.com
  • ARC, http://www.arccores.com
  • Tensilica, http://www.tensilica.com
  • LisaTek, http://www.lisatec.com
  • HP Pico: http://www.hpl.hp.com/research
  • http://www.siliconstrategies.com