- M. Scheer, Univ. of Karlsruhe, WS04/05, 2005
http://ces.univ-karlsruhe.de
Design and Architectures for Embedded Systems
Maik Maik Scheer Scheer ( (Lehrstuhl Lehrstuhl Prof. Dr. J.
- Prof. Dr. J. Henkel
Henkel) ) CES CES -
- Chair for Embedded Systems
Design and Architectures for Embedded Systems Maik Scheer Scheer ( - - PowerPoint PPT Presentation
Design and Architectures for Embedded Systems Maik Scheer Scheer ( (Lehrstuhl Lehrstuhl Prof. Dr. J. Prof. Dr. J. Henkel Henkel) ) Maik CES - - Chair for Embedded Systems Chair for Embedded Systems CES University of Karlsruhe, Germany
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Optimization for:
Embedded Processor Design
Integration Hardware Design
Middleware, RTOS
System specification Design space exploration
System partitioning
Estimation&Simulation
Tape out Prototyping
embedded IP:
IC technology
Optimization
refine
http://ces.univ-karlsruhe.de
PLDs
FPGAs
XPP
Dynamically Reconfigurable SoC SoC Based on LEON + XPP Based on LEON + XPP
HoneyComb
http://ces.univ-karlsruhe.de
8 inputs at address 0x1000
Output at bit 0 of address 0x2000
#define INPORT_ADR 0x1000 #define OUTPORT_ADR 0x2000 unsigned char *a = INPORT_ADR; unsigned char *b = OUTPORT_ADR; main() { while(1) { if(*a == 0) *b = 1; else *b = 0; } }
http://ces.univ-karlsruhe.de
Minimum 6 cycles for computation (ELSE)
Maximum 7 cycles for computation (IF)
Input changes are provided to the output in minimum 5, maximum 12 2 cycles cycles
L0: MOV R1, #a ; Address of a MOV R2, #b ; in R1, of b in R2 L1: LD R3, (R1) ; Inport-Bits in R3 CMP R3, #0 ; IF-condition BNE L3 L2: MOV R4, #1 ; IF-branch JMP L4 L3: MOV R4, #0 ; ELSE-branch L4: ST (R2), R4 JMP L1
http://ces.univ-karlsruhe.de
Computation time depends on propagation delay time
Computation time is constant and predictable
Inport_0 Inport_1 Inport_2 Inport_3 Inport_4 Inport_5 Inport_6 Inport_7 Outport_0
Inport_0 Inport_1 Inport_2 Inport_3 Inport_4 Inport_5 Inport_6 Inport_7 Outport_0
http://ces.univ-karlsruhe.de
Computation time depends on algorithm + available instruction set t
Computation time can vary depending on the actual program path
New algorithms can be implemented as software functions
Available instruction instruction set set is is fix fix
Used silicon area is fix
Each algorithm is implemented in hardware
Modifications of algorithms are not possible
Computation time depends on critical path
No fixed instruction set
Used silicon area depends on the number of implemented instructions (algorithms) and on the instructions itself instructions (algorithms) and on the instructions itself
http://ces.univ-karlsruhe.de
Computation in time in time
Dimension of algrithms algrithms is is time time
Complex algorithms algorithms needs needs less less computation computation time time than than simple simple
Computation in in space space
Dimension of algorithms algorithms is is space space ( (silicon silicon area area) )
Complex algoritms algoritms requires requires more more silicon silicon area area than than simple simple
Reconfigurable computing
Design time (hardware) Computation time (software) Space ASIC Time General purpose processor Algorithm is implemented during: Computation is distributed in:
http://ces.univ-karlsruhe.de
Reconfigurable computing computing involves involves chips chips or
systems capable capable of
modifying modifying themselves themselves on
the fly fly, , while while running running, to , to meet meet different different application application needs needs. .
Flexibility Performance
ASIC CPU‘s & DSP‘s Reconfigurable computing
http://ces.univ-karlsruhe.de
High computation computation power power ( (near near ASIC) ASIC)
Better ratio ratio between between power power consumption consumption and and computing computing power power compared compared to to general general purpose purpose processors processors
Flexible like like general general purpose purpose processors processors
Suitible for for data data-
flow oriented
algorithms
Reconfiguration overhead
Utilization of
hardware may may be be low low, , depending depending on
actual configuration configuration
Difficult to to map map control control-
flow dominant dominant structures structures
http://ces.univ-karlsruhe.de
Silicon bridges can be destroyed through current
One time programmable
Two metal layers with dielectric in between (capacitor)
High resistance when not programmed
Dielectric layer can be destroyed through high voltage -
> The metal layers are connected layers are connected
One time programmable
CMOS-
transistor with isolated gate
High voltage tunnels electrons to the gate
Deletion is done through UV-
light
About 100 times reprogrammable
http://ces.univ-karlsruhe.de
Same technique as EPROM
Deletion is done electrically
About 10.000 times reprogrammable
Flip-
Flops are used for storage
Stored configuration / data is lost when system is powered down
> External memory needed to store configuration
Unlimited times programmable
http://ces.univ-karlsruhe.de
Only SRAM SRAM-
based devices devices are are suitable suitable for for reconfigurable reconfigurable computing computing
http://ces.univ-karlsruhe.de
Granularity of
reconfigurable logic logic is is the the size size of
the smallest smallest functional functional unit unit that that is is addressed addressed by by the the mapping mapping tools tools. .
Mapping tools tools are are used used to to generate generate configurations configurations for for the the reconfigurable reconfigurable device device
Often granularity granularity is is defined defined through through the the data data-
path width width
8-bit adder Fine-grained Coarse-grained
http://ces.univ-karlsruhe.de
High reconfiguration reconfiguration overhead
Provides more more flexibility flexibility in in adapting adapting it it to to the the algorithm algorithm structure structure. .
Low reconfiguration reconfiguration overhead
Combination of
coarse-
and fine fine-
grained functional functional units units
Combination of
microprocessor(s) and ) and reconfigurable reconfigurable logic logic
http://ces.univ-karlsruhe.de
EEPROM-
based
600 – – 5000 5000 usable usable gates gates
Up to 175 MHz
Up to 164 I/O O-
pins
Up to 100 times times reprogrammable reprogrammable
Logic Array Blocks (LAB) Array Blocks (LAB)
Each LAB LAB consists consists of 16
Macrocells Macrocells
Programmable Interconnect Interconnect Array (PIA) Array (PIA)
I/O Blocks [MAX03] [MAX03]
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
SRAM-
based
5000 – – 40.000 40.000 usable usable gates gates
Up to 224 I/O O-
pins
Up to 784 CLBs CLBs
Control logic logic blocks blocks (CLB) (CLB)
Logic implemented implemented in in look look-
up tables tables (LUT) (LUT)
Programmable switch switch matrix matrix (PSM) (PSM)
I/O Blocks [Spa02] [Spa02]
http://ces.univ-karlsruhe.de
Six transistors transistors required required per per interconnect interconnect point point
Ten programmable programmable interconnect interconnect points points per per PSM PSM
Transistor state state is is stored stored in in the the configuration configuration RAM RAM
Single lines lines: : routing routing between between adjacent adjacent CLBs CLBs
Double lines lines: : routing routing between between next next but but one
CLBs CLBs
Long lines lines: : routing routing across across the the entire entire length length or
width
http://ces.univ-karlsruhe.de
I/O-
Element
RAM -
PAE
ALU -
PAE
Configuration Manager Manager
http://ces.univ-karlsruhe.de
RAM-
PAE contains contains RAM RAM-
Object instead instead of
ALU-
Object
http://ces.univ-karlsruhe.de
Straightforward communication and data flow between ALL elements elements
IP model very modular and scalable
Events allow intermediate results within algorithm to determine subsequent configurations subsequent configurations
http://ces.univ-karlsruhe.de
Multiple code sections are computed sequentially computed sequentially
Operation 2
Section 1 Section 2 Section 3
Scaling the array size allows the computation of several code computation of several code sections simultaneously. sections simultaneously.
Performance increases linearly with array size with array size
Section 1 Section 2 Section 3
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
Native Mapping Language (NML) compiler compiler
Language optimized for parallel data flow data flow
NML uses constructs similar to C
XPP Software Simulator
Simulates functionality of XPP device device
Visualizer
Graphically shows status of data, events, and all array data, events, and all array elements on a clock cycle by elements on a clock cycle by clock cycle basis clock cycle basis
NML and XPP tutorials
Vectorizing C compiler C compiler
Compiler, Router, Placer Compiler, Router, Placer Algorithm Design Algorithm Design XMAP Result OK? Result OK? Upload to XPP device Upload to XPP device Visualizer / Debugger Visualizer / Debugger Software Simulation Software Simulation XSIM XVIS NML Coding NML Coding
http://ces.univ-karlsruhe.de
Displays Routing of Data and Events Events
Clock Accurate Data Flow Visualization Visualization
Interactively Step through Simulated Simulated DataFlow DataFlow
http://ces.univ-karlsruhe.de
ASIC Program ROM
LEON µC RAM
Global SoC-RAM XPP-Array PACT Local XPP-RAM Amba-Bus
FIFO-Bridge
XPP processing array Configurable SoC LEON SPARC-Core
Management and upload upload of XPP
configurations
Computation of
control-
flow dominant dominant algorithm algorithm parts parts
http://ces.univ-karlsruhe.de
LEON is is not not halted halted during during calculation calculation of XPP (XPP
works decoupled decoupled from from LEON) LEON)
Decoupling through through dual dual-
clocked FIFOs FIFOs
Different clock clock frequences frequences possible possible
XPP clock clock can can be be programmed programmed by by LEON LEON dynamically dynamically
System power power consumption consumption can can be be reduced reduced
Number and and width width of
inputs and and outputs
are scalable scalable before before synthesis synthesis [BTVB03]; [BTS03]; [BTS04] [BTVB03]; [BTS03]; [BTS04]
http://ces.univ-karlsruhe.de
alu/shift
mul/div
y
regfile
D-cache
address/dataout datain
32 32
imm, tbr, wim, psr
Decode Execute Memory Write
rs2 rs1 rd
tbr, wim, psr
30
jmpl address
32
ex pc
30 +1
jmpa
Add
call/branch address tbr '0'
Fetch
I-cache
address data
d_inst result e_inst m_inst w_inst d_pc e_pc m_pc w_pc wres Y rs1 f_pc rs2 ytmp
XPP config XPP-CLK, -IRQ,
XPP-CLK, -IRQ,
XPP data XPP event XPP data XPP event
XPP
4x4 Array
http://ces.univ-karlsruhe.de
Additional load load/ /store store and and read read/ /write write instructions instructions are are added added to to the the pipeline pipeline
Instructions are are implemented implemented in in assembler assembler and and can can be be used used via via C C-
macros
http://ces.univ-karlsruhe.de
C-code Host System Host System XPP XPP
RAM, I/O Host Interface RAM, I/O Host Interface
Standard C- Compiler + XPP Interface Calls
XPP Vectorizing C-Compiler
Irregular Code Sections Streaming Code Sections
Selection by annotation Sequence of Configurations
http://ces.univ-karlsruhe.de
AMURHA
[TZB04]; [TOB04]; [TBE04] [TZB04]; [TOB04]; [TBE04]
http://ces.univ-karlsruhe.de
Increasing of routable directions (vertical, 2x diagonal)
Bandwidth increasing and reachability reachability gain within the array gain within the array
DPHC: D Data atap path ath H Honeycomb
cell ell
MEMHC: Mem Memory
Honeycomb
cell ell
IOHC: I Input/ nput/O Output utput H Honeycomb
cell ell
MEMHC DPHC IOHC AMURHA
0,1 1,1 2,1 0,2 1,2 2,2 3,2 1,3 2,3 3,3 1,4 2,4 3,4 4,4 2,5 3,5 4,5 0,0 1,0 3,6 4,6
Logical View
0,1 1,1 2,1 1,2 2,2 0,2 3,2 0,0 1,0 1,3 2,3 3,3 2,4 3,4 1,4 4,4 2,5 3,5 4,5 3,6 4,6
Technological View
http://ces.univ-karlsruhe.de
Dedicated coarse-
grained links
Extended multi-
grained links
Routing-Unit
http://ces.univ-karlsruhe.de
Modular cell structure
Cell type definition through functional module functional module
Routing task: connecting selected output ports and the selected output ports and the input ports of the functional input ports of the functional modules modules
Routing Unit
… … … …
Functional Module
… … … …
Honeycomb cell structure
R
Unit Input links Output links
… M U X … … M UX … M UX M UX …
Datapath- / Memory output
RRR R R R R R R R R R R R R R
Datapath / Memory input
Intermediate Register Control-FSM + Registers
Input ports Output ports
http://ces.univ-karlsruhe.de
Gathering of physical sub-
channels to logical groups is possible (1 up to 32 bits) bits)
Remaining sub-
channels still usable
Partial usability of the same link (different sub channels) by different ifferent configurations configurations
Supported by adaptive runtime routing
Coarse-grained link Fine-grained link
Used output strips Input strips
http://ces.univ-karlsruhe.de
Simplified problem because if the static array-
configuration
Source Sink
http://ces.univ-karlsruhe.de
Dynamic routing: Bypass of used/defective Bypass of used/defective HCs HCs
Advantages: No static configurations (at compile time!!!) No static configurations (at compile time!!!) Chip yield increasing Chip yield increasing
Faulty Cell Used Cell
Route without defect Route with defect Error-free functional unit Fault functional unit
Used functional unit
A B C D A B C D
Free functional unit New configuration
C D A B
http://ces.univ-karlsruhe.de
http://ces.univ-karlsruhe.de
http://www.altera.com/literature/ds/m7000.pdf
http://direct.xilinx.com/bvdocs/publications/ds060.pdf
system-on-chip project (CSoC): coarse-grain XPP-/Leon-based architecture integration; Design, Automation and Test in Europe Conference and Exhibition; 2003
Reconfigurable XPP-Arrays into Pipelined RISC Processors. VLSI-SOC 2003
asynchronous reconfigurable datapath integration; Integrated Circuits and Systems Design (SBCCI); 2003
Handling in Multi-grained Reconfigurable Hardware Architectures; Symposium on Integrated Circuits and Systems Design, 2004
http://ces.univ-karlsruhe.de
Reconfigurable Hardware Architectures; Field-programmable Logic and its applications (FPL); 2004
rekonfigurierbaren Hardwarearchitektur; 17th International Conference on Architecture of Computing Systems (ARCS); 2004