Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - - - PDF document

reconfigurable and adaptive systems ras
SMART_READER_LITE
LIVE PREVIEW

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - - - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr Technische Informatik Chair for Embedded Systems


slide-1
SLIDE 1

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel

Lars Bauer, Jörg Henkel

Vorlesung im SS 2014

Reconfigurable and Adaptive Systems (RAS)

  • 1 -

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel

Reconfigurable and Adaptive Systems (RAS)

  • 2 -
  • 2. Overview and Definitions
slide-2
SLIDE 2
  • 3 -
  • L. Bauer, CES, KIT, 2014

“Computing System that can (partially)

change the functionality of its hardware”

This definition implies the use of so-called

Reconfigurable Hardware

Other definitions exist (e.g. changing the

software, changing the task mapping & task scheduling etc.) but in the scope of this lecture we will focus on those approaches that use reconfigurable hardware

Definition: ‘Reconfigurable System’

  • 4 -
  • L. Bauer, CES, KIT, 2014

Here, we do not mean ‘time’ as a

physical unit or a continuous flow

  • Rather, in the following we need to

distinguish three distinct points in time

Design time: The system is specified,

the architecture is designed and the IC is taped out and sold to customers

Compile time: Software (e.g. application)

is compiled for the design-time fixed IC; it can be simulated, profiled etc.

Run time: The application executes on the IC

and faces varying situations (input data, other applications etc.)

  • Startup time: when the system boots, application starts etc.

Definition: ‘Time’

src: Einstein, wissen.de

slide-3
SLIDE 3
  • 5 -
  • L. Bauer, CES, KIT, 2014

“Adaptive computing refers to the capability of a com-

puting system to autonomously adapt one or more of its properties (e.g. performance) during run time.”

Reconfigurable Hardware is one of the key paradigms

that enable Adaptive Systems

Not all reconfigurable systems are adaptive

  • they don‘t need to perform run-time reconfiguration
  • or they might only perform compile-time predetermined

run-time reconfigurations

Not all adaptive systems rely on reconfigurable

hardware (e.g. they might use clever software or OS/middleware to adapt their properties)

Definition: ‘Adaptive System’

  • 6 -
  • L. Bauer, CES, KIT, 2014

An description of the particular work load of the

system for a particular time

Which tasks are executing? How do these tasks depend on each other?

  • Data dependencies in a task graph
  • Resource conflicts, e.g. cache or periphery

What are the deadlines for the tasks? What are the priorities for the tasks? What is the input data for the tasks? What are the requirements of the tasks

(computational power, energy consumption, demand for hardware accelerators etc.)

Definition: ‘Application Scenario’

slide-4
SLIDE 4
  • 7 -
  • L. Bauer, CES, KIT, 2014

Dynamic Hardware

Pipeline Register IF/ID

Reconf. Manager

Instruction Memory

A D D

M U X

PC A L U Control

Data Memory Access Data Memory Access

Branch taken?

Data Memory Hierarchy

Arbiter

Test Condition

4 PC

Register File

Temporary Storage for sw-emul.

Jump Target

Reconf. Hardware

Interconnect Bus

Sign Extend

Pipeline Register ID/EXE Pipeline Register EXE/MEM Pipeline Register MEM/WB

Example for Using Reconfigurable Hardware to Accelerate Applications

Integrate the

reconfigurable HW into the pipe- line of a pro- cessor

Use it as

a reconfi- gurable functional unit (RFU)

Further pos-

sibilities exist

  • 8 -
  • L. Bauer, CES, KIT, 2014

The reconf. hardware can

be used to implement application specific accelerators on demand

The accelerators exploit:

  • parallelism (multiple

independent operations are executed at the same time in parallel) and

  • operator chaining (multiple

data-dependent operations are executed right after each

  • ther in the same cycle) to

achieve speedup

Examples for Accelerators

IN0

+

16

>> 5 > 255

255

< 0

Out0 IN1

+

16

>> 5 > 255

255

< 0

Out1 q0

  • ABS

p0

  • p0

p1

  • q0

q1 ABS ABS

<

α

<

β

<

>> 1 α

+

2

<

  • p2

p0

  • q2

q0 ABS ABS

<

β

<

β UV Ba Bb X1 X2 BS BS

slide-5
SLIDE 5
  • 9 -
  • L. Bauer, CES, KIT, 2014

It is hardly possible to

physically change the transistors (N-P doping etc.) and the metal layers after fabrication

Changing them fast (for

run-time reconfiguration) and without manual effort (for self-reconfiguration) can be considered impossible

So, that’s it??

Open Issue: How to perform h hardware reconfiguration?

src: FujitsuSuperSPARCII-85, cpu-world.com Weller, pkelektronik.com

??

  • 10 -
  • L. Bauer, CES, KIT, 2014

Fine-grained Reconfigurable Hardware: Look-up Table (LUT)

User I/O Configura- tion Data

src: Kalenteridis et al. “A complete platform and toolset for system implementation

  • n fine-grained reconfigurable hardware“, Microprocessors and Microsystems 2004
slide-6
SLIDE 6
  • 11 -
  • L. Bauer, CES, KIT, 2014

Building Larger Reconfigurable Blocks, so-called Slices and CLBs

src: Xilinx Virtex-II User Guide

  • 12 -
  • L. Bauer, CES, KIT, 2014

Two crossing lines are

either connected or not

  • Control Bit decides

Fine Grained: Each bit

line can be configured independently

Coarse Grained: Multiple

bit lines (bus) together

Reconfigurable Connections

src: T.J. Todman et al.: “Reconfigurable computing: architectures and design methods”, IEEE Proc.-

  • Comput. Digit. Tech., Vol. 152, No. 2, March 2005
slide-7
SLIDE 7
  • 13 -
  • L. Bauer, CES, KIT, 2014

Array of reconfigurable logic gates

CLB: Configurable Logic Block P PSM: Programmable Switch Matrix A Additionally: I/O Blocks, RAM Blocks, Multiplier, CPUs, … V Virtex-II 6000: 96x88 CLBs 8.448 CLBs 67.584 LUTs V Virtex 4 LX 160: 192x88 CLBs 16.896 CLBs 135.168 LUTs

src: Xilinx Data Sheet 060 „Spartan and Spartan-XL Families […]“

  • 14 -
  • L. Bauer, CES, KIT, 2014

Configuration Memory (off-chip)

Logic Layer: perform the

actual computation

Configuration Layer:

determine the kind of computation that shall be performed

  • Is typically configured from

external memory

  • May also provide some con-

figuration cache inside the FPGA

May allow reconfiguration of

parts of the area partial reconfiguration

  • This allows placing a logic

inside the FPGA that recon- figures another part of the FPGA Self-reconfiguration

Partial Run-time Reconfiguration

Configuration Layer

Logic layer Logic layer Logic layer

slide-8
SLIDE 8
  • 15 -
  • L. Bauer, CES, KIT, 2014

PROM based (Fuse, Anti-Fuse)

— Only writeable one time

(E)EPROM/Flash based (Floating-Gate)

+ Non-volatile immediately configured after boot up + Configuration data not (necessarily) readable outside the FPGA Security; Intellectual Property (IP) protection + Low power consumption — Limited re-writeability (i.e. only good for a limited number of reconfigurations) — Slow write access not suitable for run-time reconfiguration / self-reconfiguration

Internal Configuration memory

  • 16 -
  • L. Bauer, CES, KIT, 2014

SRAM based

+ Allows arbitrary number of reconfigurations good for prototyping + Fast reconfiguration Allows for run-time reconfiguration and self-reconfiguration — Needs to be reconfigured after every boot up high power consumption Security problem, as everyone can observe the configuration data (possible solution: bitstream encryption)

Hybrid (both EEPROM and SRAM on the die / in the

package)

+ Allows fast run-time reconfiguration (SRAM) and does not need external configuration data after boot up (automatically copying EEPROM to SRAM) — Still high power consumption during boot up — Needs larger chip area

Internal Configuration memory (cont’d)

slide-9
SLIDE 9
  • 17 -
  • L. Bauer, CES, KIT, 2014

Def.: ‘Bitstream’ : configuration data that is copied to the

configuration layer

Def.: ‘Partial Bitstream’ / ‘Full Bitstream’ : a Bitstream

that configures ‘only certain parts of’ / ‘the entire’ FPGA

A Bitstreams can become rather large:

  • Full Bitstream depends on the FPGA, e.g. 2-20 MB for Virtex-6
  • Partial Bitstream depend on the design, e.g. 100 KB – 1 MB

Definition ‘Reconfiguration Bandwidth’ : the average

bandwidth to copy the Bitstream from the external memory to the Configuration Layer (MB/s)

  • Virtex-II was specified for 50 MB/s and was demonstrated to work

at 100 MB/s

  • More recent FPGAs allow faster reconf. bandwidths (e.g. 32 bit @

100 MHz = 400 MB/s), but memory becomes the bottleneck

Reconfiguration Time

  • 18 -
  • L. Bauer, CES, KIT, 2014

Practically, the bandwidth is limited by the

external memory

  • In CES demonstrator for RISPP project we used

external EEPROM that provides on avg. 36 MB/s

  • Alternatively, the system DDR RAM might be used

to store the partial Bitstreams

Reduces the system’s memory performance during reconfiguration

Reconfiguration Time (cont’d)

slide-10
SLIDE 10
  • 19 -
  • L. Bauer, CES, KIT, 2014

Resulting Reconfiguration time

  • Typically 1 ms - 10 ms if fast configuration ports are

used

Note: 1 MB/s corresponds to 1 KB/ms

  • 100 KB @ 100 MB/s 1 ms
  • 1 MB @ 200 MB/s 5 ms
  • In CES demonstrator typically 30-40 KB @ 36 MB/s

0.8 – 1.1 ms

How long is 1ms?

  • 100,000 cycles of a 100 MHz CPU
  • 1 million cycles of a 1 GHz CPU
  • Task switch time of a Linux 2.6 Kernel: ~1-10 ms

(configurable)

it’s a rather long time for a CPU

Reconfiguration Time (cont’d)

  • 20 -
  • L. Bauer, CES, KIT, 2014

+ x

IN0

+ +

OUT

x

IN1

+ x

IN2

x

IN3

+ x

IN4

+ x

IN5

+ x

IN6

x

IN7

The rather slow reconfi-

guration time is due to the large amount of configu- ration data

For instance, the examples

  • n the right demand 8-16

bit Adds, Subs & Mults

Many LUTs need to be

configured and connected to implement an Adder etc.

  • This leads to rather large

partial bitstreams

  • It also affects the area

requirements and the maximal frequency

Reconfiguration Time (cont’d)

+

<< 1

+

2

+ + +

>> 3 p0 p1 q0

+

p2

+

p3 << 1 >> 2

+ +

2 >> 3 q1 p0

'

p1

'

p2

'

p3

'

+

<< 1

+

2

+ + +

>> 3 q0 q1 p0

+

q2

+

q3 << 1 >> 2

+ +

2 >> 3 p1 q0

'

q1

'

q2

'

q3

'

+

p1 q1

+ +

p0

'

p1

'

p2

'

p3

'

q0

'

q1

'

q2

'

q3

'

>> 2 >> 2 q1 q2 q3 p1 p2 p3 X1 X2 p0

'

p1

'

p2

'

p3

'

q0

'

q1

'

q2

'

q3

'

Q

32 32

P

1

X1

1

X2 P'

32

Q'

32 Loop Filter

Interface: X00 X30 X10 X20 Y20 Y00 Y10 Y30 >> 1

>> 1 >> 1

>> 1

+ + + +

<< 1 << 1

− −

DCT HT

slide-11
SLIDE 11
  • 21 -
  • L. Bauer, CES, KIT, 2014

If multiple arithmetic and/or logical

  • perations need to be performed, then

an ALU might outperform LUTs:

Differences:

+ Significantly less configuration data + Smaller area footprint (for the operations it implements) + Higher Frequency — Reduced efficiency when facing non- arithmetic operations or bit-level operations (resulting in increased area requirements and/or increased latency)

E.g. bit shuffling: how many cycles are needed to perform the operation shown on the right side with one ALU? Or: How many ALUs are needed to pipeline the operation?

Alternative for LUTs: Arithmetic and Logic Units

ALU

ctrl 16 bit Input 16 bit Output

  • 22 -
  • L. Bauer, CES, KIT, 2014

Coarse-grained Reconfigurable Array

src: PACT’s XPP 64-A1 architecture

2-D array of

connected ALUs

Connections

  • ften limited to

direct neighbors

Sometimes data

may only move downwards (starting at the top of the array and ending at the bottom)

slide-12
SLIDE 12
  • 23 -
  • L. Bauer, CES, KIT, 2014

Different connection Topologies: Performance vs. Area

  • 2D Mesh (1 step Manhattan neighborship, also called von

Neumann neighborship)

  • Extended Mesh (2 step orthogonal Manhattan neighborship)
  • Full orthogonal neighborship (each FU can access all other FUs in

the same column and the same row)

Connection Topology

src: B. Mei et al. “Architecture Exploration for a Reconfigurable Architecture Template”, IEEE Design and Test of Computers, vol. 22, no. 2, pp. 90-101, 2005

  • 24 -
  • L. Bauer, CES, KIT, 2014

Left side: Number and

placement of Multipliers (more expensive than ALUs)

Bottom side: Number and

placement of load/ store units

Heterogeneity: Special Units

src: B. Mei et al. “Architecture Exploration for a Reconfigurable Architecture Template”, IEEE Design and Test of Computers, vol. 22, no. 2, pp. 90-101, 2005

slide-13
SLIDE 13
  • 25 -
  • L. Bauer, CES, KIT, 2014

Reconfigurable Hardware can be used to

implement accelerators

  • Connected to CPU
  • Used by applications

Reconfigurable hardware is implemented by

  • Fine-grained structures (LUT array) or
  • Coarse-grained structures (ALU array)
  • They differ in their efficiency, depending on the required
  • perations (bit/byte level vs. word level)

Configuration data and configuration time have

to be kept in mind to exploit the advantages of run-time reconfiguration

Summary