[PDF] - Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - PDF Document

SLIDE 1

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel

Lars Bauer, Jörg Henkel

Vorlesung im SS 2014

Reconfigurable and Adaptive Systems (RAS)

1 -

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel

Reconfigurable and Adaptive Systems (RAS)

2 -
2. Overview and Definitions

SLIDE 2

3 -
L. Bauer, CES, KIT, 2014

“Computing System that can (partially)

change the functionality of its hardware”

This definition implies the use of so-called

Reconfigurable Hardware

Other definitions exist (e.g. changing the

software, changing the task mapping & task scheduling etc.) but in the scope of this lecture we will focus on those approaches that use reconfigurable hardware

Definition: ‘Reconfigurable System’

4 -
L. Bauer, CES, KIT, 2014

Here, we do not mean ‘time’ as a

physical unit or a continuous flow

Rather, in the following we need to

distinguish three distinct points in time

Design time: The system is specified,

the architecture is designed and the IC is taped out and sold to customers

Compile time: Software (e.g. application)

is compiled for the design-time fixed IC; it can be simulated, profiled etc.

Run time: The application executes on the IC

and faces varying situations (input data, other applications etc.)

Startup time: when the system boots, application starts etc.

Definition: ‘Time’

src: Einstein, wissen.de

SLIDE 3

5 -
L. Bauer, CES, KIT, 2014

“Adaptive computing refers to the capability of a com-

puting system to autonomously adapt one or more of its properties (e.g. performance) during run time.”

Reconfigurable Hardware is one of the key paradigms

that enable Adaptive Systems

Not all reconfigurable systems are adaptive

they don‘t need to perform run-time reconfiguration
or they might only perform compile-time predetermined

run-time reconfigurations

Not all adaptive systems rely on reconfigurable

hardware (e.g. they might use clever software or OS/middleware to adapt their properties)

Definition: ‘Adaptive System’

6 -
L. Bauer, CES, KIT, 2014

An description of the particular work load of the

system for a particular time

Which tasks are executing? How do these tasks depend on each other?

Data dependencies in a task graph
Resource conflicts, e.g. cache or periphery

What are the deadlines for the tasks? What are the priorities for the tasks? What is the input data for the tasks? What are the requirements of the tasks

(computational power, energy consumption, demand for hardware accelerators etc.)

Definition: ‘Application Scenario’

SLIDE 4

7 -
L. Bauer, CES, KIT, 2014

Dynamic Hardware

Pipeline Register IF/ID

Reconf. Manager

Instruction Memory

A D D

M U X

PC A L U Control

Data Memory Access Data Memory Access

Branch taken?

Data Memory Hierarchy

Arbiter

Test Condition

4 PC

Register File

Temporary Storage for sw-emul.

Jump Target

Reconf. Hardware

Interconnect Bus

Sign Extend

Pipeline Register ID/EXE Pipeline Register EXE/MEM Pipeline Register MEM/WB

Example for Using Reconfigurable Hardware to Accelerate Applications

Integrate the

reconfigurable HW into the pipeline of a pro- cessor

Use it as

a reconfigurable functional unit (RFU)

Further pos-

sibilities exist

8 -
L. Bauer, CES, KIT, 2014

The reconf. hardware can

be used to implement application specific accelerators on demand

The accelerators exploit:

parallelism (multiple

independent operations are executed at the same time in parallel) and

operator chaining (multiple

data-dependent operations are executed right after each

ther in the same cycle) to

achieve speedup

Examples for Accelerators

IN0

+

16

>> 5 > 255

255

< 0

Out0 IN1

+

16

>> 5 > 255

255

< 0

Out1 q0

ABS

p0

p0

p1

q0

q1 ABS ABS

<

α

<

β

<

>> 1 α

+

2

<

p2

p0

q2

q0 ABS ABS

<

β

<

β UV Ba Bb X1 X2 BS BS

SLIDE 5

9 -
L. Bauer, CES, KIT, 2014

It is hardly possible to

physically change the transistors (N-P doping etc.) and the metal layers after fabrication

Changing them fast (for

run-time reconfiguration) and without manual effort (for self-reconfiguration) can be considered impossible

So, that’s it??

Open Issue: How to perform h hardware reconfiguration?

src: FujitsuSuperSPARCII-85, cpu-world.com Weller, pkelektronik.com

??

10 -
L. Bauer, CES, KIT, 2014

Fine-grained Reconfigurable Hardware: Look-up Table (LUT)

User I/O Configura- tion Data

src: Kalenteridis et al. “A complete platform and toolset for system implementation

n fine-grained reconfigurable hardware“, Microprocessors and Microsystems 2004

SLIDE 6

11 -
L. Bauer, CES, KIT, 2014

Building Larger Reconfigurable Blocks, so-called Slices and CLBs

src: Xilinx Virtex-II User Guide

12 -
L. Bauer, CES, KIT, 2014

Two crossing lines are

either connected or not

Control Bit decides

Fine Grained: Each bit

line can be configured independently

Coarse Grained: Multiple

bit lines (bus) together

Reconfigurable Connections

src: T.J. Todman et al.: “Reconfigurable computing: architectures and design methods”, IEEE Proc.-

Comput. Digit. Tech., Vol. 152, No. 2, March 2005

SLIDE 7

13 -
L. Bauer, CES, KIT, 2014

Array of reconfigurable logic gates

CLB: Configurable Logic Block P PSM: Programmable Switch Matrix A Additionally: I/O Blocks, RAM Blocks, Multiplier, CPUs, … V Virtex-II 6000: 96x88 CLBs 8.448 CLBs 67.584 LUTs V Virtex 4 LX 160: 192x88 CLBs 16.896 CLBs 135.168 LUTs

src: Xilinx Data Sheet 060 „Spartan and Spartan-XL Families […]“

14 -
L. Bauer, CES, KIT, 2014

Configuration Memory (off-chip)

Logic Layer: perform the

actual computation

Configuration Layer:

determine the kind of computation that shall be performed

Is typically configured from

external memory

May also provide some con-

figuration cache inside the FPGA

May allow reconfiguration of

parts of the area partial reconfiguration

This allows placing a logic

inside the FPGA that recon- figures another part of the FPGA Self-reconfiguration

Partial Run-time Reconfiguration

Configuration Layer

Logic layer Logic layer Logic layer

SLIDE 8

15 -
L. Bauer, CES, KIT, 2014

PROM based (Fuse, Anti-Fuse)

— Only writeable one time

(E)EPROM/Flash based (Floating-Gate)

+ Non-volatile immediately configured after boot up + Configuration data not (necessarily) readable outside the FPGA Security; Intellectual Property (IP) protection + Low power consumption — Limited re-writeability (i.e. only good for a limited number of reconfigurations) — Slow write access not suitable for run-time reconfiguration / self-reconfiguration

Internal Configuration memory

16 -
L. Bauer, CES, KIT, 2014

SRAM based

+ Allows arbitrary number of reconfigurations good for prototyping + Fast reconfiguration Allows for run-time reconfiguration and self-reconfiguration — Needs to be reconfigured after every boot up high power consumption Security problem, as everyone can observe the configuration data (possible solution: bitstream encryption)

Hybrid (both EEPROM and SRAM on the die / in the

package)

+ Allows fast run-time reconfiguration (SRAM) and does not need external configuration data after boot up (automatically copying EEPROM to SRAM) — Still high power consumption during boot up — Needs larger chip area

Internal Configuration memory (cont’d)

SLIDE 9

17 -
L. Bauer, CES, KIT, 2014

Def.: ‘Bitstream’ : configuration data that is copied to the

configuration layer

Def.: ‘Partial Bitstream’ / ‘Full Bitstream’ : a Bitstream

that configures ‘only certain parts of’ / ‘the entire’ FPGA

A Bitstreams can become rather large:

Full Bitstream depends on the FPGA, e.g. 2-20 MB for Virtex-6
Partial Bitstream depend on the design, e.g. 100 KB – 1 MB

Definition ‘Reconfiguration Bandwidth’ : the average

bandwidth to copy the Bitstream from the external memory to the Configuration Layer (MB/s)

Virtex-II was specified for 50 MB/s and was demonstrated to work

at 100 MB/s

More recent FPGAs allow faster reconf. bandwidths (e.g. 32 bit @

100 MHz = 400 MB/s), but memory becomes the bottleneck

Reconfiguration Time

18 -
L. Bauer, CES, KIT, 2014

Practically, the bandwidth is limited by the

external memory

In CES demonstrator for RISPP project we used

external EEPROM that provides on avg. 36 MB/s

Alternatively, the system DDR RAM might be used

to store the partial Bitstreams

Reduces the system’s memory performance during reconfiguration

Reconfiguration Time (cont’d)

SLIDE 10

19 -
L. Bauer, CES, KIT, 2014

Resulting Reconfiguration time

Typically 1 ms - 10 ms if fast configuration ports are

used

Note: 1 MB/s corresponds to 1 KB/ms

100 KB @ 100 MB/s 1 ms
1 MB @ 200 MB/s 5 ms
In CES demonstrator typically 30-40 KB @ 36 MB/s

0.8 – 1.1 ms

How long is 1ms?

100,000 cycles of a 100 MHz CPU
1 million cycles of a 1 GHz CPU
Task switch time of a Linux 2.6 Kernel: ~1-10 ms

(configurable)

it’s a rather long time for a CPU

Reconfiguration Time (cont’d)

20 -
L. Bauer, CES, KIT, 2014

+ x

IN0

+ +

OUT

x

IN1

+ x

IN2

x

IN3

+ x

IN4

+ x

IN5

+ x

IN6

x

IN7

The rather slow reconfi-

guration time is due to the large amount of configuration data

For instance, the examples

n the right demand 8-16

bit Adds, Subs & Mults

Many LUTs need to be

configured and connected to implement an Adder etc.

This leads to rather large

partial bitstreams

It also affects the area

requirements and the maximal frequency

Reconfiguration Time (cont’d)

+

<< 1

+

2

+ + +

>> 3 p0 p1 q0

+

p2

+

p3 << 1 >> 2

+ +

2 >> 3 q1 p0

'

p1

'

p2

'

p3

'

+

<< 1

+

2

+ + +

>> 3 q0 q1 p0

+

q2

+

q3 << 1 >> 2

+ +

2 >> 3 p1 q0

'

q1

'

q2

'

q3

'

+

p1 q1

+ +

p0

'

p1

'

p2

'

p3

'

q0

'

q1

'

q2

'

q3

'

>> 2 >> 2 q1 q2 q3 p1 p2 p3 X1 X2 p0

'

p1

'

p2

'

p3

'

q0

'

q1

'

q2

'

q3

'

Q

32 32

P

1

X1

1

X2 P'

32

Q'

32 Loop Filter

Interface: X00 X30 X10 X20 Y20 Y00 Y10 Y30 >> 1

−

>> 1 >> 1

−

>> 1

+ + + +

<< 1 << 1

− −

DCT HT

SLIDE 11

21 -
L. Bauer, CES, KIT, 2014

If multiple arithmetic and/or logical

perations need to be performed, then

an ALU might outperform LUTs:

Differences:

+ Significantly less configuration data + Smaller area footprint (for the operations it implements) + Higher Frequency — Reduced efficiency when facing non- arithmetic operations or bit-level operations (resulting in increased area requirements and/or increased latency)

E.g. bit shuffling: how many cycles are needed to perform the operation shown on the right side with one ALU? Or: How many ALUs are needed to pipeline the operation?

Alternative for LUTs: Arithmetic and Logic Units

ALU

ctrl 16 bit Input 16 bit Output

22 -
L. Bauer, CES, KIT, 2014

Coarse-grained Reconfigurable Array

src: PACT’s XPP 64-A1 architecture

2-D array of

connected ALUs

Connections

ften limited to

direct neighbors

Sometimes data

may only move downwards (starting at the top of the array and ending at the bottom)

SLIDE 12

23 -
L. Bauer, CES, KIT, 2014

Different connection Topologies: Performance vs. Area

2D Mesh (1 step Manhattan neighborship, also called von

Neumann neighborship)

Extended Mesh (2 step orthogonal Manhattan neighborship)
Full orthogonal neighborship (each FU can access all other FUs in

the same column and the same row)

Connection Topology

src: B. Mei et al. “Architecture Exploration for a Reconfigurable Architecture Template”, IEEE Design and Test of Computers, vol. 22, no. 2, pp. 90-101, 2005

24 -
L. Bauer, CES, KIT, 2014

Left side: Number and

placement of Multipliers (more expensive than ALUs)

Bottom side: Number and

placement of load/ store units

Heterogeneity: Special Units

src: B. Mei et al. “Architecture Exploration for a Reconfigurable Architecture Template”, IEEE Design and Test of Computers, vol. 22, no. 2, pp. 90-101, 2005

SLIDE 13

25 -
L. Bauer, CES, KIT, 2014

Reconfigurable Hardware can be used to

implement accelerators

Connected to CPU
Used by applications

Reconfigurable hardware is implemented by

Fine-grained structures (LUT array) or
Coarse-grained structures (ALU array)
They differ in their efficiency, depending on the required
perations (bit/byte level vs. word level)

Configuration data and configuration time have

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - - - PDF document

Lars Bauer, Jörg Henkel

Vorlesung im SS 2014

Reconfigurable and Adaptive Systems (RAS)

Reconfigurable and Adaptive Systems (RAS)

change the functionality of its hardware”

Reconfigurable Hardware

software, changing the task mapping & task scheduling etc.) but in the scope of this lecture we will focus on those approaches that use reconfigurable hardware

Definition: ‘Reconfigurable System’

Definition: ‘Time’

Definition: ‘Adaptive System’

Definition: ‘Application Scenario’

Example for Using Reconfigurable Hardware to Accelerate Applications

be used to implement application specific accelerators on demand

Examples for Accelerators

Open Issue: How to perform h hardware reconfiguration?

??

Fine-grained Reconfigurable Hardware: Look-up Table (LUT)

Building Larger Reconfigurable Blocks, so-called Slices and CLBs

either connected or not

line can be configured independently

bit lines (bus) together

Reconfigurable Connections

Array of reconfigurable logic gates

Partial Run-time Reconfiguration

Internal Configuration memory

Internal Configuration memory (cont’d)

Reconfiguration Time

external memory

Reconfiguration Time (cont’d)

Reconfiguration Time (cont’d)

Reconfiguration Time (cont’d)

Alternative for LUTs: Arithmetic and Logic Units

Coarse-grained Reconfigurable Array

Connection Topology

Heterogeneity: Special Units

implement accelerators

to be kept in mind to exploit the advantages of run-time reconfiguration

Summary