Reconfigurable Computing Reconfigurable Computing Reconfigurable - - PowerPoint PPT Presentation

reconfigurable computing reconfigurable computing
SMART_READER_LITE
LIVE PREVIEW

Reconfigurable Computing Reconfigurable Computing Reconfigurable - - PowerPoint PPT Presentation

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable Architectures Chapter 3.2 Chapter 3.2 Prof. Dr.- -Ing. Jrgen Teich Ing. Jrgen Teich Prof. Dr. Lehrstuhl fr Hardware- -Software Software-


slide-1
SLIDE 1

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable Architectures Chapter 3.2 Chapter 3.2

  • Prof. Dr.
  • Prof. Dr.-
  • Ing. Jürgen Teich
  • Ing. Jürgen Teich

Lehrstuhl für Hardware Lehrstuhl für Hardware-

  • Software

Software-

  • Co

Co-

  • Design

Design

Reconfigurable Computing

slide-2
SLIDE 2

Coarse Coarse-

  • Grained Reconfigurable Devices

Grained Reconfigurable Devices

Reconfigurable Computing

2

slide-3
SLIDE 3

Recall: Recall:

Reconfigurable Computing

3

  • 1. Brief Historically development (Estrin Fix-Plus and

Rammig machine)

  • 2. Programmable Logic
  • 1. PALs and PLAs
  • 2. CPLDs
  • 3. FPGAs
  • 1. Technology
  • 2. Architecture by mean of example
  • 1. Actel
  • 2. Xilinx
  • 3. Altera
slide-4
SLIDE 4

Once again: General purpose vs Special purpose Once again: General purpose vs Special purpose

Reconfigurable Computing

4

With LUTs as function generators, FPGA can be seen as general purpose devices. Like any general purpose device, they are flexible but inefficient Flexible because any n-variable Boolean function can be implemented using an n-input LUT. Inefficient since complex functions must be implemented in many LUTs at different locations.The connection among the LUTs is done using the routing matrix wich increases the signal delays LUT implementation is usually slower than direct wiring

slide-5
SLIDE 5

Once again: General purpose vs Special purpose Once again: General purpose vs Special purpose

Reconfigurable Computing

5

Example Example: : Implement the function

Implement the function using 2 using 2-

  • input LUTs

input LUTs.

.

LUTs are grouped in logic blocks (LB). 2 2 LUTs are grouped in logic blocks (LB). 2 2-

  • input LUT per LB

input LUT per LB Connection inside a LB is efficient (direct) Connection inside a LB is efficient (direct) Connection outside LBs are slow (Connection matrix)

A F = ABD + AC BC D +

Connection outside LBs are slow (Connection matrix)

A B D A C D A B C F

Connection matrix

slide-6
SLIDE 6

Once again: General purpose vs Special purpose Once again: General purpose vs Special purpose

Reconfigurable Computing

6

Idea: Idea: Implement frequently used blocks as hard Implement frequently used blocks as hard-

  • core module in

core module in the device the device

A

B D A

C D

A B C F

Connection matrix

A B C D F

slide-7
SLIDE 7

Coarse grained reconfigurable devices Coarse grained reconfigurable devices

Reconfigurable Computing

7

Overcome the inefficiency of FPGAs by providing coarse grained functional units (adders, multipliers, integrators, etc.), efficiently implemented Advantage: Very efficient in terms of speed (no need for connections over connection matrices for basic

  • perators)

Advantage: Direct wiring instead of LUT implementation A coarse grained device is usually an array of programmable and identical processing elements (PE) capable of executing few operations like addition and multiplication. Depending on the manufacturer, the functional units communicate via buses or can be directly connected using programmable routing matrices

slide-8
SLIDE 8

Coarse grained reconfigurable devices Coarse grained reconfigurable devices

Reconfigurable Computing

8

Memory exists between and inside the PEs. Several other functional units according to the manufacturer. A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to execute only one

  • peration on a given period (until the next

configuration) Communication among the PEs can be either packet

  • riented (on buses) or point-to-point (using crossbar

switches) Since each vendor has its own implementation approach, study will be done by means of few

  • examples. Considered are: PACT XPP, Quicksilver

ACM, NEC DRP, picoChip, IPflex DAP/DNA.

slide-9
SLIDE 9

The PACT XPP The PACT XPP – – Overall structure Overall structure

Reconfigurable Computing

9

XPP (Extreme Processing Platform) is a hierarchical structure consisting of:

  • An array of Processing Array Elements

(PAE) grouped in clusters called Processing Array (PA)

  • PAC = Processing Array Cluster (PAC) +

Configuration manager (CM)

  • A hierarchical configuration tree
  • Local CMs manage the configuration at

the PA level

  • The local CMs access the local

configuration memory while the supervisor CM (SCM) accesses external memory and supervises the whole configuration process on the device

slide-10
SLIDE 10

The PACT XPP The PACT XPP – – The Processing Array Elements The Processing Array Elements

Reconfigurable Computing

10

  • A Communication Network
  • Memory elements aside the PACs
  • A set of I/Os

The PAE: Two types of PAE The ALU PAE The RAM PAE The ALU PAE: Contains an ALU which can be configured to perform basic operations Back-register (BREG) provides routing channels for data and events from bottom to top Forward Register (FREG) provides routing channels from top to bottom

The ALU PAE

slide-11
SLIDE 11

The PACT XPP The PACT XPP -

  • The Processing Array Elements

The Processing Array Elements

Reconfigurable Computing

11

DataFlow Register (DF-REG) can be used at the object outputs for buffering data Input register can be preloaded by configuration data.

The RAM PAE:

1.Differs from the ALU-PAE only on the

  • function. Instead of an ALU, a RAM-PAE

contains a dual-ported RAM.

2.Useful for data storage 3.Data is written or read after the reading of

an address at the RAM-inputs

4.BREG, FREG, and DF-REG of the RAM-

PAE have the same function as in the ALU-PAE

The RAM PAE

slide-12
SLIDE 12

The PACT XPP The PACT XPP -

  • Routing

Routing

Reconfigurable Computing

12

Routing in PACT XPP:

Two independent networks One for data transmission The other for event transmission A Configuration BUS exists besides the data and event networks (very little information exists about the configuration bus) All objects can be connected to horizontal routing channels using switch-objects Vertical routing channels are provided by the BREG and FREG BREGs route from bottom to top FREGs route from top to bottom

Vertical routing channels Horizontal routing channels

slide-13
SLIDE 13

The PACT XPP The PACT XPP -

  • Interface

Interface

Reconfigurable Computing

13

Interfaces are available inside the chip

Number and type of interfaces vary from device to device

On the XPP42-A1: 6 internal interfaces consisting of:

  • 4 identical general purpose I/O on-chip

interfaces (bottom left, upper left, upper right, and bottom right)

  • One configuration manager (not shown

in the picture)

  • One JTAG (Join Test Action Group,

"IEEE Standard 1149.1") Boundary scan interface for testing purpose

Interfaces

slide-14
SLIDE 14

The PACT XPP The PACT XPP -

  • Interface

Interface

Reconfigurable Computing

14

The I/O interfaces can operate independent from each other. Two

  • peration modes
  • The RAM mode
  • The streaming mode

RAM mode:

  • Each port can access external Static

RAM (SRAM).

  • Control signals for the SRAM

transactions are available.

  • No additional logic required
slide-15
SLIDE 15

The PACT XPP The PACT XPP -

  • Interface

Interface

Reconfigurable Computing

15

Streaming mode:

  • 1. For high speed streaming of data to

and from the device

  • 2. Each I/O element provides two

bidirectional ports for data streaming

  • 3. Handshake signals are used for

synchronization of data packets to external port

slide-16
SLIDE 16

The Quicksilver ACM The Quicksilver ACM -

  • Architecture

Architecture

Reconfigurable Computing

16

Structure: Fractal-like structure

  • Hierarchical group of four nodes with

full communication among the nodes

  • 4 lower level nodes are grouped in a

higher level node

  • The lowest level consists of 4

heterogeneous processing nodes

  • The connection is done in a Matrix

Interconnect Network (MIN)

  • A system controller
  • Various I/O
slide-17
SLIDE 17

The Quicksilver ACM The Quicksilver ACM – – The processing node The processing node

Reconfigurable Computing

17

An ACM processing node consists of:

An algorithmic engine. It is unique to each node type and defines the operation to perform by the node. The node memory for data storage at the node level. A node wrapper which is common to all nodes. It is used to hide the complexity of the heterogeneous architecture.

slide-18
SLIDE 18

The Quicksilver ACM The Quicksilver ACM – – The processing node The processing node

Reconfigurable Computing

18

Four types of nodes exist: The Programmable Scalar Node (PSN) provides a standard 32-bit RISC architecture with 32-bit general purpose registers The Adaptive Execution Node (AXN) provides variable size MAC and ALU operations The Domain Bit Manipulation (DBM) node provides bit manipulation and byte oriented

  • peration

External Memory Controller node provides DDRRAM, SRAM, memory random access DMA control interface

ACM PSN-Node

slide-19
SLIDE 19

The Quicksilver ACM The Quicksilver ACM – – The processing node The processing node

Reconfigurable Computing

19

ACM DBM-Node ACM AXN-Node

slide-20
SLIDE 20

The Quicksilver ACM The Quicksilver ACM – – The processing node The processing node

Reconfigurable Computing

20

The node wrapper envelopes the algorithmic engine and presents an identical interface to neighbouring

  • nodes. It features:

1.A MIN interface to support the

communication among nodes via the MIN-network

2.A hardware task manager for task

management at the node level

3.A DMA engine 4.Dedicated I/O circuitry 5.Memory controllers 6.Data distributors and aggregators

The ACM Node-Wrapper

slide-21
SLIDE 21

The Quicksilver ACM The Quicksilver ACM -

  • The MIN

The MIN

Reconfigurable Computing

21

Matrix Interconnect Network is the communication medium in an ACM chip

  • 1. Hierarchically organized. The MIN

at a given level connects many lower-level MINs

  • 2. The MIN-Root is used for:

1.Off-chip communication 2.Configuration

  • 3. Supports the communication

among nodes

  • 4. Provides service like Point to

point dataflow streaming, Real- time broadcasting, DMA, etc... Example of ACM chip configuration

slide-22
SLIDE 22

The Quicksilver ACM The Quicksilver ACM -

  • The System Controller

The System Controller

Reconfigurable Computing

22

The system controller is in charge of the system management: Loads tasks into node ready-to-run queue for execution Statically or dynamically sets the communication channels between the processing nodes Carries out the reconfiguration of nodes on a clock cycle-by-clock cycle basis The ACM chip features a set of I/O interfaces controllers like: PCI PLL SDRAM and SRAM

The system controller The interface controllers

slide-23
SLIDE 23

The NEC DRP The NEC DRP – – Architecture Architecture

Reconfigurable Computing

23

The NEC Dynamically Reconfigurable Processor (DRP) consists of:

  • A set of byte oriented processing

elements (PE)

  • A programmable interconnection

network for communication among the PEs.

  • A sequencer. Can be programmed as

finite state machine (FSM) to control the reconfiguration process

  • Memory around the device for storing

configuration and computation data

  • Various Interfaces
slide-24
SLIDE 24

The NEC DRP The NEC DRP -

  • The Processing Element

The Processing Element

Reconfigurable Computing

24

ALU: ordinary byte arithmetic/logic

  • perations

DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations An instruction dictates ALU/DMU

  • perations and inter-PE connections

Source/destination operands can either be from/to its own register file

  • ther PEs (i.e., flow through)

Instruction pointer (IP) is provided from STC (state transition controller)

slide-25
SLIDE 25

The NEC DRP The NEC DRP – – Reconfiguration Process Reconfiguration Process

Reconfigurable Computing

25

Instruction Pointer (IP) from STC identifies a datapath plane Spatial computation with using a customized datapath plane When IP changes, datapath plane switches instantaneously PE instructions as a collection behave like an extreme VLIW Sequencing through instructions => Dynamic reconfiguration

AES 3DES MD5 SHA-1 Compress Data In Control (task selection by descriptor) Dynamic Reconfigura tion Data Out Multiple Datapath Planes

slide-26
SLIDE 26

The NEC DRP The NEC DRP – – Reconfiguration Process Reconfiguration Process

Reconfigurable Computing

26

Ad d Se l Ad d C m p Ad d Ad d C m p Se l

PE

PE Array ALU DMU Insts.

PE

1 2

IP = “1”

1

3 4

PE Array PE

ALU DMU

1 2

Insts. IP = “1”

1

1 2 1

Identify the instruction to be executed

2

Decode the instruction in the ALU plane

3

Configure the ALU Plane according to the instruction

4

+

slide-27
SLIDE 27

The IPflex DAP/DNA The IPflex DAP/DNA -

  • Structure

Structure

Reconfigurable Computing

27

The IPflex DAP/DNA has the structure

  • f a System on Chip (SoC) with an

embedded FPGA. It features: Integrated RISC core Carry some computation Controls the reconfiguration process A Distributed Network Architecture (DNA) matrix (matrix of configurable operation units) Communication over an internal bus Different caches for data, instructions and configuration I/O and memory Interface controllers

slide-28
SLIDE 28

The picoChip The picoChip -

  • Architecture

Architecture

Reconfigurable Computing

28

Hundreds of array elements each with versatile 16-bit processor and local data heterogeneous architecture with four types of elements optimized for different tasks (DSP or wireless function). Interface for:

SRAM Host communication External systems Inter picoChip system

slide-29
SLIDE 29

Device size Device size

Reconfigurable Computing

29

Usually measured in terms of the number of transistors used in the device This is not so helpful for reconfigurable devices, since the number of transistors is not equal to the number of usable resources on the chip. For example: FPGA belong to the most complexs chips (complexer than Pentium processors), but their capacity is smaller than their ASIC counterpart. The capacity of FPGA is usually measured in terms of the number of gate equivalents a design needs to be implemented. A gate equivalent is a unit of measure. 1 gate equivalent = 1 2-inputs NAND gate A one million-gates FPGA is able to implement the equivalent of a circuit containing 1 million 2-inputs NAND gates.