Reconfigurable and Reconfigurable and Adaptive Systems (RAS) - PowerPoint PPT Presentation

Institut für Technische Informatik Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2012 Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4. Fine-Grained Reconfigurable Processors Lars Bauer, Jörg Henkel - 1 - - 2 - RAS Topic Overview Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 1. Introduction • PRISM 2. Overview • PRISM-II 4 4.1 PRISM: Processor • Garp 3. Special Instructions Reconfiguration through • MOLEN 4. Fine-Grained Reconfigurable Processors • PRISC Instruction Set • OneChip 5. Configuration Prefetching Metamorphosis • OneChip98 6. Coarse-Grained • XiRISC Reconfigurable Processors • XiSystem 7. Adaptive Reconfigurable Processors • New FPGA Architectures 8. Fault-tolerance by Reconfiguration - 3 - - 4 - L. Bauer, CES, KIT, 2012

PRISM Tool Chain PRISM Overview � PRISM-I system: external Observation: an adaptive micro-architecture cannot be designed by the � high-level programmer (limited expertise) stand-alone processing unit Solution: High Level Language compiler, so-called configuration compiler � ◦ Two boards that are inter- “The configuration compiler […] is a special compiler that accepts a high- � connected by a 16-bit bus level language program as input, and produces both a hardware image ◦ Processor board: Motorola and a software image” [WAL+93] 68010 processor running at Identifying Hot spots (with manual interaction) ◦ 10 MHz HW/SW partitioning ◦ ◦ Accelerator board: four ◦ Generating SIs Xilinx 3090 FPGAs � Hardly run-time reconfigurable, i.e. it takes (little less than) one second to reconfigure the FPGAs src: [WAL + 93] - 5 - - 5 - src: [WAL + 93] - 6 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 PRISM Limitations PRISM Limitations (cont’d) � Hardware Limitations: � Tool Chain Limitations: ◦ PRISM-I is the first implementation of the PRISM ◦ State and global variables are not supported concept, i.e. it is a proof-of-concept ◦ At most 32-bit input bits and 32-bit output bits ◦ Slow reconfiguration speed (a little less than one respectively (may be distributed among multiple second) under software control variables) ◦ FPGA provide only a low overall speed and capacity ◦ No support for variable loop counts (i.e. not supporting “for (i=0 to n )”, where n is variable) ◦ Slow communication: between 45 and 75 clock cycles (at 10 MHz) to move operands to an SI and to ◦ Only single-cycle SI implementations collect the results ◦ Limited support for C data types (e.g. no ‘float’) and C constructs (e.g. no ‘do-while’ or ‘switch-case’) - 7 - - 8 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012

4.2 PRISM-II PRISM-II Tool Chain � Improved System: PRISM-II � The parsing and � Supports larger parts of the C language specification optimization stage � Supports synthesis builds on top of GCC of sequential ◦ GCC used a variation logic for execu- of a register transfer tion of loops language at that time with variable � The synthesis is done loop counts (i.e. unknown using ‘VHDL Designer’ at compile or ‘X-BLOX’ time) src: [WAL + 93] src: [WAL + 93] - 9 - - 10 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 PRISM-II Architecture PRISM-II Architecture (cont’d) � 3 Xilinx 4010 FPGAs � AMD Am29050 at 33 ◦ An SI may use all 3 FPGAs MHz, 28 MIPS � By utilizing data buffers, the � Coprocessor-like FPGAs can work together or reconfigurable fabric perform individual tasks � 64-bit bus � Global bus provides control signals to be shared between ◦ Using the Address Bus FPGAs and the Data Bus at ◦ used for providing global clocks the same time ◦ or transferring state information ◦ Only 32-bit results are between the FPGAs allowed � Reported Speedup: � Tighter coupling ◦ 86x for simple bit reversal ◦ Only 30 ns data ◦ 10x for computing a Hamming code movement cost src: [WAL + 93] src: [WAL + 93] - 11 - - 12 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012

PRISM Summary Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel � Very early approach (1993) for a loosely coupled reconfigurable component � PRISM-I: external Processing unit 4.3 Garp 4 � PRISM-II: external Coprocessor (to some degree) � Very slow coupling � Very slow reconfiguration time (range of seconds, not milliseconds) � Relies on very old FPGAs (from today's perspective) ◦ Multiple FPGAs are combined to obtain a reasonable amount reconfigurable fabric - 13 - - 14 - L. Bauer, CES, KIT, 2012 Garp Overview Garp Reconfigurable Fabric Research effort on overcoming the limitations of reconfigurable HW � � Reconf. fabric is a ◦ Reconfiguration overhead 2D-mesh com- ◦ Memory access from reconfigurable hardware posed of entities ◦ Binary compatibility of executables across version of reconfigurable hardware called blocks Core processor and reconf. fabric on same die � ◦ Number of columns ◦ Core processor: a single-issue MIPS-II is fixed to 24 (1 Reconfigurable Fabric as Coprocessor, but needs ◦ control and 23 some modifications in the core processor logic blocks) ◦ However, no actual chip produced ◦ Some special Core processor and reconf. fabric purposes blocks � share the same memory hierarchy ◦ Number of rows is implementation SW controlled run-time reconfiguration � specific and can Reconfigurable fabric run asynchronous grow in an upward- � compatible fashion to the core processor (expected to be at ◦ Reconfigurable fabric: estimated 133 MHz least 32) src: [HW97] src: [HW97] - 15 - - 16 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012

Garp Reconfigurable Fabric Garp Reconfigurable Fabric (cont’d) (cont’d) � Partially reconfiguring the reconfigurable fabric is supported ◦ Basic reconfigurable unit is a row of 24 blocks, a so-called reconfigurable ALU ◦ SI size is defined by #rows ( � 1D structure) � A row is exclusively used by at most one SI, i.e. it is not allowed that some logic blocks in a row are used src: [HW97] for Si i and some others in the same row are used for SI j � Memory accesses can be initiated by the reconfigurable ◦ Fabric is blocked during reconfiguration fabric, but only through the central 16 columns ◦ Supports run-time relocation (a hardware translates � Extra blocks for overflow checking, rounding, control from logical to physical row number) functions, wider data sizes etc. - 17 - - 18 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 Reconfigurable Blocks (cont’d) Reconfigurable Blocks � Each logic block can be configured to perform � Each logic block takes ◦ an arbitrary 4-input bitwise logical function, as many as four 2-bit ◦ a variable shift of up to 15 bits, inputs and produces ◦ a 4-way select (multiplexer) function, or up to two 2-bit ◦ a 3-input add/subtract/comparison function � Garp made a first step to integrate specialized outputs hardware blocks into a partially reconfigurable � Routing architecture: processor (not only LUTs) ◦ 2-bit buses in horizon- ◦ Multi-bit adders, shifters etc. are designed with ‘more tal and vertical columns hardware’ than typically FPGAs at that time ◦ global & semi-global � Each logic block includes four bits of data state lines (i.e. registers), totaling to 92 bits per row src: [HW97] - 19 - - 20 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012

Data Access Reconfigurable Routing � The routing architecture includes 2 bit � Data input/output horizontal and vertical lines of different ◦ Up to 128 bits per cycle length, segmented in a non-uniform way to/from any 4 rows in the fabric ◦ Short horizontal segments spanning 11 blocks are tailored to multi-bit shifts across a row ◦ Up to 64 bits per cycle from the MIPS core ◦ Note: the figures show the routing for one row/column of logic blocks, respectively register file to any 2 rows ◦ Up to 32 bits per cycle from any row back to the MIPS core register file � Dedicated Queues ◦ Allowing read ahead and write behind src: [CHW00] src: [HW97] - 21 - - 22 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012 Reconfiguration Management Reconfiguration Management (cont’d) � Reconfiguration � For fast reconfiguration, the reconfigurable fabric ◦ A block requires 64 configuration bits features a transparent distributed configuration cache ◦ Configuring 32 rows: 8 [Bytes/block] x 24 [blocks/row] x 32 [rows] = 6144 Bytes ◦ Holds the equivalent of 128 total rows of configurations ◦ Assuming 128-bit memory access, 384 sequential accesses are required ◦ Distributed as 4 cached configuration rows for each ◦ Approx. 50 micro seconds (depending on the bus) physical row � To accelerate context switching, the Garp array does not contain ◦ Stores the least recently used configurations large amount of embedded memory (if an SI needs some data ◦ Content can be pre-fetched twice, it typically has to load it twice) � Reconfiguration time from external memory is 12 � Supports virtual memory, supervisor mode, and protected execution of multiple processes external bus cycles per row plus some startup time � Reported speedup (for hand-coded functions) compared to a 4- � Reconfiguration time from the integrated cache is 4 way superscalar UltraSparc 170: cycles (independent of the number of rows) ◦ 43x for an image median filter ◦ 18.7x for DES encryption - 23 - - 24 - L. Bauer, CES, KIT, 2012 L. Bauer, CES, KIT, 2012

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) - PowerPoint PPT Presentation

Institut fr Technische Informatik Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2012 Reconfigurable and Reconfigurable and Adaptive

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Reconfigurable and Adaptive Systems (RAS) 7. Adaptive Reconfigurable Processors Lars Bauer, Jrg

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

OpenSoC Fabric An open source, parameterized, CoDEx CoD Ex network generation tool Farzad

Algorithm 1 1. Stand up and think of a number 2. Pair off with someone standing, add your numbers

1 Two spherical objects have masses of 200 kg and 500 kg. Their centers are separated by a

THE ACADEMY OF FINANCIAL MARKETS WELCOME TO THE PRESENTATION! Thought for the day When the

Optimal Storage of Combinatorial State Spaces Alfons Laarman ( alfons@laarman.com ) Theory Group,

Multi-Core Model Checking Alfons Laarman November 14, 2013 ... Introduction Reachability LTL

Artificial Intelligence in Finance at at Hong Kong University of Science and Technology