Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - - PDF document

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jörg Henkel - 1 - Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Reconfigurable and Adaptive Systems (RAS) 5. Configuration Prefetching - 2 -

RAS Topic Overview 1. Introduction 2. Overview • Motivation and 3. Special Instructions Definition 4. Fine-Grained • Static Prefetching Reconfigurable Processors • Clock Frequency 5. Configuration Prefetching Variation 6. Coarse-Grained • Dynamic Reconfigurable Processors Prefetching 7. Adaptive • Area Models Reconfigurable Processors 8. Fault-tolerance by Reconfiguration - 3 - L. Bauer, CES, KIT, 2014 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 5 5.1 Motivation and Definition - 4 -

Recall: Performing Run-time Reconfigurations � PRISC ◦ Reconfiguration is triggered implicitly by SI execution ◦ Reconfiguration time: 100-600 cycles ◦ Fast reconfiguration time at the cost of very limited SI complexity � XiRISC ◦ pGA-load: load a configuration into the array ◦ pGA-free: remove a configuration ◦ 16 cycles to receive a complete configuration if it is available in 2 nd level configuration cache ◦ Approx. ‘128+startup’ cycles (not explicitly stated by the authors) to receive it from external memory � 8 times slower because 2 nd level cache bus is 8 times wider (256 bit) than memory bus (32 bit) - 5 - L. Bauer, CES, KIT, 2014 Recall: Performing Run-time Reconfigurations (cont’d) � Garp ◦ gaconf reg : Load (or switch to) configuration at address given by reg ◦ gasave reg : Save all array data state to memory at address given by reg ◦ garestore reg : Restore previously saved data state from memory at address given by reg ◦ Approx. 50 μs to reconfigure 32 rows (12 bus cycles per row plus some startup time) � MOLEN: ◦ p-set address : reconfigure those parts that that seldom change ◦ c-set address : reconfigure those parts not addressed by p-set ◦ set-prefetch address : prefetches the Microcode that is responsible for a p-set or c-set operation ◦ Reconfiguration time between 2 and 12 ms - 6 - L. Bauer, CES, KIT, 2014

Recall: Performing Run-time Reconfigurations (cont’d) � Reconfiguration can last from few cycles (if available in cache) over microseconds (for limited SI complexity) to milliseconds (powerful FPGA fabrics) � If a configuration is not available when the SI shall execute then the system performance is significantly affected ◦ Either stall the execution until the reconfiguration completes ◦ Or use the core ISA to implement the SI functionality (trap handler or conditional branch) � Solution: if some region of the reconfigurable fabric is currently free (i.e. not occupied by another configuration), then configuration prefetching can be used to perform the reconfiguration before the SI is needed - 7 - L. Bauer, CES, KIT, 2014 Configuration Prefetching � Definition: Start loading the configuration data of a particular SI before that SI is actually used � Goal: Minimize performance loss due to pending reconfigurations ◦ Typically: try to finish the reconfiguration before the SI is executed ◦ Note: sometimes it is better to avoid reconfiguration for an SI and to execute it with the core ISA instead (to avoid Thrashing ) ◦ Configuration prefetching can be used to transfer configuration data from external memory to configuration cache (preparing a reconfiguration) or to perform the reconfiguration right ahead - 8 - L. Bauer, CES, KIT, 2014

Configuration Prefetching (cont’d) prefetch SI � Example Control- flow graph Time for ◦ Each node is a Base- reconfiguration Block (BB) ◦ Color indicates the execution frequency Execute ◦ Edges show the SI control flow ◦ Red edges are function calls (dashed lines) or Return from subroutine returns (solid lines) - 9 - L. Bauer, CES, KIT, 2014 Configuration Prefetching (cont’d) � Example Control- flow graph ◦ Each node is a Base- Block (BB) ◦ Color indicates the execution frequency ◦ Edges show the control flow ◦ Red edges are function calls (dashed lines) or returns (solid lines) - 10 - L. Bauer, CES, KIT, 2014

Relevant Parameters for Prefetching � Temporal distance between starting the prefetching operation and the SI execution ◦ Depends on control flow ◦ Starting too late � SI is demanded before prefetching completes ◦ Starting too early � Potential conflicts between currently demanded SIs and SIs that shall be prefetched � Probability that the SI executions are reached ◦ Depends on control flow ◦ Typically: when prefetching is started earlier, then the uncertainty whether the SI execution is eventually reached is higher � Number of SI executions ◦ Depends on control flow ◦ If the SI is executed rather seldom then it might be better to execute it using the core ISA rather then speculating a prefetch operation - 11 - L. Bauer, CES, KIT, 2014 Aborting prefetching operations � False Prefetching: Due to control-flow uncertainty, it can happen that prefetching for an SI was triggered and even before it finishes it becomes clear that the SI is not going to execute at all � ‘Still Pending’ False Prefetching: ◦ The prefetching was triggered to a queue and did not start yet ◦ Simply remove it from that queue � ‘Already Running’ False Prefetching: ◦ The prefetching operation is currently running ◦ For line-based reconfigurable fabrics (e.g. Garp or XiRisc) finish prefetching the current line and abort it afterwards (short delay) ◦ For FPGA-based reconfigurable fabrics aborting may not be possible (unless prefetching to a cache) - 12 - L. Bauer, CES, KIT, 2014

Aborting prefetching operations (cont’d) SI-centric � In node I4 and I3 Control Flow Graph all 4 SIs may be prefetched Circle: set of in- I4 structions (poten- � When the R Rectangle: tially with embed- control flow usage of SI I3 ded control flow, moves to I1 then sub routine it is clear that calls etc.) I1 SIs 3 and 4 are not demanded ◦ Their prefetching I2 may be stopped (if possible) src: [LH02] - 13 - L. Bauer, CES, KIT, 2014 Aborting prefetching operations (cont’d) ‘0’ ‘1’ - 14 - L. Bauer, CES, KIT, 2014

Aborting prefetching operations (cont’d) ‘0’ - 15 - L. Bauer, CES, KIT, 2014 Aborting prefetching operations (cont’d) ‘0’ ‘1’ - 16 - L. Bauer, CES, KIT, 2014

Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 5 5.2 Static Prefetching - 17 - Static Prefetching � Idea: At compile time analyze the control-flow graph and embed prefetching instructions into to code that statically decide which SIs shall be prefetched � Required: probability which branch will be taken ◦ From profiling; shown as edge labels (in percent) � Next step: At each node, establish a list of all reachable SIs, sorted by the probability to reach them src: [LH02] - 18 - L. Bauer, CES, KIT, 2014

�� Static Prefetching (cont’d) � Probability of a node n to reach SI s : � � � � � n s , : � e � Paths from p Edges e on node to SI n s Path p � Example: 3 Paths to reach SI 3 from I10 ◦ 0.3 * 0.4 * 0.4 + ◦ 0.3 * 0.6 * 0.4 + ◦ 0.2 * 0.8 * 0.4 ◦ Probability to reach SI 3 from node I10: 18.4% src: [LH02] - 19 - L. Bauer, CES, KIT, 2014 Static Prefetching (cont’d) All SI nodes that can be reached (in decreasing probability) P1,4,3,2 � Algorithm moving backwards through the graph ◦ Initialize a queue with all ‘SI- P3,4,1,2 nodes’ (squares, 100% reachability) P4,3,2 ◦ Remove a node n from the P1,3,2 queue and update the probability information of all P1,2 P4,3 its predecessors that they can also reach the SIs that can be reached from node n ◦ When all successors of a P1 P2 P3 P4 node are processed then add it to the queue ◦ Iterate until queue is empty src: [LH02] - 20 - L. Bauer, CES, KIT, 2014

Static Prefetching (cont’d) P1,4,3,2 � Depending on the capacity of the FPGA, limit the prefetches to e.g. the 2 most probable SIs (this P3,4,1,2 affects I7, I8, I9, I10) � Some prefetch instructions are P4,3,2 redundant (e.g. due to previously P1,3,2 executed prefetches, I1, I4) ◦ Prefetch at I2 may not be removed P1,2 P4,3 because the control flow may come from node I8 (no SI 2 at I8) or from I5 (SI 1 might have started first and needs to be aborted now) ◦ Prefetch at I6 may be removed P1 P2 P3 P4 even though I9 prefetches P3 first; here, it is not beneficial to abort P3 to start P4 from scratch src: [LH02] - 21 - L. Bauer, CES, KIT, 2014 Institut für Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel 5 5.3 Clock Frequency Variation - 22 -

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr Technische Informatik Chair for Embedded Systems

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4.

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Reconfigurable Computing Computing Reconfigurable Partial reconfiguration reconfiguration

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Reconfigurable Computing Reconfigurable Computing for System on a Chip for System on a Chip

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

Reconfigurable Computing Reconfigurable Computing Introduction Introduction Chapter 1 1

Reconfigurable Computing Computing Reconfigurable On- -line line communication communication

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Concurrent and Real-Time Task Management for Self-Reconfigurable Robots Harris Chiu &

Performance Visualizations Brendan Gregg Software Engineer USENIX/LISA10 November, 2010

Deep Image Description Rui-Wei Zhao rw.du.zhao@gmail.com 1 Outline Generating descriptions

Containerization as the Building Block for Datacenter

Garbage collection Strategies for automatic memory management Memory management Explicit

Project Heapbleed Thoughts on heap exploitation abstraction (WIP) ZeroNights 2014 PATROKLOS

Superfetch : everything you need to know about privacy Mathilde Venault & Baptiste David

tc and IP fragments Once defragmented, how to output them? Marcelo Ricardo Leitner

Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems [VLDB 2017] Ismail

Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - - PDF document

Institut fr Technische Informatik Chair for Embedded Systems - Prof. Dr. J. Henkel Vorlesung im SS 2014 Reconfigurable and Adaptive Systems (RAS) Lars Bauer, Jrg Henkel - 1 - Institut fr Technische Informatik Chair for Embedded Systems

Reconfigurable Computing Reconfigurable Computing Reconfigurable Architectures Reconfigurable

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

Reconfigurable Computing Reconfigurable Computing Design and implementation Design and

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 4.

Reconfigurable Computing Computing Reconfigurable Design and implementation implementation

Using Reconfigurable Logic Using Reconfigurable Logic to Simulate Computer Systems Derek Chiou

Reconfigurable Computing Reconfigurable Computing Applications Applications Chapter 9 Chapter

Reconfigurable Computing Computing Reconfigurable Partial reconfiguration reconfiguration

Reconfigurable Computing Reconfigurable Computing Partitioning Partitioning Chapter 5 Chapter

Reconfigurable Computing Reconfigurable Computing for System on a Chip for System on a Chip

Reconfigurable Computing Reconfigurable Computing VHDL Crash Course VHDL Crash Course Chapter 2

Reconfigurable Computing Reconfigurable Computing Introduction Introduction Chapter 1 1

Reconfigurable Computing Computing Reconfigurable On- -line line communication communication

Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 8. Fault Tolerance and

Concurrent and Real-Time Task Management for Self-Reconfigurable Robots Harris Chiu &amp;

Performance Visualizations Brendan Gregg Software Engineer USENIX/LISA10 November, 2010

Deep Image Description Rui-Wei Zhao rw.du.zhao@gmail.com 1 Outline Generating descriptions

Containerization as the Building Block for Datacenter

Garbage collection Strategies for automatic memory management Memory management Explicit

Project Heapbleed Thoughts on heap exploitation abstraction (WIP) ZeroNights 2014 PATROKLOS

Superfetch : everything you need to know about privacy Mathilde Venault &amp; Baptiste David

tc and IP fragments Once defragmented, how to output them? Marcelo Ricardo Leitner

Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems [VLDB 2017] Ismail

Concurrent and Real-Time Task Management for Self-Reconfigurable Robots Harris Chiu &

Superfetch : everything you need to know about privacy Mathilde Venault & Baptiste David