SLIDE 1 The Future Directions of Dataflow-Based Reconfigurable Hardware Accelerators
Francesca Palumbo1, Claudio Rubattu1,2, Carlo Sau3, Tiziana Fanni3, Luigi Raffo3
1University of Sassari, PolComIng – Information Engineering Group 2University of Rennes, INSA Group 3University of Cagliari, Diee – Microelectronics and Bioengineering Group
Rennes, 12-14 December 2017
SLIDE 2 Outline
– Motivation and Approach – Current Functionalities and Future Directions
- Hardware-Software Partitioning
– Co-Processing Support and Automated Characterization
- Enhancing the MDC High-Level Synthesis Support
– Integration with the CAPH HLS engine
- Run-time Monitoring of CGR Accelerators
– Extension of PAPI for dataflow in CGR hardware
- Providing Further Degrees of Reconfigurability
– Mixed-Grain Reconfiguration Possibilities
SLIDE 3 Outline
– Motivation and Approach – Current Functionalities and Future Directions
- Hardware-Software Partitioning
– Co-Processing Support and Automated Characterization
- Enhancing the MDC High-Level Synthesis Support
– Integration with the CAPH HLS engine
- Run-time Monitoring of CGR Accelerators
– Extension of PAPI for dataflow in CGR hardware
- Providing Further Degrees of Reconfigurability
– Mixed-Grain Reconfiguration Possibilities
SLIDE 4
MDC tool Summary
Motivations HIGH PERFORMANCES
real time, portability, long battery life
UP-TO-DATE SOLUTIONS
last audio/video codecs, file formats...
MORE INTEGRATED FEATURES
MP3, Camera, Video, GPS...
MARKET DEMAND
convenient form factor, affordable price, fashion
SLIDE 5 MDC tool Summary
Approach
coarse grained substrate
C D A B C D A B
1:1
SLIDE 6 MDC tool Summary
Approach
coarse grained substrate
C D A B C D A B
1:1
coarse grained reconfigurable substrate
C D A B E D A
SB
E
SB
C D A B
2:1
SLIDE 7 Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator
http://sites.unica.it/rpct/
MDC design suite
MDC tool Summary
Current Functionalities
SLIDE 8 Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator
Functional Complexity Time to Market: Design & Mapping Automation
http://sites.unica.it/rpct/
MDC design suite
MDC tool Summary
Current Functionalities
SLIDE 9 Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator
Functional Complexity Time to Market: Design & Mapping Automation Constraint Driven Optimisation
http://sites.unica.it/rpct/
MDC design suite
MDC tool Summary
Current Functionalities
SLIDE 10 Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator
Power Efficiency Functional Complexity Time to Market: Design & Mapping Automation Constraint Driven Optimisation
http://sites.unica.it/rpct/
MDC design suite
MDC tool Summary
Current Functionalities
SLIDE 11 Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator
Power Efficiency Functional Complexity Time to Market: Design & Mapping Automation Constraint Driven Optimisation
http://sites.unica.it/rpct/
Fast Integration and Prototyping
MDC design suite
MDC tool Summary
Current Functionalities
SLIDE 12
MDC tool Summary:
Future Directions MDC design suite
Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator
SLIDE 13 MDC tool Summary:
Future Directions MDC design suite
Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator
HW/SW Partitioning
SLIDE 14 MDC tool Summary:
Future Directions MDC design suite
Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator
Enhancing HLS HW/SW Partitioning
SLIDE 15 Runtime Monitoring
MDC tool Summary:
Future Directions MDC design suite
Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator
Enhancing HLS HW/SW Partitioning
SLIDE 16 Runtime Monitoring
MDC tool Summary:
Future Directions MDC design suite
Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator
Enhancing HLS Reconfiguration Degrees HW/SW Partitioning
SLIDE 17 Outline
– Motivations and Approach – Current Functionalities and Future Directions
- Hardware-Software Partitioning
– Co-Processing Support and Automated Characterization
- Enhancing the MDC High-Level Synthesis Support
– Integration with the CAPH HLS engine
- Run-time Monitoring of CGR Accelerators
– Extension of PAPI for dataflow in CGR hardware
- Providing Further Degrees of Reconfigurability
– Mixed-Grain Reconfiguration Possibilities
SLIDE 18 Hardware-Software Partitioning
Co-Processing Support
MDC design suite Dynamic Power Manager Baseline MDC Tool (MDG+PC) Structural Profiler Co-Processor Generator
MDC is a dataflow-based design suite for the development
coarse- grained reconfigurable systems with the capability
- f generating co-processing
units.
SLIDE 19 Hardware-Software Partitioning
Co-Processing Support
MDC design suite Dynamic Power Manager Baseline MDC Tool (MDG+PC) Structural Profiler Co-Processor Generator
MDC is a dataflow-based design suite for the development
coarse- grained reconfigurable systems with the capability
- f generating co-processing
units.
- MDC assembles ready-to-use platform-dependent IPs
SLIDE 20 Hardware-Software Partitioning
Co-Processing Support
MDC design suite Dynamic Power Manager Baseline MDC Tool (MDG+PC) Structural Profiler Co-Processor Generator
MDC is a dataflow-based design suite for the development
coarse- grained reconfigurable systems with the capability
- f generating co-processing
units.
- MDC assembles ready-to-use platform-dependent IPs
- Designer can choose to opt for memory-mapped or stream-based
coupling.
SLIDE 21 Hardware-Software Partitioning
Automated Characterization
PREESM is rapid prototyping tool that generates code for heterogeneous multi/many- core embedded systems. It provides mapping of actors to multiple processing cores,
- ptimizing execution latency
and balancing loads.
SLIDE 22 Hardware-Software Partitioning
Automated Characterization
PREESM is rapid prototyping tool that generates code for heterogeneous multi/many- core embedded systems. It provides mapping of actors to multiple processing cores,
- ptimizing execution latency
and balancing loads.
- Model the costs of the available communication schemes and
co-processing units
SLIDE 23 Hardware-Software Partitioning
Automated Characterization
PREESM is rapid prototyping tool that generates code for heterogeneous multi/many- core embedded systems. It provides mapping of actors to multiple processing cores,
- ptimizing execution latency
and balancing loads.
- Model the costs of the available communication schemes and
co-processing units
- Connect PREESM and MDC to delegate specific computations (an
actor, a network of actors or a set of networks) to the most suitable co-processing units
SLIDE 24 Outline
– Approach – Baseline Functionality and Extensions
- Hardware-Software Partitioning
– Co-Processing Support and Automated Characterization
- Enhancing the MDC High-Level Synthesis Support
– Integration with the CAPH HLS engine
- Run-time Monitoring of CGR Accelerators
– Extension of PAPI for dataflow in CGR hardware
- Providing Further Degrees of Reconfigurability
– Mixed-Grain Reconfiguration Possibilities
SLIDE 25 Enhancing MDC High-Level Synthesis Support
Previous Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis
MDC back-end
IR.java multi-dataflow action weights
size per IR RVC-CAL dataflows multi-dataflow
HDL components library
RVC-CAL hardware protocol
CGR substrate S B
SLIDE 26 Enhancing MDC High-Level Synthesis Support
Previous Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis
MDC back-end
IR.java multi-dataflow action weights
size per IR RVC-CAL dataflows multi-dataflow
HDL components library
RVC-CAL hardware protocol
CGR substrate S B
SLIDE 27 Enhancing MDC High-Level Synthesis Support
Previous Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis
MDC back-end
IR.java multi-dataflow action weights
size per IR RVC-CAL dataflows multi-dataflow
HDL components library
RVC-CAL hardware protocol
CGR substrate S B
SLIDE 28 Enhancing MDC High-Level Synthesis Support
Previous Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis
MDC back-end
IR.java multi-dataflow action weights
size per IR RVC-CAL dataflows multi-dataflow
HDL components library
RVC-CAL hardware protocol
CGR substrate S B
- High-Level Synthesis supports only FPGAs from one specific FPGA
vendor (Xilinx)
SLIDE 29
is a domain- specific language for describing and implementing stream- processing applications.
Enhancing MDC High-Level Synthesis Support
CAPH
SLIDE 30
is a domain- specific language for describing and implementing stream- processing applications.
Enhancing MDC High-Level Synthesis Support
CAPH
- It relies upon the actor/dataflow model of computation
SLIDE 31
is a domain- specific language for describing and implementing stream- processing applications.
Enhancing MDC High-Level Synthesis Support
CAPH
- It is capable of generating VHDL code
- It relies upon the actor/dataflow model of computation
SLIDE 32
is a domain- specific language for describing and implementing stream- processing applications.
Enhancing MDC High-Level Synthesis Support
CAPH
- It is platform agnostic
- It is capable of generating VHDL code
- It relies upon the actor/dataflow model of computation
SLIDE 33 Enhancing MDC High-Level Synthesis Support
Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf generation
MDC back-end
IR.java multi-dataflow RVC-CAL dataflows HDL components library CGR substrate S B
.cph
CAPH dataflows
SLIDE 34 Enhancing MDC High-Level Synthesis Support
Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf generation
MDC back-end
IR.java multi-dataflow RVC-CAL dataflows HDL components library CGR substrate S B
CAPH-to-RVC-CAL .cph
CAPH dataflows
SLIDE 35 Enhancing MDC High-Level Synthesis Support
Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf
1
generation
MDC back-end
IR.java multi-dataflow RVC-CAL dataflows
FIFOs size per dataflow
HDL components library CGR substrate S B
CAPH systemC synthesis and simulation CAPH-to-RVC-CAL .cph
CAPH dataflows
SLIDE 36 Enhancing MDC High-Level Synthesis Support
Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf
1
generation
MDC back-end
IR.java multi-dataflow RVC-CAL dataflows
FIFOs size per dataflow
HDL components library CGR substrate S B
CAPH systemC synthesis and simulation worst case parsing script CAPH-to-RVC-CAL
multi-dataflow
.cph
CAPH dataflows
SLIDE 37 Enhancing MDC High-Level Synthesis Support
Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf
1
generation CAPH High-Level Synthesis
MDC back-end
IR.java multi-dataflow RVC-CAL dataflows
FIFOs size per dataflow
HDL components library CAPH protocol CGR substrate S B
CAPH systemC synthesis and simulation worst case parsing script CAPH-to-RVC-CAL
multi-dataflow
.cph
CAPH dataflows
SLIDE 38 Enhancing MDC High-Level Synthesis Support
Fully Automated Flow
composition Orcc font-end .cal
MDC front-end
.xdf
1
generation CAPH High-Level Synthesis
MDC back-end
IR.java multi-dataflow RVC-CAL dataflows
FIFOs size per dataflow
HDL components library CAPH protocol CGR substrate S B
CAPH systemC synthesis and simulation worst case parsing script CAPH-to-RVC-CAL
multi-dataflow
.cph
CAPH dataflows
SLIDE 39 Enhancing MDC High-Level Synthesis Support
Protocol Generalization
<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>
CGR substrate CGR substrate
A B A B
SLIDE 40 Enhancing MDC High-Level Synthesis Support
Protocol Generalization
<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>
CGR substrate CGR substrate
A
clk rst
B A B
SLIDE 41 reset
Enhancing MDC High-Level Synthesis Support
Protocol Generalization
<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>
CGR substrate CGR substrate
A
clk rst
B
reset reset clock clock
A B
SLIDE 42 reset
Enhancing MDC High-Level Synthesis Support
Protocol Generalization
<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>
CGR substrate CGR substrate
A
clk rst
B
reset reset clock clock
A B
clock reset FIFO_B
SLIDE 43 reset
Enhancing MDC High-Level Synthesis Support
Protocol Generalization
<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>
CGR substrate CGR substrate
A
clk rst
B
reset reset clock clock
A B
clock reset FIFO_B FANOUT_A
SLIDE 44 reset din
Enhancing MDC High-Level Synthesis Support
Protocol Generalization
<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>
CGR substrate CGR substrate
A
clk rst
B
reset reset clock clock
A B
dout wr full din wr full dout wr full din wr full dout rd empty din rd empty clock reset FIFO_B FANOUT_A
SLIDE 45
Enhancing MDC High-Level Synthesis Support
Prewitt/Sobel Multi-Flow Network
SLIDE 46 INPUT NETWORKS (provided by CAPH)
Enhancing MDC High-Level Synthesis Support
Prewitt/Sobel Multi-Flow Network
PREWITT NETWORK SOBEL NETWORK
SLIDE 47 OUTPUT NETWORK (provided by MDC) INPUT NETWORKS (provided by CAPH)
Enhancing MDC High-Level Synthesis Support
Prewitt/Sobel Multi-Flow Network
PREWITT NETWORK SOBEL NETWORK MERGED NETWORK
SLIDE 48
Enhancing MDC High-Level Synthesis Support
Preliminary Results
SLIDE 49 RESOURCES MDC+CAPH MDC+XRONOS XRONOS vs CAPH Altera Xilinx Altera Xilinx Altera Xilinx REG 1484 780
LOGIC 1047 2347
RAM 15
DSP 36 36
MAX FREQ [MHz] 105,80 93,69
EXEC TIME [cck] 15340 15340
Enhancing MDC High-Level Synthesis Support
Preliminary Results
FPGA - Altera (5SGSMD5) and Xilinx (XC7VX485T)
SLIDE 50 RESOURCES MDC+CAPH MDC+XRONOS XRONOS vs CAPH Altera Xilinx Altera Xilinx Altera Xilinx REG 1484 780
LOGIC 1047 2347
RAM 15
DSP 36 36
MAX FREQ [MHz] 105,80 93,69
EXEC TIME [cck] 15340 15340
Enhancing MDC High-Level Synthesis Support
Preliminary Results
Prewitt/Sobel Multi-Flow AREA [kGE] 269,82 466,90 (+73%) Max Freq [MHz] 417,36 399.04 (-4,4%)
ASIC - TSMC 45 nm CMOS technology FPGA - Altera (5SGSMD5) and Xilinx (XC7VX485T)
SLIDE 51 RESOURCES MDC+CAPH MDC+XRONOS XRONOS vs CAPH Altera Xilinx Altera Xilinx Altera Xilinx REG 1484 780
LOGIC 1047 2347
RAM 15
DSP 36 36
MAX FREQ [MHz] 105,80 93,69
EXEC TIME [cck] 15340 15340
Enhancing MDC High-Level Synthesis Support
Preliminary Results
Prewitt/Sobel Multi-Flow AREA [kGE] 269,82 466,90 (+73%) Max Freq [MHz] 417,36 399.04 (-4,4%)
ASIC - TSMC 45 nm CMOS technology FPGA - Altera (5SGSMD5) and Xilinx (XC7VX485T)
COMING SOON:
EXPLORATION ON THE BENEFITS OF DATAFLOW-BASED HLS IN CGR ARCHITECTURES ON THE ROAD
SLIDE 52 Outline
– Motivations and Approach – Current Functionalities and Future Directions
- Hardware-Software Partitioning
– Co-Processing Support and Automated Characterization
- Enhancing the MDC High-Level Synthesis Support
– Integration with the CAPH HLS engine
- Run-time Monitoring of CGR Accelerators
– Extension of PAPI for dataflow in CGR hardware
- Providing Further Degrees of Reconfigurability
– Mixed-Grain Reconfiguration Possibilities
SLIDE 53 Run-time Monitoring of CGR Accelerators
PAPI for dataflow in software
PROCESSOR
C code processing
C D A B
dataflow application (RVC-CAL)
C code generation
SLIDE 54 Run-time Monitoring of CGR Accelerators
PAPI for dataflow in software
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
P M C
C D A B
dataflow application (RVC-CAL)
C code generation C code generation with PAPI
SLIDE 55 Run-time Monitoring of CGR Accelerators
PAPI for dataflow in software
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
8 10 12 14 16 18
1 2 3 4
Power Workpoint PAPI estimation
Est Real
P M C
C D A B
dataflow application (RVC-CAL)
@design time @run time
C code generation C code generation with PAPI
SLIDE 56 C D A B
Run-time Monitoring of CGR Accelerators
Extension of PAPI for dataflow in CGR hardware
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
8 10 12 14 16 18
1 2 3 4
Power Workpoint PAPI estimation
Est Real
P M C
C D A B
dataflow applications (RVC-CAL)
@run time
C code generation with PAPI
@design time
SLIDE 57 C D A B
Run-time Monitoring of CGR Accelerators
Extension of PAPI for dataflow in CGR hardware
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
8 10 12 14 16 18
1 2 3 4
Power Workpoint PAPI estimation
Est Real
P M C
C D A B
dataflow applications (RVC-CAL)
@run time CGR accelerator
SB SB 2
A B D
SB 1
F E C
configurator
sel0 sel1 sel2
ID
1 1 1
FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F
MDC CGR accelerator generation C code generation with PAPI
@design time
SLIDE 58 C D A B
Run-time Monitoring of CGR Accelerators
Extension of PAPI for dataflow in CGR hardware
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
8 10 12 14 16 18
1 2 3 4
Power Workpoint PAPI estimation
Est Real
P M C
C D A B
dataflow applications (RVC-CAL)
@run time CGR accelerator
SB SB 2
A B D
SB 1
F E C
configurator
sel0 sel1 sel2
ID
1 1 1
FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F
C code generation with PAPI
MDC CGR accelerator generation with PAPI
@design time
SLIDE 59 C D A B
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
P M C
C D A B
dataflow applications (RVC-CAL)
CGR accelerator
SB SB 2
A B D
SB 1
F E C
configurator
sel0 sel1 sel2
ID
1 1 1
FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F
MDC CGR accelerator generation C code generation with PAPI
MDC CGR accelerator generation with PAPI
@design time
Run-time Monitoring of CGR Accelerators
Extension of PAPI for dataflow in CGR hardware
Configuration Manager
SLIDE 60 CGR accelerator
SB 2
A B D
SB 1
F E C
configurator
sel0 sel1 sel2
ID
1 1 1
FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F
SB
MDC CGR accelerator generation MDC CGR accelerator generation with PAPI
C D A B
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
P M C
C D A B
dataflow applications (RVC-CAL)
C code generation with PAPI
@design time
Configuration Manager
α
Run-time Monitoring of CGR Accelerators
Extension of PAPI for dataflow in CGR hardware
SLIDE 61 CGR accelerator
SB 2
A B D
SB 1
F E C
configurator
sel0 sel1 sel2
ID
1 1 1
FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F
SB
MDC CGR accelerator generation MDC CGR accelerator generation with PAPI
C D A B
PROCESSOR
C code processing PAPI registers reading:
- Total instructions
- Type of operations
- Memory usage
Energy estimation
Based on board characterization
P M C
C D A B
dataflow applications (RVC-CAL)
C code generation with PAPI
@design time
β
Run-time Monitoring of CGR Accelerators
Extension of PAPI for dataflow in CGR hardware
Configuration Manager
SLIDE 62 Outline
– Motivations and Approach – Current Functionalities and Future Directions
- Hardware-Software Partitioning
– Co-Processing Support and Automated Characterization
- Enhancing the MDC High-Level Synthesis Support
– Integration with the CAPH HLS engine
- Run-time Monitoring of CGR Accelerators
– Extension of PAPI for dataflow in CGR hardware
- Providing Further Degrees of Reconfigurability
– Mixed-Grain Reconfiguration Possibilities
SLIDE 63 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
PU PU 2 PU 5 PU 4 PU 1 PU 3
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
SLIDE 64 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
LUT 4x2
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
bit-level reconfiguration word-level reconfiguration
SLIDE 65 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
LUT 4x2
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
bit-level reconfiguration very flexible (any kind of HDL defined system) word-level reconfiguration small flexibility (fixed set of predefined configuration)
SLIDE 66 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
LUT 4x2
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
bit-level reconfiguration very flexible (any kind of HDL defined system) slow to configure (lot of switches and LUTs) big memory footprint (long configuration bitstream) word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits)
SLIDE 67 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
bit-level reconfiguration very flexible (any kind of HDL defined system) slow to configure (lot of switches and LUTs) big memory footprint (long configuration bitstream) word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits)
DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of
region of the FPGA
.bit
system configurations
SLIDE 68 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits) bit-level reconfiguration flexible (HDL systems precedently implemented) time to configure typically in terms of ms memory footprint to be considered
DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of
region of the FPGA
.bit
system configurations
SLIDE 69 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits) power consumption due to reconfiguration bit-level reconfiguration flexible (HDL systems precedently implemented) time to configure typically in terms of ms memory footprint to be considered power consumption peak during reconfiguration
DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of
region of the FPGA
.bit
system configurations
SLIDE 70 LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8
Providing Further Degrees of Reconfigurability
Fine-Grain and Partial Reconfiguration
PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)
control (fsm)
/ 16 / 16
FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)
word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits) power consumption due to reconfiguration bit-level reconfiguration flexible (HDL systems precedently implemented) time to configure typically in terms of ms memory footprint to be considered power consumption peak during reconfiguration
DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of
region of the FPGA
.bit
system configurations
COMPLEMENTARITY
DPR BIG change, BIG overhead CGR SMALL change, SMALL overhead
SLIDE 71 CG reconfigurable substrate
Providing Further Degrees of Reconfigurability
FG into CG reconfiguration
PU0 PU2 PU5 PU4 PU1 PU3
SLIDE 72 FPGA FG reconfigurable substrate
CG reconfigurable substrate
Providing Further Degrees of Reconfigurability
FG into CG reconfiguration
PU0 PU2 PU5 PU4 PU1 PU3
SLIDE 73 FPGA FG reconfigurable substrate
CG reconfigurable substrate
Providing Further Degrees of Reconfigurability
FG into CG reconfiguration
PU0 PU2 PU5 PU4 PU1 PU3 DPR subjected region
SLIDE 74 FPGA FG reconfigurable substrate
CG reconfigurable substrate
Providing Further Degrees of Reconfigurability
FG into CG reconfiguration
PU0 PU2 PU5 PU4 PU1 PU3 DPR subjected region
PU1a PU1b PU1c PU1d
SLIDE 75 FPGA FG reconfigurable substrate
CG reconfigurable substrate
Providing Further Degrees of Reconfigurability
FG into CG reconfiguration
PU0 PU2 PU5 PU4 PU1 PU3 DPR subjected region
PU1a PU1b PU1c PU1d
To be stored into the FPGA internal memory PU1 a.bit PU1 b.bit PU1 c.bit PU1 d.bit
SLIDE 76 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate
host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
SLIDE 77 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate
host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
SLIDE 78 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region RUNTIME:
t0: config FG = CG0
execute CG0
SLIDE 79 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 4
RUNTIME:
t0: config FG = CG0 t1: config CG = α
PU 1
execute α α execute CG0
SLIDE 80 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 4
RUNTIME:
t0: config FG = CG0 t1: config CG = α … SMALL context change …
PU 1
α
SLIDE 81 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 2 PU 4 PU 3
RUNTIME:
t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β
α execute β β
SLIDE 82 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 2 PU 4 PU 3
RUNTIME:
t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change …
α β
SLIDE 83 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 2 PU 3
RUNTIME:
t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change … t3: config CG = γ
PU 1 PU
α β execute γ γ
SLIDE 84 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 2 PU 3
RUNTIME:
t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change … t3: config CG = γ … BIG context change …
PU 1 PU
α β γ
SLIDE 85 Providing Further Degrees of Reconfigurability
CG into FG reconfiguration
FPGA
FG reconfgurable substrate CG reconfigurable substrate CG0
PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor
DDR
||||||||||||||||| ||||||||||||||||| ||||||| |||||||
BUS if REGS
LOCAL MEM
DMA
Ethernet CTRL
DPR subjected region
PU 2 PU 3
RUNTIME:
t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change … t3: config CG = γ … BIG context change … t4: config FG = CG1
PU 1 PU
α β γ execute CG1 ω
PU 2
CG reconfigurable substrate CG1
PU 2 PU 3
PU 4
PU PU 1
SLIDE 86 Providing Further Degrees of Reconfigurability
CG into Artico³
Artico3 is a DPR supporting architecture in charge of smartly manage performance, consumption and dependability.
- hardware acceleration
- hierarchical memory
- bus based, DMA enabled communication
SLIDE 87 Providing Further Degrees of Reconfigurability
CG into Artico³
Artico3 is a DPR supporting architecture in charge of smartly manage performance, consumption and dependability.
- enhance flexibility by enabling CGR
within Artico3 slots
- hardware acceleration
- hierarchical memory
- bus based, DMA enabled communication
SLIDE 88 Providing Further Degrees of Reconfigurability
CG into Artico³
Artico3 is a DPR supporting architecture in charge of smartly manage performance, consumption and dependability.
dataflow to facilitate/ automate programmability
- enhance flexibility by enabling CGR
within Artico3 slots
- hardware acceleration
- hierarchical memory
- bus based, DMA enabled communication
SLIDE 89 Providing Further Degrees of Reconfigurability
The big picture within CERBERO
PREESM: dataflow based HW/SW and FGR/CGR partitioning
SLIDE 90 Providing Further Degrees of Reconfigurability
The big picture within CERBERO
PREESM: dataflow based HW/SW and FGR/CGR partitioning
SLIDE 91 Providing Further Degrees of Reconfigurability
The big picture within CERBERO
PREESM: dataflow based HW/SW and FGR/CGR partitioning PAPI: dataflow based runtime monitoring of the system to trigger reconfiguration
SLIDE 92 Providing Further Degrees of Reconfigurability
The big picture within CERBERO
ALPHA BETA GAMMA MULTI-FLOW
PREESM: dataflow based HW/SW and FGR/CGR partitioning PAPI: dataflow based runtime monitoring of the system to trigger reconfiguration
SLIDE 93 Providing Further Degrees of Reconfigurability
The big picture within CERBERO
ALPHA BETA GAMMA MULTI-FLOW
Performance Monitor Fault Monitor Accelerators (fine/coarse grain)
Monitoring Counters
Evaluate Monitors Output Fine/Coarse-grained accelerator reconfiguration
PREESM: dataflow based HW/SW and FGR/CGR partitioning PAPI: dataflow based runtime monitoring of the system to trigger reconfiguration
SLIDE 94
Thanks To …
Coordinator: Michal Masin (IBM), michaelm@il.ibm.com Scientific Coordinator: Francesca Palumbo (UniSS), fpalumbo@uniss.it Innovation Manager: Katiuscia Zedda (Abinsula), katiuscia.zedda@abinsula.com Dissemination-Communication Manager: Francesco Regazzoni (USI), francesco.regazzoni@usi.ch
www.cerbero-h2020.eu info@cerbero-h2020.eu @CERBERO_h2020
EU Commission for funding the CERBERO (Cross-layer modEl-based fRamework for multi-oBjective dEsign of Reconfigurable systems in unceRtain hybRid envirOnments) project as part of the H2020 Programme under grant agreement No 732105.
SLIDE 95
The Future Directions of Dataflow-Based Reconfigurable Hardware Accelerators
Francesca Palumbo, Claudio Rubattu, Carlo Sau, Tiziana Fanni, Luigi Raffo Rennes, 12-14 December 2017