The Future Directions of Dataflow-Based Reconfigurable Hardware - - PowerPoint PPT Presentation

the future directions of dataflow based reconfigurable
SMART_READER_LITE
LIVE PREVIEW

The Future Directions of Dataflow-Based Reconfigurable Hardware - - PowerPoint PPT Presentation

The Future Directions of Dataflow-Based Reconfigurable Hardware Accelerators Francesca Palumbo 1 , Claudio Rubattu 1,2 , Carlo Sau 3 , Tiziana Fanni 3 , Luigi Raffo 3 1 University of Sassari, PolComIng Information Engineering Group 2 University


slide-1
SLIDE 1

The Future Directions of Dataflow-Based Reconfigurable Hardware Accelerators

Francesca Palumbo1, Claudio Rubattu1,2, Carlo Sau3, Tiziana Fanni3, Luigi Raffo3

1University of Sassari, PolComIng – Information Engineering Group 2University of Rennes, INSA Group 3University of Cagliari, Diee – Microelectronics and Bioengineering Group

Rennes, 12-14 December 2017

slide-2
SLIDE 2

Outline

  • MDC Tool Summary

– Motivation and Approach – Current Functionalities and Future Directions

  • Hardware-Software Partitioning

– Co-Processing Support and Automated Characterization

  • Enhancing the MDC High-Level Synthesis Support

– Integration with the CAPH HLS engine

  • Run-time Monitoring of CGR Accelerators

– Extension of PAPI for dataflow in CGR hardware

  • Providing Further Degrees of Reconfigurability

– Mixed-Grain Reconfiguration Possibilities

slide-3
SLIDE 3

Outline

  • MDC Tool Summary

– Motivation and Approach – Current Functionalities and Future Directions

  • Hardware-Software Partitioning

– Co-Processing Support and Automated Characterization

  • Enhancing the MDC High-Level Synthesis Support

– Integration with the CAPH HLS engine

  • Run-time Monitoring of CGR Accelerators

– Extension of PAPI for dataflow in CGR hardware

  • Providing Further Degrees of Reconfigurability

– Mixed-Grain Reconfiguration Possibilities

slide-4
SLIDE 4

MDC tool Summary

Motivations HIGH PERFORMANCES

real time, portability, long battery life

UP-TO-DATE SOLUTIONS

last audio/video codecs, file formats...

MORE INTEGRATED FEATURES

MP3, Camera, Video, GPS...

MARKET DEMAND

convenient form factor, affordable price, fashion

slide-5
SLIDE 5

MDC tool Summary

Approach

coarse grained substrate

C D A B C D A B

1:1

slide-6
SLIDE 6

MDC tool Summary

Approach

coarse grained substrate

C D A B C D A B

1:1

coarse grained reconfigurable substrate

C D A B E D A

SB

E

SB

C D A B

2:1

slide-7
SLIDE 7

Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator

http://sites.unica.it/rpct/

MDC design suite

MDC tool Summary

Current Functionalities

slide-8
SLIDE 8

Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator

Functional Complexity Time to Market: Design & Mapping Automation

http://sites.unica.it/rpct/

MDC design suite

MDC tool Summary

Current Functionalities

slide-9
SLIDE 9

Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator

Functional Complexity Time to Market: Design & Mapping Automation Constraint Driven Optimisation

http://sites.unica.it/rpct/

MDC design suite

MDC tool Summary

Current Functionalities

slide-10
SLIDE 10

Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator

Power Efficiency Functional Complexity Time to Market: Design & Mapping Automation Constraint Driven Optimisation

http://sites.unica.it/rpct/

MDC design suite

MDC tool Summary

Current Functionalities

slide-11
SLIDE 11

Dynamic Power Manager Multi Dataflow Composer Tool Structural Profiler Co-Processor Generator

Power Efficiency Functional Complexity Time to Market: Design & Mapping Automation Constraint Driven Optimisation

http://sites.unica.it/rpct/

Fast Integration and Prototyping

MDC design suite

MDC tool Summary

Current Functionalities

slide-12
SLIDE 12

MDC tool Summary:

Future Directions MDC design suite

Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator

slide-13
SLIDE 13

MDC tool Summary:

Future Directions MDC design suite

Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator

HW/SW Partitioning

slide-14
SLIDE 14

MDC tool Summary:

Future Directions MDC design suite

Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator

Enhancing HLS HW/SW Partitioning

slide-15
SLIDE 15

Runtime Monitoring

MDC tool Summary:

Future Directions MDC design suite

Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator

Enhancing HLS HW/SW Partitioning

slide-16
SLIDE 16

Runtime Monitoring

MDC tool Summary:

Future Directions MDC design suite

Dynamic Power Manager Baseline MDC Tool Structural Profiler Co-Processor Generator

Enhancing HLS Reconfiguration Degrees HW/SW Partitioning

slide-17
SLIDE 17

Outline

  • MDC Tool Summary

– Motivations and Approach – Current Functionalities and Future Directions

  • Hardware-Software Partitioning

– Co-Processing Support and Automated Characterization

  • Enhancing the MDC High-Level Synthesis Support

– Integration with the CAPH HLS engine

  • Run-time Monitoring of CGR Accelerators

– Extension of PAPI for dataflow in CGR hardware

  • Providing Further Degrees of Reconfigurability

– Mixed-Grain Reconfiguration Possibilities

slide-18
SLIDE 18

Hardware-Software Partitioning

Co-Processing Support

MDC design suite Dynamic Power Manager Baseline MDC Tool (MDG+PC) Structural Profiler Co-Processor Generator

MDC is a dataflow-based design suite for the development

  • f

coarse- grained reconfigurable systems with the capability

  • f generating co-processing

units.

slide-19
SLIDE 19

Hardware-Software Partitioning

Co-Processing Support

MDC design suite Dynamic Power Manager Baseline MDC Tool (MDG+PC) Structural Profiler Co-Processor Generator

MDC is a dataflow-based design suite for the development

  • f

coarse- grained reconfigurable systems with the capability

  • f generating co-processing

units.

  • MDC assembles ready-to-use platform-dependent IPs
slide-20
SLIDE 20

Hardware-Software Partitioning

Co-Processing Support

MDC design suite Dynamic Power Manager Baseline MDC Tool (MDG+PC) Structural Profiler Co-Processor Generator

MDC is a dataflow-based design suite for the development

  • f

coarse- grained reconfigurable systems with the capability

  • f generating co-processing

units.

  • MDC assembles ready-to-use platform-dependent IPs
  • Designer can choose to opt for memory-mapped or stream-based

coupling.

slide-21
SLIDE 21

Hardware-Software Partitioning

Automated Characterization

PREESM is rapid prototyping tool that generates code for heterogeneous multi/many- core embedded systems. It provides mapping of actors to multiple processing cores,

  • ptimizing execution latency

and balancing loads.

slide-22
SLIDE 22

Hardware-Software Partitioning

Automated Characterization

PREESM is rapid prototyping tool that generates code for heterogeneous multi/many- core embedded systems. It provides mapping of actors to multiple processing cores,

  • ptimizing execution latency

and balancing loads.

  • Model the costs of the available communication schemes and

co-processing units

slide-23
SLIDE 23

Hardware-Software Partitioning

Automated Characterization

PREESM is rapid prototyping tool that generates code for heterogeneous multi/many- core embedded systems. It provides mapping of actors to multiple processing cores,

  • ptimizing execution latency

and balancing loads.

  • Model the costs of the available communication schemes and

co-processing units

  • Connect PREESM and MDC to delegate specific computations (an

actor, a network of actors or a set of networks) to the most suitable co-processing units

slide-24
SLIDE 24

Outline

  • MDC Tool Summary

– Approach – Baseline Functionality and Extensions

  • Hardware-Software Partitioning

– Co-Processing Support and Automated Characterization

  • Enhancing the MDC High-Level Synthesis Support

– Integration with the CAPH HLS engine

  • Run-time Monitoring of CGR Accelerators

– Extension of PAPI for dataflow in CGR hardware

  • Providing Further Degrees of Reconfigurability

– Mixed-Grain Reconfiguration Possibilities

slide-25
SLIDE 25

Enhancing MDC High-Level Synthesis Support

Previous Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis

MDC back-end

IR.java multi-dataflow action weights

  • ptimal FIFOs

size per IR RVC-CAL dataflows multi-dataflow

  • ptimal FIFOs size

HDL components library

RVC-CAL hardware protocol

CGR substrate S B

slide-26
SLIDE 26

Enhancing MDC High-Level Synthesis Support

Previous Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis

MDC back-end

IR.java multi-dataflow action weights

  • ptimal FIFOs

size per IR RVC-CAL dataflows multi-dataflow

  • ptimal FIFOs size

HDL components library

RVC-CAL hardware protocol

CGR substrate S B

slide-27
SLIDE 27

Enhancing MDC High-Level Synthesis Support

Previous Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis

MDC back-end

IR.java multi-dataflow action weights

  • ptimal FIFOs

size per IR RVC-CAL dataflows multi-dataflow

  • ptimal FIFOs size

HDL components library

RVC-CAL hardware protocol

CGR substrate S B

slide-28
SLIDE 28

Enhancing MDC High-Level Synthesis Support

Previous Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf TURNUS causation trace analysis worst case parsing script generation XRONOS high level synthesis

MDC back-end

IR.java multi-dataflow action weights

  • ptimal FIFOs

size per IR RVC-CAL dataflows multi-dataflow

  • ptimal FIFOs size

HDL components library

RVC-CAL hardware protocol

CGR substrate S B

  • High-Level Synthesis supports only FPGAs from one specific FPGA

vendor (Xilinx)

slide-29
SLIDE 29
  • CAPH

is a domain- specific language for describing and implementing stream- processing applications.

Enhancing MDC High-Level Synthesis Support

CAPH

slide-30
SLIDE 30
  • CAPH

is a domain- specific language for describing and implementing stream- processing applications.

Enhancing MDC High-Level Synthesis Support

CAPH

  • It relies upon the actor/dataflow model of computation
slide-31
SLIDE 31
  • CAPH

is a domain- specific language for describing and implementing stream- processing applications.

Enhancing MDC High-Level Synthesis Support

CAPH

  • It is capable of generating VHDL code
  • It relies upon the actor/dataflow model of computation
slide-32
SLIDE 32
  • CAPH

is a domain- specific language for describing and implementing stream- processing applications.

Enhancing MDC High-Level Synthesis Support

CAPH

  • It is platform agnostic
  • It is capable of generating VHDL code
  • It relies upon the actor/dataflow model of computation
slide-33
SLIDE 33

Enhancing MDC High-Level Synthesis Support

Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf generation

MDC back-end

IR.java multi-dataflow RVC-CAL dataflows HDL components library CGR substrate S B

.cph

CAPH dataflows

slide-34
SLIDE 34

Enhancing MDC High-Level Synthesis Support

Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf generation

MDC back-end

IR.java multi-dataflow RVC-CAL dataflows HDL components library CGR substrate S B

CAPH-to-RVC-CAL .cph

CAPH dataflows

slide-35
SLIDE 35

Enhancing MDC High-Level Synthesis Support

Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf

1

generation

MDC back-end

IR.java multi-dataflow RVC-CAL dataflows

  • ptimal

FIFOs size per dataflow

HDL components library CGR substrate S B

CAPH systemC synthesis and simulation CAPH-to-RVC-CAL .cph

CAPH dataflows

slide-36
SLIDE 36

Enhancing MDC High-Level Synthesis Support

Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf

1

generation

MDC back-end

IR.java multi-dataflow RVC-CAL dataflows

  • ptimal

FIFOs size per dataflow

HDL components library CGR substrate S B

CAPH systemC synthesis and simulation worst case parsing script CAPH-to-RVC-CAL

multi-dataflow

  • ptimal FIFOs size

.cph

CAPH dataflows

slide-37
SLIDE 37

Enhancing MDC High-Level Synthesis Support

Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf

1

generation CAPH High-Level Synthesis

MDC back-end

IR.java multi-dataflow RVC-CAL dataflows

  • ptimal

FIFOs size per dataflow

HDL components library CAPH protocol CGR substrate S B

CAPH systemC synthesis and simulation worst case parsing script CAPH-to-RVC-CAL

multi-dataflow

  • ptimal FIFOs size

.cph

CAPH dataflows

slide-38
SLIDE 38

Enhancing MDC High-Level Synthesis Support

Fully Automated Flow

composition Orcc font-end .cal

MDC front-end

  • ptimisation

.xdf

1

generation CAPH High-Level Synthesis

MDC back-end

IR.java multi-dataflow RVC-CAL dataflows

  • ptimal

FIFOs size per dataflow

HDL components library CAPH protocol CGR substrate S B

CAPH systemC synthesis and simulation worst case parsing script CAPH-to-RVC-CAL

multi-dataflow

  • ptimal FIFOs size

.cph

CAPH dataflows

slide-39
SLIDE 39

Enhancing MDC High-Level Synthesis Support

Protocol Generalization

<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>

CGR substrate CGR substrate

A B A B

slide-40
SLIDE 40

Enhancing MDC High-Level Synthesis Support

Protocol Generalization

<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>

CGR substrate CGR substrate

A

clk rst

B A B

slide-41
SLIDE 41

reset

Enhancing MDC High-Level Synthesis Support

Protocol Generalization

<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>

CGR substrate CGR substrate

A

clk rst

B

reset reset clock clock

A B

slide-42
SLIDE 42

reset

Enhancing MDC High-Level Synthesis Support

Protocol Generalization

<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>

CGR substrate CGR substrate

A

clk rst

B

reset reset clock clock

A B

clock reset FIFO_B

slide-43
SLIDE 43

reset

Enhancing MDC High-Level Synthesis Support

Protocol Generalization

<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>

CGR substrate CGR substrate

A

clk rst

B

reset reset clock clock

A B

clock reset FIFO_B FANOUT_A

slide-44
SLIDE 44

reset din

Enhancing MDC High-Level Synthesis Support

Protocol Generalization

<protocol> <sys_signals> <signal id=“0” net_port=“clock” is_clock=“”…></signal> … </sys_signals> <actor> <sys_signals> <signal id=“0” port=“clk” net_port=“clock” …></signal> … </sys_signals> <comm_signals> <signal id=“0” port=“din” channel=“data”…></signal> <signal id=“1” port=“dout” channel=“data”…></signal> <signal id=“2” port=“wr” channel=“en”…></signal> … <comm_signals> </actor> <predecessor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </predecessor> <successor> <sys_signals>…</sys_signals> <comm_signals>…<comm_signals> </successor> </protocol>

CGR substrate CGR substrate

A

clk rst

B

reset reset clock clock

A B

dout wr full din wr full dout wr full din wr full dout rd empty din rd empty clock reset FIFO_B FANOUT_A

slide-45
SLIDE 45

Enhancing MDC High-Level Synthesis Support

Prewitt/Sobel Multi-Flow Network

slide-46
SLIDE 46

INPUT NETWORKS (provided by CAPH)

Enhancing MDC High-Level Synthesis Support

Prewitt/Sobel Multi-Flow Network

PREWITT NETWORK SOBEL NETWORK

slide-47
SLIDE 47

OUTPUT NETWORK (provided by MDC) INPUT NETWORKS (provided by CAPH)

Enhancing MDC High-Level Synthesis Support

Prewitt/Sobel Multi-Flow Network

PREWITT NETWORK SOBEL NETWORK MERGED NETWORK

slide-48
SLIDE 48

Enhancing MDC High-Level Synthesis Support

Preliminary Results

slide-49
SLIDE 49

RESOURCES MDC+CAPH MDC+XRONOS XRONOS vs CAPH Altera Xilinx Altera Xilinx Altera Xilinx REG 1484 780

  • 632
  • 18,97%

LOGIC 1047 2347

  • 1533
  • 34,68%

RAM 15

  • 6.5
  • +100%

DSP 36 36

  • 100%

MAX FREQ [MHz] 105,80 93,69

  • 142,86
  • +58,50%

EXEC TIME [cck] 15340 15340

  • 15348
  • +0,05%

Enhancing MDC High-Level Synthesis Support

Preliminary Results

FPGA - Altera (5SGSMD5) and Xilinx (XC7VX485T)

slide-50
SLIDE 50

RESOURCES MDC+CAPH MDC+XRONOS XRONOS vs CAPH Altera Xilinx Altera Xilinx Altera Xilinx REG 1484 780

  • 632
  • 18,97%

LOGIC 1047 2347

  • 1533
  • 34,68%

RAM 15

  • 6.5
  • +100%

DSP 36 36

  • 100%

MAX FREQ [MHz] 105,80 93,69

  • 142,86
  • +58,50%

EXEC TIME [cck] 15340 15340

  • 15348
  • +0,05%

Enhancing MDC High-Level Synthesis Support

Preliminary Results

Prewitt/Sobel Multi-Flow AREA [kGE] 269,82 466,90 (+73%) Max Freq [MHz] 417,36 399.04 (-4,4%)

ASIC - TSMC 45 nm CMOS technology FPGA - Altera (5SGSMD5) and Xilinx (XC7VX485T)

slide-51
SLIDE 51

RESOURCES MDC+CAPH MDC+XRONOS XRONOS vs CAPH Altera Xilinx Altera Xilinx Altera Xilinx REG 1484 780

  • 632
  • 18,97%

LOGIC 1047 2347

  • 1533
  • 34,68%

RAM 15

  • 6.5
  • +100%

DSP 36 36

  • 100%

MAX FREQ [MHz] 105,80 93,69

  • 142,86
  • +58,50%

EXEC TIME [cck] 15340 15340

  • 15348
  • +0,05%

Enhancing MDC High-Level Synthesis Support

Preliminary Results

Prewitt/Sobel Multi-Flow AREA [kGE] 269,82 466,90 (+73%) Max Freq [MHz] 417,36 399.04 (-4,4%)

ASIC - TSMC 45 nm CMOS technology FPGA - Altera (5SGSMD5) and Xilinx (XC7VX485T)

COMING SOON:

EXPLORATION ON THE BENEFITS OF DATAFLOW-BASED HLS IN CGR ARCHITECTURES ON THE ROAD

slide-52
SLIDE 52

Outline

  • MDC Tool Summary

– Motivations and Approach – Current Functionalities and Future Directions

  • Hardware-Software Partitioning

– Co-Processing Support and Automated Characterization

  • Enhancing the MDC High-Level Synthesis Support

– Integration with the CAPH HLS engine

  • Run-time Monitoring of CGR Accelerators

– Extension of PAPI for dataflow in CGR hardware

  • Providing Further Degrees of Reconfigurability

– Mixed-Grain Reconfiguration Possibilities

slide-53
SLIDE 53

Run-time Monitoring of CGR Accelerators

PAPI for dataflow in software

PROCESSOR

C code processing

C D A B

dataflow application (RVC-CAL)

C code generation

slide-54
SLIDE 54

Run-time Monitoring of CGR Accelerators

PAPI for dataflow in software

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

P M C

C D A B

dataflow application (RVC-CAL)

C code generation C code generation with PAPI

slide-55
SLIDE 55

Run-time Monitoring of CGR Accelerators

PAPI for dataflow in software

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

8 10 12 14 16 18

1 2 3 4

Power Workpoint PAPI estimation

Est Real

P M C

C D A B

dataflow application (RVC-CAL)

@design time @run time

C code generation C code generation with PAPI

slide-56
SLIDE 56

C D A B

Run-time Monitoring of CGR Accelerators

Extension of PAPI for dataflow in CGR hardware

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

8 10 12 14 16 18

1 2 3 4

Power Workpoint PAPI estimation

Est Real

P M C

C D A B

dataflow applications (RVC-CAL)

@run time

C code generation with PAPI

@design time

slide-57
SLIDE 57

C D A B

Run-time Monitoring of CGR Accelerators

Extension of PAPI for dataflow in CGR hardware

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

8 10 12 14 16 18

1 2 3 4

Power Workpoint PAPI estimation

Est Real

P M C

C D A B

dataflow applications (RVC-CAL)

@run time CGR accelerator

SB SB 2

A B D

SB 1

F E C

configurator

sel0 sel1 sel2

ID

1 1 1

FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F

MDC CGR accelerator generation C code generation with PAPI

@design time

slide-58
SLIDE 58

C D A B

Run-time Monitoring of CGR Accelerators

Extension of PAPI for dataflow in CGR hardware

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

8 10 12 14 16 18

1 2 3 4

Power Workpoint PAPI estimation

Est Real

P M C

C D A B

dataflow applications (RVC-CAL)

@run time CGR accelerator

SB SB 2

A B D

SB 1

F E C

configurator

sel0 sel1 sel2

ID

1 1 1

FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F

C code generation with PAPI

  • FIFOs operation

MDC CGR accelerator generation with PAPI

@design time

slide-59
SLIDE 59

C D A B

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

P M C

C D A B

dataflow applications (RVC-CAL)

CGR accelerator

SB SB 2

A B D

SB 1

F E C

configurator

sel0 sel1 sel2

ID

1 1 1

FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F

MDC CGR accelerator generation C code generation with PAPI

  • FIFOs operation

MDC CGR accelerator generation with PAPI

@design time

Run-time Monitoring of CGR Accelerators

Extension of PAPI for dataflow in CGR hardware

Configuration Manager

slide-60
SLIDE 60

CGR accelerator

SB 2

A B D

SB 1

F E C

configurator

sel0 sel1 sel2

ID

1 1 1

FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F

SB

MDC CGR accelerator generation MDC CGR accelerator generation with PAPI

C D A B

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

P M C

C D A B

dataflow applications (RVC-CAL)

C code generation with PAPI

  • FIFOs operation

@design time

Configuration Manager

α

Run-time Monitoring of CGR Accelerators

Extension of PAPI for dataflow in CGR hardware

slide-61
SLIDE 61

CGR accelerator

SB 2

A B D

SB 1

F E C

configurator

sel0 sel1 sel2

ID

1 1 1

FIFO_A FIFO_B FIFO_E FIFO_C FIFO_F

SB

MDC CGR accelerator generation MDC CGR accelerator generation with PAPI

C D A B

PROCESSOR

C code processing PAPI registers reading:

  • Total instructions
  • Type of operations
  • Memory usage

Energy estimation

Based on board characterization

P M C

C D A B

dataflow applications (RVC-CAL)

C code generation with PAPI

  • FIFOs operation

@design time

β

Run-time Monitoring of CGR Accelerators

Extension of PAPI for dataflow in CGR hardware

Configuration Manager

slide-62
SLIDE 62

Outline

  • MDC Tool Summary

– Motivations and Approach – Current Functionalities and Future Directions

  • Hardware-Software Partitioning

– Co-Processing Support and Automated Characterization

  • Enhancing the MDC High-Level Synthesis Support

– Integration with the CAPH HLS engine

  • Run-time Monitoring of CGR Accelerators

– Extension of PAPI for dataflow in CGR hardware

  • Providing Further Degrees of Reconfigurability

– Mixed-Grain Reconfiguration Possibilities

slide-63
SLIDE 63

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

PU PU 2 PU 5 PU 4 PU 1 PU 3

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

slide-64
SLIDE 64

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

LUT 4x2

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

bit-level reconfiguration word-level reconfiguration

slide-65
SLIDE 65

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

LUT 4x2

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

bit-level reconfiguration very flexible (any kind of HDL defined system) word-level reconfiguration small flexibility (fixed set of predefined configuration)

slide-66
SLIDE 66

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

LUT 4x2

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

bit-level reconfiguration very flexible (any kind of HDL defined system) slow to configure (lot of switches and LUTs) big memory footprint (long configuration bitstream) word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits)

slide-67
SLIDE 67

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

bit-level reconfiguration very flexible (any kind of HDL defined system) slow to configure (lot of switches and LUTs) big memory footprint (long configuration bitstream) word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits)

DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of

  • nly a well defined

region of the FPGA

.bit

system configurations

slide-68
SLIDE 68

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits) bit-level reconfiguration flexible (HDL systems precedently implemented) time to configure typically in terms of ms memory footprint to be considered

DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of

  • nly a well defined

region of the FPGA

.bit

system configurations

slide-69
SLIDE 69

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits) power consumption due to reconfiguration bit-level reconfiguration flexible (HDL systems precedently implemented) time to configure typically in terms of ms memory footprint to be considered power consumption peak during reconfiguration

DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of

  • nly a well defined

region of the FPGA

.bit

system configurations

slide-70
SLIDE 70

LB LB 1 LB 2 LB 3 LB 4 LB 5 LB 6 LB 7 LB 8

Providing Further Degrees of Reconfigurability

Fine-Grain and Partial Reconfiguration

PU PU 2 PU 5 PU 4 PU 1 PU 3 datapath (mul, sh)

control (fsm)

/ 16 / 16

FINE-GRAIN RECONFIGURATION (FPGA) COARSE-GRAIN RECONFIGURATION (MDC ACCELERATOR)

word-level reconfiguration small flexibility (fixed set of predefined configuration) fast to configure (small amount of switches) negligible memory footprint (log₂(#config) bits) power consumption due to reconfiguration bit-level reconfiguration flexible (HDL systems precedently implemented) time to configure typically in terms of ms memory footprint to be considered power consumption peak during reconfiguration

DYNAMIC PARTIAL RECONFIGURATION (DPR) runtime reconfiguration of

  • nly a well defined

region of the FPGA

.bit

system configurations

COMPLEMENTARITY

DPR BIG change, BIG overhead CGR SMALL change, SMALL overhead

slide-71
SLIDE 71

CG reconfigurable substrate

Providing Further Degrees of Reconfigurability

FG into CG reconfiguration

PU0 PU2 PU5 PU4 PU1 PU3

slide-72
SLIDE 72

FPGA FG reconfigurable substrate

CG reconfigurable substrate

Providing Further Degrees of Reconfigurability

FG into CG reconfiguration

PU0 PU2 PU5 PU4 PU1 PU3

slide-73
SLIDE 73

FPGA FG reconfigurable substrate

CG reconfigurable substrate

Providing Further Degrees of Reconfigurability

FG into CG reconfiguration

PU0 PU2 PU5 PU4 PU1 PU3 DPR subjected region

slide-74
SLIDE 74

FPGA FG reconfigurable substrate

CG reconfigurable substrate

Providing Further Degrees of Reconfigurability

FG into CG reconfiguration

PU0 PU2 PU5 PU4 PU1 PU3 DPR subjected region

PU1a PU1b PU1c PU1d

slide-75
SLIDE 75

FPGA FG reconfigurable substrate

CG reconfigurable substrate

Providing Further Degrees of Reconfigurability

FG into CG reconfiguration

PU0 PU2 PU5 PU4 PU1 PU3 DPR subjected region

PU1a PU1b PU1c PU1d

To be stored into the FPGA internal memory PU1 a.bit PU1 b.bit PU1 c.bit PU1 d.bit

slide-76
SLIDE 76

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate

host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

slide-77
SLIDE 77

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate

host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

slide-78
SLIDE 78

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region RUNTIME:

t0: config FG = CG0

execute CG0

slide-79
SLIDE 79

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 4

RUNTIME:

t0: config FG = CG0 t1: config CG = α

PU 1

execute α α execute CG0

slide-80
SLIDE 80

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 4

RUNTIME:

t0: config FG = CG0 t1: config CG = α … SMALL context change …

PU 1

α

slide-81
SLIDE 81

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 2 PU 4 PU 3

RUNTIME:

t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β

α execute β β

slide-82
SLIDE 82

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 2 PU 4 PU 3

RUNTIME:

t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change …

α β

slide-83
SLIDE 83

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 2 PU 3

RUNTIME:

t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change … t3: config CG = γ

PU 1 PU

α β execute γ γ

slide-84
SLIDE 84

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 2 PU 3

RUNTIME:

t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change … t3: config CG = γ … BIG context change …

PU 1 PU

α β γ

slide-85
SLIDE 85

Providing Further Degrees of Reconfigurability

CG into FG reconfiguration

FPGA

FG reconfgurable substrate CG reconfigurable substrate CG0

PU PU 2 PU 5 PU 4 PU 1 PU 3 host processor

DDR

||||||||||||||||| ||||||||||||||||| ||||||| |||||||

BUS if REGS

LOCAL MEM

DMA

Ethernet CTRL

DPR subjected region

PU 2 PU 3

RUNTIME:

t0: config FG = CG0 t1: config CG = α … SMALL context change … t2: config CG = β … SMALL context change … t3: config CG = γ … BIG context change … t4: config FG = CG1

PU 1 PU

α β γ execute CG1 ω

PU 2

CG reconfigurable substrate CG1

PU 2 PU 3

PU 4

PU PU 1

slide-86
SLIDE 86

Providing Further Degrees of Reconfigurability

CG into Artico³

Artico3 is a DPR supporting architecture in charge of smartly manage performance, consumption and dependability.

  • hardware acceleration
  • hierarchical memory
  • bus based, DMA enabled communication
slide-87
SLIDE 87

Providing Further Degrees of Reconfigurability

CG into Artico³

Artico3 is a DPR supporting architecture in charge of smartly manage performance, consumption and dependability.

  • enhance flexibility by enabling CGR

within Artico3 slots

  • hardware acceleration
  • hierarchical memory
  • bus based, DMA enabled communication
slide-88
SLIDE 88

Providing Further Degrees of Reconfigurability

CG into Artico³

Artico3 is a DPR supporting architecture in charge of smartly manage performance, consumption and dependability.

  • exploit

dataflow to facilitate/ automate programmability

  • enhance flexibility by enabling CGR

within Artico3 slots

  • hardware acceleration
  • hierarchical memory
  • bus based, DMA enabled communication
slide-89
SLIDE 89

Providing Further Degrees of Reconfigurability

The big picture within CERBERO

PREESM: dataflow based HW/SW and FGR/CGR partitioning

slide-90
SLIDE 90

Providing Further Degrees of Reconfigurability

The big picture within CERBERO

PREESM: dataflow based HW/SW and FGR/CGR partitioning

slide-91
SLIDE 91

Providing Further Degrees of Reconfigurability

The big picture within CERBERO

PREESM: dataflow based HW/SW and FGR/CGR partitioning PAPI: dataflow based runtime monitoring of the system to trigger reconfiguration

slide-92
SLIDE 92

Providing Further Degrees of Reconfigurability

The big picture within CERBERO

ALPHA BETA GAMMA MULTI-FLOW

PREESM: dataflow based HW/SW and FGR/CGR partitioning PAPI: dataflow based runtime monitoring of the system to trigger reconfiguration

slide-93
SLIDE 93

Providing Further Degrees of Reconfigurability

The big picture within CERBERO

ALPHA BETA GAMMA MULTI-FLOW

Performance Monitor Fault Monitor Accelerators (fine/coarse grain)

Monitoring Counters

Evaluate Monitors Output Fine/Coarse-grained accelerator reconfiguration

PREESM: dataflow based HW/SW and FGR/CGR partitioning PAPI: dataflow based runtime monitoring of the system to trigger reconfiguration

slide-94
SLIDE 94

Thanks To …

Coordinator: Michal Masin (IBM), michaelm@il.ibm.com Scientific Coordinator: Francesca Palumbo (UniSS), fpalumbo@uniss.it Innovation Manager: Katiuscia Zedda (Abinsula), katiuscia.zedda@abinsula.com Dissemination-Communication Manager: Francesco Regazzoni (USI), francesco.regazzoni@usi.ch

www.cerbero-h2020.eu info@cerbero-h2020.eu @CERBERO_h2020

EU Commission for funding the CERBERO (Cross-layer modEl-based fRamework for multi-oBjective dEsign of Reconfigurable systems in unceRtain hybRid envirOnments) project as part of the H2020 Programme under grant agreement No 732105.

slide-95
SLIDE 95

The Future Directions of Dataflow-Based Reconfigurable Hardware Accelerators

Francesca Palumbo, Claudio Rubattu, Carlo Sau, Tiziana Fanni, Luigi Raffo Rennes, 12-14 December 2017