[PPT] - Model-based engineering of high-performance embedded applications on PowerPoint Presentation

SLIDE 1

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871669

15th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'20, part of ISC 2020) 15th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'20, part of ISC 2020)

Model-based engineering of high-performance embedded applications on heterogeneous hardware with real-time constraints and energy efficiency

Tommaso Cucinotta – Scuola Superiore Sant’Anna, Pisa (Italy)

SLIDE 2

2

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Introduction & Motivations

CPSs have high

gher and high gher comput utatjo tjonal performance & reli liability requi uirements

Use of incr

creasingl gly heterogeneous us & interconnect cted, batu tuery-operated pla latg tgorms

– non-SMP multj-core – GP-GPU/TPU acceleratjon – FPGA

Heterogeneous pla

latg tgorms needed in sofu fu and hard real-tjm tjme use-ca cases

– automotjve, railways, aerospace,

robotjcs, gaming, multjmedia, ...

Heterogeneous Hardware Heterogeneous Hardware

Non-SMP multi-core GP-GPU TPU FPGA

SLIDE 3

3

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Problems & Challenges

Develo

elopme ment of so sofu fuware e for CP CPSs s is is cumbe mberso some me!

– Optjmum usage of

underlying hardware parallelism & acceleratjon

– Performance vs energy

consumptjon trade-ofgs

– Real-tjme constraints – Safety & certjfjcatjon

Heterogeneous Hardware Heterogeneous Hardware

Non-SMP multi-core GP-GPU TPU FPGA

Operating System Kernel / Hypervisor Operating System Kernel / Hypervisor Operating System Services & Middleware Operating System Services & Middleware CPU Scheduler CPU Scheduler I/O Scheduler I/O Scheduler Drivers... Drivers... Power Management Power Management App 1 App 1 App n App n

SLIDE 4

4

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

MDE & Formalisms in Embedded System Design

Model-D
Driv

iven n Engi ngine neering ring (MD MDE)

– Fill the gap between

high-level specifjcatjons and actual system behavior

MD

MDE E emb mbraces

– Formal specifjcatjon language(s) – Model transformatjon engine(s) – Model refjnements & composability – Automatjc code generator(s) – Model verifjability

=> Corr

rrect ctne ness-b s-by-c

const

nstructj ructjon

Sensors Actuators

MD E

(e.g. CAPELLA, AMALTHEA, AUTOSAR)

Logic Controller

Functional & Non-functional Requirements Functional & Non-functional Requirements System Specification System Specification Architecture Definition Architecture Definition Implemented Software Implemented Software

gap gap gap

SLIDE 5

5

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

MDE & Formalisms in Embedded System Design

Model-D
Driv

iven n Engi ngine neering ring (MD MDE)

– Fill the gap between

high-level specifjcatjons and actual system behavior

MD

MDE E emb mbraces

– Formal specifjcatjon language(s) – Model transformatjon engine(s) – Model refjnements & composability – Automatjc code generator(s) – Model verifjability

=> Corr

rrect ctne ness-b s-by-c

const

nstructj ructjon

Sensors Actuators

MD E

(e.g. CAPELLA, AMALTHEA, AUTOSAR)

Logic Controller

Functional & Non-functional Requirements Functional & Non-functional Requirements System Specification System Specification Architecture Definition Architecture Definition Implemented Software Implemented Software

gap gap gap

Trad aditjo itjonal M al MDE l limit mitatjo tjons

Single-processor systems or very limited

support for multj-core systems

Struggles at coping with nowadays

complex heterogeneous embedded boards Trad aditj itjonal M al MDE l limit mitatjo tjons

Single-processor systems or very limited

support for multj-core systems

Struggles at coping with nowadays

complex heterogeneous embedded boards

SLIDE 6

6

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

AMPERE Project Goal

Fill

ll the gap p bet etween en

– MDE techniques with no/limited

parallelism support

– Parallel-programming models with effjcient

HW offmoading (OpenMP, CUDA, ...)

– Heterogeneity in hardware

In pr

pres esen ence e of non-functj tjonal l req equirements

– High-Performance – Real-Time Constraints – Energy Effjciency – Fault Tolerance

Sensors Actuators

MD E

(e.g. CAPELLA, AMALTHEA, AUTOSAR)

Logic Controller

Heterogeneous Hardware Heterogeneous Hardware

Non-SMP multi-core GP-GPU TPU FPGA

Run-time parallel frameworks Parallel Programming Models

(e.g. OpenMP, OpenCL, CUDA, COMPSs)

P a ra lle l E x e c u tion Mo d e l

Parallel Units Parallel Untits Parallel Units

SLIDE 7

7

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Bridge the gap

1. Synthesis methods for an effjcient generatjon of

parallel source code, while keeping non- functjonal and composability guarantees

2. Run-tjme parallel frameworks that guarantee

system correctness and exploit the performance capabilitjes of parallel architectures

3. Integratjon of parallel frameworks into MDE

frameworks

Sensors Actuators

MD E

(e.g. CAPELLA, AMALTHEA, AUTOSAR)

Logic Controller Run-time parallel frameworks Parallel Programming Models

(e.g. OpenMP, OpenCL, CUDA, COMPSs)

P a r a lle l E x e c u tion Mo d e l

Parallel Units Parallel Untits Parallel Units

AMPERE Vision

SLIDE 8

8

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Bridge the gap

Sensors Actuators

MD E

(e.g. CAPELLA, AMALTHEA, AUTOSAR)

Logic Controller Run-time parallel frameworks Parallel Programming Models

(e.g. OpenMP, OpenCL, CUDA, COMPSs)

P a r a lle l E x e c u tion Mo d e l

Parallel Units Parallel Untits Parallel Units

AUTOSAR SW-C Runnables Client-server ASIL AUTOSAR SW-C Runnables Client-server ASIL AMALTHEA Performance Tasks Scheduling Platform AMALTHEA Performance Tasks Scheduling Platform CAPELLA Functional components Allocation of resources Data models View points validation rules CAPELLA Functional components Allocation of resources Data models View points validation rules Meta-model Driven Abstractions Components, Communications, Timing Characteristics, IntegrityAassurance, ... Meta-model Driven Abstractions Components, Communications, Timing Characteristics, IntegrityAassurance, ... Model Transformation Engine Model Transformation Engine Meta-parallel Programming Abstraction Parallelism, Synchronization, Data Dependencies, Data Attributes, ... Meta-parallel Programming Abstraction Parallelism, Synchronization, Data Dependencies, Data Attributes, ... OpenMP Task construct Dependencies Parallel construct OpenMP Task construct Dependencies Parallel construct OpenCL clgetDeviceId clCreateBuffer __kernel_exec OpenCL clgetDeviceId clCreateBuffer __kernel_exec COMPSs Compute resource Data movements Task annotations COMPSs Compute resource Data movements Task annotations Parallel Run-Time Frameworks Parallel Run-Time Frameworks

AMPERE MDE Framework

AMPERE Vision

SLIDE 9

9

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Sofuware Layer Tool Owner (License) DSMLs

AUTOSAR AUTOSAR (Proprietary) AMALTHEA BOSCH (Open-source) CAPELLA TRT (Open-source)

Parallel programming models

OpenMP OpenMP ARB (Proprietary) CUDA NVIDIA (Proprietary) OpenCL Khronos (Proprietary) COMPSs BSC (Open-source)

Artjfjcial Intelligence

TensorFlow Google (Open-source)

Code synthesis tools

Synthesis tools AMPERE (Open-source)

Analysis and testjng tools

NFP analysis AMPERE (Open-source)

Compilers and hardware synthesis tools

Mercurium BSC (Open-source) GCC/LLVM GNU/LLVM (Open-source) Vivado Xilinx (Proprietary)

Run-tjme libraries

GOMP GNU-GCC (Open-source) KMP LLVM (Open-source) Vivado Xilinx (Proprietary)

Operatjng systems

Linux Linux-Foundatjon (Open-source) ERIKA Enterp. EVI (Open-source/commercial)

Hypervisors

PikeOS SYSGO (Proprietary)

AMPERE Software Architecture

SLIDE 10

10

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

AMPERE Software Development Workflow Overview

R u n

tim

e fr a m e w

rk

+ O S + H y p e rv is

r

Ex ecutionProfile

Model

P la tfo rm d e s c rip tio n

Accel. devices
Cores/clusters
Memory model
Etc.

S y s te md e s c rip tion

Components/communication
Functional/NFP
Etc.

Me ta MD E abs traction

C

d

eS y n th e s is+ Mu lti-c rite ria O p tim iz a tion

Performance
Time-predictability
Energy-efficiency
Resiliency

Meta PPM abstraction

C

m

p ile r

(Correctness + Refined Parallel Structure)

Parallel code(e.g. OpenMP , C UD A graphs)

R e s

u

rc eA lloc a tion

(i.e., mapping/scheduling)

Monitoring
Dynamic

resource allocation

SLIDE 11

11

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Obstacle Detectjon and Avoidance System (ODAS)

ADAS functjonalitjes based on data fusion coming from

tram vehicle sensors

Predictjve Cruise Control (PCC)

Extends Adaptjve Cruise Control (ACC) functjonality by

calculatjng the vehicle’s future velocity curve using the data from the electronic horizon

Improve fuel effjciency (in cooperatjon with the powertrain

control) by confjguring the driving strategy based on data analytjcs and AI

AMPERE Use-Cases

SLIDE 12

12

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

FPGA-based system-on-chips are a very promising solution to enable predictable HW acceleration of complex computing workloads

Multiprocessors can host multi-OS software systems
FPGA fabric can be used to deploy HW accelerators

Asymmetric multjprocessor Large FPGA fabric Zynq Ultrascale+

FPGA System-on-Chip (SoC)

SLIDE 13

13

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Programmable logic exhibits very regular, clock-level behavior

(difgerently from other HW accelerators, e.g., GP-GPUs)

Internal control logic of several HW accelerators is typically based
n state machines

FIR and Sobel fjlters from Xilinx IP library (screenshot from Vivado 2017.4)

PL

PS

Interconnect HW accel #1 HW accel #2

port towards processing system

DRAM We can monitor & supervise bus transactjons to shield the systems from misbehaviors We can realize custom bus arbitratjon policies that help meet tjming constraints

FPGA fabric

Possibility to deploy custom bus logic
Bus/memory contention can be made predictable

HW Accelerators on FPGAs

SLIDE 14

14

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Modern FPGAs ofger dynamic partial reconfjguration (DPR) capabilities
DPR allows reconfjguring a portion of the FPGA at runtime, while

the rest of the device continues to operate

Is essence, reconfjguration requires programming a memory
Simplifying, an image of the FPGA confjguration (bitstream) is copied from one

memory to another

bitstreams Reco nfjgu rable regio n FPGA fabric

Dynamic Partial Reconfiguration

SLIDE 15

15

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Enable predictable HW acceleration on FPGA system-on-chips
Collection of technologies developed at the ReTiS Lab

http://fred.santannapisa.it/ http://fred.santannapisa.it/

Zynq-7000 series Zynq Ultrascale+

Supported platgorms

FRED Framework

SLIDE 16

16

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

TASK(myTask) { <prepare input data> EXECUTE_HW_TASK(myHWtask); <retrieve output data> }

SW-Task

Suspend the execution until the completion of the HW-task Suspend the execution until the completion of the HW-task CPU FPGA Fabric SW-Task

Fixed-priority scheduling non-preemptjve executjon

HW-Task

periodic/sporadic real-tjme tasks HW accelerators implemented as programmable logic

System-on-Chip

work on shared-memory bufgers

FRED Programming Model

SLIDE 17

17

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

CHaiDNN: HLS based DNN Accelerator Library for Xilinx Ultrascale+
Designed for maximum compute effjciency at 6-bit integer data types (it also

supports 8-bit integer data types)

The inference time in isolation exhibits very little fmuctuations
The real issue for time predictability is bus/memory contention

Setup: Xilinx ZCU102 (Ultrascale+), Vivado2018.2, GoogleNet, DMA from Xilinx IP lib

Time-predictable DNN Inference

SLIDE 18

18

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

The FRED framework is a combination of several technologies:

Run-time FPGA manager & scheduler for Linux (both C and Python API)
Bus monitors and budget enforcers
Automated FPGA fmoor-planning
Automatic synthesis of bus interconnections

Inside the FRED Framework

SLIDE 19

19

Tommaso Cucinotta – Real-Time Systems Laboratory - Scuola Superiore Sant’Anna - VHPC 2020

Conclusions

AMPERE aims to bridge the gap between MDE and PPM on HHW by

1. Providing a development framework for CPS targetjng parallel heterogeneous architectures for an increased productjvity compliant with current MDE practjses 2. Providing an executjon framework for an effjcient exploitatjon of parallel and heterogeneous architectures, fulfjlling functjonal and non-functjonal constraints 3. Integratjng AMPERE sofuware solutjons into relevant industrial use-cases (automotjve and railway) with HPC and real-tjme requirements

SLIDE 20

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871669

Thanks for Listening Any Questions?

htups://www.linkedin.com/company/ampere-project htups://twituer.com/ampereproject