The AXIOM-board: bringing programmability, acceleration, scalability - - PowerPoint PPT Presentation

the axiom board bringing programmability acceleration
SMART_READER_LITE
LIVE PREVIEW

The AXIOM-board: bringing programmability, acceleration, scalability - - PowerPoint PPT Presentation

Agile, eXtensible, fast I/O Module for the cyber-physical era IWES 2017 2 nd Italian Workshop on Embedded Systems Rome, 7-8 September 2017 The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board


slide-1
SLIDE 1

Agile, eXtensible, fast I/O Module for the cyber-physical era

IWES 2017 – 2nd Italian Workshop on Embedded Systems Rome, 7-8 September 2017

The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board Roberto Giorgi

University of Siena, Italy (Project Coordinator)

University of Siena (Coordinator Partner)

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Highlights of this talk

1) Exploring the concept of ”scalable embedded system” 2) Indicating a way to achieve such scalability by supporting special threads called Data-Flow Threads (DF-Threads) 3) Illustrating how these concepts are integrated in the AXIOM project, which is focused to build a scalable Single Board Computer

3 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-4
SLIDE 4

Vehicle Architecture

  • Expected total number of ECUs: 120*
  • 5 - 10 domain controllers will run with adaptive platform
  • Classical and adaptive AUTOSAR will cohexist
  • Hardware

acceleration is needed

4

[*Stefan Voget, Continental, TAPPS Workshop keynote, May 2017]

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-5
SLIDE 5

AXIOM OBJECTIVES

  • OBJ1) Producing a small board that is flexible, energy efficient and modularly scalable

– A as AGILITY, i.e. flexibility: FPGA, fast-and-cheap interconnects based on existing connectors like SATA – Energy efficiency: low-power ARM, FPGA – Modularity: fast-interconnects, distributed shared memory across boards

  • OBJ2) Easy programmability of multi-core, multi-board, FPGA

– Programming model: Improved OmpSs  X as EXTENSIBILTY – Runtime & OS: improved thread management

  • OBJ3) Leveraging Open-Source software to manage the board

– Compiler: BSC Mercurium – OS: Linux – Drivers: provided as open-source software by partners

  • OBJ4) Easy Interfacing with the Cyber-Physical Worlds

– Platform: integrating also Arduino support for a plenty of pluggable board (so-called “shields”)  “IO” as I/O – Platform: building on the UDOO experience from SECO

  • OBJ5) Enabling real time movement of threads

– Runtime: will leverage the EVIDENCE’s SCHED_DEADLINE scheduler (i.e. EDF) included Linux 3.14, UNISI low-level thread management techniques

  • OBJ6) Contribution to Standards

– Hardware: SECO is founding member of the Standardization Group for Embedded Systems (SGET) – Software: BSC is member of the OpenMP consortium 5 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-6
SLIDE 6

Smart Living/Home scenario

6

  • Speaker

identification is the identification of a person from characteristics of voices (voice biometrics).

  • Iris recognition is the process of

recognizing a person by analyzing the random pattern of the iris.

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-7
SLIDE 7

SLH application demo on the AXIOM Evaluation Platform (AEP)

7

Iris Recognition Speaker identification Identification done! Audio trigger

The T-800's POV from Terminator 2 Carolco Pictures / Tristar

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-8
SLIDE 8

AXIOM – THE MODULE-v2

  • KEY ELEMENTS

– K1: ZYNQ FPGA (INCLUDES 6 ARM CORES) – K2: ARM GP CORE(S) – K3: HIGH-SPEED & INEXPENSIVE INTERCONNECTS – K4: SW STACK – OMPSS+LINUX BASED – K5: OTHER I/F (ARDUINO, USB, ETH, WIFI, …)

8 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-9
SLIDE 9

AXIOM-v2 Architectural Template

9

B2B (Board to Board)

Quad Core ARM A53 + Dual Core ARM R5 AXI BUS MIO SHARED DRAM “O/S” DRAM USB OTG USB 2.0 UART Gb Ethernet SD-CARD I2C

GPIO HDMI Controller AXI-MASTER AXI-SLAVE “Glue-Logic”

Arduino Shield Connector

AXI-MASTER

Zynq FPGA Zynq “Hard” Off-Chip MEM-CTRL “FPGA Sandbox” AXIOM-link Connector

DSM-like engine FPGA acceleration

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-10
SLIDE 10

WHY OMPSS

`

10 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-11
SLIDE 11

The AXIOM-BOARD (about 10x15 cm)

11

Disclaimer: subject to changes without notice.

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-12
SLIDE 12

Testing Environment

  • Problem to analyze

12

BOARD1 BOARD2

... ...

Linux1 Linux2 APP Nanos++ XSM

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-13
SLIDE 13

BOARD1 BOARD2

...

XSM FRAME1 FRAME 2 THREAD1 THREAD2 THREAD3 FRAME 2 MEM1 MEM2

Data movement

13 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-14
SLIDE 14

4x 10Gbit/s via USB-C connectors

14 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-15
SLIDE 15

Several Topologies are possible

  • E.g. ring or 2D torus

15

BOARD1 BOARD2 BOARD3

BOARD1 BOARD1 BOARD1 BOARD1 BOARD1 BOARD1 BOARD1 BOARD1 BOARD1

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-16
SLIDE 16

XSMLL -- XSM Low Level

  • X-thread (new incarnation of DF-thread)

– A function that expects no parameters and returns no parameters.

  • The body of this function can refer to any memory location for which it has got

the pointer through XSM function calls (e.g., xpreload, xpoststor, xsubscribe, ...). An X-thread is identified by an object of type xtid_t (X-thread identifier). In other words: typedef void (*xthread_t)(void)

  • INPUT_FRAME, OUTPUT_FRAME

– INPUT_FRAME: A buffer which is allocated in the local memory and contains the input values for the current X-thread. – OUTPUT_FRAME: A buffer which is allocated in the local memory and contains values to be used by other X-threads (consumer X- threads)

  • SYNCHRONIZATION_COUNT

– A number which is initially set to the number of input values (or events) needed by an X-thread. The SYNCHRONIZATION_COUNT has to be decremented each time the expected data is written in an OUTPUT_FRAME.

16

TH4 FM FM FM

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-17
SLIDE 17

Core1 CoreN (GPU)

I/O hub

PL

HIGH SPEED TRANCEIVERS

MC

XSM

MEM

… …

SoC1 Core1 CoreN (GPU)

I/O hub

PL

HIGH SPEED TRANCEIVERS

MC

XSM

MEM

… …

SoC2 Core1 CoreN (GPU)

I/O hub

PL

HIGH SPEED TRANCEIVERS

MC

XSM

MEM

… …

SoC3 Core1 CoreN (GPU)

I/O hub

PL

HIGH SPEED TRANCEIVERS

MC

XSM

MEM

… …

SoC4

4-board AXIOM System

17 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-18
SLIDE 18

Modeled SoC

18 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-19
SLIDE 19

Matrix-Multiply on COTSon/XSM

  • Some experiments have been performed on the

COTSon/XSMLL with the following parameters

– Square Matrix size n and block size b:

  • n=160,200,250,320,400,500,640,800,1000,1280,1600,2000 b=5,10,25,50
  • n=128,256,512 b=8

– Different programming models

  • OpenMPI, Cilk

– Different execution models

  • XSMLL, Standard

– Different Linux Distributions

  • Ubuntu 9.10 (karmic64), 10.10 (tfxv4), 14.04 (trusty-axmv3), 16.04 (xenv0)

19

http://cotson.sourceforge.net

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-20
SLIDE 20

Speedup (t1/tN)

  • No. of SoCs

(4) (8) (16) (No. of Cores) 1 2 4 1 2 4 size=320 size=400 size=250 size=200

Strong Scaling for benchmark “Dense Matrix Multiplication”

User only

20 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-21
SLIDE 21

The OS dependency

21

5E+09 1E+10 1.5E+10 2E+10 2.5E+10 1 2 4

Cycles NODES

xenv0,2 xenv0,4 xenv0,8 xenv0,16 xenv0,32 karmic64,2 karmic64,4 karmic64,8 karmic64,16 karmic64,32 trusty-axmv3,2 trusty-axmv3,4 trusty-axmv3,8 trusty-axmv3,16 trusty-axmv3,32 tfxv4,2 tfxv4,4 tfxv4,8 tfxv4,16 tfxv4,32 5E+09 1E+10 1.5E+10 2E+10 2.5E+10 1 2 4

cycles NODES

tfxv4,32 tfxv4,8 tfxv4,2 trusty-axmv3,32 trusty-axmv3,8 trusty-axmv3,2 karmic64,32 karmic64,8 karmic64,2 xenv0,32 xenv0,8 xenv0,2

~60% faster ~60% faster

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-22
SLIDE 22

XSMLL vs OpenMPI vs Cilk

22

1 NODE (1CORE) 4 NODES/CORES XSMLL SPEEDUP OPENMPI 54281301097 13223633943 3.63 3.49 CILK 45645234077 16738179585 3.05 4.41 XSMLL 14941500251 3792215176

* For CILK we are using 4 cores instead of 4 nodes

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-23
SLIDE 23

Energy Efficient Processors

ETH ZURICH – EXPLORING RISC-V FOR THE PULP ARCHITECTURE 30/6/2015

23 Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-24
SLIDE 24

Toward Zero-Power Computing

  • Feynman’s principle
  • R. Feynman (from “Feynman’s Lecture Notes on Computation”):

“Logic or arithmetic could be done with the power that converges to zero, even if the number of operations in a program approaches infinity; the power for communications can never approach to zero, and only to infinity, if the length of communication lines approaches infinity and the number of instructions in a program approaches infinity.”

  • “There is no limit

to the minimum of energy required to operate a computer”*

(Luca Gammaitoni, The Future Technology Summit, 24th Sept. 2015) 24

[D. Paul, ICT-Energy - Strategic Research Agenda]

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-25
SLIDE 25

Dataflow Approach

  • V. Milutinovic, N. Trifunovic, R. Giorgi:

The control flow approach is based on unavoidable communications (with logic and arithmetic as primary issues for computing, but secondary contributors to power consumption); the dataflow approach is based on logic and arithmetic (with non-zero communications present

  • nly if the dataflow compiler is not smart

enough, and consequently is not able to generate the execution graph with only zero-length communications lines between neighboring arithmetic and/or logic units). "

25

  • M. Milutinovic, J. Salom, N. Trifunovic, R. Giorgi, "Guide to DataFlow Supercomputing", Apr 2015

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-26
SLIDE 26

26

THANK YOU

Questions?

Agile, eXtensible, fast I/O Module for the cyber-physical era PROJECT ID: 645496

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-27
SLIDE 27

Agile, eXtensible, fast I/O Module for the cyber-physical era PROJECT ID: 645496

University of Siena (Coordinator Partner)

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu The AXIOM-board: bringing programmability, acceleration, scalability into a 64-bit hand-size board

slide-28
SLIDE 28

Multi-HD Experiment (2)

AXIOM id. 645496 http://www.axiom-project.eu 28

T311

slide-29
SLIDE 29

Data Access Latency

AXIOM id. 645496 http://www.axiom-project.eu 29

10.5 11.5 12.5 13.5 14.5 15.5 1 2 4

Data Access Latency (cycles) NODES

xenv0,64 xenv0,256 xenv0,1024 karmic64,64 karmic64,256 karmic64,1024 trusty-axmv3,64 trusty-axmv3,256 trusty-axmv3,1024 tfxv4,64 tfxv4,256 tfxv4,1024 10.5 11.5 12.5 13.5 14.5 15.5 1 2 4

Data Access Latency (cycles) NODES

tfxv4,1024 tfxv4,256 tfxv4,64 trusty-axmv3,1024 trusty-axmv3,256 trusty-axmv3,64 karmic64,1024 karmic64,256 karmic64,64 xenv0,1024 xenv0,256 xenv0,64

T312

TFX TRU KAR XEN

slide-30
SLIDE 30

L2 Cache Miss Rate

AXIOM id. 645496 http://www.axiom-project.eu 30

0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 '1 '2 '4

L2 Cache Miss Rate NODES

xenv0,64 xenv0,256 xenv0,1024 karmic64,64 karmic64,256 karmic64,1024 trusty-axmv3,64 trusty-axmv3,256 trusty-axmv3,1024 tfxv4,64 tfxv4,256 tfxv4,1024 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 '1 '2 '4

L2 Cache Miss Rate NODES

tfxv4,1024 tfxv4,256 tfxv4,64 trusty-axmv3,1024 trusty-axmv3,256 trusty-axmv3,64 karmic64,1024 karmic64,256 karmic64,64 xenv0,1024 xenv0,256 xenv0,64

T312

slide-31
SLIDE 31

Ratio of Kernel Cycles vs Total Cycles

AXIOM id. 645496 http://www.axiom-project.eu 31

0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4

Ratio of Kernel Cycles NODES

xenv0,32 xenv0,64 xenv0,128 xenv0,256 xenv0,512 xenv0,1024 karmic64,32 karmic64,64 karmic64,128 karmic64,256 karmic64,512 karmic64,1024 trusty-axmv3,32 trusty-axmv3,64 trusty-axmv3,128 trusty-axmv3,256 trusty-axmv3,512 trusty-axmv3,1024 tfxv4,32 tfxv4,64 tfxv4,128 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 2 4

Percent of Kernel Cycles NODES

tfxv4,1024 tfxv4,256 tfxv4,64 trusty-axmv3,1024 trusty-axmv3,256 trusty-axmv3,64 karmic64,1024 karmic64,256 karmic64,64 xenv0,1024 xenv0,256 xenv0,64

T312

slide-32
SLIDE 32

Agile, eXtensible, fast I/O Module for the cyber-physical era PROJECT ID: 645496

University of Siena (Coordinator Partner)

Roberto Giorgi –- AXIOM project --- http://www.axiom-project.eu Scalable Embedded Systems: Towards the Convergence of High-Performance and Embedded Computing