An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models - PowerPoint PPT Presentation

IWES 2018 Third Italian Workshop on Embedded Systems Siena – 13-14 September 2018 An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy

The end of Dennard scaling… Engineering community forced to find new solutions to improve [1] performance with a limited power budget by  Stop increasing clock frequency  Shifting to multicore processors Moore’s law 2018 – Source: Wikipedia Programming limitations to exploit full performance still remains…

The “DF-Threads” Data-Flow execution model The “DF-Threads” Data-Flow execution model is capable of taking advantage of the full parallelism offered by multicore systems [2][3][4][5][6][7]  Execution relies on data-dependencies  Parallel execution of data independent paths

Hybrid Data-Flow Model DF-Threads based execution does not need to totally replace the conventional general purpose processors (GPP) Hybrid Model based on GPP and Field Programmable Gate Arrays (FPGA)  GPP cores are suitable for legacy or OS  FPGA can easily provide an efficient parallel execution via DF-Threads

System Design A possible architecture to enable an easy distribution of the Data-Flow [8] Threads (DF-Threads) among multiple core and multiple nodes

The Idea Improving the execution of the Data-Flow Threads scheduling, by [9][10] implementing an Hardware Scheduler (HS) on FPGA PS: processing system HS: hardware scheduler GPPs: general purpose processors HS-L1: local scheduler HDF: hardware Data-Flow Threads HS-L2: distributed scheduler The GPP: The HS  Retrieves meta-information  Asynchronous APIs  Provides ready HDF-Threads  Schedule DF-Threads  Distribute HDF-Threads on network  Execute DF-Threads

Compilation and testing flow Testing Environment [11]  COTSon Simulator [12]  AXIOM Board

System Abstraction in a Perspective Routing topologies: 2d-mesh or ring PS (Processing System) Application: Fibonacci Algorithm HS API DDR Axiom Library Memory [2] Axiom IOCTLs NIC Device HS Device Driver Driver AXI buses HS Registers Registers PS PS DDR DDR NIC HS [1] HS HS NIC NIC PL (Programmable Logic) PL PL Proposed Hardware Scheduler [1] Vasileios Amourgianos-Lorentzos. “Efficient network interface design for low cost distributed systems” Master Thesis, 2017 at Technical University of Crete as part of the FORTH Axiom program. [2] Evidence Embedding Technology, 2017, “https://github.com/evidence

Hardware Scheduler (HS) Primitives Hardware Scheduler Level 1 DDR Memory Memory HS API Controller Opcode Register f_ptr = load_frame(); HFD_schedule(f_ptr, i_ptr, init_sc) Register Argument 1 Decoder HDF_decrease (f_ptr, num_sc) HS-L1 Controller Register FSM HDF_subscribe (d_ptr) [1] Argument 2 HDF_publish (f_ptr) Register HS-L2 PS (Processing System) HS Module Hardware Scheduler Level 2 NIC Network Interface Card PL (Programmable Logic) [1] F. Khalili, M. Procaccini and R. Giorgi. “Reconfigurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1,4. Fiuggi, Italy, July 2018. Poster.

Register Controller [2]  The Write/Read access of each registers are separately controllable through the ‘Control’ register.  The Register Controller FSM (1) is responsible to control Master AXI Stream Handler Module (2) and exchange data between AXI Stream and AXI memory mapped Domains.  Register Controller FSM (1) also polls control_reg (3) and checks corresponding bit fields of each register if it is configured as write access or read access to set the direction of the data. [2] F. Khalili, M. Procaccini and R. Giorgi. “Recongurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1{4. Fiuggi, Italy, Julyy 2018. Poster.

HS-L1 (Hardware Scheduler Level 1)  Retrieves meta-information of FRAMEs (Schedule FSM)  Schedules the FRAMEs which are ready to be executed (Decrease FSM). Frames are stored in GM Sector  Fetches the IP (Instruction Pointer) from the ready FRAMEs (Fetch FSM) Global Memory – Direct HS-L2 Memory Access Memory Decoder Schedule FSM GM GM-ctrler GM-DMA FSM Sector Decrease RFQ FSM RFQ-ctrler RFQ-DMA Sector Fetch FSM Ready Frame Pointers are stored in RFQ Sector Ready Frame Queue – Direct Memory Access

HS-L2 (Hardware Scheduler Level 2)  Distribute FRAMEs in order to balance the loads throughout the network.  Work-stealing from remote nodes.  Off-load the works to remote nodes N Msg- TX-FIFO Composer Load HS-L1 W NIC Balancing S FSM Msg- E RX-FIFO Interpreter

Design Snippets GM Controller GM DMA Decrease FSM Register Controller Schedule FSM RFQ DMA RFQ Controller Msg-Composer TX - FIFO Fetch FSM RX - FIFO Msg-Interpreter

Evaluation – Execution Cycles Number Of Clock Cycles (PL). Operation Data Width Worst Best FIFO Enqueue/Dequeue 64 bits 2 1 Global Memory Write (DDR4) 16 bytes 48 40 Global Memory Read (DDR4) 16 bytes 38 38 Ready Queue Write 32 bits 48 40 Ready Queue Read 32 bits 44 44 Number of Clock Cycles (PL). Instruction Name Delay Contributors Worst Best Total 49 40 HDF-Schedule DMA IP 48 39 Decoder FSM 1 1 Total 89 43 HDF-Decrease DMA IP 86 40 Decoder FSM 3 3 Total 85 34 HDF-Fetch DMA IP 82 31 Fetch FSM 3 3

Evaluation – Resource Utilization  Extracted resource utilization from Vivavo Design Suit 2016.4. • Axiom board Zynq UltraScale+ XCZU9EG platform. PL Units Number of Units Available Utilization % LUT 20357 274080 7.43 LUTRAM 2876 144000 2.00 FF 26116 548160 4.76 BRAM 49.50 912 5.43 IO 27 204 13.24 GT 2 16 12.50 BUFG 6 404 1.49

Results HDF-Threads vs OpenMPI – Matrix Multiply Test 512+8 14 10 1 9 0.9 12 Exdecution Time (sec) 8 0.8 10 Speedup T(1)/T(n) 7 0.7 Efficiency S(p)/p 8 6 0.6 5 0.5 6 4 0.4 4 3 0.3 2 2 0.2 1 0.1 0 0 0 1N 1C 2N 1C 4N 1C 4N 2C 2N 1C 4N 1C 4N 2C 2N 1C 4N 1C 4N 2C OpenMPI HDF-Threads OpenMPI HDF-Threads OpenMPI HDF-Threads Execution Time Speedup Efficiency

Results HDF-Threads vs OpenMPI – Matrix Multiply Test 2.5 70 size =512, b =8 size =512, b =8 2.26 2.17 58.73 2.01 60 1.97 2 52.03 50 Bus Utilization % Kernel Cycles 42.56 39.75 1.4 1.5 40 1.08 1.07 1.07 30 1 20 0.5 10 2.6 2.52 2.42 0.79 0 0 1N 1C 2N 1C 4N 1C 4N 2C 1N 1C 2N 1C 4N 1C 4N 2C OpenMPI HDF-Threads OpenMPI HDF-Threads Kernel Cycles Bus Utilization

References [1] Frank, D. J., Dennard, R. H., Nowak, E., Solomon, P. M., Taur, Y., & Wong, H. S. P. (2001). Device scaling limits of Si MOSFETs and their application dependencies. Proceedings of the IEEE , 89 (3), 259-288. [2] Mondelli, Andrea, et al. "Dataflow support in x86_64 multicore architectures through small hardware extensions." Digital System Design (DSD), 2015 Euromicro Conference on . IEEE, 2015 [3] Dennis, J. B. (1980). Data flow supercomputers. Computer , (11), 48-56. [4] Giorgi, R., & Faraboschi, P. (2014, October). An introduction to DF-Threads and their execution model. In Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on (pp. 60-65). IEEE. [5] Verdoscia, L., Vaccaro, R., & Giorgi, R. (2014, August). A clockless computing system based on the static dataflow paradigm. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2014 Fourth Workshop on (pp. 30-37). IEEE. [6] Giorgi, R., Popovic, Z., & Puzovic, N. (2007, October). DTA-C: A decoupled multi-threaded architecture for CMP systems. In Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on (pp. 263-270). IEEE. [7] Kavi, K. M., Giorgi, R., & Arul, J. (2001). Scheduled dataflow: Execution paradigm, architecture, and performance evaluation. IEEE Transactions on Computers , 50 (8), 834-846. [8] Procaccini, M., Giorgi, R. (2017). A Data-Flow Execution Engine for Scalable Embedded Computing. HiPEAC ACACES-2018. [9] Procaccini, M., Khalili, F., Giorgi, R. (2018). An FPGA-based Scalable Hardware Scheduler for Data-Flow Models. HiPEAC ACACES- 2018. [10] Khalili, F., Procaccini, M., Giorgi, R. (2018). Reconfigurable Logic Interface Architecture for CPU-FPGA Accelerators. HiPEAC ACACES-2018. [11] Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., & Ortega, D. (2009). COTSon: infrastructure for full system simulation. ACM SIGOPS Operating Systems Review , 43 (1), 52-61. [12] Theodoropoulos, D., Mazumdar, S., Ayguade, E., Bettin, N., Bueno, J., Ermini, S., ... & Montefoschi, F. (2017). The AXIOM platform for next-generation cyber physical systems. Microprocessors and Microsystems , 52 , 540-555.

THANK FOR YOU YOUR ATTENTION ANY QUESTIONS ?

An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models - PowerPoint PPT Presentation

IWES 2018 Third Italian Workshop on Embedded Systems Siena 13-14 September 2018 An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy The end of Dennard

A Configurable Hardware Scheduler A Configurable Hardware Scheduler (CHS) for Real- -Time

Open Source FPGA Toolchain FPGA LSE Summer Week 2015 iCE40 Flow Conclusion Vincent Gatine

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Fast, Scalable, and Programmable Packet Scheduler in Hardware Vishal Shrivastav Cornell

Preempting Scheduler Activations Scheduler activations are completely preemptable Deadlocks

WORK STEALING SCHEDULER 2 6/16/2010 Work Stealing Scheduler

Design and Implemention of a Plugin Scheduler for DIET March 11, 2005 Design and Implemention of

Tips about an FPGA 02/09/2018 J.C. special topic FPGA ( field-programmable gate array ) FPGA :

FPGA What is a FPGA? How FPGAs work How do they work? Manufacturers

WWW.FPGA What is an FPGA? Field Programmable Gate Array Introduction to FPGA designs

Public FPGA based DM Public FPGA based DMA Atta A Attacking king UlfFrisk Agenda Background

An introduction to FPGA-based acceleration of neural networks Marco Pagani 1 What is an FPGA?

From OO to FPGA: From OO to FPGA: Fitting Round Objects Fitting Round Objects into Square

Current Trends in Hybrid FPGA/CPU Devices Hybrid FPGA/CPU Devices Xilinx Zynq Series Real

FPGA-CAPELLA: A REAL TIME AUDIO FX UNIT COSMA KUFA AND JUSTIN XIAO WHAT IS FPGA-CAPELLA?

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework Jan Gray |

CS510 Software Engineering Static Program Analysis Asst. Prof. Mathias Payer Department of

Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996.

Chapter 9 such statements as they tend to sound pretty silly in 5 years Alternative

Disciplina Sistemas de Computao Aula 04 Aviso Slides e Arquivos j esto no site

Computing in Space PRACE Keynote Oskar Mencer, April 2014 Thinking

CS654 Advanced Computer Architecture Lec 8 Instruction Level Parallelism Peter Kemper

Fundamentals of Computer Design Computer Architecture J. Daniel Garca Snchez (coordinator)

Data Flow Coverage 1 Stuart Anderson Stuart Anderson Data Flow Coverage 1 2011 c 1 Why