an fpga based scalable hardware scheduler for data flow
play

An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models - PowerPoint PPT Presentation

IWES 2018 Third Italian Workshop on Embedded Systems Siena 13-14 September 2018 An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy The end of Dennard


  1. IWES 2018 Third Italian Workshop on Embedded Systems Siena – 13-14 September 2018 An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy

  2. The end of Dennard scaling… Engineering community forced to find new solutions to improve [1] performance with a limited power budget by  Stop increasing clock frequency  Shifting to multicore processors Moore’s law 2018 – Source: Wikipedia Programming limitations to exploit full performance still remains…

  3. The “DF-Threads” Data-Flow execution model The “DF-Threads” Data-Flow execution model is capable of taking advantage of the full parallelism offered by multicore systems [2][3][4][5][6][7]  Execution relies on data-dependencies  Parallel execution of data independent paths

  4. Hybrid Data-Flow Model DF-Threads based execution does not need to totally replace the conventional general purpose processors (GPP) Hybrid Model based on GPP and Field Programmable Gate Arrays (FPGA)  GPP cores are suitable for legacy or OS  FPGA can easily provide an efficient parallel execution via DF-Threads

  5. System Design A possible architecture to enable an easy distribution of the Data-Flow [8] Threads (DF-Threads) among multiple core and multiple nodes

  6. The Idea Improving the execution of the Data-Flow Threads scheduling, by [9][10] implementing an Hardware Scheduler (HS) on FPGA PS: processing system HS: hardware scheduler GPPs: general purpose processors HS-L1: local scheduler HDF: hardware Data-Flow Threads HS-L2: distributed scheduler The GPP: The HS  Retrieves meta-information  Asynchronous APIs  Provides ready HDF-Threads  Schedule DF-Threads  Distribute HDF-Threads on network  Execute DF-Threads

  7. Compilation and testing flow Testing Environment [11]  COTSon Simulator [12]  AXIOM Board

  8. System Abstraction in a Perspective Routing topologies: 2d-mesh or ring PS (Processing System) Application: Fibonacci Algorithm HS API DDR Axiom Library Memory [2] Axiom IOCTLs NIC Device HS Device Driver Driver AXI buses HS Registers Registers PS PS DDR DDR NIC HS [1] HS HS NIC NIC PL (Programmable Logic) PL PL Proposed Hardware Scheduler [1] Vasileios Amourgianos-Lorentzos. “Efficient network interface design for low cost distributed systems” Master Thesis, 2017 at Technical University of Crete as part of the FORTH Axiom program. [2] Evidence Embedding Technology, 2017, “https://github.com/evidence

  9. Hardware Scheduler (HS) Primitives Hardware Scheduler Level 1 DDR Memory Memory HS API Controller Opcode Register f_ptr = load_frame(); HFD_schedule(f_ptr, i_ptr, init_sc) Register Argument 1 Decoder HDF_decrease (f_ptr, num_sc) HS-L1 Controller Register FSM HDF_subscribe (d_ptr) [1] Argument 2 HDF_publish (f_ptr) Register HS-L2 PS (Processing System) HS Module Hardware Scheduler Level 2 NIC Network Interface Card PL (Programmable Logic) [1] F. Khalili, M. Procaccini and R. Giorgi. “Reconfigurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1,4. Fiuggi, Italy, July 2018. Poster.

  10. Register Controller [2]  The Write/Read access of each registers are separately controllable through the ‘Control’ register.  The Register Controller FSM (1) is responsible to control Master AXI Stream Handler Module (2) and exchange data between AXI Stream and AXI memory mapped Domains.  Register Controller FSM (1) also polls control_reg (3) and checks corresponding bit fields of each register if it is configured as write access or read access to set the direction of the data. [2] F. Khalili, M. Procaccini and R. Giorgi. “Recongurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1{4. Fiuggi, Italy, Julyy 2018. Poster.

  11. HS-L1 (Hardware Scheduler Level 1)  Retrieves meta-information of FRAMEs (Schedule FSM)  Schedules the FRAMEs which are ready to be executed (Decrease FSM). Frames are stored in GM Sector  Fetches the IP (Instruction Pointer) from the ready FRAMEs (Fetch FSM) Global Memory – Direct HS-L2 Memory Access Memory Decoder Schedule FSM GM GM-ctrler GM-DMA FSM Sector Decrease RFQ FSM RFQ-ctrler RFQ-DMA Sector Fetch FSM Ready Frame Pointers are stored in RFQ Sector Ready Frame Queue – Direct Memory Access

  12. HS-L2 (Hardware Scheduler Level 2)  Distribute FRAMEs in order to balance the loads throughout the network.  Work-stealing from remote nodes.  Off-load the works to remote nodes N Msg- TX-FIFO Composer Load HS-L1 W NIC Balancing S FSM Msg- E RX-FIFO Interpreter

  13. Design Snippets GM Controller GM DMA Decrease FSM Register Controller Schedule FSM RFQ DMA RFQ Controller Msg-Composer TX - FIFO Fetch FSM RX - FIFO Msg-Interpreter

  14. Evaluation – Execution Cycles Number Of Clock Cycles (PL). Operation Data Width Worst Best FIFO Enqueue/Dequeue 64 bits 2 1 Global Memory Write (DDR4) 16 bytes 48 40 Global Memory Read (DDR4) 16 bytes 38 38 Ready Queue Write 32 bits 48 40 Ready Queue Read 32 bits 44 44 Number of Clock Cycles (PL). Instruction Name Delay Contributors Worst Best Total 49 40 HDF-Schedule DMA IP 48 39 Decoder FSM 1 1 Total 89 43 HDF-Decrease DMA IP 86 40 Decoder FSM 3 3 Total 85 34 HDF-Fetch DMA IP 82 31 Fetch FSM 3 3

  15. Evaluation – Resource Utilization  Extracted resource utilization from Vivavo Design Suit 2016.4. • Axiom board Zynq UltraScale+ XCZU9EG platform. PL Units Number of Units Available Utilization % LUT 20357 274080 7.43 LUTRAM 2876 144000 2.00 FF 26116 548160 4.76 BRAM 49.50 912 5.43 IO 27 204 13.24 GT 2 16 12.50 BUFG 6 404 1.49

  16. Results HDF-Threads vs OpenMPI – Matrix Multiply Test 512+8 14 10 1 9 0.9 12 Exdecution Time (sec) 8 0.8 10 Speedup T(1)/T(n) 7 0.7 Efficiency S(p)/p 8 6 0.6 5 0.5 6 4 0.4 4 3 0.3 2 2 0.2 1 0.1 0 0 0 1N 1C 2N 1C 4N 1C 4N 2C 2N 1C 4N 1C 4N 2C 2N 1C 4N 1C 4N 2C OpenMPI HDF-Threads OpenMPI HDF-Threads OpenMPI HDF-Threads Execution Time Speedup Efficiency

  17. Results HDF-Threads vs OpenMPI – Matrix Multiply Test 2.5 70 size =512, b =8 size =512, b =8 2.26 2.17 58.73 2.01 60 1.97 2 52.03 50 Bus Utilization % Kernel Cycles 42.56 39.75 1.4 1.5 40 1.08 1.07 1.07 30 1 20 0.5 10 2.6 2.52 2.42 0.79 0 0 1N 1C 2N 1C 4N 1C 4N 2C 1N 1C 2N 1C 4N 1C 4N 2C OpenMPI HDF-Threads OpenMPI HDF-Threads Kernel Cycles Bus Utilization

  18. References [1] Frank, D. J., Dennard, R. H., Nowak, E., Solomon, P. M., Taur, Y., & Wong, H. S. P. (2001). Device scaling limits of Si MOSFETs and their application dependencies. Proceedings of the IEEE , 89 (3), 259-288. [2] Mondelli, Andrea, et al. "Dataflow support in x86_64 multicore architectures through small hardware extensions." Digital System Design (DSD), 2015 Euromicro Conference on . IEEE, 2015 [3] Dennis, J. B. (1980). Data flow supercomputers. Computer , (11), 48-56. [4] Giorgi, R., & Faraboschi, P. (2014, October). An introduction to DF-Threads and their execution model. In Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on (pp. 60-65). IEEE. [5] Verdoscia, L., Vaccaro, R., & Giorgi, R. (2014, August). A clockless computing system based on the static dataflow paradigm. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2014 Fourth Workshop on (pp. 30-37). IEEE. [6] Giorgi, R., Popovic, Z., & Puzovic, N. (2007, October). DTA-C: A decoupled multi-threaded architecture for CMP systems. In Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on (pp. 263-270). IEEE. [7] Kavi, K. M., Giorgi, R., & Arul, J. (2001). Scheduled dataflow: Execution paradigm, architecture, and performance evaluation. IEEE Transactions on Computers , 50 (8), 834-846. [8] Procaccini, M., Giorgi, R. (2017). A Data-Flow Execution Engine for Scalable Embedded Computing. HiPEAC ACACES-2018. [9] Procaccini, M., Khalili, F., Giorgi, R. (2018). An FPGA-based Scalable Hardware Scheduler for Data-Flow Models. HiPEAC ACACES- 2018. [10] Khalili, F., Procaccini, M., Giorgi, R. (2018). Reconfigurable Logic Interface Architecture for CPU-FPGA Accelerators. HiPEAC ACACES-2018. [11] Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., & Ortega, D. (2009). COTSon: infrastructure for full system simulation. ACM SIGOPS Operating Systems Review , 43 (1), 52-61. [12] Theodoropoulos, D., Mazumdar, S., Ayguade, E., Bettin, N., Bueno, J., Ermini, S., ... & Montefoschi, F. (2017). The AXIOM platform for next-generation cyber physical systems. Microprocessors and Microsystems , 52 , 540-555.

  19. THANK FOR YOU YOUR ATTENTION ANY QUESTIONS ?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend