An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models - - PowerPoint PPT Presentation

an fpga based scalable hardware scheduler for data flow
SMART_READER_LITE
LIVE PREVIEW

An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models - - PowerPoint PPT Presentation

IWES 2018 Third Italian Workshop on Embedded Systems Siena 13-14 September 2018 An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy The end of Dennard


slide-1
SLIDE 1

IWES 2018 Third Italian Workshop on Embedded Systems Siena – 13-14 September 2018

An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models

Roberto Giorgi, Marco Procaccini, Farnam Khalili University of Siena, Italy

slide-2
SLIDE 2

The end of Dennard scaling…

Engineering community forced to find new solutions to improve performance with a limited power budget by

[1]

 Stop increasing clock frequency  Shifting to multicore processors

Programming limitations to exploit full performance still remains…

Moore’s law 2018 – Source: Wikipedia

slide-3
SLIDE 3

The “DF-Threads” Data-Flow execution model

The “DF-Threads” Data-Flow execution model is capable of taking advantage of the full parallelism offered by multicore systems[2][3][4][5][6][7]

 Execution relies on data-dependencies  Parallel execution of data independent paths

slide-4
SLIDE 4

Hybrid Data-Flow Model

DF-Threads based execution does not need to totally replace the conventional general purpose processors (GPP) Hybrid Model based on GPP and Field Programmable Gate Arrays (FPGA)

 GPP cores are suitable for legacy or OS  FPGA can easily provide an efficient parallel execution via DF-Threads

slide-5
SLIDE 5

System Design

A possible architecture to enable an easy distribution of the Data-Flow Threads (DF-Threads) among multiple core and multiple nodes

[8]

slide-6
SLIDE 6

The Idea

Improving the execution of the Data-Flow Threads scheduling, by implementing an Hardware Scheduler (HS) on FPGA

[9][10]

The GPP: The HS

 Asynchronous APIs  Schedule DF-Threads  Execute DF-Threads  Retrieves meta-information  Provides ready HDF-Threads  Distribute HDF-Threads on network

HS: hardware scheduler HS-L1: local scheduler HS-L2: distributed scheduler PS: processing system GPPs: general purpose processors HDF: hardware Data-Flow Threads

slide-7
SLIDE 7

Compilation and testing flow

Testing Environment

 COTSon Simulator

[11]

 AXIOM Board

[12]

slide-8
SLIDE 8

System Abstraction in a Perspective

PL (Programmable Logic) PS (Processing System)

HS Device Driver NIC Device Driver

Axiom IOCTLs

HS Registers

AXI buses

Axiom Library [2] HS API

Application: Fibonacci Algorithm

Routing topologies: 2d-mesh or ring

Registers

PS

NIC

PL Proposed Hardware Scheduler

NIC

[2] Evidence Embedding Technology, 2017, “https://github.com/evidence [1] Vasileios Amourgianos-Lorentzos. “Efficient network interface design for low cost distributed systems” Master Thesis, 2017 at Technical University of Crete as part of the FORTH Axiom program.

HS

DDR

DDR

PS

NIC

PL

HS

DDR Memory

[1]

HS

slide-9
SLIDE 9

Hardware Scheduler (HS) Primitives

PS (Processing System) PL (Programmable Logic) HS API

f_ptr = load_frame(); HDF_decrease (f_ptr, num_sc) HDF_subscribe (d_ptr) HDF_publish (f_ptr) HFD_schedule(f_ptr, i_ptr, init_sc) Register Controller [1]

Opcode Register Argument 1 Register Argument 2 Register

[1] F. Khalili, M. Procaccini and R. Giorgi. “Reconfigurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1,4. Fiuggi, Italy, July 2018. Poster.

HS-L1 HS-L2

HS Module

NIC

DDR Memory

Memory Controller

Network Interface Card Hardware Scheduler Level 2 Hardware Scheduler Level 1

Decoder FSM

slide-10
SLIDE 10

Register Controller [2]

[2] F. Khalili, M. Procaccini and R. Giorgi. “Recongurable logic interface architecture for cpu-fpga accelerators.“ In HiPEAC ACACES-2018, pp. 1{4. Fiuggi, Italy, Julyy 2018. Poster.

 The Write/Read access of each registers are separately controllable through the ‘Control’ register.  The Register Controller FSM (1) is responsible to control Master AXI Stream Handler Module (2) and exchange data between AXI Stream and AXI memory mapped Domains.  Register Controller FSM (1) also polls control_reg (3) and checks corresponding bit fields of each register if it is configured as write access or read access to set the direction of the data.

slide-11
SLIDE 11

HS-L1 (Hardware Scheduler Level 1)

 Retrieves meta-information of FRAMEs (Schedule FSM)  Schedules the FRAMEs which are ready to be executed (Decrease FSM).  Fetches the IP (Instruction Pointer) from the ready FRAMEs (Fetch FSM) GM-DMA GM-ctrler RFQ-DMA

Global Memory – Direct Memory Access

GM Sector RFQ Sector Memory

Ready Frame Queue – Direct Memory Access Frames are stored in GM Sector Ready Frame Pointers are stored in RFQ Sector

RFQ-ctrler Schedule FSM Decrease FSM Fetch FSM

Decoder FSM

HS-L2

slide-12
SLIDE 12

HS-L2 (Hardware Scheduler Level 2)

Msg-

Composer

HS-L1

 Distribute FRAMEs in order to balance the loads throughout the network.

  • Work-stealing from remote nodes.
  • Off-load the works to remote nodes

TX-FIFO RX-FIFO Msg-

Interpreter

Load Balancing FSM

N

NIC

W S E

slide-13
SLIDE 13

Design Snippets

Register Controller Schedule FSM Decrease FSM Fetch FSM RFQ Controller GM Controller GM DMA RFQ DMA Msg-Interpreter Msg-Composer RX - FIFO TX - FIFO

slide-14
SLIDE 14

Evaluation – Execution Cycles

FIFO Enqueue/Dequeue 64 bits 2 1 Global Memory Write (DDR4) 16 bytes 48 40 Global Memory Read (DDR4) 16 bytes 38 38 Ready Queue Write 32 bits 48 40 Ready Queue Read 32 bits 44 44

Operation Data Width Number Of Clock Cycles (PL). Worst Best

HDF-Schedule

Total 49 40

DMA IP 48 39 Decoder FSM 1 1

HDF-Decrease

Total 89 43

DMA IP 86 40 Decoder FSM 3 3

HDF-Fetch Total 85 34

DMA IP 82 31 Fetch FSM 3 3

Instruction Name Delay Contributors Number of Clock Cycles (PL). Worst Best

slide-15
SLIDE 15

Evaluation – Resource Utilization

LUT

20357 274080 7.43

LUTRAM

2876 144000 2.00

FF

26116 548160 4.76

BRAM

49.50 912 5.43

IO

27 204 13.24

GT

2 16 12.50

BUFG

6 404 1.49

PL Units Number of Units Available Utilization %  Extracted resource utilization from Vivavo Design Suit 2016.4.

  • Axiom board Zynq UltraScale+ XCZU9EG platform.
slide-16
SLIDE 16

Execution Time

2 4 6 8 10 12 14 1N 1C 2N 1C 4N 1C 4N 2C Exdecution Time (sec) OpenMPI HDF-Threads

Results

HDF-Threads vs OpenMPI – Matrix Multiply Test 512+8 Speedup Efficiency

1 2 3 4 5 6 7 8 9 10 2N 1C 4N 1C 4N 2C

Speedup T(1)/T(n)

OpenMPI HDF-Threads 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2N 1C 4N 1C 4N 2C

Efficiency S(p)/p

OpenMPI HDF-Threads

slide-17
SLIDE 17

Results

HDF-Threads vs OpenMPI – Matrix Multiply Test

39.75 42.56 52.03 58.73 0.79 2.6 2.52 2.42 10 20 30 40 50 60 70 1N 1C 2N 1C 4N 1C 4N 2C % Kernel Cycles OpenMPI HDF-Threads 1.97 2.17 2.26 2.01 1.07 1.07 1.08 1.4 0.5 1 1.5 2 2.5 1N 1C 2N 1C 4N 1C 4N 2C Bus Utilization OpenMPI HDF-Threads

Kernel Cycles Bus Utilization

size=512, b=8 size=512, b=8

slide-18
SLIDE 18

References

[1] Frank, D. J., Dennard, R. H., Nowak, E., Solomon, P. M., Taur, Y., & Wong, H. S. P. (2001). Device scaling limits of Si MOSFETs and their application dependencies. Proceedings of the IEEE, 89(3), 259-288. [2] Mondelli, Andrea, et al. "Dataflow support in x86_64 multicore architectures through small hardware extensions." Digital System Design (DSD), 2015 Euromicro Conference on. IEEE, 2015 [3] Dennis, J. B. (1980). Data flow supercomputers. Computer, (11), 48-56. [4] Giorgi, R., & Faraboschi, P. (2014, October). An introduction to DF-Threads and their execution model. In Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on (pp. 60-65). IEEE. [5] Verdoscia, L., Vaccaro, R., & Giorgi, R. (2014, August). A clockless computing system based on the static dataflow paradigm. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2014 Fourth Workshop on (pp. 30-37). IEEE.

[6] Giorgi, R., Popovic, Z., & Puzovic, N. (2007, October). DTA-C: A decoupled multi-threaded architecture for CMP systems. In Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on (pp. 263-270). IEEE.

[7] Kavi, K. M., Giorgi, R., & Arul, J. (2001). Scheduled dataflow: Execution paradigm, architecture, and performance evaluation. IEEE Transactions on Computers, 50(8), 834-846. [8] Procaccini, M., Giorgi, R. (2017). A Data-Flow Execution Engine for Scalable Embedded Computing. HiPEAC ACACES-2018. [9] Procaccini, M., Khalili, F., Giorgi, R. (2018). An FPGA-based Scalable Hardware Scheduler for Data-Flow Models. HiPEAC ACACES- 2018. [10] Khalili, F., Procaccini, M., Giorgi, R. (2018). Reconfigurable Logic Interface Architecture for CPU-FPGA Accelerators. HiPEAC ACACES-2018.

[11] Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., & Ortega, D. (2009). COTSon: infrastructure for full system simulation. ACM SIGOPS Operating Systems Review, 43(1), 52-61.

[12] Theodoropoulos, D., Mazumdar, S., Ayguade, E., Bettin, N., Bueno, J., Ermini, S., ... & Montefoschi, F. (2017). The AXIOM platform for next-generation cyber physical systems. Microprocessors and Microsystems, 52, 540-555.

slide-19
SLIDE 19

THANK YOU FOR YOUR ATTENTION ANY QUESTIONS ?