ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine - PowerPoint PPT Presentation

ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine Learning Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani DATE 2020 Luca P. Carloni

ESP4ML Open-source design flow to build and program SoCs for ML applications. Combines and • ESP is a platform for heterogeneous SoC design • hls4ml automatically generates accelerators from ML models Main contributions to ESP : • Automated integration of hls4ml accelerators • Accelerator-accelerator communication • Accelerator invocation API 2

hls4ml • Open-source tool developed by Fast ML Lab • Translates ML algorithms into HLS-able accelerator specifications o Targets Xilinx Vivado HLS (i.e. FPGA only) o ASIC support is in the works • Born for high-energy physics Image from https://fastmachinelearning.org/hls4ml/ (small and ultra-low latency networks) o Now has broad applicability 3

ESP motivation CPU DDR GPU $ B C L Accelerators E Heterogeneous systems are pervasive D I/O A N A D T Integrating accelerators into a SoC is hard T E E A Embedded SoC R Doing so in a scalable way is very hard Keeping the system simple to program while doing so is even harder ESP makes it easy ESP combines a scalable architecture with a flexible methodology ESP enables several accelerator design flows and takes care of the hardware and software integration 4

ESP overview new design flows Application Developers SoC Integration accelerator HLS accelerator Design ** By lewing@isc.tamu.edu Larry Ewing and The GIMP * Flows accelerator … Rapid Hardware Designers ** Prototyping * By Nvidia Corporation Processor RTL Design … Flows 5

ESP architecture • Multi-Processors • Many-Accelerator • Distributed Memory • Multi-Plane NoC The ESP architecture implements a distributed system, which is scalable , modular and heterogeneous , giving processors and accelerators similar weight in the SoC 4

ESP architecture: the tiles 7

automated ESP methodology in practice interactive manual (opt.) manual Accelerator Flow SoC Flow Application Developers accelerator accelerator HLS Design … Flows Generate accelerator Generate sockets accelerator … Hardware Designers … RTL Design Configure SoC Flows Specialize accelerator Compile bare-metal (not required by hls4ml flow) Simulate system Test behavior Implement for FGPA Generate RTL Design runtime apps Test RTL accelerator Compile Linux accelerator ** … Optimize accelerator accelerator Deploy prototype … … 8

ESP accelerator flow Developers focus on the high-level specification , decoupled from memory access, system communication, hardware/software interface Ver. 1 Programmer View Application Developers Design Space Ver. 3 Ver. 2 HLS Design accelerator RTL Flows accelerator Design Space … accelerator … Code Transformation Area / Power Hardware Designers 1 High-Level Synthesis 3 RTL Design 2 Flows Performance 9

ESP Interactive SoC Flow SoC Integration accelerator accelerator … accelerator … … 10

New ESP features • New accelerator design flows (C/C++, Keras/Pytorch/ONNX) • Accelerator-to-accelerator communication • Accelerator invocation API 11

New accelerator design flows C/C++ accelerators with Vivado HLS • Generate the accelerator skeleton with ESP o Takes care of communication with the ESP tile socket • Implement the computation part of the accelerator void top( dma_t * out , dma_t * in1 , unsigned cfg_size , dma_info_t * load_ctrl , dma_info_t * store_ctrl ) { for ( unsigned i = 0 ; i < cfg_size ; i ++) { word_t _inbuff [ IN_BUF_SIZE ]; word_t _outbuff [ OUT_BUF_SIZE ]; load( _inbuff , in1 , i , load_ctrl , 0 ); compute( _inbuff , _outbuff ); store( _outbuff , out , i , store_ctrl , cfg_size ); } } Example of top level function of ESP accelerator for Vivado HLS 12

New accelerator design flows Keras/Pytorch/ONNX accelerators with hls4ml Completely automated integration in ESP : • Generate an accelerator with hls4ml • Generate the accelerator wrapper with ESP 13

Accelerator-to-accelerator communication Accelerators can exchange data with: • Shared memory • Other accelerators (new!) Benefits • Avoid roundtrips to shared memory • Fine-grained accelerators synchronization o Higher throughput o Lower invocation and data pre- or post- processing overheads 14

Accelerator-to-accelerator communication • No need for additional queues or NoC channels • Communication configured at invocation time • Accelerators can pull data from other accelerators, not push 15

Accelerator invocation API • Invokes accelerators through Linux API for the invocation of accelerators from a user application device drivers o ESP automatically generates the device • Exposes only 3 functions to the drivers programmer • Enables shared memory between processors and accelerators Application mode o No data copies user • Can be targeted by existing ESP Library applications with minimal modifications ESP accelerator driver kernel mode • Can be targeted to automatically ESP core ESP alloc map tasks to accelerators Linux 16

Accelerator invocation API API for the invocation of accelerators /* from a user application * Example of existing C application * with ESP accelerators that replace • Exposes only 3 functions to the * software kernels 2, 3 and 5 */ programmer { int * buffer = esp_alloc (size); for (...) { kernel_1(buffer,...); // existing software Application mode user esp_run (cfg_k2); // run accelerator(s) esp_run (cfg_k3); ESP Library kernel_4(buffer,...); // existing software ESP accelerator driver esp_run (cfg_k5); } kernel mode validate(buffer); // existing checks ESP core ESP alloc esp_cleanup (); // memory free } Linux 17

Accelerator API /* Example of double-accelerator config */ esp_thread_info_t cfg_k12[] = { { Configuration example: .devname = “k1.0", .type = k1, • Invoke accelerators k1 and k2 /* accelerator configuration */ .desc.k1_desc.nbursts = 8, • Enable point-to-point /* p2p configuration */ .desc.k1_desc.esp.p2p_store = true , communication between them .desc.k1_desc.esp.p2p_nsrcs = 0, .desc.k1_desc.esp.p2p_srcs = {"","","",""}, }, { .devname = “k2.0", .type = k2, /* accelerator configuration */ .desc.k2_desc.nbursts = 8, /* p2p configuration */ .desc.k2_desc.esp.p2p_store = false , .desc.k2_desc.esp.p2p_nsrcs = 1, .desc.k2_desc.esp.p2p_srcs = {“k1.0","","",""}, }, }; 18

Evaluation 19

Experimental setup • We deploy two multi-accelerator Featured accelerators: SoCs on FPGA (Xilinx VCU118) • Image classifier (hls4ml) • We execute applications with o Street View House Numbers (SVHN) accelerator chaining and parallelism dataset from Google • Denoiser (hls4ml) opportunities • We compare the our SoCs against: o Implemented as an autoencoder • Night-vision (Stratus HLS) o Intel i7 8700K processor o NVIDIA Jetson TX1 o Noise filtering, histogram, histogram equalization ▪ 256-core NVIDIA Maxwell GPU ▪ Quad-core ARM Cortex A57 20

Case studies 21

Efficiency Denoiser and Night-Vision and Multi-tile Frames / Joule (normalized) Classifier Classifier Classifier Chaining 100 100 100 accelerators brings energy savings. 10 10 10 Jetson TX1 Our SoCs achieve 1 1 1 better energy i7 8700k efficiency than Jetson and i7. 0.1 0.1 0.1 1NV+1Cl 4NV+1Cl 4NV+4Cl 1De + 1Cl 1Cl split memory p2p 22

Performance 5 Frames / sec (normalized) 4 3 Performance increases to up to 2 4.5 times thanks to: 1 - Parallelization 0 - Chaining (p2p) Cl split in 1NV+1Cl 2NV+1Cl 4NV+1Cl 2NV+2Cl 4NV+4Cl 5 memory p2p 23

Memory accesses 100% DRAM accesses 80% (normalized) 60% Accelerator chaining (p2p) 40% reduces the memory accesses by 2-3 times 20% 0% Multi-tile Nightvision Denoiser + classifier + classifier classifier memory p2p 24

Conclusions ESP4ML is a complete system-level design flow to implement many- accelerator SoCs and to deploy embedded applications on them. We enhanced ESP with the following features: • Fully automatic integration in ESP of accelerators specified in C/C++ (Vivado HLS) and Keras/Pytorch/ONNX (hls4ml) • Minimal API to invoke accelerator for ESP • Reconfigurable activation of accelerators pipelines through efficient point-to- point communication mechanisms 25

Thank you from the ESP team! sld.cs.columbia.edu esp.cs.columbia.edu sld-columbia/esp ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine Learning Davide Giri (www.cs.columbia.edu/~davide_giri) Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani DATE 2020 Luca P. Carloni

ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine - PowerPoint PPT Presentation

ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine Learning Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani DATE 2020 Luca P. Carloni ESP4ML Open-source design flow to build and program SoCs for ML

Implemen'ng a ver'cally hardened DNP3 control stack Sven M.

Multidimensional quadrilateral lattices with the values in Grassmann manifold are integrable

Kubernetes+GlusterFS: Lightning Ver. Mohamed Ashiq Liazudeen & Jos A. Rivera

An Approach for Detecting Learning Styles in Learning Management Systems Sabine Graf Kinshuk

Modelling, Specification and Verification of Reactive Systems Introduction to the Course

Antimagic Labelings of Regular Bipartite Graphs Daniel Cranston dcransto@dimacs.rutgers.edu

Model-View-Controller CS 4720 Web & Mobile Systems

CNA e Tool v 3.0: Ba sics Office of Multifamily Housing Programs We bina r Logistics

Badges Project June 2011 Board of Directors Conference Call Badges Make it easy to collect and

Condensed Version of Slides from Lee Silber

Which Configuration Option Should I Change? Sai Zhang , Michael D. Ernst University of Washington

Spack Tutorial on AWS The most recent version of these slides can be found at: July 28-29, 2020

GO ! 100% RES Communities / Publishable slides / version: April 2013 100% RES COMMUNITIES A

Understanding Employer Pickup 50-338c, 1/19/E Todays agenda Review five key items 1. Define

Modern Version Control with Git Aaron Perley (aperley@andrew.cmu.edu) Ilan Biala

SEARCHING AND SORTING ALGORITHMS (download slides and .py files and follow along!) 6.0001

FOURTH QUARTER 2019 EARNINGS CALL AND WEBCAST January 31, 2020 Sweeny Fractionator OLD OCEAN,

Git, GitHub, and Version Control Version Control: How you keep track of coding projects or

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya

2020 Basic Election Judge Training Version 1.0 Revised 6/2/2020 INTRODUCTION Hennepin County

SUBVERSION , FUNCTIONS, PARAMETERS, AND FILE HANDLING CSSE 120 Rose-Hulman Institute of

Drawing on the Web Version Control CSCI-UA 380 Project Management with Git Drawing on the Web

#CCI2017 CHANGE THE STORY, CHANGE THE FUTURE David Korten International Best Selling Author,

MULTI-MODAL IMAGE INTEGRATION CARLO CAVEDON MEDICAL PHYSICS UNIT VERONA UNIVERSITY HOSPITAL -

ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine - PowerPoint PPT Presentation

ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine Learning Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani DATE 2020 Luca P. Carloni ESP4ML Open-source design flow to build and program SoCs for ML

Implemen'ng a ver'cally hardened DNP3 control stack Sven M.

Multidimensional quadrilateral lattices with the values in Grassmann manifold are integrable

Kubernetes+GlusterFS: Lightning Ver. Mohamed Ashiq Liazudeen &amp; Jos A. Rivera

An Approach for Detecting Learning Styles in Learning Management Systems Sabine Graf Kinshuk

Modelling, Specification and Verification of Reactive Systems Introduction to the Course

Antimagic Labelings of Regular Bipartite Graphs Daniel Cranston dcransto@dimacs.rutgers.edu

Model-View-Controller CS 4720 Web &amp; Mobile Systems

CNA e Tool v 3.0: Ba sics Office of Multifamily Housing Programs We bina r Logistics

Badges Project June 2011 Board of Directors Conference Call Badges Make it easy to collect and

Condensed Version of Slides from Lee Silber

Which Configuration Option Should I Change? Sai Zhang , Michael D. Ernst University of Washington

Spack Tutorial on AWS The most recent version of these slides can be found at: July 28-29, 2020

GO ! 100% RES Communities / Publishable slides / version: April 2013 100% RES COMMUNITIES A

Understanding Employer Pickup 50-338c, 1/19/E Todays agenda Review five key items 1. Define

Modern Version Control with Git Aaron Perley (aperley@andrew.cmu.edu) Ilan Biala

SEARCHING AND SORTING ALGORITHMS (download slides and .py files and follow along!) 6.0001

FOURTH QUARTER 2019 EARNINGS CALL AND WEBCAST January 31, 2020 Sweeny Fractionator OLD OCEAN,

Git, GitHub, and Version Control Version Control: How you keep track of coding projects or

Dat ataH aHub ub : Collaborative Data Science and Dataset Version Management at Scale Aditya

2020 Basic Election Judge Training Version 1.0 Revised 6/2/2020 INTRODUCTION Hennepin County

SUBVERSION , FUNCTIONS, PARAMETERS, AND FILE HANDLING CSSE 120 Rose-Hulman Institute of

Drawing on the Web Version Control CSCI-UA 380 Project Management with Git Drawing on the Web

#CCI2017 CHANGE THE STORY, CHANGE THE FUTURE David Korten International Best Selling Author,

MULTI-MODAL IMAGE INTEGRATION CARLO CAVEDON MEDICAL PHYSICS UNIT VERONA UNIVERSITY HOSPITAL -

Kubernetes+GlusterFS: Lightning Ver. Mohamed Ashiq Liazudeen & Jos A. Rivera

Model-View-Controller CS 4720 Web & Mobile Systems