ESP4ML
Platform-Based Design of System-on-Chip for Embedded Machine Learning
Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani Luca P. Carloni
ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine - - PowerPoint PPT Presentation
ESP4ML Platform-Based Design of System-on-Chip for Embedded Machine Learning Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani DATE 2020 Luca P. Carloni ESP4ML Open-source design flow to build and program SoCs for ML
Davide Giri Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani Luca P. Carloni
Combines and
SoC design
accelerators from ML models
Main contributions to ESP:
accelerators
Open-source design flow to build and program SoCs for ML applications.
2
by Fast ML Lab
HLS-able accelerator specifications
FPGA only)
(small and ultra-low latency networks)
3 Image from https://fastmachinelearning.org/hls4ml/
4
B L A D E
C E N T E R D A T A CPU GPU $ Accelerators I/O DDR
Embedded SoC
Rapid Prototyping SoC Integration
Application Developers Hardware Designers
5
** By lewing@isc.tamu.edu Larry Ewing and The GIMP
**
accelerator accelerator
HLS Design Flows RTL Design Flows
* By Nvidia Corporation
accelerator
*
Processor
new design flows
4
The ESP architecture implements a distributed system, which is scalable, modular and heterogeneous, giving processors and accelerators similar weight in the SoC
7
8
interactive automated manual manual (opt.)
Generate accelerator Test behavior Generate RTL Test RTL Optimize accelerator Specialize accelerator
(not required by hls4ml flow)
Generate sockets Configure SoC
Compile bare-metal Simulate system Implement for FGPA Compile Linux Deploy prototype Design runtime apps
… … …
accelerator accelerator accelerator
… … …
accelerator accelerator accelerator**
Application Developers Hardware Designers
HLS Design Flows RTL Design Flows
Performance Area / Power 3 2 1
High-Level Synthesis
Code Transformation
RTL Design Space Programmer View Design Space
… …
accelerator accelerator accelerator9
10
accelerator accelerator accelerator
11
12 Example of top level function of ESP accelerator for Vivado HLS
void top(dma_t *out, dma_t *in1, unsigned cfg_size, dma_info_t *load_ctrl, dma_info_t *store_ctrl) { for (unsigned i = 0; i < cfg_size; i++) { word_t _inbuff[IN_BUF_SIZE]; word_t _outbuff[OUT_BUF_SIZE]; load(_inbuff, in1, i, load_ctrl, 0); compute(_inbuff, _outbuff); store(_outbuff, out, i, store_ctrl, cfg_size); } }
Completely automated integration in ESP:
13
processing overheads
14
channels
invocation time
accelerators, not push
15
API for the invocation of accelerators from a user application
programmer
device drivers
drivers
processors and accelerators
applications with minimal modifications
map tasks to accelerators
16
kernel mode Linux ESP core ESP accelerator driver user mode ESP alloc ESP Library Application
17
kernel mode Linux ESP core ESP accelerator driver user mode ESP alloc ESP Library Application
/* * Example of existing C application * with ESP accelerators that replace * software kernels 2, 3 and 5 */ { int *buffer = esp_alloc(size); for (...) { kernel_1(buffer,...); // existing software esp_run(cfg_k2); // run accelerator(s) esp_run(cfg_k3); kernel_4(buffer,...); // existing software esp_run(cfg_k5); } validate(buffer); // existing checks esp_cleanup(); // memory free }
API for the invocation of accelerators from a user application
programmer
18
/* Example of double-accelerator config */ esp_thread_info_t cfg_k12[] = { { .devname = “k1.0", .type = k1, /* accelerator configuration */ .desc.k1_desc.nbursts = 8, /* p2p configuration */ .desc.k1_desc.esp.p2p_store = true, .desc.k1_desc.esp.p2p_nsrcs = 0, .desc.k1_desc.esp.p2p_srcs = {"","","",""}, }, { .devname = “k2.0", .type = k2, /* accelerator configuration */ .desc.k2_desc.nbursts = 8, /* p2p configuration */ .desc.k2_desc.esp.p2p_store = false, .desc.k2_desc.esp.p2p_nsrcs = 1, .desc.k2_desc.esp.p2p_srcs = {“k1.0","","",""}, }, };
Configuration example:
communication between them
19
SoCs on FPGA (Xilinx VCU118)
accelerator chaining and parallelism
▪ 256-core NVIDIA Maxwell GPU ▪ Quad-core ARM Cortex A57
Featured accelerators:
dataset from Google
equalization
20
21
Chaining accelerators brings energy savings. Our SoCs achieve better energy efficiency than Jetson and i7.
22
0.1 1 10 100 1NV+1Cl 4NV+1Cl 4NV+4Cl
Frames / Joule (normalized)
Night-Vision and Classifier
memory p2p
i7 8700k Jetson TX1 0.1 1 10 100 1De + 1Cl
Denoiser and Classifier
0.1 1 10 100 1Cl split
Multi-tile Classifier
Performance increases to up to 4.5 times thanks to:
23
1 2 3 4 5
Cl split in 5 1NV+1Cl 2NV+1Cl 4NV+1Cl 2NV+2Cl 4NV+4Cl
Frames / sec (normalized)
24
0% 20% 40% 60% 80% 100% Multi-tile classifier Nightvision + classifier Denoiser + classifier
DRAM accesses (normalized) memory p2p
HLS) and Keras/Pytorch/ONNX (hls4ml)
point communication mechanisms
25
Davide Giri (www.cs.columbia.edu/~davide_giri) Kuan-Lin Chiu Giuseppe di Guglielmo Paolo Mantovani Luca P. Carloni
DATE 2020
sld.cs.columbia.edu esp.cs.columbia.edu sld-columbia/esp