Broadening the Exploration of the Accelerator Design Space in
Embedded Scalable Platforms
IEEE High Performance Extreme Computing Conference (HPEC), 2017
Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, - - PowerPoint PPT Presentation
IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia
IEEE High Performance Extreme Computing Conference (HPEC), 2017
Hardware Accelerator Hardware Accelerator Hardware Accelerator Processor Core #4 Processor Core #3
Processor Cores Hardware Accelerators
Processor Core #2 Processor Core #1 Hardware Accelerator Hardware Accelerator
IEEE High Performance Extreme Computing Conference (HPEC), 2017
2 / 15
IEEE High Performance Extreme Computing Conference (HPEC), 2017
3 / 15 [L. Carloni, “The Case for Embedded Scalable Platforms”, DAC 2016]
Memory Controller Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Processor Core Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator I/O Misc. Channels, etc. Hardware Accelerator Hardware Accelerator Hardware Accelerator Memory Controller
IEEE High Performance Extreme Computing Conference (HPEC), 2017
3 / 15
Memory Controller Accelerator
GRAYSCALE
Accelerator
GRADIENT
Accelerator
DEBAYER
Accelerator
MATRIX-MUL
LEON3 CPU Accelerator
MATRIX-SUB
Accelerator
MATRIX-ADD
Accelerator
STEEP-DESC.
Accelerator
CHANGE-DET
Accelerator
MATRIX-RES
I/O Misc. Channels, etc. Accelerator
HESSIAN
Accelerator
SD-UPDATE
Accelerator
WARP
Memory Controller
IEEE High Performance Extreme Computing Conference (HPEC), 2017
3 / 15
Memory Controller Accelerator
GRAYSCALE
Accelerator
GRADIENT
Accelerator
DEBAYER
Accelerator
MATRIX-MUL
LEON3 CPU Accelerator
MATRIX-SUB
Accelerator
MATRIX-ADD
Accelerator
STEEP-DESC.
Accelerator
CHANGE-DET
Accelerator
MATRIX-RES
I/O Misc. Channels, etc. Accelerator
HESSIAN
Accelerator
SD-UPDATE
Accelerator
WARP
Memory Controller HLS HLS HLS HLS HLS HLS HLS HLS HLS HLS HLS HLS
IEEE High Performance Extreme Computing Conference (HPEC), 2017
3 / 15
Memory Controller
LEON3 CPU I/O Misc. Channels, etc. Accelerator WARP Accelerator DEBAYER Accelerator MATRIX-MUL Accelerator STEEP-DESC. Memory Controller Accelerator GRADIENT Accelerator MATRIX-ADD Accelerator CHANGE-DET Accelerator HESSIAN Accelerator GRAYSCALE Accelerator MATRIX-SUB Accelerator MATRIX-RES Accelerator SD-UPDATE
Memory Controller
LEON3 CPU I/O Misc. Channels, etc. Accelerator WARP Accelerator DEBAYER Accelerator MATRIX-MUL Accelerator STEEP-DESC. Memory Controller Accelerator GRADIENT Accelerator MATRIX-ADD Accelerator CHANGE-DET Accelerator HESSIAN Accelerator GRAYSCALE Accelerator MATRIX-SUB Accelerator MATRIX-RES Accelerator SD-UPDATE
IEEE High Performance Extreme Computing Conference (HPEC), 2017
4 / 15
bank
GRAYSCALE Logic
load compute store bank bank bank Input PLM bank bank bank bank Output PLM
GRAYSCALE Interface
Private Local Memories (PLMs)
IEEE High Performance Extreme Computing Conference (HPEC), 2017
4 / 15
bank
GRAYSCALE Logic
load compute store bank bank bank Input PLM bank bank bank bank Output PLM
GRAYSCALE Interface
Private Local Memories (PLMs)
High-Level Synthesis (HLS)
RTL knob conf. #1
IEEE High Performance Extreme Computing Conference (HPEC), 2017
4 / 15
bank
GRAYSCALE Logic
load compute store bank bank bank Input PLM bank bank bank bank Output PLM
GRAYSCALE Interface
Private Local Memories (PLMs)
High-Level Synthesis (HLS)
RTL knob conf. #2
IEEE High Performance Extreme Computing Conference (HPEC), 2017
4 / 15
bank
GRAYSCALE Logic
load compute store bank bank bank Input PLM bank bank bank bank Output PLM
GRAYSCALE Interface
Private Local Memories (PLMs)
High-Level Synthesis (HLS)
RTL knob conf. #3
IEEE High Performance Extreme Computing Conference (HPEC), 2017
4 / 15
bank
GRAYSCALE Logic
load compute store bank bank bank Input PLM bank bank bank bank Output PLM
GRAYSCALE Interface
Private Local Memories (PLMs)
High-Level Synthesis (HLS)
RTL knob conf. #4
IEEE High Performance Extreme Computing Conference (HPEC), 2017
4 / 15
bank
GRAYSCALE Logic
load compute store bank bank bank Input PLM bank bank bank bank Output PLM
GRAYSCALE Interface
Private Local Memories (PLMs)
RTL High-Level Synthesis (HLS)
Pareto Optimal Pareto Dominated
Knob Settings and Effects Loop manipulations Unrolls, pipelines or breaks the body of loops Array mappings Maps arrays to registers or on-chip memories Clock period Sets the target clock period for synthesis
IEEE High Performance Extreme Computing Conference (HPEC), 2017
5 / 15
1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.0 1.5 2.0 2.5 3.0 DEBAYER
Bounded by on-chip memory bandwidth
Normalized Area Normalized Effective Latency
IEEE High Performance Extreme Computing Conference (HPEC), 2017
6 / 15
synthesized with the standard knobs synthesized with the proposed knobs
1.0 1.5 2.0 2.5 3.0 3.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 GRAYSCALE
Bounded by off-chip memory bandwidth
Normalized Area Normalized Effective Latency
IEEE High Performance Extreme Computing Conference (HPEC), 2017
6 / 15
synthesized with the standard knobs synthesized with the proposed knobs
XKnob Settings and Effects PLM PORTS Sets the on-chip memory bandwidth DMA WIDTH Sets the off-chip memory bandwidth DMA CHUNK Sets the size of the input and output PLM
IEEE High Performance Extreme Computing Conference (HPEC), 2017
7 / 15
1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.0 1.5 2.0 2.5 3.0
DEBAYER
Normalized Area Normalized Effective Latency PLM PORTS = 1 PLM PORTS = 2 PLM PORTS = 4
IEEE High Performance Extreme Computing Conference (HPEC), 2017
8 / 15
1.0 1.1 1.2 1.3 1.4 1.5 1.0 1.5 2.0 2.5 3.0 3.5
GRAYSCALE
Normalized Area Normalized Effective Latency DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256 DMA WIDTH = 512
IEEE High Performance Extreme Computing Conference (HPEC), 2017
9 / 15
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 1.0 1.2 1.4 1.6 1.8 2.0 2.2
GRAYSCALE
DMA WIDTH = 256 PLM PORTS = 4/8
Normalized Area Normalized Effective Latency
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 1.0 1.1 1.2 1.3
GRAYSCALE
DMA WIDTH = 256 PLM PORTS = 4/8
Normalized Area Normalized Effective Latency
IEEE High Performance Extreme Computing Conference (HPEC), 2017
10 / 15
DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048
IEEE High Performance Extreme Computing Conference (HPEC), 2017
11 / 15
1.0 1.5 2.0 2.5 3.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1
GRAYSCALE
Normalized Area Normalized Effective Latency
DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256
IEEE High Performance Extreme Computing Conference (HPEC), 2017
12 / 15
DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256
1.0 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 DEBAYER
PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1 Normalized Area Normalized Effective Latency
IEEE High Performance Extreme Computing Conference (HPEC), 2017
12 / 15
1.0 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 GRAYSCALE DMA WIDTH = 256 CONTENTION
PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1
Normalized Area Normalized Effective Latency DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048
IEEE High Performance Extreme Computing Conference (HPEC), 2017
13 / 15
DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1.0 1.5 2.0 2.5 3.0 DEBAYER DMA WIDTH = 256 CONTENTION
PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1
Normalized Area Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), 2017
13 / 15
IEEE High Performance Extreme Computing Conference (HPEC), 2017
14 / 15
IEEE High Performance Extreme Computing Conference (HPEC), 2017