Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, - - PowerPoint PPT Presentation

embedded scalable platforms
SMART_READER_LITE
LIVE PREVIEW

Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, - - PowerPoint PPT Presentation

IEEE High Performance Extreme Computing Conference (HPEC), 2017 Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms Luca Piccolboni , Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia


slide-1
SLIDE 1

Broadening the Exploration of the Accelerator Design Space in

Embedded Scalable Platforms

IEEE High Performance Extreme Computing Conference (HPEC), 2017

Luca Piccolboni, Paolo Mantovani, Giuseppe Di Guglielmo, Luca Carloni Columbia University, New York, NY, USA

slide-2
SLIDE 2

Hardware Accelerator Hardware Accelerator Hardware Accelerator Processor Core #4 Processor Core #3

Efficiency Generality

Processor Cores Hardware Accelerators

Processor Core #2 Processor Core #1 Hardware Accelerator Hardware Accelerator

  • High-performance embedded systems are heterogeneous:
  • they include multiple general-purpose processor cores
  • they include special-function hardware accelerators

IEEE High Performance Extreme Computing Conference (HPEC), 2017

2 / 15

Why Hardware Accelerators?

slide-3
SLIDE 3

Embedded Scalable Platforms (ESP)

  • To balance the demand for hardware specialization with

the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms

IEEE High Performance Extreme Computing Conference (HPEC), 2017

3 / 15 [L. Carloni, “The Case for Embedded Scalable Platforms”, DAC 2016]

Memory Controller Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Processor Core Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator I/O Misc. Channels, etc. Hardware Accelerator Hardware Accelerator Hardware Accelerator Memory Controller

slide-4
SLIDE 4

Embedded Scalable Platforms (ESP)

  • To balance the demand for hardware specialization with

the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms

IEEE High Performance Extreme Computing Conference (HPEC), 2017

3 / 15

Memory Controller Accelerator

GRAYSCALE

Accelerator

GRADIENT

Accelerator

DEBAYER

Accelerator

MATRIX-MUL

  • Proc. core

LEON3 CPU Accelerator

MATRIX-SUB

Accelerator

MATRIX-ADD

Accelerator

STEEP-DESC.

Accelerator

CHANGE-DET

Accelerator

MATRIX-RES

I/O Misc. Channels, etc. Accelerator

HESSIAN

Accelerator

SD-UPDATE

Accelerator

WARP

Memory Controller

ESP instance for WAMI (Wide-Area Motion Imagery)

slide-5
SLIDE 5
  • To balance the demand for hardware specialization with

the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms

Embedded Scalable Platforms (ESP)

IEEE High Performance Extreme Computing Conference (HPEC), 2017

3 / 15

Memory Controller Accelerator

GRAYSCALE

Accelerator

GRADIENT

Accelerator

DEBAYER

Accelerator

MATRIX-MUL

  • Proc. core

LEON3 CPU Accelerator

MATRIX-SUB

Accelerator

MATRIX-ADD

Accelerator

STEEP-DESC.

Accelerator

CHANGE-DET

Accelerator

MATRIX-RES

I/O Misc. Channels, etc. Accelerator

HESSIAN

Accelerator

SD-UPDATE

Accelerator

WARP

Memory Controller HLS HLS HLS HLS HLS HLS HLS HLS HLS HLS HLS HLS

System-Level Design with High-Level Synthesis (HLS)

slide-6
SLIDE 6
  • To balance the demand for hardware specialization with

the need of maintaining helpful degrees of regularity and modularity we proposed: Embedded Scalable Platforms

Embedded Scalable Platforms (ESP)

IEEE High Performance Extreme Computing Conference (HPEC), 2017

3 / 15

Memory Controller

  • Proc. core

LEON3 CPU I/O Misc. Channels, etc. Accelerator WARP Accelerator DEBAYER Accelerator MATRIX-MUL Accelerator STEEP-DESC. Memory Controller Accelerator GRADIENT Accelerator MATRIX-ADD Accelerator CHANGE-DET Accelerator HESSIAN Accelerator GRAYSCALE Accelerator MATRIX-SUB Accelerator MATRIX-RES Accelerator SD-UPDATE

rapid integration and prototyping System-Level Design with High-Level Synthesis (HLS) ESP instance for WAMI (Wide-Area Motion Imagery)

slide-7
SLIDE 7

Memory Controller

  • Proc. core

LEON3 CPU I/O Misc. Channels, etc. Accelerator WARP Accelerator DEBAYER Accelerator MATRIX-MUL Accelerator STEEP-DESC. Memory Controller Accelerator GRADIENT Accelerator MATRIX-ADD Accelerator CHANGE-DET Accelerator HESSIAN Accelerator GRAYSCALE Accelerator MATRIX-SUB Accelerator MATRIX-RES Accelerator SD-UPDATE

IEEE High Performance Extreme Computing Conference (HPEC), 2017

4 / 15

Hardware Accelerators with HLS

SystemC Specification

bank

GRAYSCALE Logic

load compute store bank bank bank Input PLM bank bank bank bank Output PLM

GRAYSCALE Interface

Private Local Memories (PLMs)

slide-8
SLIDE 8

IEEE High Performance Extreme Computing Conference (HPEC), 2017

4 / 15

bank

GRAYSCALE Logic

load compute store bank bank bank Input PLM bank bank bank bank Output PLM

GRAYSCALE Interface

Private Local Memories (PLMs)

SystemC Specification

High-Level Synthesis (HLS)

Hardware Accelerators with HLS

Performance (Latency) Cost (Area)

RTL knob conf. #1

slide-9
SLIDE 9

IEEE High Performance Extreme Computing Conference (HPEC), 2017

4 / 15

bank

GRAYSCALE Logic

load compute store bank bank bank Input PLM bank bank bank bank Output PLM

GRAYSCALE Interface

Private Local Memories (PLMs)

SystemC Specification

High-Level Synthesis (HLS)

Hardware Accelerators with HLS

Performance (Latency) Cost (Area)

RTL knob conf. #2

slide-10
SLIDE 10

IEEE High Performance Extreme Computing Conference (HPEC), 2017

4 / 15

bank

GRAYSCALE Logic

load compute store bank bank bank Input PLM bank bank bank bank Output PLM

GRAYSCALE Interface

Private Local Memories (PLMs)

SystemC Specification

High-Level Synthesis (HLS)

Hardware Accelerators with HLS

Performance (Latency) Cost (Area)

RTL knob conf. #3

slide-11
SLIDE 11

IEEE High Performance Extreme Computing Conference (HPEC), 2017

4 / 15

bank

GRAYSCALE Logic

load compute store bank bank bank Input PLM bank bank bank bank Output PLM

GRAYSCALE Interface

Private Local Memories (PLMs)

SystemC Specification

High-Level Synthesis (HLS)

Hardware Accelerators with HLS

Performance (Latency) Cost (Area)

RTL knob conf. #4

slide-12
SLIDE 12

IEEE High Performance Extreme Computing Conference (HPEC), 2017

4 / 15

bank

GRAYSCALE Logic

load compute store bank bank bank Input PLM bank bank bank bank Output PLM

GRAYSCALE Interface

Private Local Memories (PLMs)

SystemC Specification Performance (Latency) Cost (Area)

RTL High-Level Synthesis (HLS)

Pareto Optimal Pareto Dominated

Hardware Accelerators with HLS

slide-13
SLIDE 13

Knob Settings and Effects Loop manipulations Unrolls, pipelines or breaks the body of loops Array mappings Maps arrays to registers or on-chip memories Clock period Sets the target clock period for synthesis

Standard knobs provided by the current HLS tools

  • These knobs enable already a rich design-space exploration
  • However, they are not sufficient for exploring accelerators

IEEE High Performance Extreme Computing Conference (HPEC), 2017

5 / 15

Standard HLS Knobs

We need other knobs to broaden the exploration

slide-14
SLIDE 14

1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.0 1.5 2.0 2.5 3.0 DEBAYER

Bounded by on-chip memory bandwidth

Normalized Area Normalized Effective Latency

  • Limiting factor: limited bandwidth to the on-chip memory
  • We need knobs to tailor the PLM to the accelerator needs

IEEE High Performance Extreme Computing Conference (HPEC), 2017

6 / 15

Motivational Example #1

synthesized with the standard knobs synthesized with the proposed knobs

slide-15
SLIDE 15
  • Limiting factor: limited bandwidth to the off-chip memory
  • We need knobs to operate on the communication interfaces

1.0 1.5 2.0 2.5 3.0 3.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 GRAYSCALE

Bounded by off-chip memory bandwidth

Normalized Area Normalized Effective Latency

IEEE High Performance Extreme Computing Conference (HPEC), 2017

6 / 15

Motivational Example #2

synthesized with the standard knobs synthesized with the proposed knobs

slide-16
SLIDE 16

Contributions: Xknobs

XKnob Settings and Effects PLM PORTS Sets the on-chip memory bandwidth DMA WIDTH Sets the off-chip memory bandwidth DMA CHUNK Sets the size of the input and output PLM

eXtended Knobs for High-Level Synthesis

IEEE High Performance Extreme Computing Conference (HPEC), 2017

7 / 15

slide-17
SLIDE 17

1.0 1.2 1.4 1.6 1.8 2.0 2.2 1.0 1.5 2.0 2.5 3.0

DEBAYER

Normalized Area Normalized Effective Latency PLM PORTS = 1 PLM PORTS = 2 PLM PORTS = 4

  • Sets the number of read/write ports of input/output PLMs
  • Higher values of PLM PORTS → more read/write accesses
  • Higher values of PLM PORTS → higher area (more banks)

IEEE High Performance Extreme Computing Conference (HPEC), 2017

8 / 15

Xknob #1: PLM PORTS

slide-18
SLIDE 18

1.0 1.1 1.2 1.3 1.4 1.5 1.0 1.5 2.0 2.5 3.0 3.5

GRAYSCALE

Normalized Area Normalized Effective Latency DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256 DMA WIDTH = 512

  • Set the size in bits of the DMA communication channels
  • Higher values of DMA WIDTH → higher mem. throughput
  • Higher values of DMA WIDTH → higher area (more banks)

(higher number of write/read ports of input/output PLMs)

IEEE High Performance Extreme Computing Conference (HPEC), 2017

9 / 15

Xknob #2: DMA WIDTH

slide-19
SLIDE 19

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 1.0 1.2 1.4 1.6 1.8 2.0 2.2

GRAYSCALE

DMA WIDTH = 256 PLM PORTS = 4/8

Normalized Area Normalized Effective Latency

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 1.0 1.1 1.2 1.3

GRAYSCALE

DMA WIDTH = 256 PLM PORTS = 4/8

Normalized Area Normalized Effective Latency

with contention without contention

  • Set the size of the PLM in multiple of the stored data type
  • Higher values of DMA CHUNK → optimized communication
  • Higher values of DMA CHUNK → higher area (for the PLM)

IEEE High Performance Extreme Computing Conference (HPEC), 2017

10 / 15

DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048

Xknob #3: DMA CHUNK

slide-20
SLIDE 20

Experimental Results

IEEE High Performance Extreme Computing Conference (HPEC), 2017

11 / 15

  • We evaluate the combined effects of the XKnobs by using:
  • GRAYSCALE → accelerator limited by communication
  • DEBAYER → accelerator limited by computation
  • The other WAMI accelerators behave similarly to either

the GRAYSCALE accelerator or the DEBAYER accelerator

slide-21
SLIDE 21

Experiment #1

  • We consider two XKnobs: PLM PORTS and DMA WIDTH

1.0 1.5 2.0 2.5 3.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1

GRAYSCALE

Normalized Area Normalized Effective Latency

DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256

  • GRAYSCALE → accelerator limited by communication

IEEE High Performance Extreme Computing Conference (HPEC), 2017

12 / 15

slide-22
SLIDE 22

Experiment #1

  • We consider two XKnobs: PLM PORTS and DMA WIDTH
  • DEBAYER → accelerator limited by computation

DMA WIDTH = 32 DMA WIDTH = 64 DMA WIDTH = 128 DMA WIDTH = 256

1.0 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 DEBAYER

PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1 Normalized Area Normalized Effective Latency

IEEE High Performance Extreme Computing Conference (HPEC), 2017

12 / 15

slide-23
SLIDE 23

Experiment #2

  • We consider two XKnobs: PLM PORTS and DMA CHUNK
  • GRAYSCALE → accelerator limited by communication

1.0 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 GRAYSCALE DMA WIDTH = 256 CONTENTION

PLM PORTS = 8 PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1

Normalized Area Normalized Effective Latency DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048

IEEE High Performance Extreme Computing Conference (HPEC), 2017

13 / 15

slide-24
SLIDE 24

Experiment #2

  • We consider two XKnobs: PLM PORTS and DMA CHUNK
  • DEBAYER → accelerator limited by computation

DMA CHUNK = 256 DMA CHUNK = 512 DMA CHUNK = 1024 DMA CHUNK = 2048 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1.0 1.5 2.0 2.5 3.0 DEBAYER DMA WIDTH = 256 CONTENTION

PLM PORTS = 4 PLM PORTS = 2 PLM PORTS = 1

Normalized Area Normalized Effective Latency IEEE High Performance Extreme Computing Conference (HPEC), 2017

13 / 15

slide-25
SLIDE 25

IEEE High Performance Extreme Computing Conference (HPEC), 2017

14 / 15

  • We presented the XKnobs
  • a set of knobs that aims at extending the

standard knobs used in current HLS tools

  • The XKnobs can be integrated in any HLS tools and

design-space exploration methodologies to enrich the set of Pareto-optimal implementations of hardware accelerators

  • For WAMI, the Xknobs broaden the design space

by up to 8.5x for performance and 3.5x for cost

Concluding Remarks

slide-26
SLIDE 26

Thank you for the attention!

Speaker: Luca Piccolboni Columbia University, NY

IEEE High Performance Extreme Computing Conference (HPEC), 2017

Broadening the Exploration of the Accelerator Design Space in

Embedded Scalable Platforms