Flexible and Scalable Acceleration Techniques for Low-Power Edge - - PowerPoint PPT Presentation

flexible and scalable acceleration techniques for low
SMART_READER_LITE
LIVE PREVIEW

Flexible and Scalable Acceleration Techniques for Low-Power Edge - - PowerPoint PPT Presentation

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop on Embedded Systems 8.9.2017 Universit degli Studi di Roma La Sapienza Francesco Conti 1,2 , Davide Rossi 1 , Luca Benini 1,2


slide-1
SLIDE 1

2Integrated Systems Laboratory 1Energy Efficient Embedded Systems Laboratory

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing

2nd Italian Workshop on Embedded Systems Università degli Studi di Roma “La Sapienza„

8.9.2017

Francesco Conti1,2, Davide Rossi1, Luca Benini1,2

f.conti@unibo.it

slide-2
SLIDE 2

| |

Computing for the Internet of Things

08/09/17 F.Conti @ IWES 2017 2

Battery + Harvesting powered à a few mW power envelope

slide-3
SLIDE 3

| |

Computing for the Internet of Things

08/09/17 F.Conti @ IWES 2017 3

Battery + Harvesting powered à a few mW power envelope

Sense

MEMS IMU MEMS Microphone ULP Imager 100 µW ÷ 2 mW EMG/ECG/EIT

slide-4
SLIDE 4

| |

Computing for the Internet of Things

08/09/17 F.Conti @ IWES 2017 4

Battery + Harvesting powered à a few mW power envelope

Analyze and Classify

µController IOs 1 ÷ 25 MOPS 1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU MEMS Microphone ULP Imager 100 µW ÷ 2 mW EMG/ECG/EIT

slide-5
SLIDE 5

| |

Computing for the Internet of Things

08/09/17 F.Conti @ IWES 2017 5

Battery + Harvesting powered à a few mW power envelope

Long range, low BW Short range, medium BW Low rate (periodic) data SW update, commands

Transmit

Idle: ~1µW Active: ~ 50mW

Analyze and Classify

µController IOs 1 ÷ 25 MOPS 1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU MEMS Microphone ULP Imager 100 µW ÷ 2 mW EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW

slide-6
SLIDE 6

| |

The Road to Efficiency

320mV 320mV 65nm CMOS, 50° C Subthreshold Region

450 400 350 300 250 200 150 100 50 104 103 102 101 1 102 101 1 10–1 10–2 101 1 10–1 10–2

0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.2 1.2 1.0 1.0 1.4 1.4

Maximum Frequency [MHz] Energy Effjciency [Gop/s/W] Total Power [W] Active Leakage Power [mW]

Supply Voltage [V]

Adapted from Borkar and Chien, The Future of Microprocessors, Communications of the ACM, May 2011

near-threshold normal performance constraint parallel computing Parallel computing is particularly attractive for analytics workloads, which often expose natural parallelism, and is naturally coupled with near-threshold computing

08/09/17 F.Conti @ IWES 2017 6

slide-7
SLIDE 7

| |

Computing for the Internet of Things

08/09/17 F.Conti @ IWES 2017 7

Battery + Harvesting powered à a few mW power envelope

Long range, low BW Short range, medium BW Low rate (periodic) data SW update, commands

Transmit

Idle: ~1µW Active: ~ 50mW

Analyze and Classify

µController IOs 1 ÷ 25 MOPS 1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU MEMS Microphone ULP Imager 100 µW ÷ 2 mW EMG/ECG/EIT L2 Memory 1 ÷ 2000 MOPS 1 ÷ 10 mW

slide-8
SLIDE 8

| |

Computing for the Internet of Things

08/09/17 F.Conti @ IWES 2017 8

Battery + Harvesting powered à a few mW power envelope

Long range, low BW Short range, medium BW Low rate (periodic) data SW update, commands

Transmit

Idle: ~1µW Active: ~ 50mW

Analyze and Classify

µController IOs 1 ÷ 25 MOPS 1 ÷ 10 mW

e.g. CortexM

Sense

MEMS IMU MEMS Microphone ULP Imager 100 µW ÷ 2 mW EMG/ECG/EIT L2 Memory 1 ÷ 2000 MOPS 1 ÷ 10 mW

slide-9
SLIDE 9

| |

PULP architecture outline

TCDM Logarithmic Interconnect Core #1 Instruction Cache Mem Bank Mem Bank Core #2 Instruction Cache Mem Bank Mem Bank Core #3 Instruction Cache Mem Bank Mem Bank Core #N Instruction Cache Mem Bank Mem Bank

Parallel access to shared memory à Flexibility

Parallel Ultra Low Power in a nutshell: energy efficiency for the IoT through

  • near-threshold ULP execution
  • parallel computing
  • architecture targeted at low power

Targeting 100-1000 GOPS/W of performance/Watt (>100x of current MCUs) A joint effort of University of Bologna, ETH Zurich and other academic and industrial partners.

08/09/17 F.Conti @ IWES 2017 9

slide-10
SLIDE 10

| |

PULP architecture outline

TCDM Logarithmic Interconnect Core #1 Mem Bank Mem Bank Core #2 Mem Bank Mem Bank Core #3 Mem Bank Mem Bank Core #N Mem Bank Mem Bank

Shared I$ + L0 fetch buffer à Efficiency

Shared Instruction Cache L0 L0 L0 L0 Shared Instruction Cache

Parallel Ultra Low Power in a nutshell: energy efficiency for the IoT through

  • near-threshold ULP execution
  • parallel computing
  • architecture targeted at low power

Targeting 100-1000 GOPS/W of performance/Watt (>100x of current MCUs) A joint effort of University of Bologna, ETH Zurich and other academic and industrial partners.

08/09/17 F.Conti @ IWES 2017 10

slide-11
SLIDE 11

| |

PULP architecture outline

TCDM Logarithmic Interconnect Core #1 Core #2 Core #3 Core #N

Hybrid memory: SRAM+SCM à can work at very low Vdd

Shared Instruction Cache L0 L0 L0 L0 SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM

Parallel Ultra Low Power in a nutshell: energy efficiency for the IoT through

  • near-threshold ULP execution
  • parallel computing
  • architecture targeted at low power

Targeting 100-1000 GOPS/W of performance/Watt (>100x of current MCUs) A joint effort of University of Bologna, ETH Zurich and other academic and industrial partners.

08/09/17 F.Conti @ IWES 2017 11

slide-12
SLIDE 12

| |

PULP architecture outline

TCDM Logarithmic Interconnect Core #1 Core #2 Core #3 Core #N

HW Synch à Faster core shutdown + parallelism

Shared Instruction Cache L0 L0 L0 L0 SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM HW Synch

Parallel Ultra Low Power in a nutshell: energy efficiency for the IoT through

  • near-threshold ULP execution
  • parallel computing
  • architecture targeted at low power

Targeting 100-1000 GOPS/W of performance/Watt (>100x of current MCUs) A joint effort of University of Bologna, ETH Zurich and other academic and industrial partners.

08/09/17 F.Conti @ IWES 2017 12

slide-13
SLIDE 13

| |

PULP architecture outline

TCDM Logarithmic Interconnect Core #1 Core #2 Core #3 Core #N

Fine-grain Clk-Gating + Body-Bias à Less Power

Shared Instruction Cache L0 L0 L0 L0 SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM HW Synch

Parallel Ultra Low Power in a nutshell: energy efficiency for the IoT through

  • near-threshold ULP execution
  • parallel computing
  • architecture targeted at low power

Targeting 100-1000 GOPS/W of performance/Watt (>100x of current MCUs) A joint effort of University of Bologna, ETH Zurich and other academic and industrial partners.

08/09/17 F.Conti @ IWES 2017 13

slide-14
SLIDE 14

| |

PULP architecture outline

L2 Memory

Cluster Bus

TCDM Logarithmic Interconnect Bus Adapter DMA Instruction Bus Core #1 Core #2 Core #3 Core #N

Add infrastructure to access off-cluster memory

Shared Instruction Cache L0 L0 L0 L0 SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM SRAM SRAM SCM SCM HW Synch

QSPI Slave QSPI Master

08/09/17 F.Conti @ IWES 2017 14

slide-15
SLIDE 15

| |

How to get even more efficient?

320mV 320mV 65nm CMOS, 50° C Subthreshold Region

450 400 350 300 250 200 150 100 50 104 103 102 101 1 102 101 1 10–1 10–2 101 1 10–1 10–2

0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.2 1.2 1.0 1.0 1.4 1.4

Maximum Frequency [MHz] Energy Effjciency [Gop/s/W] Total Power [W] Active Leakage Power [mW]

Supply Voltage [V]

Adapted from Borkar and Chien, The Future of Microprocessors, Communications of the ACM, May 2011

parallel computing heterogeneous computing

08/09/17 F.Conti @ IWES 2017 15

slide-16
SLIDE 16

| |

HW Acceleration in Tightly-Coupled Clusters

TCDM Logarithmic Interconnect Core #1 Instruction Cache Mem Bank Mem Bank Core #2 Instruction Cache Mem Bank Mem Bank Core #3 Instruction Cache Mem Bank Mem Bank Core #N Instruction Cache Mem Bank Mem Bank

A host processor outside the cluster

L2 Memory

Cluster Bus

Cluster Interface DMA Instruction Bus Bus Adapter Host Processor

08/09/17 F.Conti @ IWES 2017 16

slide-17
SLIDE 17

| |

HW Acceleration in Tightly-Coupled Clusters

TCDM Logarithmic Interconnect Core #1 Instruction Cache Mem Bank Mem Bank Core #2 Instruction Cache Mem Bank Mem Bank Core #3 Instruction Cache Mem Bank Mem Bank Core #N Instruction Cache Mem Bank Mem Bank

HW Processing Engines inside the cluster

HW Processing Engine L2 Memory

Cluster Bus

Cluster Interface DMA Instruction Bus Bus Adapter Host Processor

08/09/17 F.Conti @ IWES 2017 17

slide-18
SLIDE 18

| |

HW Acceleration in Tightly-Coupled Clusters

TCDM Logarithmic Interconnect Core #1 Instruction Cache Mem Bank Mem Bank Core #2 Instruction Cache Mem Bank Mem Bank Core #3 Instruction Cache Mem Bank Mem Bank Core #N Instruction Cache Mem Bank Mem Bank HW Processing Engine L2 Memory

Cluster Bus

Cluster Interface DMA Instruction Bus Bus Adapter Host Processor

08/09/17 F.Conti @ IWES 2017 18

slide-19
SLIDE 19

| |

HW Processing Engines

HW Processing Engine “Virtual” Core #N+1

=

Core #1

  • n the data plane, cores see

HWPEs as a set of SW cores

HW Processing Engine “Virtual” DMA

=

Core #1

  • n the control plane, cores

control HWPEs as a memory mapped peripheral (e.g. a DMA)

“Virtual” Core #N+2 “Virtual” Core #N+3

08/09/17 F.Conti @ IWES 2017 19

slide-20
SLIDE 20

| |

HW Processing Engines

TCDM Logarithmic Interconnect Peripheral Interconnect

control

L1 Shared Memory Cores

Data plane Control plane

HW Processing Engine “Virtual” Core #N+1

=

Core #1

  • n the data plane, cores see

HWPEs as a set of SW cores

HW Processing Engine “Virtual” DMA

=

Core #1

  • n the control plane, cores

control HWPEs as a memory mapped peripheral (e.g. a DMA)

“Virtual” Core #N+2 “Virtual” Core #N+3

08/09/17 F.Conti @ IWES 2017 20

slide-21
SLIDE 21

| |

HW Processing Engines

Address Translation Register File + Control Logic

TCDM Logarithmic Interconnect Peripheral Interconnect

control

HWPE wrapper

L1 Shared Memory Cores

HW Processing Engine “Virtual” Core #N+1

=

Core #1

  • n the data plane, cores see

HWPEs as a set of SW cores

HW Processing Engine “Virtual” DMA

=

Core #1

  • n the control plane, cores

control HWPEs as a memory mapped peripheral (e.g. a DMA)

“Virtual” Core #N+2 “Virtual” Core #N+3

08/09/17 F.Conti @ IWES 2017 21

slide-22
SLIDE 22

| |

PULP: a busy silicon schedule 2013-2???

§ ST 28nm FDSOI

§ PULP1 § PULP2 § PULP3 (on board)

§ UMC 65nm

§ Artemis, Hecate, Selene, Diana - FPU § Mia Wallace – full system (on board) § Imperio - PULPino chip (on board) § Fulmine – secure smart analytics (on board) § Patronus – tiny cores (taped out)

§ GF 28nm

§ Honey Bunny – first RISC-V based (on board)

§ GF 22nm

§ Ariane – RISC-V 64bit core (under development) § Quentin – second-gen PULPino MCU (under development)

180nm 130nm 28nm 28nm 28nm 65nm 65nm 65nm 65nm 65nm 180nm 180nm 180nm 180nm

08/09/17 F.Conti @ IWES 2017 22

§ UMC 180nm

§ Sir10us § Or10n

§ SMIC 130nm

§ VivoSoC § VivoSoC2 (on board)

§ ALP 180nm

§ Diego § Manny

§ TSMC 40nm

§

  • Mr. Wolf (taping out)

65nm

slide-23
SLIDE 23

| |

PULP: a busy silicon schedule 2013-2???

§ ST 28nm FDSOI

§ PULP1 § PULP2 § PULP3 (on board)

§ UMC 65nm

§ Artemis, Hecate, Selene, Diana - FPU § Mia Wallace – full system (on board) § Imperio - PULPino chip (on board) § Fulmine – secure smart analytics (on board) § Patronus – tiny cores (taped out)

§ GF 28nm

§ Honey Bunny – first RISC-V based (on board)

§ GF 22nm

§ Ariane – RISC-V 64bit core (under development) § Quentin – second-gen PULPino MCU (under development)

180nm 130nm 28nm 28nm 28nm 65nm 65nm 65nm 65nm 65nm 180nm 180nm 180nm 180nm

08/09/17 F.Conti @ IWES 2017 23

§ UMC 180nm

§ Sir10us § Or10n

§ SMIC 130nm

§ VivoSoC § VivoSoC2 (on board)

§ ALP 180nm

§ Diego § Manny

§ TSMC 40nm

§

  • Mr. Wolf (taping out)

65nm

slide-24
SLIDE 24

| |

The Fulmine System-on-Chip

Fulmine SoC:

  • UMC 65nm technology
  • 6.86 mm2
  • 4 cores, 2 accelerators
  • HWCE for 3D conv layers
  • HWCRYPT for AES
  • DSP-optimized cores
  • 64 kB of L1, 192 kB of L2
  • uDMA for I/O with no SW

intervention

  • QSPI master/slave
  • I2C
  • I2S
  • UART

24 11/04/17 F.Conti - Fulmine 9 08/09/17 F.Conti @ IWES 2017 24

slide-25
SLIDE 25

| |

The Fulmine PULP cluster for Secure Smart Analytics

Cluster Interconnect DMA Core #1 Core #2 Core #3 Core #N Shared Instruction Cache L0 L0 L0 L0 HW Synch Hardware Encryption Engine (HWCRYPT) Hardware Convolution Engine (HWCE) SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

08/09/17 F.Conti @ IWES 2017 25

slide-26
SLIDE 26

| |

HWPEs for CNNs?

Relative execution time of the layers of a scene labeling CNN [Cavigelli et al., DAC 2015] input image C1 P1 C2 P2 convolution activation max/avg- pooling convolution activation max/avg- pooling filters

∗ ∗ ∗ +

y(i, j) = y0(i, j) + ⇣ W ∗ x ⌘ (i, j)

convolve

  • accumulate

Convolve-accumulate loops Convolutional layer

  • 1. Suitable for streaming

implementation

  • 2. Can use shared memory

for intermediate results (i.e. accumulation)

  • 3. Target one case in HW,

but manage all by SW

08/09/17 F.Conti @ IWES 2017 26

slide-27
SLIDE 27

| |

The Fulmine PULP cluster for Secure Smart Analytics

Cluster Interconnect DMA Core #1 Core #2 Core #3 Core #N Shared Instruction Cache L0 L0 L0 L0 HW Synch Hardware Encryption Engine (HWCRYPT) Hardware Convolution Engine (HWCE) SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM LINE BUFFER

MEMORY MUXING

yin[3] MEM2STREAM yin[2] MEM2STREAM yin[1] MEM2STREAM yin[0] MEM2STREAM xin MEM2STREAM yout[3] STREAM2MEM yout[2] STREAM2MEM yout[1] STREAM2MEM yout[0] STREAM2MEM

WEIGHT BUFFER

CONTROLLER SUM-OF-PRODUCTS

xwin W

REDUCTION TREE

to TIGHTLY COUPLED DATA MEMORY INTERCONNECT to PERIPHERAL INTERCONNECT

data stream mem master mem slave data buffer

08/09/17 F.Conti @ IWES 2017 27

slide-28
SLIDE 28

| |

The Fulmine PULP cluster for Secure Smart Analytics

Cluster Interconnect DMA Core #1 Core #2 Core #3 Core #N Shared Instruction Cache L0 L0 L0 L0 HW Synch Hardware Encryption Engine (HWCRYPT) Hardware Convolution Engine (HWCE) SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM LINE BUFFER

MEMORY MUXING

yin[3] MEM2STREAM yin[2] MEM2STREAM yin[1] MEM2STREAM yin[0] MEM2STREAM xin MEM2STREAM yout[3] STREAM2MEM yout[2] STREAM2MEM yout[1] STREAM2MEM yout[0] STREAM2MEM

WEIGHT BUFFER

CONTROLLER SUM-OF-PRODUCTS

xwin W

REDUCTION TREE

to TIGHTLY COUPLED DATA MEMORY INTERCONNECT to PERIPHERAL INTERCONNECT

data stream mem master mem slave data buffer

! ! !

!

PIPE PIPE PIPE PIPE

+ + + + +

! ! !

!

PIPE PIPE PIPE PIPE

+ + + + +

! ! !

!

PIPE PIPE PIPE PIPE

+ + + + +

! ! !

!

PIPE PIPE PIPE PIPE

+ + + + +

PIPE PIPE PIPE PIPE PIPE PIPE PIPE PIPE

<< 4

+

<< 4

+

<< 8

+

MUX MUX MUX MUX

hb[3] hb[2] hb[1] hb[0] fb[1] fb[0] hw[0] fb[0] hb[0] fb[1] hb[1] hb[2] hb[3]

+ + + +

yin[0] yout[0] yout[1] yout[2] yout[3]

20 20 20 20 20 20 20 20 20 20 20 20 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 30 34 35 43 44 30 30 34 30 35

>> QF & SAT >> QF & SAT >> QF & SAT >> QF & SAT

43

<< QF

yin[1]

<< QF

yin[2]

<< QF

yin[3]

<< QF

Wbits[7:4] ! xwin Wbits[3:0] ! xwin Wbits[15:12] ! xwin Wbits[11:8] ! xwin 08/09/17 F.Conti @ IWES 2017 28

slide-29
SLIDE 29

| |

The Fulmine PULP cluster for Secure Smart Analytics

Cluster Interconnect DMA Core #1 Core #2 Core #3 Core #N Shared Instruction Cache L0 L0 L0 L0 HW Synch Hardware Encryption Engine (HWCRYPT) Hardware Convolution Engine (HWCE) SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

08/09/17 F.Conti @ IWES 2017 29

slide-30
SLIDE 30

| |

The Fulmine PULP cluster for Secure Smart Analytics

Cluster Interconnect DMA Core #1 Core #2 Core #3 Core #N Shared Instruction Cache L0 L0 L0 L0 HW Synch Hardware Encryption Engine (HWCRYPT) Hardware Convolution Engine (HWCE) SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM

mem master mem slave

MEMORY MUXING to TIGHTLY COUPLED DATA MEMORY INTERCONNECT

CONTROLLER

to PERIPHERAL INTERCONNECT

MEMORY MUXING COMMAND QUEUE

AES Engine Sponge Engine

08/09/17 F.Conti @ IWES 2017 30

slide-31
SLIDE 31

| |

Fulmine SoC performance and power envelope

  • 0. 75
  • 0. 80
  • 0. 85
  • 0. 90
  • 0. 95
  • 1. 00
  • 1. 05
  • 1. 10
  • 1. 15
  • 1. 20

Vdd [V]

20 40 60 80 100 120 140 160

Power [mW]

@120MHz @200MHz @280MHz @320MHz @400MHz

100 MHz @15mW 400 MHz @140mW

08/09/17 F.Conti @ IWES 2017 31

slide-32
SLIDE 32

| |

State-of-the-Art comparison

CRYPTOGRAPHY

OUR WORK Zhang et al. [2] VLSI’16 Mathew et al. [1] JSSC’15 @ 0.9V Mathew et al. [1] JSSC’15 @ 0.43V Technology UMC 65nm LL 1P8M TSMC 40nm Intel 22nm Intel 22nm Operating Point 0.8V, 84MHz 0.9V, 1.3 GHz 0.9V, 1.13 GHz 0.43V, 324 MHz Area 5.75 mm2 (SoC) 0.56 mm2 (HWCRYPT) 0.42 mm2 (AES) 0.19 mm2 (AES) 0.19 mm2 (AES) Power 27 mW (SoC) 4.39 mW (AES) 13 mW (AES) 428 µW (AES) Performance 1.76 Gbit/s 0.446 Gbit/s 0.432 Gbit/s 0.124 Gbit/s Energy Efficiency 65.2 Gbit/s/W 113 Gbit/s/W 33.2 Gbit/s/W 289 Gbit/s/W Supported Schemes AES-XTS, AES-ECB, Keccak-f400, LR masking/shuffling AES-ECB AES-ECB AES-ECB

CNN

OUR WORK Eyeriss [3] ISSCC’16 Sim et al. [4] ISSCC’16 Technology UMC 65nm LL 1P8M TSMC 65nm LP 1P9M 65nm 1P8M Operating Point 0.8V, 84MHz 1V, 200MHz 1.2V, 128MHz Area 5.75 mm2 (SoC) 0.35 mm2 (HWCE) 12.25 mm2 16 mm2 Power 14 mW (SoC) 288 mW 45 mW Performance 1.85/3.44/4.64 GMac/s 21.4 GMac/s 32 GMac/s Energy Efficiency 132/246/331 GMac/s/W 74.3 GMac/s/w 710 GMac/s/W Arithmetic Precision Fixed point 16/8/4x16 bits Fixed point 16x16 bits Fixed point 16x16/24x24 bits 08/09/17 F.Conti @ IWES 2017 32

slide-33
SLIDE 33

| |

Example application: secure aerial surveillance

Fulmine silicon measurements for CONV, AES, DMA + datasheet values for COTS FRAM, Flash

§ ResNet-based CNN secured at the cluster boundary with AES encryption § An example application for a smart endnode mounting a Fulmine chip…

08/09/17 F.Conti @ IWES 2017 33

slide-34
SLIDE 34

| |

Thanks for your attention… http://www.pulp-platform.org GitHub: pulp-platform pulp-info@list.ee.ethz.ch

08/09/17 F.Conti @ IWES 2017 34

slide-35
SLIDE 35

| |

BACKUP SLIDES

08/09/17 F.Conti @ IWES 2017 35

slide-36
SLIDE 36

| |

PULP: pJ/op Parallel ULP computing

pJ/op is traditionally the target of ASIC + super-small research µControllers

Parallel + Programmable + Heterogeneous ULP computing 1mW-10mW active power

Compiler Infrastructure

Processor & Hardware IPs

Virtualization Layer Programming Model

Low-Power Silicon Technology

08/09/17 F.Conti @ IWES 2017 36

slide-37
SLIDE 37

| |

An example: Secured AlexNet

08/09/17 F.Conti @ IWES 2017 37

slide-38
SLIDE 38

| |

An example: Secured AlexNet

08/09/17 F.Conti @ IWES 2017 38

slide-39
SLIDE 39

| |

An example: Secured AlexNet

08/09/17 F.Conti @ IWES 2017 39

slide-40
SLIDE 40

| |

Power Breakdown

08/09/17 F.Conti @ IWES 2017 40

slide-41
SLIDE 41

| |

Relative execution time of the layers of a scene labeling CNN [Cavigelli et al., DAC 2015] input image C1 P1 C2 P2 convolution activation max/avg- pooling convolution activation max/avg- pooling filters

∗ ∗ ∗ +

y(i, j) = y0(i, j) + ⇣ W ∗ x ⌘ (i, j)

convolve

  • accumulate

Convolve-accumulate loops Convolutional layer

  • 1. Suitable for streaming

implementation

  • 2. Can use shared memory

for intermediate results (i.e. accumulation)

  • 3. Target one case in HW,

but manage all by SW

Accelerating CNNs

08/09/17 F.Conti @ IWES 2017 41

slide-42
SLIDE 42

| |

  • 2. Decouple the streaming domain

from the shared memory domain; convert streams in 3D-strided memory accesses

HWCE WRAPPER

Source

HWPE Slave Module HWPE Register File

Source Sink

ENGINE WEIGHT UNIT

x_in y_in y_out

port 2

to shared memory

port 0

configuration interface

MUX/DEMUX

TCDM Streaming event

weight

  • 1. Perform convolve-accumulate

in streaming fashion

  • 3. Allow “jobs” to be offloaded to

the HWCE by regular SW cores

  • 4. Weights for each

convolution filter are stored privately

port 3 port 1 port 0

  • 5. Fine-grain clock gating

to minimize dynamic power

Hardware Convolution Engine

08/09/17 F.Conti @ IWES 2017 42

slide-43
SLIDE 43

| |

But how to map full CNNs on PULP?

input image C1 P1 C2 P2 convolution activation max/avg- pooling convolution activation max/avg- pooling

08/09/17 F.Conti @ IWES 2017 43

slide-44
SLIDE 44

| |

But how to map full CNNs on PULP?

Shared L1 Memory

Core #1 Core #2

Higher level Memory (e.g. L3)

relatively low bandwidth K high latency L high access energy LLL big or very big J high bandwidth J low latency J low access energy J very small LLL

input image C1 convolution activation

Essentially, a problem of optimizing data exchange

  • 1. Maximize data reuse
  • 2. Avoid unneeded transfers back and forth

08/09/17 F.Conti @ IWES 2017 44

slide-45
SLIDE 45

| |

Mapping CNNs on PULP

Shared L1 Memory

Core #1 Core #2

Higher level Memory (e.g. L3) ∗ ∗ ∗ +

  • 1. Copy all weights for current

layer from L3➞L1

08/09/17 F.Conti @ IWES 2017 45

slide-46
SLIDE 46

| |

Mapping CNNs on PULP

Shared L1 Memory

Core #1 Core #2

Higher level Memory (e.g. L3) ∗ ∗ ∗ +

  • 2. Copy input tile 0 from L3➞L1

(stripe of N input features)

08/09/17 F.Conti @ IWES 2017 46

slide-47
SLIDE 47

| |

Mapping CNNs on PULP

Shared L1 Memory

Core #1 Core #2

Higher level Memory (e.g. L3) ∗ ∗ ∗ +

  • 3. Copy input tile 1 from L3➞L1

+ compute on tile 0 (double buffering)

08/09/17 F.Conti @ IWES 2017 47

slide-48
SLIDE 48

| |

Mapping CNNs on PULP

Shared L1 Memory

Core #1 Core #2

Higher level Memory (e.g. L3) ∗ ∗ ∗ +

  • 3. When computation on a tile is complete

for the given layer, write it back to L3

08/09/17 F.Conti @ IWES 2017 48

slide-49
SLIDE 49

| |

Relative execution time of the layers of a scene labeling CNN [Cavigelli et al., DAC 2015] input image C1 P1 C2 P2 convolution activation max/avg- pooling convolution activation max/avg- pooling filters

∗ ∗ ∗ +

y(i, j) = y0(i, j) + ⇣ W ∗ x ⌘ (i, j)

convolve

  • accumulate

Convolve-accumulate loops Convolutional layer

  • 1. Suitable for streaming

implementation

  • 2. Can use shared memory

for intermediate results (i.e. accumulation)

  • 3. Target one case in HW,

but manage all by SW

Accelerating CNNs

08/09/17 F.Conti @ IWES 2017 49

slide-50
SLIDE 50

| |

Tiling

08/09/17 F.Conti @ IWES 2017 50

slide-51
SLIDE 51

| |

Tiling

08/09/17 F.Conti @ IWES 2017 51

slide-52
SLIDE 52

| |

Tiling on Input Features

1 2 3 4 5 6

∗∗∗∗∗∗

fetch input tile fetch partially computed

  • utput tiles

fetch weights

1 2 3 4 5 6

store partially computed

  • utput tiles

repeated times

Ni

08/09/17 F.Conti @ IWES 2017 52

slide-53
SLIDE 53

| |

Tiling on Output Features

∗ ∗

+

1 2 1 2

fetch input tiles fetch weights fetch weights store output tile

repeated times

No

08/09/17 F.Conti @ IWES 2017 53