Flexible and Scalable Acceleration Techniques for Low-Power Edge - PowerPoint PPT Presentation

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop on Embedded Systems 8.9.2017 Università degli Studi di Roma “La Sapienza„ Francesco Conti 1,2 , Davide Rossi 1 , Luca Benini 1,2 f.conti@unibo.it 1 Energy Efficient Embedded Systems Laboratory 2 Integrated Systems Laboratory

Computing for the Internet of Things Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 2

Computing for the Internet of Things Sense MEMS IMU MEMS Microphone ULP Imager EMG/ECG/EIT 100 µW ÷ 2 mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 3

Computing for the Internet of Things Analyze and Classify Sense MEMS IMU µ Controller MEMS Microphone e.g. CortexM ULP Imager IOs EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 100 µW ÷ 2 mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 4

Computing for the Internet of Things Analyze and Classify Transmit Sense MEMS IMU Short range, medium BW µ Controller MEMS Microphone e.g. CortexM ULP Imager Low rate (periodic) data IOs 1 ÷ 25 MOPS SW update, commands EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 1 ÷ 10 mW Long range, low BW Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 50mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 5

The Road to Efficiency near-threshold normal 65nm CMOS, 50° C 10 1 450 Microprocessors, Communications of the ACM, May 2011 Active Leakage Power 400 Energy E ffj ciency 350 Subthreshold Region [Gop/s/W] 300 1 [mW] Adapted from Borkar and Chien, The Future of 250 200 10 –1 150 100 50 320mV 10 –2 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 10 2 parallel computing 10 4 Maximum Frequency performance constraint 10 1 10 3 Total Power [MHz] [W] 10 2 1 10 –1 10 1 Parallel computing is particularly 320mV attractive for analytics workloads, 10 –2 which often expose natural parallelism, 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 and is naturally coupled with Supply Voltage near-threshold computing [V] F.Conti @ IWES 2017 | 08/09/17 | 6

Computing for the Internet of Things Analyze and Classify Transmit Sense MEMS IMU Short range, medium BW µ Controller L2 Memory MEMS Microphone e.g. CortexM ULP Imager Low rate (periodic) data IOs 1 ÷ 2000 MOPS SW update, commands EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 1 ÷ 10 mW Long range, low BW Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 50mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 7

Computing for the Internet of Things Analyze and Classify Transmit Sense MEMS IMU Short range, medium BW µ Controller L2 Memory MEMS Microphone e.g. CortexM ULP Imager Low rate (periodic) data IOs 1 ÷ 2000 MOPS SW update, commands EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 1 ÷ 10 mW Long range, low BW Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 50mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 8

PULP architecture outline Mem Mem Mem Mem Mem Mem Mem Mem Bank Bank Bank Bank Bank Bank Bank Bank TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution Core #1 Core #2 Core #3 Core #N • parallel computing • architecture targeted at low power Targeting 100-1000 GOPS/W of Instruction Instruction Instruction Instruction performance/Watt (> 100x of current Cache Cache Cache Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Parallel access to shared memory à Flexibility F.Conti @ IWES 2017 | 08/09/17 | 9

PULP architecture outline Mem Mem Mem Mem Mem Mem Mem Mem Bank Bank Bank Bank Bank Bank Bank Bank TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution Core #1 Core #2 Core #3 Core #N • parallel computing • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared Shared performance/Watt (> 100x of current Instruction Cache Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Shared I$ + L0 fetch buffer à Efficiency F.Conti @ IWES 2017 | 08/09/17 | 10

PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution Core #1 Core #2 Core #3 Core #N • parallel computing • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared performance/Watt (> 100x of current Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Hybrid memory: SRAM+SCM à can work at very low Vdd F.Conti @ IWES 2017 | 08/09/17 | 11

PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution HW Core #1 Core #2 Core #3 Core #N • parallel computing Synch • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared performance/Watt (> 100x of current Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. HW Synch à Faster core shutdown + parallelism F.Conti @ IWES 2017 | 08/09/17 | 12

PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution HW Core #1 Core #2 Core #3 Core #N • parallel computing Synch • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared performance/Watt (> 100x of current Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Fine-grain Clk-Gating + Body-Bias à Less Power F.Conti @ IWES 2017 | 08/09/17 | 13

PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM L2 Memory SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM DMA TCDM Logarithmic Interconnect Cluster Bus QSPI Master HW Core #1 Core #2 Core #3 Core #N Synch L0 L0 L0 L0 Shared QSPI Instruction Cache Slave Bus Instruction Bus Adapter Add infrastructure to access off-cluster memory F.Conti @ IWES 2017 | 08/09/17 | 14

How to get even more efficient? 65nm CMOS, 50° C 10 1 450 Microprocessors, Communications of the ACM, May 2011 Active Leakage Power 400 Energy E ffj ciency 350 Subthreshold Region [Gop/s/W] 300 1 [mW] Adapted from Borkar and Chien, The Future of 250 200 10 –1 150 100 50 320mV 10 –2 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 10 2 parallel computing 10 4 heterogeneous computing Maximum Frequency 10 1 10 3 Total Power [MHz] [W] 10 2 1 10 –1 10 1 320mV 10 –2 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage [V] F.Conti @ IWES 2017 | 08/09/17 | 15

HW Acceleration in Tightly-Coupled Clusters Mem Mem Mem Mem Mem Mem Mem Mem L2 Bank Bank Bank Bank Bank Bank Bank Bank Memory DMA TCDM Logarithmic Interconnect Cluster Bus Host Processor Core #1 Core #2 Core #3 Core #N Instruction Instruction Instruction Instruction Cluster Cache Cache Cache Cache Interface Bus Instruction Bus Adapter A host processor outside the cluster F.Conti @ IWES 2017 | 08/09/17 | 16

HW Acceleration in Tightly-Coupled Clusters Mem Mem Mem Mem Mem Mem Mem Mem L2 Bank Bank Bank Bank Bank Bank Bank Bank Memory DMA TCDM Logarithmic Interconnect Cluster Bus Host HW Processor Processing Core #1 Core #2 Core #3 Core #N Engine Instruction Instruction Instruction Instruction Cluster Cache Cache Cache Cache Interface Bus Instruction Bus Adapter HW Processing Engines inside the cluster F.Conti @ IWES 2017 | 08/09/17 | 17

HW Acceleration in Tightly-Coupled Clusters Mem Mem Mem Mem Mem Mem Mem Mem L2 Bank Bank Bank Bank Bank Bank Bank Bank Memory DMA TCDM Logarithmic Interconnect Cluster Bus Host HW Processor Processing Core #1 Core #2 Core #3 Core #N Engine Instruction Instruction Instruction Instruction Cluster Cache Cache Cache Cache Interface Bus Instruction Bus Adapter F.Conti @ IWES 2017 | 08/09/17 | 18

HW Processing Engines on the data plane, cores see HW = “Virtual” “Virtual” “Virtual” HWPE s as a set of SW cores Core #1 Processing Core #N+1 Core #N+2 Core #N+3 Engine on the control plane, cores HW control HWPE s as a memory = “Virtual” Core #1 Processing mapped peripheral (e.g. a DMA) DMA Engine F.Conti @ IWES 2017 | 08/09/17 | 19

Flexible and Scalable Acceleration Techniques for Low-Power Edge - PowerPoint PPT Presentation

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop on Embedded Systems 8.9.2017 Universit degli Studi di Roma La Sapienza Francesco Conti 1,2 , Davide Rossi 1 , Luca Benini 1,2

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Low Power Techniques for SoC Design: basic concepts and techniques Estagi ario de Doc encia

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers Samuel Palermo

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

WEYERHAEUSER Earnings Release 3rd Quarter 2010 Forward-looking Statement This presentation

Rotary Press Revolutionizes Sludge Compaction Presented by: Ross Elliot VP & GM Tecumseth

A Comparison of Linux Software Update Technologies Matt Porter, Konsulko Group Embedded Linux

Pulp Google Hacking The Next Generation Search Engine Hacking Arsenal 3 August 2011 Black Hat

AI for AI Systems and Chips Azalia Mirhoseini Senior Research Scientist, Google Brain In the

saga PRODUCTION UNIT RAW MATERIAL PRODUCT APPLICATIONS PRODUCTION PROCESS BALES

2018 Georgia Forest Economic Report 1/27/20 Jonathan Brown Utilization Forester Georgia

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab

Flexible and Scalable Acceleration Techniques for Low-Power Edge - PowerPoint PPT Presentation

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop on Embedded Systems 8.9.2017 Universit degli Studi di Roma La Sapienza Francesco Conti 1,2 , Davide Rossi 1 , Luca Benini 1,2

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Flexible Hardware Design at Flexible Hardware Design at Low Levels of Abstraction Low Levels of

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Low Power Techniques for SoC Design: basic concepts and techniques Estagi ario de Doc encia

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Design Techniques for Scalable, Sub-pJ/b Serial I/O Transceivers Samuel Palermo

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Intraday Techniques Intraday Techniques Intraday Techniques Intraday Techniques Combining

WEYERHAEUSER Earnings Release 3rd Quarter 2010 Forward-looking Statement This presentation

Rotary Press Revolutionizes Sludge Compaction Presented by: Ross Elliot VP &amp; GM Tecumseth

A Comparison of Linux Software Update Technologies Matt Porter, Konsulko Group Embedded Linux

Pulp Google Hacking The Next Generation Search Engine Hacking Arsenal 3 August 2011 Black Hat

AI for AI Systems and Chips Azalia Mirhoseini Senior Research Scientist, Google Brain In the

saga PRODUCTION UNIT RAW MATERIAL PRODUCT APPLICATIONS PRODUCTION PROCESS BALES

2018 Georgia Forest Economic Report 1/27/20 Jonathan Brown Utilization Forester Georgia

Scaling performance In PowerLimited HPC Systems Prof. Dr. Luca Benini ERC Multitherman Lab

Rotary Press Revolutionizes Sludge Compaction Presented by: Ross Elliot VP & GM Tecumseth