Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne - PowerPoint PPT Presentation

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology In collabora*on with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems

2 Example Applica:ons of Deep Learning Speech Recognition Computer Vision Game Play Medical

What is Deep Learning? 3 “Volvo Image XC90” Image Source: [Lee et al., Comm. ACM 2011]

Weighted Sums 4 ⎛ ⎞ 3 ∑ Y j = activation W ij × X i ⎜ ⎟ W 11 ⎝ ⎠ Y 1 i = 1 X 1 Y 2 X 2 Y 3 X 3 W 34 Y 4 Image Source: Stanford

Why is Deep Learning Hot Now? 5 Big Data GPU New ML Availability Acceleration Techniques 350M images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute

Deep Convolu:onal Neural Networks 6 Modern deep CNN: up to 1000 CONV layers CONV CONV FC Classes Layer Layer Layers Low-level High-level Features Features

Deep Convolu:onal Neural Networks 7 1 – 3 layers CONV CONV FC Classes Layer Layer Layers Low-level High-level Features Features

Deep Convolu:onal Neural Networks 8 CONV CONV FC Classes Layer Layer Layers Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

High-Dimensional CNN Convolu:on 9 Input Image (Feature Map) Filter H R R H

High-Dimensional CNN Convolu:on 10 Input Image (Feature Map) Filter H R R H Element-wise Multiplication

High-Dimensional CNN Convolu:on 11 Input Image (Feature Map) Output Image a pixel Filter H E R R H E Element-wise Partial Sum (psum) Multiplication Accumulation

High-Dimensional CNN Convolu:on 12 Input Image (Feature Map) Output Image a pixel Filter H E R R H E Sliding Window Processing

High-Dimensional CNN Convolu:on 13 Input Image C Filter Output Image C H E R R H E Many Input Channels (C) AlexNet: 3 – 192 Channels (C)

High-Dimensional CNN Convolu:on 14 Input Image Many Output Image Filters (M) C C M H E R 1 R H E … Many Output Channels (M) C R M R AlexNet: 96 – 384 Filters (M)

High-Dimensional CNN Convolu:on 15 Many Many Input Images (N) Output Images (N) C Filters M C H E R 1 1 R H E … … … C C R E H N R N E H Image batch size: 1 – 256 (N)

Large Sizes with Varying Shapes 16 AlexNet 1 Convolu:onal Layer Configura:ons Layer Filter Size (R) # Filters (M) # Channels (C) Stride 1 11x11 96 3 4 2 5x5 256 48 1 3 3x3 384 256 1 4 3x3 384 192 1 5 3x3 256 192 1 Layer 1 Layer 2 Layer 3 34k Params 307k Params 885k Params 105M MACs 150M MACs 224M MACs 1. [Krizhevsky, NIPS 2012]

Popular CNNs 17 ImageNet: Large Scale Visual Recognition Challenge (ILSVRC) • LeNet (1998) 18 AlexNet • AlexNet (2012) 16 Accuracy (Top 5 error) OverFeat • OverFeat (2013) 14 12 • VGGNet (2014) 10 • GoogleNet (2014) VGGNet 8 GoogLeNet • ResNet (2015) 6 ResNet 4 Clarifai 2 0 2012 2013 2014 2015 Human [O. Russakovsky et al., IJCV 2015]

Summary of Popular CNNs 18 Metrics LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50 (v1) Top-5 error n/a 16.4 7.4 6.7 5.3 Input Size 28x28 227x227 224x224 224x224 224x224 # of CONV Layers 2 5 16 21 (depth) 49 Filter Sizes 5 3, 5,11 3 1, 3 , 5, 7 1, 3, 7 # of Channels 1, 6 3 - 256 3 - 512 3 - 1024 3 - 2048 # of Filters 6, 16 96 - 384 64 - 512 64 - 384 64 - 2048 Stride 1 1, 4 1 1, 2 1, 2 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M # of MACs 283k 666M 15.3G 1.43G 3.86G # of FC layers 2 3 3 1 1 # of Weights 58k 58.6M 124M 1M 2M # of MACs 58k 58.6M 124M 1M 2M Total Weights 60k 61M 138M 7M 25.5M Total MACs 341k 724M 15.5G 1.43G 3.9G CONV Layers increasingly important!

Training vs. Inference 19 Training Inference (determine weights) (use weights) Large Datasets Weights

Processing at “Edge” instead of the “Cloud” 20 Communication Privacy Image source: www.theregister.co.uk Latency Sensor Cloud Image source: ericsson.com Actuator

21 Challenges

Key Metrics 22 ImageNet MNIST • Accuracy – Evaluate hardware using the appropriate DNN model and dataset • Programmability – Support mulXple applicaXons Computer Speech – Different weights Vision Recogni:on • Energy/Power – Energy per operaXon – DRAM Bandwidth • Throughput/Latency Chip – GOPS, frame rate, delay • Cost DRAM – Area (size of memory and # of cores) [Sze et al., CICC 2017]

23 Opportunities in Architecture

GPUs and CPUs Targe:ng Deep Learning 24 Intel Knights Landing (2016) Nvidia PASCAL GP100 (2016) Knights Mill: next gen Xeon Phi “optimized for deep learning” Use matrix multiplication libraries on CPUs and GPUs

Map DNN to a Matrix Mul:plica:on 25 Filter Input Fmap Output Fmap 1 2 1 2 1 2 3 = * Convolution: 3 4 3 4 4 5 6 7 8 9 Toeplitz Matrix (w/ redundant data) 1 2 3 4 × = 1 2 4 5 1 2 3 4 2 3 5 6 Matrix Mult: 4 5 7 8 5 6 8 9 Data is repeated Goal: Reduced number of operations to increase throughput

Reduce Opera:ons in Matrix Mul:plica:on 26 • Fast Fourier Transform [Mathieu, ICLR 2014] – Pro: Direct convoluXon O(N o 2 N f 2 ) to O(N o 2 log 2 N o ) – Con: Increase storage requirements • Strassen [Cong, ICANN 2014] – Pro: O(N 3 ) to (N 2.807 ) – Con: Numerical stability • Winograd [Lavin, CVPR 2016] – Pro: 2.25x speed up for 3x3 filter – Con: Specialized processing depending on filter size

Analogy: Gauss’s Mul:plica:on Algorithm 27 4 multiplications + 3 additions 3 multiplications + 5 additions Reduce number of multiplications, but increase number of additions

28 Specialized Hardware (Accelerators)

Proper:es We Can Leverage 29 • OperaXons exhibit high parallelism à high throughput possible • Memory Access is the Bocleneck Memory Read MAC * Memory Write filter weight ALU image pixel DRAM DRAM updated partial sum partial sum 200x 1x Worst Case: all memory R/W are DRAM accesses • Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

Proper:es We Can Leverage 30 • OperaXons exhibit high parallelism à high throughput possible • Input data reuse opportuniXes ( up to 500x ) Images à exploit low-cost memory Filters 1 Image Image Filter Filter 1 2 2 Convolu:onal Image Filter Reuse Reuse Reuse (pixels, weights) (pixels) (weights)

Highly-Parallel Compute Paradigms 31 Temporal Architecture Spatial Architecture (SIMD/SIMT) (Dataflow Processing) Memory Hierarchy Memory Hierarchy Register File ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control

Advantages of Spa:al Architecture 32 Spatial Architecture Temporal Architecture (Dataflow Processing) (SIMD/SIMT) Efficient Data Reuse Memory Hierarchy Memory Hierarchy Distributed local storage (RF) Register File ALU ALU ALU ALU Inter-PE Communica:on ALU ALU ALU ALU Sharing among regions of PEs ALU ALU ALU ALU ALU ALU ALU ALU Processing ALU ALU ALU ALU Element (PE) ALU ALU ALU ALU 0.5 – 1.0 kB Reg File ALU ALU ALU ALU ALU ALU ALU ALU Control Control

How to Map the Dataflow? 33 Spatial Architecture (Dataflow Processing) CNN Convolu:on Memory Hierarchy ? ALU ALU ALU ALU pixels weights ALU ALU ALU ALU partial sums ALU ALU ALU ALU Goal: Increase reuse of input data ( weights and pixels ) and local ALU ALU ALU ALU par:al sums accumulaXon

34 Energy-Efficient Dataflow Yu-Hsin Chen, Joel Emer, Vivienne Sze, ISCA 2016 Maximize data reuse and accumula:on at RF

Data Movement is Expensive 35 Accelerator Processing Engine PE PE Off-Chip Global DRAM = Buffer PE ALU PE ALU Data Movement Energy Cost 200 × DRAM ALU 6 × Buffer ALU 2 × PE ALU 1 × RF ALU 1 × (Reference) ALU Maximize data reuse at lower levels of hierarchy

Weight Sta:onary (WS) 36 Global Buffer Psum Pixel W0 W1 W2 W3 W4 W5 W6 W7 PE Weight • Minimize weight read energy consumption − maximize convolutional and filter reuse of weights • Examples: [ Chakradhar , ISCA 2010] [ nn-X (NeuFlow) , CVPRW 2014] [ Park , ISSCC 2015] [ Origami , GLSVLSI 2015]

Output Sta:onary (OS) 37 Global Buffer Pixel Weight P0 P1 P2 P3 P4 P5 P6 P7 PE Psum • Minimize partial sum R/W energy consumption − maximize local accumulation • Examples: [ Gupta , ICML 2015] [ ShiDianNao , ISCA 2015] [ Peemen , ICCD 2013]

No Local Reuse (NLR) 38 Global Buffer Weight Pixel PE Psum • Use a large global buffer as shared storage − Reduce DRAM access energy consumption • Examples: [ DianNao , ASPLOS 2014] [ DaDianNao , MICRO 2014] [ Zhang , FPGA 2015]

Row Sta:onary: Energy-efficient Dataflow 39 Input Image Filter Output Image * = [ Chen , ISCA 2016]

1D Row Convolu:on in PE 40 Input Image Filter Partial Sums a b c d e a b c a b c * = PE Reg File c b a e d c b a

1D Row Convolu:on in PE 41 Input Image Filter Partial Sums a b c d e a b c a b c * = PE Reg File c b a e d c b a a

1D Row Convolu:on in PE 42 Input Image Filter Partial Sums a b c d e a b c a b c * = PE Reg File c b a e d c b a b

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne - PowerPoint PPT Presentation

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology In collabora*on with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems 2 Example

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Mitigating Wordline Crosstalk using Adaptive Trees of Counters Mohammad Seyedzadeh , Alex Jones,

Associative Memory Mnemonists or Memory men Can perform amazing feats of memory

A First Course on Kinetics and Reaction Engineering Class 4 on Unit 4 Where Weve Been

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou,

Optimizing Sensor Deployment and Maintenance Costs for Large-Scale Environmental Monitoring

Slide 1 / 76 Work & Energy Multiple Choice Problems Slide 2 / 76 1 A driver in a 2000 kg

ENERGY Bill Nye Energy https://www.youtube.com/watch?v=C4pnVGko0zU What are kinetic and

The ability to do work Causes changes Two forms Potential Kinetic POTENTIAL

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne - PowerPoint PPT Presentation

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology In collabora*on with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems 2 Example

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Energy Efficient Mortgages Initiative Energy efficient Mortgages Action Plan (EeMAP) Energy

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Mitigating Wordline Crosstalk using Adaptive Trees of Counters Mohammad Seyedzadeh , Alex Jones,

Associative Memory Mnemonists or Memory men Can perform amazing feats of memory

A First Course on Kinetics and Reaction Engineering Class 4 on Unit 4 Where Weve Been

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao , Christina Delimitrou,

Optimizing Sensor Deployment and Maintenance Costs for Large-Scale Environmental Monitoring

Slide 1 / 76 Work &amp; Energy Multiple Choice Problems Slide 2 / 76 1 A driver in a 2000 kg

ENERGY Bill Nye Energy https://www.youtube.com/watch?v=C4pnVGko0zU What are kinetic and

The ability to do work Causes changes Two forms Potential Kinetic POTENTIAL

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Slide 1 / 76 Work & Energy Multiple Choice Problems Slide 2 / 76 1 A driver in a 2000 kg