energy efficient deep learning
play

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne - PowerPoint PPT Presentation

Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology In collabora*on with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems 2 Example


  1. Energy-Efficient Deep Learning: Challenges and Opportuni:es Vivienne Sze Massachuse@s Ins:tute of Technology In collabora*on with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang Contact Info email: sze@mit.edu website: www.rle.mit.edu/eems

  2. 2 Example Applica:ons of Deep Learning Speech Recognition Computer Vision Game Play Medical

  3. What is Deep Learning? 3 “Volvo Image XC90” Image Source: [Lee et al., Comm. ACM 2011]

  4. Weighted Sums 4 ⎛ ⎞ 3 ∑ Y j = activation W ij × X i ⎜ ⎟ W 11 ⎝ ⎠ Y 1 i = 1 X 1 Y 2 X 2 Y 3 X 3 W 34 Y 4 Image Source: Stanford

  5. Why is Deep Learning Hot Now? 5 Big Data GPU New ML Availability Acceleration Techniques 350M images uploaded per day 2.5 Petabytes of customer data hourly 300 hours of video uploaded every minute

  6. Deep Convolu:onal Neural Networks 6 Modern deep CNN: up to 1000 CONV layers CONV CONV FC Classes Layer Layer Layers Low-level High-level Features Features

  7. Deep Convolu:onal Neural Networks 7 1 – 3 layers CONV CONV FC Classes Layer Layer Layers Low-level High-level Features Features

  8. Deep Convolu:onal Neural Networks 8 CONV CONV FC Classes Layer Layer Layers Convolutions account for more than 90% of overall computation, dominating runtime and energy consumption

  9. High-Dimensional CNN Convolu:on 9 Input Image (Feature Map) Filter H R R H

  10. High-Dimensional CNN Convolu:on 10 Input Image (Feature Map) Filter H R R H Element-wise Multiplication

  11. High-Dimensional CNN Convolu:on 11 Input Image (Feature Map) Output Image a pixel Filter H E R R H E Element-wise Partial Sum (psum) Multiplication Accumulation

  12. High-Dimensional CNN Convolu:on 12 Input Image (Feature Map) Output Image a pixel Filter H E R R H E Sliding Window Processing

  13. High-Dimensional CNN Convolu:on 13 Input Image C Filter Output Image C H E R R H E Many Input Channels (C) AlexNet: 3 – 192 Channels (C)

  14. High-Dimensional CNN Convolu:on 14 Input Image Many Output Image Filters (M) C C M H E R 1 R H E … Many Output Channels (M) C R M R AlexNet: 96 – 384 Filters (M)

  15. High-Dimensional CNN Convolu:on 15 Many Many Input Images (N) Output Images (N) C Filters M C H E R 1 1 R H E … … … C C R E H N R N E H Image batch size: 1 – 256 (N)

  16. Large Sizes with Varying Shapes 16 AlexNet 1 Convolu:onal Layer Configura:ons Layer Filter Size (R) # Filters (M) # Channels (C) Stride 1 11x11 96 3 4 2 5x5 256 48 1 3 3x3 384 256 1 4 3x3 384 192 1 5 3x3 256 192 1 Layer 1 Layer 2 Layer 3 34k Params 307k Params 885k Params 105M MACs 150M MACs 224M MACs 1. [Krizhevsky, NIPS 2012]

  17. Popular CNNs 17 ImageNet: Large Scale Visual Recognition Challenge (ILSVRC) • LeNet (1998) 18 AlexNet • AlexNet (2012) 16 Accuracy (Top 5 error) OverFeat • OverFeat (2013) 14 12 • VGGNet (2014) 10 • GoogleNet (2014) VGGNet 8 GoogLeNet • ResNet (2015) 6 ResNet 4 Clarifai 2 0 2012 2013 2014 2015 Human [O. Russakovsky et al., IJCV 2015]

  18. Summary of Popular CNNs 18 Metrics LeNet-5 AlexNet VGG-16 GoogLeNet ResNet-50 (v1) Top-5 error n/a 16.4 7.4 6.7 5.3 Input Size 28x28 227x227 224x224 224x224 224x224 # of CONV Layers 2 5 16 21 (depth) 49 Filter Sizes 5 3, 5,11 3 1, 3 , 5, 7 1, 3, 7 # of Channels 1, 6 3 - 256 3 - 512 3 - 1024 3 - 2048 # of Filters 6, 16 96 - 384 64 - 512 64 - 384 64 - 2048 Stride 1 1, 4 1 1, 2 1, 2 # of Weights 2.6k 2.3M 14.7M 6.0M 23.5M # of MACs 283k 666M 15.3G 1.43G 3.86G # of FC layers 2 3 3 1 1 # of Weights 58k 58.6M 124M 1M 2M # of MACs 58k 58.6M 124M 1M 2M Total Weights 60k 61M 138M 7M 25.5M Total MACs 341k 724M 15.5G 1.43G 3.9G CONV Layers increasingly important!

  19. Training vs. Inference 19 Training Inference (determine weights) (use weights) Large Datasets Weights

  20. Processing at “Edge” instead of the “Cloud” 20 Communication Privacy Image source: www.theregister.co.uk Latency Sensor Cloud Image source: ericsson.com Actuator

  21. 21 Challenges

  22. Key Metrics 22 ImageNet MNIST • Accuracy – Evaluate hardware using the appropriate DNN model and dataset • Programmability – Support mulXple applicaXons Computer Speech – Different weights Vision Recogni:on • Energy/Power – Energy per operaXon – DRAM Bandwidth • Throughput/Latency Chip – GOPS, frame rate, delay • Cost DRAM – Area (size of memory and # of cores) [Sze et al., CICC 2017]

  23. 23 Opportunities in Architecture

  24. GPUs and CPUs Targe:ng Deep Learning 24 Intel Knights Landing (2016) Nvidia PASCAL GP100 (2016) Knights Mill: next gen Xeon Phi “optimized for deep learning” Use matrix multiplication libraries on CPUs and GPUs

  25. Map DNN to a Matrix Mul:plica:on 25 Filter Input Fmap Output Fmap 1 2 1 2 1 2 3 = * Convolution: 3 4 3 4 4 5 6 7 8 9 Toeplitz Matrix (w/ redundant data) 1 2 3 4 × = 1 2 4 5 1 2 3 4 2 3 5 6 Matrix Mult: 4 5 7 8 5 6 8 9 Data is repeated Goal: Reduced number of operations to increase throughput

  26. Reduce Opera:ons in Matrix Mul:plica:on 26 • Fast Fourier Transform [Mathieu, ICLR 2014] – Pro: Direct convoluXon O(N o 2 N f 2 ) to O(N o 2 log 2 N o ) – Con: Increase storage requirements • Strassen [Cong, ICANN 2014] – Pro: O(N 3 ) to (N 2.807 ) – Con: Numerical stability • Winograd [Lavin, CVPR 2016] – Pro: 2.25x speed up for 3x3 filter – Con: Specialized processing depending on filter size

  27. Analogy: Gauss’s Mul:plica:on Algorithm 27 4 multiplications + 3 additions 3 multiplications + 5 additions Reduce number of multiplications, but increase number of additions

  28. 28 Specialized Hardware (Accelerators)

  29. Proper:es We Can Leverage 29 • OperaXons exhibit high parallelism à high throughput possible • Memory Access is the Bocleneck Memory Read MAC * Memory Write filter weight ALU image pixel DRAM DRAM updated partial sum partial sum 200x 1x Worst Case: all memory R/W are DRAM accesses • Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required

  30. Proper:es We Can Leverage 30 • OperaXons exhibit high parallelism à high throughput possible • Input data reuse opportuniXes ( up to 500x ) Images à exploit low-cost memory Filters 1 Image Image Filter Filter 1 2 2 Convolu:onal Image Filter Reuse Reuse Reuse (pixels, weights) (pixels) (weights)

  31. Highly-Parallel Compute Paradigms 31 Temporal Architecture Spatial Architecture (SIMD/SIMT) (Dataflow Processing) Memory Hierarchy Memory Hierarchy Register File ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control

  32. Advantages of Spa:al Architecture 32 Spatial Architecture Temporal Architecture (Dataflow Processing) (SIMD/SIMT) Efficient Data Reuse Memory Hierarchy Memory Hierarchy Distributed local storage (RF) Register File ALU ALU ALU ALU Inter-PE Communica:on ALU ALU ALU ALU Sharing among regions of PEs ALU ALU ALU ALU ALU ALU ALU ALU Processing ALU ALU ALU ALU Element (PE) ALU ALU ALU ALU 0.5 – 1.0 kB Reg File ALU ALU ALU ALU ALU ALU ALU ALU Control Control

  33. How to Map the Dataflow? 33 Spatial Architecture (Dataflow Processing) CNN Convolu:on Memory Hierarchy ? ALU ALU ALU ALU pixels weights ALU ALU ALU ALU partial sums ALU ALU ALU ALU Goal: Increase reuse of input data ( weights and pixels ) and local ALU ALU ALU ALU par:al sums accumulaXon

  34. 34 Energy-Efficient Dataflow Yu-Hsin Chen, Joel Emer, Vivienne Sze, ISCA 2016 Maximize data reuse and accumula:on at RF

  35. Data Movement is Expensive 35 Accelerator Processing Engine PE PE Off-Chip Global DRAM = Buffer PE ALU PE ALU Data Movement Energy Cost 200 × DRAM ALU 6 × Buffer ALU 2 × PE ALU 1 × RF ALU 1 × (Reference) ALU Maximize data reuse at lower levels of hierarchy

  36. Weight Sta:onary (WS) 36 Global Buffer Psum Pixel W0 W1 W2 W3 W4 W5 W6 W7 PE Weight • Minimize weight read energy consumption − maximize convolutional and filter reuse of weights • Examples: [ Chakradhar , ISCA 2010] [ nn-X (NeuFlow) , CVPRW 2014] [ Park , ISSCC 2015] [ Origami , GLSVLSI 2015]

  37. Output Sta:onary (OS) 37 Global Buffer Pixel Weight P0 P1 P2 P3 P4 P5 P6 P7 PE Psum • Minimize partial sum R/W energy consumption − maximize local accumulation • Examples: [ Gupta , ICML 2015] [ ShiDianNao , ISCA 2015] [ Peemen , ICCD 2013]

  38. No Local Reuse (NLR) 38 Global Buffer Weight Pixel PE Psum • Use a large global buffer as shared storage − Reduce DRAM access energy consumption • Examples: [ DianNao , ASPLOS 2014] [ DaDianNao , MICRO 2014] [ Zhang , FPGA 2015]

  39. Row Sta:onary: Energy-efficient Dataflow 39 Input Image Filter Output Image * = [ Chen , ISCA 2016]

  40. 1D Row Convolu:on in PE 40 Input Image Filter Partial Sums a b c d e a b c a b c * = PE Reg File c b a e d c b a

  41. 1D Row Convolu:on in PE 41 Input Image Filter Partial Sums a b c d e a b c a b c * = PE Reg File c b a e d c b a a

  42. 1D Row Convolu:on in PE 42 Input Image Filter Partial Sums a b c d e a b c a b c * = PE Reg File c b a e d c b a b

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend