cmsc5743 l02 cnn accurate speedup i
play

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 / 31 These slides contain/adapt materials developed by Minsik Cho and Daniel Brand (2017). MEC: memory-efficient convolution for deep neural


  1. CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 / 31

  2. These slides contain/adapt materials developed by ◮ Minsik Cho and Daniel Brand (2017). “MEC: memory-efficient convolution for deep neural network”. In: Proc. ICML ◮ Asit K. Mishra et al. (2017). “Fine-grained accelerators for sparse machine learning workloads”. In: Proc. ASPDAC , pp. 635–640 ◮ Jongsoo Park et al. (2017). “Faster CNNs with direct sparse convolutions and guided pruning”. In: Proc. ICLR ◮ UC Berkeley EE290: “Hardware for Machine Learning” https://inst.eecs.berkeley.edu/~ee290-2/sp20/ 2 / 31

  3. Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 3 / 31

  4. Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 4 / 31

  5. 2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation f g h i j 1 2 3 A B C R: Height of Weight S: Width of Weight H k l m n o R 4 5 6 P D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S Q W 𝐵 = 𝑏 ∗ 1 + 𝑐 ∗ 2 + 𝑑 ∗ 3 +𝑔 ∗ 4 + 𝑕 ∗ 5 + ℎ ∗ 6 +𝑙 ∗ 7 + 𝑚 ∗ 8 + 𝑛 ∗ 9 4 / 31

  6. 2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 4 / 31

  7. 2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 3 4 / 31

  8. 2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H k l m n o R 4 5 6 P D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 𝐽 = 𝑛 ∗ 1 + 𝑜 ∗ 2 + 𝑝 ∗ 3 +𝑠 ∗ 4 + 𝑡 ∗ 5 + 𝑢 ∗ 6 +𝑥 ∗ 7 + 𝑦 ∗ 8 + 𝑧 ∗ 9 4 / 31

  9. 2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation a b c d e W: Width of Input Activation R: Height of Weight f g h i j 1 2 3 A B C S: Width of Weight H R P k l m n o 4 5 6 D E F P: Height of Output Activation p q r s t 7 8 9 G H I Q: Width of Output Activation u v w x y S stride: # of rows/columns Q traversed per step W 𝑄 = (𝐼 − 𝑆) 𝑡𝑢𝑠𝑗𝑒𝑓 + 1 𝑅 = (𝑋 − 𝑇) 𝑡𝑢𝑠𝑗𝑒𝑓 + 1 4 / 31

  10. 2D-Convolution Input Output Weight Activation Activation H: Height of Input Activation 0 0 0 0 0 0 0 W: Width of Input Activation 0 a b c d e 0 A B C R: Height of Weight 1 2 3 0 f g h i j 0 P D E F S: Width of Weight H R 4 5 6 0 k l m n o 0 P: Height of Output Activation G H I 0 p q r s t 0 7 8 9 Q: Width of Output Activation 0 u v w x y 0 S stride: # of rows/columns 0 0 0 0 0 0 0 Q traversed per step 𝑄 = (𝐼 − 𝑆 + 2 ∗ 𝑞𝑏𝑒) W padding: # of zero + 1 rows/columns added 𝑡𝑢𝑠𝑗𝑒𝑓 𝑅 = (𝑋 − 𝑇 + 2 ∗ 𝑞𝑏𝑒) + 1 𝑡𝑢𝑠𝑗𝑒𝑓 4 / 31

  11. 3D-Convolution H: Height of Input Activation W: Width of Input Activation Input Output Weight R: Height of Weight Activation Activation S: Width of Weight P: Height of Output Activation Q: Width of Output Activation C stride: # of rows/columns C traversed per step a b c d e padding: # of zero A B C f g h i j 1 2 3 rows/columns added P D E F H R k l m n o 4 5 6 G H I C: # of Input Channels p q r s t 7 8 9 Q u v w x y S W 5 / 31

  12. 3D-Convolution Weight H: Height of Input Activation Input Output W: Width of Input Activation Activation Activation R: Height of Weight C S: Width of Weight P: Height of Output Activation C 1 2 3 Q: Width of Output Activation K R 4 5 6 stride: # of rows/columns a b c d e traversed per step 7 8 9 A B C padding: # of zero f g h i j P S D E F K rows/columns added H k l m n o G H I … p q r s t C: # of Input Channels Q u v w x y K: # of Output Channels C W 1 2 3 R 4 5 6 7 8 9 S 5 / 31

  13. 3D-Convolution Input Activation Weight Output Activation C C K H: Height of Input Activation a b c d e 1 2 3 A B C W: Width of Input Activation f g h i j R 4 5 6 P D E F R: Height of Weight H k l m n o S: Width of Weight 7 8 9 G H I P: Height of Output Activation p q r s t S Q Q: Width of Output Activation K u v w x y N stride: # of rows/columns N … … traversed per step padding: # of zero C K rows/columns added C C: # of Input Channels 1 2 3 K: # of Output Channels P R 4 5 6 N: Batch size H 7 8 9 Q S W 5 / 31

  14. Convolution 101 0 0 0 0 0 0 0 4 6 3 5 4 0 2 2 1 1 2 0 2 0 2 0 1 1 0 0 1 0 0 6 2 4 4 1 1 1 1 5 3 4 4 0 2 0 1 2 0 0 2 4 3 3 0 1 1 1 1 1 0 1 0 -1 4 0 0 0 1 0 2 0 0 2 2 4 3 0 0 0 0 0 0 0 Direct convolution: No extra memory overhead ◮ Low performance ◮ Poor memory access pattern due to geometry-specific constraint ◮ Relatively short dot product 6 / 31

  15. Background: Memory System Processor Inclusive– 4-8 bytes (word) what is in L1$ is a subset of Increasing L1$ what is in L2$ distance is a subset of 8-32 bytes (block) from the what is in MM L2$ processor that is a 1 to 4 blocks in access subset of is in Main Memory time SM 1,024+ bytes (disk sector = page) Secondary Memory (Relative) size of the memory at each level ◮ Spatial locality ◮ Temporal Locality 7 / 31

  16. Overview Convolution 101 GEMM Sparse Convolution Direct Convolution Further Discussions 8 / 31

  17. Im2col (Image2Column) Convolution 0 0 0 0 2 2 0 2 0 1 0 0 0 0 0 0 0 0 0 0 2 2 1 2 0 1 0 4 6 3 5 4 0 2 2 1 1 2 0 0 0 0 2 1 1 0 1 1 0 . 2 6 2 4 4 0 2 0 1 1 0 0 1 . 1 5 3 4 4 1 0 2 0 1 2 0 0 . 2 4 3 3 4 1 0 1 1 1 1 1 0 1 1 0 1 2 0 1 1 1 1 0 2 2 4 3 0 0 0 1 0 2 0 . . 0 0 0 0 0 0 0 0 -1 1 1 0 0 2 0 0 0 0 25 x 9 9 x 1 ◮ Large extra memory overhead ◮ Good performance ◮ BLAS-friendly memory layout to enjoy SIMD/locality/parallelism ◮ Applicable for any convolution configuration on any platform 8 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend