cmsc5743 l03 cnn accurate speedup ii
play

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33 Overview MNN Architecture MNN Backend and Runtime 2 / 33 Overview MNN Architecture MNN Backend and Runtime 3 / 33 Architecture 1 1


  1. CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33

  2. Overview MNN Architecture MNN Backend and Runtime 2 / 33

  3. Overview MNN Architecture MNN Backend and Runtime 3 / 33

  4. Architecture 1 1 Xiaotang Jiang et al. (2020). “MNN: A Universal and Efficient Inference Engine”. In: Proc. MLSys . 3 / 33

  5. Frontends ◮ Caffe Deep Learning Framework ◮ TensorFlow Deep Learning Framework ◮ Pytorch Deep Learning Framework 4 / 33

  6. PyTorch ◮ PyTorch is a python package that provides two high-level features: ◮ Tensor computation (like numpy) with strong GPU acceleration ◮ Deep Neural Networks built on a tape-based autograd system ◮ Model Deployment: ◮ For high-performance inference deployment for trained models, export to ONNX format and optimize and deploy with NVIDIA TensorRT or MNN inference accelerator 5 / 33

  7. PyTorch Code Sample 6 / 33

  8. TensorFlow ◮ TensorFlow is an open source software library for numerical computation using data flow graphs ◮ Model Deployment ◮ For high-performance inference deployment for trained models, using TensorFlow-MNN integration to optimize models within TensorFlow and deploy with MNN inference accelerator 7 / 33

  9. Tensorflow Code Sample 8 / 33

  10. Caffe ◮ Caffe is a deep learning framework made with expression, speed, and modularity in mind: ◮ Expressive architecture encourages application and innovation ◮ Extensible code fosters active development. ◮ Speed makes Caffe perfect for research experiments and industry deployment ◮ Model Deployment: ◮ For high-performance inference deployment for trained models, using Caffe-MNN integration to optimize models within Caffe and MNN inference accelerator 9 / 33

  11. Caffe Code Sample 10 / 33

  12. Data Layout Formats 2 ◮ N is the batch size ◮ C is the number of feature maps ◮ H is the image height ◮ W is the image width 2 https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html 11 / 33

  13. NCHW Memory Layout ◮ Begin with first channel (c=0), elements arranged contiguously in row-major order ◮ Continue with second and subsequent channels until all channels are laid out 12 / 33

  14. NHWC Memory Layout ◮ Begin with the first element of channel 0, then proceed to the first element of channel 1, and so on, until the first elements of all the C channels are laid out ◮ Next, select the second element of channel 0, then proceed to the second element of channel 1, and so on, until the second element of all the channels are laid out ◮ Follow the row-major order of channel 0 and complete all the elements ◮ Proceed to the next batch (if N is > 1) 13 / 33

  15. NHWC Memory Layout 14 / 33

  16. Overview MNN Architecture MNN Backend and Runtime 15 / 33

  17. Overview of the proposed Mobile Neural Network Conv Conv Conv …… Deconv 1x1s1d1 3x3s1d1 3x3s1d2 3x3s2d1 Operator diversity Model (kernel/stride/dilation) CNN RNN LSTM GAN Transformer (upstream) Framework diversity ONNX Model Information Scheme A Scheme B fastest Inference Engine MNN Best Scheme Search ··· Scheme C Device Information Diversity (hardware/os/standard) Device (downstream) Metal CPU IoT Limited resource OpenGL OpenCL Vulkan CPU (computation/memory) 15 / 33

  18. On-device inference Pixel2 MI6 Mate20 MNN (optimized) Model ARM backend Valid Valid Valid OpenCL backend Invalid Valid Valid A1 A2 B1 Vulkan backend Valid Valid Valid ONNX A3 B2 computational graph Valid ARM82 backend Invalid Invalid A4 B3 load valid backends B4 Format Backend Abstraction Converter ARM ARM ARM82 (CPU) ARM Pre-inference backend info model info Vulkan OpenCL OpenCL (GPU) MNN search Vulkan Vulkan (not optimized) Model A1 B1 A1 B1 A1 B1 Offline Graph A2 B2 A2 B2 A2 B2 Optimizer A3 B3 A3 B3 A3 B3 match A4 B4 A4 B4 A4 B4 optimal computation scheme allocated memory pre-computed constants Model Compressor Session Fill Input Inference Output MNN (optimized) Model Offline Conversion On-device Inference 16 / 33

  19. What is Convolution? The calculation process of convolutional layer ◮ No padding ◮ Unit strides ◮ 3 × 3 kernel size ◮ 4 × 4 input feature map 17 / 33

  20. What is Deconvolution (transposed convolution)? 3 The calculation process of deconvolutional layer ◮ 2 × 2 padding with border of zeros ◮ Unit strides ◮ 3 × 3 kernel size ◮ 4 × 4 input feature map 3 Vincent Dumoulin and Francesco Visin (2016). “A guide to convolution arithmetic for deep learning”. In: arXiv preprint arXiv:1603.07285 . 18 / 33

  21. Strassen Algorithm 4 4 Jason Cong and Bingjun Xiao (2014). “Minimizing computation in convolutional neural networks”. In: Proc. ICANN , pp. 281–290. 19 / 33

  22. Strassen Algorithm Matrix size w/o Strassen w/ Strassen ( 256 , 256 , 256 ) 23 23 ( 512 , 512 , 512 ) 176 ( ↓ 7 . 9 % ) 191 ( 512 , 512 , 1024 ) 388 359 ( ↓ 7 . 5 % ) ( 1024 , 1024 , 1024 ) 1299 ( ↓ 13 . 5 % ) 1501 class XPUBackend final : public Backend { XPUBackend(MNNForwardType type, MemoryMode mode); virtual ~XPUBackend(); virtual Execution* onCreate(const vector<Tensor*>& inputs, const vector<Tensor*>& outputs, const MNN::Op* op); virtual void onExecuteBegin() const; virtual void onExecuteEnd() const; virtual bool onAcquireBuffer(const Tensor* tensor, StorageType storageType); virtual bool onReleaseBuffer(const Tensor* tensor, StorageType storageType); virtual bool onClearBuffer(); virtual void onCopyBuffer(const Tensor* srcTensor, const Tensor* dstTensor) const; } 20 / 33

  23. Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

  24. Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

  25. Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

  26. Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

  27. Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

  28. Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

  29. Optimized Winograd algorithm in MNN 𝑌 𝑌′ 𝑋′ 𝑋 𝑋′ = 𝐻𝑋𝐻 𝑈 𝑌′ = 𝐶 𝑈 𝑌𝐶 (4, 4, 𝑉, 𝑗 𝑑 ) (4, 4, 𝑉, 𝑗 𝑑 ) (4, 4, 𝑗 𝑑 , 𝑝 𝑑 ) (3, 3, 𝑗 𝑑 , 𝑝 𝑑 ) ′ 𝑨 = ෍ ′ 𝑙 ⋅ 𝑋 ′ 𝑙 𝑨 𝑍 𝑌 𝑗𝑘 𝑗𝑘 𝑗𝑘 𝑙 matrix mul matrix mul matrix mul matrix mul 𝑉, 𝑝 𝑑 𝑉, 𝑗 𝑑 𝑗 𝑑 , 𝑝 𝑑 Merge 𝑍′ 𝑍 = 𝐵 𝑈 𝑍 ′ 𝐵 𝑍 (4, 4, 𝑉, 𝑝 𝑑 ) (2, 2, 𝑉, 𝑝 𝑑 ) 22 / 33

  30. Memory optimization of MNN 0 Alloc 0 (pre-inference) (inference) Alloc 1 Alloc 0 Conv1 Convolution 1 Alloc 1 Free 0 Free 0 Convolution 1 1 Alloc 2 Alloc 2 Pool Pool Pool Free 1 Free 1 Free 2 2 Free 2 Computational Mixed compute & Memory management Pure compute graph memory management ◮ MNN can infer the exact required memory for the entire graph: ◮ virtually walking through all operations ◮ summing up all allocation and freeing 23 / 33

  31. Inference in FP16 ◮ Training in fp32 and inference in fp16 is expected to get same accuracy as in fp32 most of the time ◮ Add batch normalization to activation ◮ If it is integer RGB input (0 - 255), normalize it to be float (0 - 1) 24 / 33

  32. Analysis of FP16 inference ◮ Advantages of FP16: ◮ FP16 improves speed (TFLOPS) and performance ◮ FP16 reduces memory usage of a neural network ◮ FP16 data transfers are faster than FP32 ◮ Disadvantages of FP16: ◮ They must be converted to or from 32-bit floats before they are operated on 25 / 33

  33. Neon optimization ◮ As a programmer, there are several ways you can use Neon technology: ◮ Neon intrinsics ◮ Neon-enabled libraries ◮ Auto-vectorization by your compiler ◮ Hand-coded Neon assembler 26 / 33

  34. Why use Neon ◮ Support for both integer and floating point operations ensures the adaptability of a broad range of applications, from codecs to High Performance Computing to 3D graphics. ◮ Tight coupling to the Arm processor provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow 27 / 33

  35. Manual Search Conv Conv Conv (3x3s1d1) (3x3s2d1) (3x3s1d2) func_3x3s2d1 func_3x3s1d1 func_3x3s1d2 (Winograd/ (Winograd/ (Winograd/ Sliding Window) Sliding Window) Sliding Window) SIMD SIMD SIMD multi-threading multi-threading multi-threading pipelining pipelining pipelining … … … Optimization Techniques Optimization Techniques Optimization Techniques 28 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend