CMSC5743 L03: CNN Accurate Speedup II
Bei Yu
(Latest update: September 29, 2020)
Fall 2020
1 / 33
CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - - PowerPoint PPT Presentation
CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33 Overview MNN Architecture MNN Backend and Runtime 2 / 33 Overview MNN Architecture MNN Backend and Runtime 3 / 33 Architecture 1 1
1 / 33
2 / 33
3 / 33
1Xiaotang Jiang et al. (2020). “MNN: A Universal and Efficient Inference Engine”. In: Proc. MLSys. 3 / 33
4 / 33
and optimize and deploy with NVIDIA TensorRT or MNN inference accelerator
5 / 33
6 / 33
integration to optimize models within TensorFlow and deploy with MNN inference accelerator
7 / 33
8 / 33
integration to optimize models within Caffe and MNN inference accelerator
9 / 33
10 / 33
2https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html 11 / 33
12 / 33
13 / 33
14 / 33
15 / 33
Scheme A Scheme B Scheme C
··· Search
Best Scheme fastest
ONNX
CNN
Conv
1x1s1d1
Conv
3x3s1d1
Deconv
3x3s2d1
……
RNN LSTM GAN Transformer
Model Information
(upstream) Operator diversity
(kernel/stride/dilation)
Framework diversity
CPU Metal CPU
OpenGL OpenCL Vulkan
IoT Device Information
(downstream) Diversity
(hardware/os/standard)
Limited resource
(computation/memory)
Conv
3x3s1d2
15 / 33
Format Converter MNN Model (not optimized) Model Compressor Offline Graph Optimizer MNN Model (optimized) A1 A2 B1 B2 A3 B3 A4 B4 MNN Model (optimized)
Pre-inference
ARM ARM ARM OpenCL OpenCL Vulkan Vulkan ARM82 Vulkan load valid backends model info backend info computational graph A1 A2 A3 A4 B1 B2 B3 B4 A1 A2 A3 A4 B1 B2 B3 B4 A1 A2 A3 A4 B1 B2 B3 B4
allocated memory match pre-computed constants Session Fill Input Inference Output Offline Conversion On-device Inference search Valid Valid Valid Valid Valid Valid
Invalid
Valid
Invalid
Valid Valid
Invalid
Pixel2 MI6 Mate20
ARM backend OpenCL backend Vulkan backend ARM82 backend Backend Abstraction (CPU) (GPU)
ONNX
16 / 33
The calculation process of convolutional layer
17 / 33
The calculation process of deconvolutional layer
3Vincent Dumoulin and Francesco Visin (2016). “A guide to convolution arithmetic for deep learning”. In: arXiv preprint
arXiv:1603.07285.
18 / 33
4Jason Cong and Bingjun Xiao (2014). “Minimizing computation in convolutional neural networks”. In: Proc. ICANN,
19 / 33
class XPUBackend final : public Backend { XPUBackend(MNNForwardType type, MemoryMode mode); virtual ~XPUBackend(); virtual Execution* onCreate(const vector<Tensor*>& inputs, const vector<Tensor*>& outputs, const MNN::Op* op); virtual void onExecuteBegin() const; virtual void onExecuteEnd() const; virtual bool onAcquireBuffer(const Tensor* tensor, StorageType storageType); virtual bool onReleaseBuffer(const Tensor* tensor, StorageType storageType); virtual bool onClearBuffer(); virtual void onCopyBuffer(const Tensor* srcTensor, const Tensor* dstTensor) const; } 20 / 33
5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,
21 / 33
5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,
21 / 33
5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,
21 / 33
5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,
21 / 33
5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,
21 / 33
5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,
21 / 33
Merge
𝑍
𝑗𝑘 ′ 𝑨 = 𝑙
𝑌𝑗𝑘
′ 𝑙 ⋅ 𝑋 𝑗𝑘 ′ 𝑙 𝑨
𝑌 (4, 4, 𝑉, 𝑗𝑑) 𝑌′ (4, 4, 𝑉, 𝑗𝑑) 𝑋
(3, 3, 𝑗𝑑, 𝑝𝑑)
𝑋′ (4, 4, 𝑗𝑑, 𝑝𝑑)
matrix mul
𝑉, 𝑝𝑑 𝑉, 𝑗𝑑 𝑗𝑑, 𝑝𝑑
matrix mul matrix mul matrix mul
𝑍′ (4, 4, 𝑉, 𝑝𝑑) 𝑍 = 𝐵𝑈𝑍′𝐵 𝑍
𝑋′ = 𝐻𝑋𝐻𝑈 𝑌′ = 𝐶𝑈𝑌𝐶
(2, 2, 𝑉, 𝑝𝑑)
22 / 33
Conv1 1 Pool 2 Alloc 0 Alloc 1 Convolution 1 Free 0 Alloc 2 Pool Free 1 Free 2 Convolution 1 Pool Computational graph Mixed compute & memory management Memory management Pure compute Alloc 0 Alloc 1 Free 0 Alloc 2 Free 1 Free 2
23 / 33
24 / 33
25 / 33
26 / 33
27 / 33
Conv (3x3s1d1) Conv (3x3s2d1) Conv (3x3s1d2) func_3x3s1d1 (Winograd/ Sliding Window) SIMD multi-threading pipelining … Optimization Techniques func_3x3s2d1 (Winograd/ Sliding Window) func_3x3s1d2 (Winograd/ Sliding Window) SIMD multi-threading pipelining … Optimization Techniques SIMD multi-threading pipelining … Optimization Techniques
28 / 33
Conv (3x3s1d1) Conv (3x3s2d1) Conv (3x3s1d2) basic matrix mul
Search (Pre-inference) Strassen Winograd Sliding Window cost1 cost2,3,4... 1x1conv ··· SIMD multi-threading data re-ordering pipelining … Optimization Techniques
(least cost)
29 / 33
30 / 33
31 / 33
NCNN MACE TF-Lite CoreML MNN 29 28 42 115 48 156 21 21 79 370 27 27 37 101 50 100 150 200 250 300 350 400 iPhoneX iPhone8 Mate20 MI6 21 21 28 66 38 115 16 16 113 256 15 14 21 58 50 100 150 200 250 300 iPhoneX iPhone8 Mate20 MI6 62 62 76 218 87 236 61 63 252 519 70 68 69 208 100 200 300 400 500 600 iPhoneX iPhone8 Mate20 MI6 29 41 25 51 23 23 194 356 14 14 13 31 50 100 150 200 250 300 350 400 iPhoneX iPhone8 Mate20 MI6 15 26 20 40 17 17 190 367 9 8 8 20 50 100 150 200 250 300 350 400 iPhoneX iPhone8 Mate20 MI6 32 40 39 48 62 57 330 477 34 34 30 34 100 200 300 400 500 600 iPhoneX iPhone8 Mate20 MI6
32 / 33
23 18 25 23 26 51 21 18 21 17 23 19 10 20 30 40 50 60 Mate20 MI6 19 18 26 27 22 16 5 10 15 20 25 30 iPhoneX iPhone8 24 15 35 38 31 39 17 16 13 15 16 12 5 10 15 20 25 30 35 40 45 Mate20 MI6 16 15 29 33 19 19 5 10 15 20 25 30 35 iPhoneX iPhone8 32 56 262 518 80 199 34 78 36 54 34 34 100 200 300 400 500 600 Mate20 MI6 25 25 116 105 27 28 20 40 60 80 100 120 140 iPhoneX iPhone8 MNN TF-Lite CoreML iOS: NCNN (Vulkan) MACE (OpenCL) TF-Lite (OpenGL) MNN (Vulkan) MNN (OpenCL) MNN (OpenGL) Android:
33 / 33