CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - - PowerPoint PPT Presentation

cmsc5743 l03 cnn accurate speedup ii
SMART_READER_LITE
LIVE PREVIEW

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - - PowerPoint PPT Presentation

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33 Overview MNN Architecture MNN Backend and Runtime 2 / 33 Overview MNN Architecture MNN Backend and Runtime 3 / 33 Architecture 1 1


slide-1
SLIDE 1

CMSC5743 L03: CNN Accurate Speedup II

Bei Yu

(Latest update: September 29, 2020)

Fall 2020

1 / 33

slide-2
SLIDE 2

Overview

MNN Architecture MNN Backend and Runtime

2 / 33

slide-3
SLIDE 3

Overview

MNN Architecture MNN Backend and Runtime

3 / 33

slide-4
SLIDE 4

Architecture1

1Xiaotang Jiang et al. (2020). “MNN: A Universal and Efficient Inference Engine”. In: Proc. MLSys. 3 / 33

slide-5
SLIDE 5

Frontends

◮ Caffe Deep Learning Framework ◮ TensorFlow Deep Learning Framework ◮ Pytorch Deep Learning Framework

4 / 33

slide-6
SLIDE 6

PyTorch

◮ PyTorch is a python package that provides two high-level features:

◮ Tensor computation (like numpy) with strong GPU acceleration ◮ Deep Neural Networks built on a tape-based autograd system

◮ Model Deployment:

◮ For high-performance inference deployment for trained models, export to ONNX format

and optimize and deploy with NVIDIA TensorRT or MNN inference accelerator

5 / 33

slide-7
SLIDE 7

PyTorch Code Sample

6 / 33

slide-8
SLIDE 8

TensorFlow

◮ TensorFlow is an open source software library for numerical computation using data

flow graphs

◮ Model Deployment

◮ For high-performance inference deployment for trained models, using TensorFlow-MNN

integration to optimize models within TensorFlow and deploy with MNN inference accelerator

7 / 33

slide-9
SLIDE 9

Tensorflow Code Sample

8 / 33

slide-10
SLIDE 10

Caffe

◮ Caffe is a deep learning framework made with expression, speed, and modularity in

mind: ◮ Expressive architecture encourages application and innovation ◮ Extensible code fosters active development. ◮ Speed makes Caffe perfect for research experiments and industry deployment

◮ Model Deployment:

◮ For high-performance inference deployment for trained models, using Caffe-MNN

integration to optimize models within Caffe and MNN inference accelerator

9 / 33

slide-11
SLIDE 11

Caffe Code Sample

10 / 33

slide-12
SLIDE 12

Data Layout Formats2

◮ N is the batch size ◮ C is the number of feature maps ◮ H is the image height ◮ W is the image width

2https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html 11 / 33

slide-13
SLIDE 13

NCHW Memory Layout

◮ Begin with first channel (c=0), elements arranged contiguously in row-major order ◮ Continue with second and subsequent channels until all channels are laid out

12 / 33

slide-14
SLIDE 14

NHWC Memory Layout

◮ Begin with the first element of channel 0, then proceed to the first element of channel

1, and so on, until the first elements of all the C channels are laid out

◮ Next, select the second element of channel 0, then proceed to the second element of

channel 1, and so on, until the second element of all the channels are laid out

◮ Follow the row-major order of channel 0 and complete all the elements ◮ Proceed to the next batch (if N is > 1)

13 / 33

slide-15
SLIDE 15

NHWC Memory Layout

14 / 33

slide-16
SLIDE 16

Overview

MNN Architecture MNN Backend and Runtime

15 / 33

slide-17
SLIDE 17

Overview of the proposed Mobile Neural Network

MNN

Scheme A Scheme B Scheme C

··· Search

Best Scheme fastest

Inference Engine

ONNX

CNN

Conv

1x1s1d1

Conv

3x3s1d1

Deconv

3x3s2d1

……

RNN LSTM GAN Transformer

Model Information

Model

(upstream) Operator diversity

(kernel/stride/dilation)

Framework diversity

CPU Metal CPU

OpenGL OpenCL Vulkan

IoT Device Information

Device

(downstream) Diversity

(hardware/os/standard)

Limited resource

(computation/memory)

Conv

3x3s1d2

15 / 33

slide-18
SLIDE 18

On-device inference

Format Converter MNN Model (not optimized) Model Compressor Offline Graph Optimizer MNN Model (optimized) A1 A2 B1 B2 A3 B3 A4 B4 MNN Model (optimized)

Pre-inference

ARM ARM ARM OpenCL OpenCL Vulkan Vulkan ARM82 Vulkan load valid backends model info backend info computational graph A1 A2 A3 A4 B1 B2 B3 B4 A1 A2 A3 A4 B1 B2 B3 B4 A1 A2 A3 A4 B1 B2 B3 B4

  • ptimal computation scheme

allocated memory match pre-computed constants Session Fill Input Inference Output Offline Conversion On-device Inference search Valid Valid Valid Valid Valid Valid

Invalid

Valid

Invalid

Valid Valid

Invalid

Pixel2 MI6 Mate20

ARM backend OpenCL backend Vulkan backend ARM82 backend Backend Abstraction (CPU) (GPU)

ONNX

16 / 33

slide-19
SLIDE 19

What is Convolution?

The calculation process of convolutional layer

◮ No padding ◮ Unit strides ◮ 3 × 3 kernel size ◮ 4 × 4 input feature map

17 / 33

slide-20
SLIDE 20

What is Deconvolution (transposed convolution)?3

The calculation process of deconvolutional layer

◮ 2 × 2 padding with border of zeros ◮ Unit strides ◮ 3 × 3 kernel size ◮ 4 × 4 input feature map

3Vincent Dumoulin and Francesco Visin (2016). “A guide to convolution arithmetic for deep learning”. In: arXiv preprint

arXiv:1603.07285.

18 / 33

slide-21
SLIDE 21

Strassen Algorithm4

4Jason Cong and Bingjun Xiao (2014). “Minimizing computation in convolutional neural networks”. In: Proc. ICANN,

  • pp. 281–290.

19 / 33

slide-22
SLIDE 22

Strassen Algorithm

Matrix size w/o Strassen w/ Strassen

(256, 256, 256) 23 23 (512, 512, 512) 191 176 (↓ 7.9%) (512, 512, 1024) 388 359 (↓ 7.5%) (1024, 1024, 1024) 1501 1299 (↓ 13.5%)

class XPUBackend final : public Backend { XPUBackend(MNNForwardType type, MemoryMode mode); virtual ~XPUBackend(); virtual Execution* onCreate(const vector<Tensor*>& inputs, const vector<Tensor*>& outputs, const MNN::Op* op); virtual void onExecuteBegin() const; virtual void onExecuteEnd() const; virtual bool onAcquireBuffer(const Tensor* tensor, StorageType storageType); virtual bool onReleaseBuffer(const Tensor* tensor, StorageType storageType); virtual bool onClearBuffer(); virtual void onCopyBuffer(const Tensor* srcTensor, const Tensor* dstTensor) const; } 20 / 33

slide-23
SLIDE 23

Winograd Algorithm5

5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,

  • pp. 4013–4021.

21 / 33

slide-24
SLIDE 24

Winograd Algorithm5

5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,

  • pp. 4013–4021.

21 / 33

slide-25
SLIDE 25

Winograd Algorithm5

5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,

  • pp. 4013–4021.

21 / 33

slide-26
SLIDE 26

Winograd Algorithm5

5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,

  • pp. 4013–4021.

21 / 33

slide-27
SLIDE 27

Winograd Algorithm5

5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,

  • pp. 4013–4021.

21 / 33

slide-28
SLIDE 28

Winograd Algorithm5

5Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR,

  • pp. 4013–4021.

21 / 33

slide-29
SLIDE 29

Optimized Winograd algorithm in MNN

Merge

𝑍

𝑗𝑘 ′ 𝑨 = ෍ 𝑙

𝑌𝑗𝑘

′ 𝑙 ⋅ 𝑋 𝑗𝑘 ′ 𝑙 𝑨

𝑌 (4, 4, 𝑉, 𝑗𝑑) 𝑌′ (4, 4, 𝑉, 𝑗𝑑) 𝑋

(3, 3, 𝑗𝑑, 𝑝𝑑)

𝑋′ (4, 4, 𝑗𝑑, 𝑝𝑑)

matrix mul

𝑉, 𝑝𝑑 𝑉, 𝑗𝑑 𝑗𝑑, 𝑝𝑑

matrix mul matrix mul matrix mul

𝑍′ (4, 4, 𝑉, 𝑝𝑑) 𝑍 = 𝐵𝑈𝑍′𝐵 𝑍

𝑋′ = 𝐻𝑋𝐻𝑈 𝑌′ = 𝐶𝑈𝑌𝐶

(2, 2, 𝑉, 𝑝𝑑)

22 / 33

slide-30
SLIDE 30

Memory optimization of MNN

Conv1 1 Pool 2 Alloc 0 Alloc 1 Convolution 1 Free 0 Alloc 2 Pool Free 1 Free 2 Convolution 1 Pool Computational graph Mixed compute & memory management Memory management Pure compute Alloc 0 Alloc 1 Free 0 Alloc 2 Free 1 Free 2

(pre-inference) (inference)

◮ MNN can infer the exact required memory for the entire graph:

◮ virtually walking through all operations ◮ summing up all allocation and freeing

23 / 33

slide-31
SLIDE 31

Inference in FP16

◮ Training in fp32 and inference in fp16 is expected to get same accuracy as in fp32

most of the time

◮ Add batch normalization to activation ◮ If it is integer RGB input (0 - 255), normalize it to be float (0 - 1)

24 / 33

slide-32
SLIDE 32

Analysis of FP16 inference

◮ Advantages of FP16:

◮ FP16 improves speed (TFLOPS) and performance ◮ FP16 reduces memory usage of a neural network ◮ FP16 data transfers are faster than FP32

◮ Disadvantages of FP16:

◮ They must be converted to or from 32-bit floats before they are operated on

25 / 33

slide-33
SLIDE 33

Neon optimization

◮ As a programmer, there are several ways you can use Neon technology:

◮ Neon intrinsics ◮ Neon-enabled libraries ◮ Auto-vectorization by your compiler ◮ Hand-coded Neon assembler

26 / 33

slide-34
SLIDE 34

Why use Neon

◮ Support for both integer and floating point operations ensures the adaptability of a

broad range of applications, from codecs to High Performance Computing to 3D graphics.

◮ Tight coupling to the Arm processor provides a single instruction stream and a unified

view of memory, presenting a single development platform target with a simpler tool flow

27 / 33

slide-35
SLIDE 35

Manual Search

Conv (3x3s1d1) Conv (3x3s2d1) Conv (3x3s1d2) func_3x3s1d1 (Winograd/ Sliding Window) SIMD multi-threading pipelining … Optimization Techniques func_3x3s2d1 (Winograd/ Sliding Window) func_3x3s1d2 (Winograd/ Sliding Window) SIMD multi-threading pipelining … Optimization Techniques SIMD multi-threading pipelining … Optimization Techniques

28 / 33

slide-36
SLIDE 36

Semi-automated Search

Conv (3x3s1d1) Conv (3x3s2d1) Conv (3x3s1d2) basic matrix mul

scheme pool

Search (Pre-inference) Strassen Winograd Sliding Window cost1 cost2,3,4... 1x1conv ··· SIMD multi-threading data re-ordering pipelining … Optimization Techniques

  • ptimal scheme

(least cost)

29 / 33

slide-37
SLIDE 37

Automated Search

IR + Optimize LLVM Code Generation Conv (3x3s1d1) Conv (3x3s2d1) Conv (3x3s1d2) Library Library Auto-tuning (search) Auto-tuning (search) Auto-tuning (search) Library

30 / 33

slide-38
SLIDE 38

Performance on different smartphones and networks

◮ Generally, MNN outperforms other inference engines under almost all settings by

about 20% − 40%, regardless of the smartphones, backends, and networks

◮ For CPU, on average, 4-thread inference with MNN is about 30% faster than others on

iOS platforms, and about 34% faster on Android platforms

◮ For Metal GPU backend on iPhones, MNN is much faster than TF-Lite, a little slower

than CoreML but still comparable

31 / 33

slide-39
SLIDE 39

Performance on different smartphones and networks

NCNN MACE TF-Lite CoreML MNN 29 28 42 115 48 156 21 21 79 370 27 27 37 101 50 100 150 200 250 300 350 400 iPhoneX iPhone8 Mate20 MI6 21 21 28 66 38 115 16 16 113 256 15 14 21 58 50 100 150 200 250 300 iPhoneX iPhone8 Mate20 MI6 62 62 76 218 87 236 61 63 252 519 70 68 69 208 100 200 300 400 500 600 iPhoneX iPhone8 Mate20 MI6 29 41 25 51 23 23 194 356 14 14 13 31 50 100 150 200 250 300 350 400 iPhoneX iPhone8 Mate20 MI6 15 26 20 40 17 17 190 367 9 8 8 20 50 100 150 200 250 300 350 400 iPhoneX iPhone8 Mate20 MI6 32 40 39 48 62 57 330 477 34 34 30 34 100 200 300 400 500 600 iPhoneX iPhone8 Mate20 MI6

32 / 33

slide-40
SLIDE 40

Performance on different smartphones and networks

23 18 25 23 26 51 21 18 21 17 23 19 10 20 30 40 50 60 Mate20 MI6 19 18 26 27 22 16 5 10 15 20 25 30 iPhoneX iPhone8 24 15 35 38 31 39 17 16 13 15 16 12 5 10 15 20 25 30 35 40 45 Mate20 MI6 16 15 29 33 19 19 5 10 15 20 25 30 35 iPhoneX iPhone8 32 56 262 518 80 199 34 78 36 54 34 34 100 200 300 400 500 600 Mate20 MI6 25 25 116 105 27 28 20 40 60 80 100 120 140 iPhoneX iPhone8 MNN TF-Lite CoreML iOS: NCNN (Vulkan) MACE (OpenCL) TF-Lite (OpenGL) MNN (Vulkan) MNN (OpenCL) MNN (OpenGL) Android:

33 / 33