CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33

Overview MNN Architecture MNN Backend and Runtime 2 / 33

Architecture 1 1 Xiaotang Jiang et al. (2020). “MNN: A Universal and Efficient Inference Engine”. In: Proc. MLSys . 3 / 33

Frontends ◮ Caffe Deep Learning Framework ◮ TensorFlow Deep Learning Framework ◮ Pytorch Deep Learning Framework 4 / 33

PyTorch ◮ PyTorch is a python package that provides two high-level features: ◮ Tensor computation (like numpy) with strong GPU acceleration ◮ Deep Neural Networks built on a tape-based autograd system ◮ Model Deployment: ◮ For high-performance inference deployment for trained models, export to ONNX format and optimize and deploy with NVIDIA TensorRT or MNN inference accelerator 5 / 33

PyTorch Code Sample 6 / 33

TensorFlow ◮ TensorFlow is an open source software library for numerical computation using data flow graphs ◮ Model Deployment ◮ For high-performance inference deployment for trained models, using TensorFlow-MNN integration to optimize models within TensorFlow and deploy with MNN inference accelerator 7 / 33

Tensorflow Code Sample 8 / 33

Caffe ◮ Caffe is a deep learning framework made with expression, speed, and modularity in mind: ◮ Expressive architecture encourages application and innovation ◮ Extensible code fosters active development. ◮ Speed makes Caffe perfect for research experiments and industry deployment ◮ Model Deployment: ◮ For high-performance inference deployment for trained models, using Caffe-MNN integration to optimize models within Caffe and MNN inference accelerator 9 / 33

Caffe Code Sample 10 / 33

Data Layout Formats 2 ◮ N is the batch size ◮ C is the number of feature maps ◮ H is the image height ◮ W is the image width 2 https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html 11 / 33

NCHW Memory Layout ◮ Begin with first channel (c=0), elements arranged contiguously in row-major order ◮ Continue with second and subsequent channels until all channels are laid out 12 / 33

NHWC Memory Layout ◮ Begin with the first element of channel 0, then proceed to the first element of channel 1, and so on, until the first elements of all the C channels are laid out ◮ Next, select the second element of channel 0, then proceed to the second element of channel 1, and so on, until the second element of all the channels are laid out ◮ Follow the row-major order of channel 0 and complete all the elements ◮ Proceed to the next batch (if N is > 1) 13 / 33

NHWC Memory Layout 14 / 33

Overview of the proposed Mobile Neural Network Conv Conv Conv …… Deconv 1x1s1d1 3x3s1d1 3x3s1d2 3x3s2d1 Operator diversity Model (kernel/stride/dilation) CNN RNN LSTM GAN Transformer (upstream) Framework diversity ONNX Model Information Scheme A Scheme B fastest Inference Engine MNN Best Scheme Search ··· Scheme C Device Information Diversity (hardware/os/standard) Device (downstream) Metal CPU IoT Limited resource OpenGL OpenCL Vulkan CPU (computation/memory) 15 / 33

On-device inference Pixel2 MI6 Mate20 MNN (optimized) Model ARM backend Valid Valid Valid OpenCL backend Invalid Valid Valid A1 A2 B1 Vulkan backend Valid Valid Valid ONNX A3 B2 computational graph Valid ARM82 backend Invalid Invalid A4 B3 load valid backends B4 Format Backend Abstraction Converter ARM ARM ARM82 (CPU) ARM Pre-inference backend info model info Vulkan OpenCL OpenCL (GPU) MNN search Vulkan Vulkan (not optimized) Model A1 B1 A1 B1 A1 B1 Offline Graph A2 B2 A2 B2 A2 B2 Optimizer A3 B3 A3 B3 A3 B3 match A4 B4 A4 B4 A4 B4 optimal computation scheme allocated memory pre-computed constants Model Compressor Session Fill Input Inference Output MNN (optimized) Model Offline Conversion On-device Inference 16 / 33

What is Convolution? The calculation process of convolutional layer ◮ No padding ◮ Unit strides ◮ 3 × 3 kernel size ◮ 4 × 4 input feature map 17 / 33

What is Deconvolution (transposed convolution)? 3 The calculation process of deconvolutional layer ◮ 2 × 2 padding with border of zeros ◮ Unit strides ◮ 3 × 3 kernel size ◮ 4 × 4 input feature map 3 Vincent Dumoulin and Francesco Visin (2016). “A guide to convolution arithmetic for deep learning”. In: arXiv preprint arXiv:1603.07285 . 18 / 33

Strassen Algorithm 4 4 Jason Cong and Bingjun Xiao (2014). “Minimizing computation in convolutional neural networks”. In: Proc. ICANN , pp. 281–290. 19 / 33

Strassen Algorithm Matrix size w/o Strassen w/ Strassen ( 256 , 256 , 256 ) 23 23 ( 512 , 512 , 512 ) 176 ( ↓ 7 . 9 % ) 191 ( 512 , 512 , 1024 ) 388 359 ( ↓ 7 . 5 % ) ( 1024 , 1024 , 1024 ) 1299 ( ↓ 13 . 5 % ) 1501 class XPUBackend final : public Backend { XPUBackend(MNNForwardType type, MemoryMode mode); virtual ~XPUBackend(); virtual Execution* onCreate(const vector<Tensor*>& inputs, const vector<Tensor*>& outputs, const MNN::Op* op); virtual void onExecuteBegin() const; virtual void onExecuteEnd() const; virtual bool onAcquireBuffer(const Tensor* tensor, StorageType storageType); virtual bool onReleaseBuffer(const Tensor* tensor, StorageType storageType); virtual bool onClearBuffer(); virtual void onCopyBuffer(const Tensor* srcTensor, const Tensor* dstTensor) const; } 20 / 33

Winograd Algorithm 5 5 Andrew Lavin and Scott Gray (2016). “Fast Algorithms for Convolutional Neural Networks”. In: Proc. CVPR , pp. 4013–4021. 21 / 33

Optimized Winograd algorithm in MNN 𝑌 𝑌′ 𝑋′ 𝑋 𝑋′ = 𝐻𝑋𝐻 𝑈 𝑌′ = 𝐶 𝑈 𝑌𝐶 (4, 4, 𝑉, 𝑗 𝑑 ) (4, 4, 𝑉, 𝑗 𝑑 ) (4, 4, 𝑗 𝑑 , 𝑝 𝑑 ) (3, 3, 𝑗 𝑑 , 𝑝 𝑑 ) ′ 𝑨 = ෍ ′ 𝑙 ⋅ 𝑋 ′ 𝑙 𝑨 𝑍 𝑌 𝑗𝑘 𝑗𝑘 𝑗𝑘 𝑙 matrix mul matrix mul matrix mul matrix mul 𝑉, 𝑝 𝑑 𝑉, 𝑗 𝑑 𝑗 𝑑 , 𝑝 𝑑 Merge 𝑍′ 𝑍 = 𝐵 𝑈 𝑍 ′ 𝐵 𝑍 (4, 4, 𝑉, 𝑝 𝑑 ) (2, 2, 𝑉, 𝑝 𝑑 ) 22 / 33

Memory optimization of MNN 0 Alloc 0 (pre-inference) (inference) Alloc 1 Alloc 0 Conv1 Convolution 1 Alloc 1 Free 0 Free 0 Convolution 1 1 Alloc 2 Alloc 2 Pool Pool Pool Free 1 Free 1 Free 2 2 Free 2 Computational Mixed compute & Memory management Pure compute graph memory management ◮ MNN can infer the exact required memory for the entire graph: ◮ virtually walking through all operations ◮ summing up all allocation and freeing 23 / 33

Inference in FP16 ◮ Training in fp32 and inference in fp16 is expected to get same accuracy as in fp32 most of the time ◮ Add batch normalization to activation ◮ If it is integer RGB input (0 - 255), normalize it to be float (0 - 1) 24 / 33

Analysis of FP16 inference ◮ Advantages of FP16: ◮ FP16 improves speed (TFLOPS) and performance ◮ FP16 reduces memory usage of a neural network ◮ FP16 data transfers are faster than FP32 ◮ Disadvantages of FP16: ◮ They must be converted to or from 32-bit floats before they are operated on 25 / 33

Neon optimization ◮ As a programmer, there are several ways you can use Neon technology: ◮ Neon intrinsics ◮ Neon-enabled libraries ◮ Auto-vectorization by your compiler ◮ Hand-coded Neon assembler 26 / 33

Why use Neon ◮ Support for both integer and floating point operations ensures the adaptability of a broad range of applications, from codecs to High Performance Computing to 3D graphics. ◮ Tight coupling to the Arm processor provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow 27 / 33

Manual Search Conv Conv Conv (3x3s1d1) (3x3s2d1) (3x3s1d2) func_3x3s2d1 func_3x3s1d1 func_3x3s1d2 (Winograd/ (Winograd/ (Winograd/ Sliding Window) Sliding Window) Sliding Window) SIMD SIMD SIMD multi-threading multi-threading multi-threading pipelining pipelining pipelining … … … Optimization Techniques Optimization Techniques Optimization Techniques 28 / 33

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33 Overview MNN Architecture MNN Backend and Runtime 2 / 33 Overview MNN Architecture MNN Backend and Runtime 3 / 33 Architecture 1 1

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 /

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

GaudiMP GaudiMP performance performance- and and KSM KSM- measurements measurements

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25,

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25 Overview

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: November 2, 2020) Fall 2020 1 / 21

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

CMSC5743 Lab05 Introduction to Distiller Qi Sun (Latest update: October 13, 2020) Fall 2020 1

Brownian motion 18.S995 - L03 & 04 Typical length scales

CS3505/5020 Software Practice II C# Vector Review Homework Help CS 3505 L03 - 1 Decimal

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Arrays in C Dalhousie University Winter 2019 Arrays vs Scalar Types Values of a scalar types

Arrays, Structs, and Memory 10/18/16 Recall: Indexed Addressing Mode General form:

Mesh Models (Chapter 8) 1. Overview of Mesh and Related models. a. Diameter: The linear

Introduction To MCS-51 By Charoen Vongchumyen Department of Computer Engineering

A Hardware Timestamper for One- Way Delay Measurements Zhang Shu Katsushi Kobayashi

Point Cloud Processing Has anyone seen the toothpaste? Given a point cloud: how do you

Pr Proposal Tr Tracking an and Se Segmentation (PTS) S): A A cascaded network for video obje

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: - PowerPoint PPT Presentation

CMSC5743 L03: CNN Accurate Speedup II Bei Yu (Latest update: September 29, 2020) Fall 2020 1 / 33 Overview MNN Architecture MNN Backend and Runtime 2 / 33 Overview MNN Architecture MNN Backend and Runtime 3 / 33 Architecture 1 1

CMSC5743 L02: CNN Accurate Speedup I Bei Yu (Latest update: September 28, 2020) Fall 2020 1 /

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

GaudiMP GaudiMP performance performance- and and KSM KSM- measurements measurements

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25,

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

CMSC5743 L05: Quantization Bei Yu (Latest update: October 12, 2020) Fall 2020 1 / 25 Overview

CMSC5743 L06: Binary/Ternary Network Bei Yu (Latest update: November 2, 2020) Fall 2020 1 / 21

CMSC5743 L09: Network Architecture Search Bei Yu (Latest update: September 13, 2020) Fall 2020

CMSC5743 Lab05 Introduction to Distiller Qi Sun (Latest update: October 13, 2020) Fall 2020 1

Brownian motion 18.S995 - L03 &amp; 04 Typical length scales

CS3505/5020 Software Practice II C# Vector Review Homework Help CS 3505 L03 - 1 Decimal

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

Arrays in C Dalhousie University Winter 2019 Arrays vs Scalar Types Values of a scalar types

Arrays, Structs, and Memory 10/18/16 Recall: Indexed Addressing Mode General form:

Mesh Models (Chapter 8) 1. Overview of Mesh and Related models. a. Diameter: The linear

Introduction To MCS-51 By Charoen Vongchumyen Department of Computer Engineering

A Hardware Timestamper for One- Way Delay Measurements Zhang Shu Katsushi Kobayashi

Point Cloud Processing Has anyone seen the toothpaste? Given a point cloud: how do you

Pr Proposal Tr Tracking an and Se Segmentation (PTS) S): A A cascaded network for video obje

Brownian motion 18.S995 - L03 & 04 Typical length scales