[PPT] - Deep Learning g on mobile le phones - A Practit itio ionersguid PowerPoint Presentation

SLIDE 1

Deep Learning g on mobile le phones

A Practit

itio ionersguid ide

Anirudh Koul, Siddha Ganju, Meher Kasam

SLIDE 2

SLIDE 3

Deep Learning g on mobile le phones

A Practit

itio ionersguid ide

Anirudh Koul, Siddha Ganju, Meher Kasam

SLIDE 4

Anirud rudh h Koul @Anirud rudhK hKoul ul

Head of AI & Research, Aira [Lastname]@aira.io

Siddha ha Ganju @SiddhaGa haGanj nju

Architect, Self-Driving Vehicles, NVIDIA [FirstnameLastname]@gmail.com

Meher Anand nd Kasam @MeherK rKasam

Software Engineer, Square [FirstnameMiddlenameK]@gmail.com

SLIDE 5

Wh Why Deep Le Learning On Mob Mobile?

Latency Privacy

SLIDE 6

Response Time Li Limits – Pow

wers of 10

10

0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention

[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

SLIDE 7

Mob Mobile Deep Le Learning Recipe

Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)

SLIDE 8

Building ng a DL App in _ time

SLIDE 9

Building ng a DL App in 1 hour

SLIDE 10

Us Use Cloud API PIs for

r Ge

General Recognition Ne Needs

Microsoft Cognitive Services
Clarifai
Google Cloud Vision
IBM Watson Services
Amazon Rekognition

SLIDE 11

How

w to
Choose a Com
mputer Vi

Vision Ba Based API PI?

Benchmark & Compare them COCO-Text v2.0 for Text reading in the wild

~2k random images
Candidate text has at least 2 characters together
Direct word match

COCO-Val 2017 for Image T agging in the wild

~4k random images
T

ag similarity match instead of word match

SLIDE 12

Pri Pricing

SLIDE 13

Recognize Text Be Benchmarks

Text API Accuracy Amazon Rekognition 45.4% Google Cloud Vision 33.4% Microsoft Cognitive Services 55.4% Evaluation criteria:

Photos have candidate words with at length>=2
Direct word match with ground truth

SLIDE 14

Im Image T agging Be Benchmarks

Evaluation criteria:

Concept similarity match instead of word match
E.g. ‘military-officer’ tag matched with ground truth tag ‘person’

Text API Accuracy Amazon Rekognition 65% Google Cloud Vision 47.6% Microsoft Cognitive Services 50.0%

SLIDE 15

Im Image T agging Be Benchmarks

Evaluation criteria:

Concept similarity match instead of word match
E.g. ‘military-officer’ tag matched with ground truth tag ‘person’

Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8

SLIDE 16

Im Image T agging Be Benchmarks

Hard to do Precision-Recall since COCO ground truth tags are not exhaustive Lower # of tags for a given accuracy indicates higher F-measure Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8

SLIDE 17

Tips for

r reducing network latency
For Text Recognition
Compressing setting of upto 90% has little effect on accuracy, but drastic

savings in size

Resizing is dangerous, text recognition needs a minimum size for

recognition

For image recognition
Resize to 224 as the minimum(height,width) at 50% compression with

bilinear interpolation

SLIDE 18

Building ng a DL App in 1 day

SLIDE 19

http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Energy to train Convolutional Neural Network Energy to use Convolutional Neural Network

SLIDE 20

Ba Base Pr Pretrained Mod Model

ImageNet – 1000 Object Categorizer VGG16 Inception-v3 Resnet-50 MobileNet SqueezeNet

SLIDE 21

Running pr pre-trained mod

dels on
n mob
bile

Core ML TensorFlow Lite Caffe2

SLIDE 22

Apple’s Ecosystem

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

SLIDE 23

Apple’s Ecosystem

Metal

low-level, low-overhead hardware-accelerated 3D graphic and

compute shader application programming interface (API)

Available since iOS 8

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

SLIDE 24

Apple’s Ecosystem

Fast low-level primitives:

BNNS – Basic Neural Network Subroutine
Ideal case: Fully connected NN
MPS – Metal Performance Shaders
Ideal case: Convolutions

Inconvenient for large networks:

Inception-v3 inference consisted of 1.5K hard coded model definition
Libraries Like Forge by Matthijs Hollemans provide abstraction

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

SLIDE 25

Apple’s Ecosystem

Convert Caffe/Tensorflow model to CoreML model in 3 lines:

import coremltools coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’) coreml_model.save('my_model.mlmodel’)

Add model to iOS project and call for prediction. Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM Automatically minimizes memory footprint and power consumption

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

SLIDE 26

Apple’s Ecosystem

Model quantization support upto 1 bit
Batch API for improved performance
Conversion support for MXNet, ONNX
ONNX opens models from PyT
rch, Cognitive T
olkit, Caffe2, Chainer
ML Create for quick training
tf-coreml for direct conversion from tensorflow

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

SLIDE 27

CoreML Be Benchmark - Pic Pick a DNN fo for yo your mobile architecture

Model Top-1 Accurac y Size of Model (MB) iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 7408 4556 235 181 146 Inception v3 78 95 727 637 114 90 78 Resnet 50 75 103 538 557 77 74 71 MobileNet 71 17 129 109 44 35 33 SqueezeNet 57 5 75 78 36 30 29

2014 2015 2016

Huge improv

vement in

GPU hardwa dware in 2015

2013 2017

SLIDE 28

Putting out more frames than an art gall llery

SLIDE 29

TensorFlow Ec Ecos

system

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

SLIDE 30

TensorFlow Ec Ecos

system

The full, bulky deal

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

SLIDE 31

TensorFlow Ec Ecos

system

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018 Easy pipeline to bring Tensorflow models to mobile Excellent documentation Optimizations to bring model to mobile

SLIDE 32

TensorFlow Ec Ecos

system
Smaller
Faster
Minimal dependencies
Easier to package & deploy
Allows running custom operators

1 line conversion from Keras to TensorFlow lite

tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

SLIDE 33

TensorFlow Li Lite is small

~75KB for core interpreter
~400KB for core interpreter + supported operations
Compared to 1.5MB for Tensorflow Mobile

SLIDE 34

TensorFlow Li Lite is fast

T

akes advantage of on-device hardware acceleration

Uses FlatBuffers
Reduces code footprint, memory usage
Reduces CPU cycles on serialization and deserialization
Improves startup time
Pre-fused activations
Combining batch normalization layer with previous Convolution
Interpreter uses static memory and static execution plan
Decreases load time

SLIDE 35

TensorFlow Li Lite Architecture

SLIDE 36

T ensorFlow Lite ite Be Benchmarks - htt ttp://alpha.lab.numericcal.com/

SLIDE 37

TensorFlow Li Lite Be Benchmarks - http://ai-benchmark.com/

Crowdsourcing benchmarking with AI Benchmark android app
By Andrey Ignatov from ETH
9 Tests
E.g. Semantic Segmentation, Image Super Resolution, Face Recognition

SLIDE 38

TensorFlow Li Lite acceleration – GP GPU U dele delegate (de dev pr preview)

SLIDE 39

Caffe2

From Facebook Under 1 MB of binary size Built for Speed : For ARM CPU : Uses NEON Kernels, NNPack For iPhone GPU : Uses Metal Performance Shaders and Metal For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup) ONNX format support to import models from CNTK/PyTorch

SLIDE 40

Caffe2

SLIDE 41

MLK MLKit

Simple, easy to use
Abstraction over TensorFlow Lite
Built in Image Labeling, OCR, Face Detection, Barcode scanning, landmark

detection, Smart reply

Model management with Firebase
Upload model on web interface to distribute
A/B T

esting

SLIDE 42

MLK MLKit – Face Con

ntours

By leveraging GPU delegate, ~4x speed up on Pixel 3 ~6x speed up on iPhone7

SLIDE 43

Recommendation for

r pr

production deve development

1. Train a model using Keras
2. Convert to Tensorflow Lite format
3. Upload to Firebase
4. Deploy to iOS/Android apps with MLKit

Keras .tflite file

tflite_convert

SLIDE 44

Com

mmon Questions

“My app has become too big to download. What do I do?”

iOS doesn’t allow apps over 150 MB to be downloaded
Solution : Download on demand, and compile on device
0 MB change to app size on first install

SLIDE 45

Com

mmon Questions

“Do I need to ship a new app update with every model improvement?”

Making App updates is a decent amount of overheard, plus ~2 days

wait time

Solution : Check for model updates, download and compile on device
Easier solution – Use a framework for Model Management, e.g.
Google ML Kit
Fritz
Numerrical

SLIDE 46

Com

mmon Questions

“Why does my app not recognize objects at top/bottom of screen?”

Solution : Check the cropping used, by default, its center crop ☺

SLIDE 47

Build ldin ing a DL App in 1 week ek

SLIDE 48

Learn Playing an Accordion 3 months

SLIDE 49

Learn Playing an Accordion 3 months Knows Piano

Fin ine Tune Sk Skil ills

1 week

SLIDE 50

I I go got a da dataset, Now Now Wh What?

Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks

“Don’t Be A Hero”

Andrej Karpathy

SLIDE 51

How

w to
find pr

pretrained mod

dels for
r my task?

Model Zoo

https://modelzoo.co

300+ models

Papers with Code

https://paperswithcode.com/sota

SLIDE 52

AlexNet, 2012 (simplified)

[Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation

SLIDE 53

Deciding how

w to
fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

SLIDE 54

Deciding when to

fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

SLIDE 55

Deciding when to

fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

SLIDE 56

Deciding when to

fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

SLIDE 57

Cou

uld you
u tra

raining you

ur ow
wn classifier ... without cod
ding?
Microsoft CustomVision.ai
Unique: Under a minute training, Custom object detection (100x speedup)
Google AutoML
Unique: Full CNN training, crowdsourced workers
IBM Watson Visual recognition
Baidu EZDL
Unique: Custom Sound recognition

SLIDE 58

Cust stom Vis ision Ser Service (customvision.ai) i) – Drag and drop tr training

Tip : Upload 30 photos per class for make prototype model Upload 200 photos per class for more robust production model More distinct the shape/type of object, lesser images required.

SLIDE 59

Cust stom Vis ision Ser Service (customvision.ai) i) – Drag and drop tr training

Tip : Use Fatkun Browser Extension to download images from Search Engine,

r use Bing Image Search API to programmatically download photos with

proper rights

SLIDE 60

CoreML exporter from customvision.ai – Drag and drop tr training

5 minute shortcut to training, finetuning and getting model ready in CoreML format Drag and drop interface

SLIDE 61

Build ldin ing g a Crowd wdsou

urce

ced d Data Colle lect ctor

r

in 1 months

SLIDE 62

Ba Barcode recognition from

m See

Seeing AI

Live Guide user in finding a barcode with audio cues With Server Decode barcode to identify product Tech MPSCNN running on mobile GPU + barcode library Metrics 40 FPS (~25 ms) on iPhone 7

Aim : Help blind users identify products using barcode Issue ue : Blind users don’t know where the barcode is

SLIDE 63

Currency recognition from

m Se

Seeing AI

Aim : Identify currency

Live Identify denomination of paper currency instantly With Server

Tech

T ask specific CNN running on mobile GPU Metrics 40 FPS (~25 ms) on iPhone 7

SLIDE 64

Training Data Col

llection App

pp

Request volunteers to take photos of objects in non-obvious settings Sends photos to cloud, trains model nightly Newsletter shows the best photos from volunteers Let them compete for fame

SLIDE 65

Daily challenge - Col

llected by

by vol

lunteers

SLIDE 66

Daily challenge - Col

llected by

by vol

lunteers

SLIDE 67

Building ng a producti uction n DL App in 3 months hs

SLIDE 68

What you want

https://www.flickr.com/photos/kenjonbro/9075514760/and http://www.newcars.com/land-rover/range-rover-sport/2016

$2000

$200,000

What you can afford

SLIDE 69

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012)

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

SLIDE 70

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000

AlexNet, 8 layers (ILSVRC 2012)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014)

input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivation soft max1 Soft maxAct ivat ion soft max2

GoogleNet, 22 layers (ILSVRC 2014)

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

SLIDE 71

AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2

VGG, 19 layers (ILSVRC 2014)

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

Ultra deep

SLIDE 72

ResNet, 152 layers

1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv 128 7x7 conv, 64, /2, pool/2

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

SLIDE 73

28.2 25.8 16.4 11.7 7.3 6.7 3.6 2.9 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'13 ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'16 Ensemble

ImageNet Classification top-5 error (%)

shallow 8 layers 19 layers 22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

Revolution of Dept pth vs Classification Accuracy

Ensemble of Resnet, Inception Resnet, Inception and Wide Residual Network

SLIDE 74

Accuracy vs Ope perations Per r Im Image In Inference

Size is proportional to num parameters

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

552 MB

240 MB

What t we want

SLIDE 75

Your Budget - Sm Smartphone Floating Point Operations Per Se Second (2015)

http://pages.experts-exchange.com/processing-power-compared/

SLIDE 76

iPhone X is mo more powerful th than a Macbook Pro

https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/

SLIDE 77

Str Strategies to

get

get maximum efficiency from

m you
ur CNN

NN

Before training

Pick an efficient architecture for your task
Designing efficient layers

After training

Pruning
Quantization
Network binarization

SLIDE 78

CoreML Be Benchmark - Pic Pick a DNN fo for yo your mobile architecture

Model Top-1 Accura cy Size of Model (MB) Million Multi Adds iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 15300 7408 4556 235 181 146 Inception v3 78 95 5000 727 637 114 90 78 Resnet 50 75 103 3900 538 557 77 74 71 MobileNet 71 17 569 129 109 44 35 33 SqueezeN et 57 5 800 75 78 36 30 29

2014 2015 2016

Huge improv

vement in

GPU hardwa dware in 2015

2013 2017

SLIDE 79

Mob MobileNet family

Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv Tune with two parameters – Width Multiplier and resolution multiplier

Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017

SLIDE 80

Ef Efficient Classification Architectures

https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html

MobileNetV2 is the current favorite

SLIDE 81

Ef Efficient Detection Architectures

Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

SLIDE 82

Ef Efficient Detection Architectures

Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

SLIDE 83

Ef Efficient Seg Segmentation Architectures

ICNet - Image cascade network

SLIDE 84

Tri ricks while des designing you

ur ow
wn network
Dilated Convolutions
Great for Segmentation / when target object has high area in image
Replace NxN convolutions with Nx1 followed by 1xN
Depth wise Separable Convolutions (e.g. MobileNet)
Inverted residual block (e.g. MobileNetV2)
Replacing large filters with multiple small filters
5x5 is slower than 3x3 followed by 3x3

SLIDE 85

Design con

nsideration for
r custom architectures – Sm

Small Filters

Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters ☺ Less compute ☺ More non-linearity ☺

Better Faster Stronge ger

Andrej Karpathy, CS-231n Notes, Lecture 11

SLIDE 86

Sele Selective tra raining to

keep networks shallow

Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames

SLIDE 87

Pru Pruning

Aim : Remove all connections with absolute weights below a threshold

Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

SLIDE 88

Observation : Mos Most pa parameters in Fully Con

nnected La

Layers

AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters

SLIDE 89

Pr Pruning gets ts quickest model compression wit ithout accuracy lo loss ss

AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy

SLIDE 90

Pru Prune in Kera ras (Before)

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile( optimizer='adam’, loss= ‘sparse_categorical_crossentropy’, metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test)

SLIDE 91

Pru Prune in Kera ras (After)

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(),

prune.Prune(tf.keras.layers.Dense(512, activation=tf.nn.relu)),

tf.keras.layers.Dropout(0.2),

prune.Prune(tf.keras.layers.Dense(10, activation=tf.nn.softmax))

]) model.compile( optimizer='adam’, loss= ‘sparse_categorical_crossentropy’, metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test)

SLIDE 92

Weight Sh Sharing

Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Cons: Need a special inference engine, doesn’t work for most applications

SLIDE 93

Filt ilter Pr Pruning - ThiN iNet

Idea : Discard whole filter if not important to predictions Advantage:

No change in architecture, other than thinning of filters per layer
Can be further compressed with other methods

Just like feature selection, select filter to discard. Possible greedy methods:

Absolute weight sum of entire filter closest to 0
Average percentage of ‘Zeros’ as outputs
ThiNet – Collect statistics on the output of the next layer

SLIDE 94

Quantization

Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice:

Ristretto + Caffe
Automatic Network quantization
Finds balance between compression rate and accuracy
Apple Metal Performance Shaders automatically quantize to 16 bits
Tensorflow has 8 bit quantization support
Gemmlowp – Low precision matrix multiplication library

SLIDE 95

Quantizing CNN NNs in Pra Practice

Reducing CoreML models to half size

# Load a model, lower its precision, and then save the smaller model. model_spec = coremltools.utils.load_spec(‘model.mlmodel’) model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec) coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')

SLIDE 96

Quantizing CNN NNs in Pra Practice

Reducing CoreML models to even smaller size Choose bits and quantization mode Bits from [1,2,4,8] Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]

Lut = look up table

from coremltools.models.neural_network.quantization_utils import * quantized_model= quantize_weights(model, 8, 'linear') quantized_model.save('quantizedModel.mlmodel’) compare_model(model, quantized_model, './sample_data/')

SLIDE 97

Bin Binary weighted Ne Networks

Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction