Deep Learning g on mobile le phones - A Practit itio ionersguid - - PowerPoint PPT Presentation

deep learning g on mobile le phones a practit itio
SMART_READER_LITE
LIVE PREVIEW

Deep Learning g on mobile le phones - A Practit itio ionersguid - - PowerPoint PPT Presentation

Deep Learning g on mobile le phones - A Practit itio ionersguid ide Anirudh Koul, Siddha Ganju, Meher Kasam Deep Learning g on mobile le phones - A Practit itio ionersguid ide Anirudh Koul, Siddha Ganju, Meher Kasam Anirud rudh h


slide-1
SLIDE 1

Deep Learning g on mobile le phones

  • A Practit

itio ionersguid ide

Anirudh Koul, Siddha Ganju, Meher Kasam

slide-2
SLIDE 2
slide-3
SLIDE 3

Deep Learning g on mobile le phones

  • A Practit

itio ionersguid ide

Anirudh Koul, Siddha Ganju, Meher Kasam

slide-4
SLIDE 4

Anirud rudh h Koul @Anirud rudhK hKoul ul

Head of AI & Research, Aira [Lastname]@aira.io

Siddha ha Ganju @SiddhaGa haGanj nju

Architect, Self-Driving Vehicles, NVIDIA [FirstnameLastname]@gmail.com

Meher Anand nd Kasam @MeherK rKasam

Software Engineer, Square [FirstnameMiddlenameK]@gmail.com

slide-5
SLIDE 5

Wh Why Deep Le Learning On Mob Mobile?

Latency Privacy

slide-6
SLIDE 6

Response Time Li Limits – Pow

  • wers of 10

10

0.1 second : Reacting instantly 1.0 seconds : User’s flow of thought 10 seconds : Keeping the user’s attention

[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

slide-7
SLIDE 7

Mob Mobile Deep Le Learning Recipe

Mobile Inference Engine + Pretrained Model = DL App (Efficient) (Efficient)

slide-8
SLIDE 8

Building ng a DL App in _ time

slide-9
SLIDE 9

Building ng a DL App in 1 hour

slide-10
SLIDE 10

Us Use Cloud API PIs for

  • r Ge

General Recognition Ne Needs

  • Microsoft Cognitive Services
  • Clarifai
  • Google Cloud Vision
  • IBM Watson Services
  • Amazon Rekognition
slide-11
SLIDE 11

How

  • w to
  • Choose a Com
  • mputer Vi

Vision Ba Based API PI?

Benchmark & Compare them COCO-Text v2.0 for Text reading in the wild

  • ~2k random images
  • Candidate text has at least 2 characters together
  • Direct word match

COCO-Val 2017 for Image T agging in the wild

  • ~4k random images
  • T

ag similarity match instead of word match

slide-12
SLIDE 12

Pri Pricing

slide-13
SLIDE 13

Recognize Text Be Benchmarks

Text API Accuracy Amazon Rekognition 45.4% Google Cloud Vision 33.4% Microsoft Cognitive Services 55.4% Evaluation criteria:

  • Photos have candidate words with at length>=2
  • Direct word match with ground truth
slide-14
SLIDE 14

Im Image T agging Be Benchmarks

Evaluation criteria:

  • Concept similarity match instead of word match
  • E.g. ‘military-officer’ tag matched with ground truth tag ‘person’

Text API Accuracy Amazon Rekognition 65% Google Cloud Vision 47.6% Microsoft Cognitive Services 50.0%

slide-15
SLIDE 15

Im Image T agging Be Benchmarks

Evaluation criteria:

  • Concept similarity match instead of word match
  • E.g. ‘military-officer’ tag matched with ground truth tag ‘person’

Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8

slide-16
SLIDE 16

Im Image T agging Be Benchmarks

Hard to do Precision-Recall since COCO ground truth tags are not exhaustive Lower # of tags for a given accuracy indicates higher F-measure Text API Accuracy Avg #Tags Amazon Rekognition 65% 14 Google Cloud Vision 47.6% 14 Microsoft Cognitive Services 50.0% 8

slide-17
SLIDE 17

Tips for

  • r reducing network latency
  • For Text Recognition
  • Compressing setting of upto 90% has little effect on accuracy, but drastic

savings in size

  • Resizing is dangerous, text recognition needs a minimum size for

recognition

  • For image recognition
  • Resize to 224 as the minimum(height,width) at 50% compression with

bilinear interpolation

slide-18
SLIDE 18

Building ng a DL App in 1 day

slide-19
SLIDE 19

http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Energy to train Convolutional Neural Network Energy to use Convolutional Neural Network

slide-20
SLIDE 20

Ba Base Pr Pretrained Mod Model

ImageNet – 1000 Object Categorizer VGG16 Inception-v3 Resnet-50 MobileNet SqueezeNet

slide-21
SLIDE 21

Running pr pre-trained mod

  • dels on
  • n mob
  • bile

Core ML TensorFlow Lite Caffe2

slide-22
SLIDE 22

Apple’s Ecosystem

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

slide-23
SLIDE 23

Apple’s Ecosystem

Metal

  • low-level, low-overhead hardware-accelerated 3D graphic and

compute shader application programming interface (API)

  • Available since iOS 8

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

slide-24
SLIDE 24

Apple’s Ecosystem

Fast low-level primitives:

  • BNNS – Basic Neural Network Subroutine
  • Ideal case: Fully connected NN
  • MPS – Metal Performance Shaders
  • Ideal case: Convolutions

Inconvenient for large networks:

  • Inception-v3 inference consisted of 1.5K hard coded model definition
  • Libraries Like Forge by Matthijs Hollemans provide abstraction

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

slide-25
SLIDE 25

Apple’s Ecosystem

Convert Caffe/Tensorflow model to CoreML model in 3 lines:

import coremltools coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’) coreml_model.save('my_model.mlmodel’)

Add model to iOS project and call for prediction. Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM Automatically minimizes memory footprint and power consumption

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

slide-26
SLIDE 26

Apple’s Ecosystem

  • Model quantization support upto 1 bit
  • Batch API for improved performance
  • Conversion support for MXNet, ONNX
  • ONNX opens models from PyT
  • rch, Cognitive T
  • olkit, Caffe2, Chainer
  • ML Create for quick training
  • tf-coreml for direct conversion from tensorflow

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

slide-27
SLIDE 27

CoreML Be Benchmark - Pic Pick a DNN fo for yo your mobile architecture

Model Top-1 Accurac y Size of Model (MB) iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 7408 4556 235 181 146 Inception v3 78 95 727 637 114 90 78 Resnet 50 75 103 538 557 77 74 71 MobileNet 71 17 129 109 44 35 33 SqueezeNet 57 5 75 78 36 30 29

2014 2015 2016

Huge improv

  • vement in

GPU hardwa dware in 2015

2013 2017

slide-28
SLIDE 28

Putting out more frames than an art gall llery

slide-29
SLIDE 29

TensorFlow Ec Ecos

  • system

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

slide-30
SLIDE 30

TensorFlow Ec Ecos

  • system

The full, bulky deal

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

slide-31
SLIDE 31

TensorFlow Ec Ecos

  • system

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018 Easy pipeline to bring Tensorflow models to mobile Excellent documentation Optimizations to bring model to mobile

slide-32
SLIDE 32

TensorFlow Ec Ecos

  • system
  • Smaller
  • Faster
  • Minimal dependencies
  • Easier to package & deploy
  • Allows running custom operators

1 line conversion from Keras to TensorFlow lite

  • tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

slide-33
SLIDE 33

TensorFlow Li Lite is small

  • ~75KB for core interpreter
  • ~400KB for core interpreter + supported operations
  • Compared to 1.5MB for Tensorflow Mobile
slide-34
SLIDE 34

TensorFlow Li Lite is fast

  • T

akes advantage of on-device hardware acceleration

  • Uses FlatBuffers
  • Reduces code footprint, memory usage
  • Reduces CPU cycles on serialization and deserialization
  • Improves startup time
  • Pre-fused activations
  • Combining batch normalization layer with previous Convolution
  • Interpreter uses static memory and static execution plan
  • Decreases load time
slide-35
SLIDE 35

TensorFlow Li Lite Architecture

slide-36
SLIDE 36

T ensorFlow Lite ite Be Benchmarks - htt ttp://alpha.lab.numericcal.com/

slide-37
SLIDE 37

TensorFlow Li Lite Be Benchmarks - http://ai-benchmark.com/

  • Crowdsourcing benchmarking with AI Benchmark android app
  • By Andrey Ignatov from ETH
  • 9 Tests
  • E.g. Semantic Segmentation, Image Super Resolution, Face Recognition
slide-38
SLIDE 38

TensorFlow Li Lite acceleration – GP GPU U dele delegate (de dev pr preview)

slide-39
SLIDE 39

Caffe2

From Facebook Under 1 MB of binary size Built for Speed : For ARM CPU : Uses NEON Kernels, NNPack For iPhone GPU : Uses Metal Performance Shaders and Metal For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup) ONNX format support to import models from CNTK/PyTorch

slide-40
SLIDE 40

Caffe2

slide-41
SLIDE 41

MLK MLKit

  • Simple, easy to use
  • Abstraction over TensorFlow Lite
  • Built in Image Labeling, OCR, Face Detection, Barcode scanning, landmark

detection, Smart reply

  • Model management with Firebase
  • Upload model on web interface to distribute
  • A/B T

esting

slide-42
SLIDE 42

MLK MLKit – Face Con

  • ntours

By leveraging GPU delegate, ~4x speed up on Pixel 3 ~6x speed up on iPhone7

slide-43
SLIDE 43

Recommendation for

  • r pr

production deve development

  • 1. Train a model using Keras
  • 2. Convert to Tensorflow Lite format
  • 3. Upload to Firebase
  • 4. Deploy to iOS/Android apps with MLKit

Keras .tflite file

tflite_convert

slide-44
SLIDE 44

Com

  • mmon Questions

“My app has become too big to download. What do I do?”

  • iOS doesn’t allow apps over 150 MB to be downloaded
  • Solution : Download on demand, and compile on device
  • 0 MB change to app size on first install
slide-45
SLIDE 45

Com

  • mmon Questions

“Do I need to ship a new app update with every model improvement?”

  • Making App updates is a decent amount of overheard, plus ~2 days

wait time

  • Solution : Check for model updates, download and compile on device
  • Easier solution – Use a framework for Model Management, e.g.
  • Google ML Kit
  • Fritz
  • Numerrical
slide-46
SLIDE 46

Com

  • mmon Questions

“Why does my app not recognize objects at top/bottom of screen?”

  • Solution : Check the cropping used, by default, its center crop ☺
slide-47
SLIDE 47

Build ldin ing a DL App in 1 week ek

slide-48
SLIDE 48

Learn Playing an Accordion 3 months

slide-49
SLIDE 49

Learn Playing an Accordion 3 months Knows Piano

Fin ine Tune Sk Skil ills

1 week

slide-50
SLIDE 50

I I go got a da dataset, Now Now Wh What?

Step 1 : Find a pre-trained model Step 2 : Fine tune a pre-trained model Step 3 : Run using existing frameworks

“Don’t Be A Hero”

  • Andrej Karpathy
slide-51
SLIDE 51

How

  • w to
  • find pr

pretrained mod

  • dels for
  • r my task?

Model Zoo

https://modelzoo.co

  • 300+ models

Papers with Code

https://paperswithcode.com/sota

slide-52
SLIDE 52

AlexNet, 2012 (simplified)

[Krizhevsky, Sutskever,Hinton’12] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11 n-dimension Feature representation

slide-53
SLIDE 53

Deciding how

  • w to
  • fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

slide-54
SLIDE 54

Deciding when to

  • fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

slide-55
SLIDE 55

Deciding when to

  • fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

slide-56
SLIDE 56

Deciding when to

  • fine tune

Size of New Datase set Simi milarity to Original Dataset What to do? Large High Fine tune. Small High Don’t Fine Tune, it will overfit. Train linear classifier on CNN Features Small Low Train a classifier from activations in lower layers. Higher layers are dataset specific to older dataset. Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

slide-57
SLIDE 57

Cou

  • uld you
  • u tra

raining you

  • ur ow
  • wn classifier ... without cod
  • ding?
  • Microsoft CustomVision.ai
  • Unique: Under a minute training, Custom object detection (100x speedup)
  • Google AutoML
  • Unique: Full CNN training, crowdsourced workers
  • IBM Watson Visual recognition
  • Baidu EZDL
  • Unique: Custom Sound recognition
slide-58
SLIDE 58

Cust stom Vis ision Ser Service (customvision.ai) i) – Drag and drop tr training

Tip : Upload 30 photos per class for make prototype model Upload 200 photos per class for more robust production model More distinct the shape/type of object, lesser images required.

slide-59
SLIDE 59

Cust stom Vis ision Ser Service (customvision.ai) i) – Drag and drop tr training

Tip : Use Fatkun Browser Extension to download images from Search Engine,

  • r use Bing Image Search API to programmatically download photos with

proper rights

slide-60
SLIDE 60

CoreML exporter from customvision.ai – Drag and drop tr training

5 minute shortcut to training, finetuning and getting model ready in CoreML format Drag and drop interface

slide-61
SLIDE 61

Build ldin ing g a Crowd wdsou

  • urce

ced d Data Colle lect ctor

  • r

in 1 months

slide-62
SLIDE 62

Ba Barcode recognition from

  • m See

Seeing AI

Live Guide user in finding a barcode with audio cues With Server Decode barcode to identify product Tech MPSCNN running on mobile GPU + barcode library Metrics 40 FPS (~25 ms) on iPhone 7

Aim : Help blind users identify products using barcode Issue ue : Blind users don’t know where the barcode is

slide-63
SLIDE 63

Currency recognition from

  • m Se

Seeing AI

Aim : Identify currency

Live Identify denomination of paper currency instantly With Server

  • Tech

T ask specific CNN running on mobile GPU Metrics 40 FPS (~25 ms) on iPhone 7

slide-64
SLIDE 64

Training Data Col

  • llection App

pp

Request volunteers to take photos of objects in non-obvious settings Sends photos to cloud, trains model nightly Newsletter shows the best photos from volunteers Let them compete for fame

slide-65
SLIDE 65

Daily challenge - Col

  • llected by

by vol

  • lunteers
slide-66
SLIDE 66

Daily challenge - Col

  • llected by

by vol

  • lunteers
slide-67
SLIDE 67

Building ng a producti uction n DL App in 3 months hs

slide-68
SLIDE 68

What you want

https://www.flickr.com/photos/kenjonbro/9075514760/and http://www.newcars.com/land-rover/range-rover-sport/2016

$2000

$200,000

What you can afford

slide-69
SLIDE 69

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012)

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

slide-70
SLIDE 70

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000

AlexNet, 8 layers (ILSVRC 2012)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000

VGG, 19 layers (ILSVRC 2014)

input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivation soft max1 Soft maxAct ivat ion soft max2

GoogleNet, 22 layers (ILSVRC 2014)

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

slide-71
SLIDE 71

AlexNet, 8 layers (ILSVRC 2012) ResNet, 152 layers (ILSVRC 2015)

3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2

VGG, 19 layers (ILSVRC 2014)

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

Ultra deep

slide-72
SLIDE 72

ResNet, 152 layers

1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv 128 7x7 conv, 64, /2, pool/2

Revolution of Dept pth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

slide-73
SLIDE 73

28.2 25.8 16.4 11.7 7.3 6.7 3.6 2.9 ILSVRC'10 ILSVRC'11 ILSVRC'12 AlexNet ILSVRC'13 ILSVRC'14 VGG ILSVRC'14 GoogleNet ILSVRC'15 ResNet ILSVRC'16 Ensemble

ImageNet Classification top-5 error (%)

shallow 8 layers 19 layers 22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

Revolution of Dept pth vs Classification Accuracy

Ensemble of Resnet, Inception Resnet, Inception and Wide Residual Network

slide-74
SLIDE 74

Accuracy vs Ope perations Per r Im Image In Inference

Size is proportional to num parameters

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

552 MB

240 MB

What t we want

slide-75
SLIDE 75

Your Budget - Sm Smartphone Floating Point Operations Per Se Second (2015)

http://pages.experts-exchange.com/processing-power-compared/

slide-76
SLIDE 76

iPhone X is mo more powerful th than a Macbook Pro

https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/

slide-77
SLIDE 77

Str Strategies to

  • get

get maximum efficiency from

  • m you
  • ur CNN

NN

Before training

  • Pick an efficient architecture for your task
  • Designing efficient layers

After training

  • Pruning
  • Quantization
  • Network binarization
slide-78
SLIDE 78

CoreML Be Benchmark - Pic Pick a DNN fo for yo your mobile architecture

Model Top-1 Accura cy Size of Model (MB) Million Multi Adds iPhone 5S Execution Time (ms) iPhone 6 Execution Time (ms) iPhone 6S/SE Execution Time (ms) iPhone 7 Execution Time (ms) iPhone 8/X Execution Time (ms) VGG 16 71 553 15300 7408 4556 235 181 146 Inception v3 78 95 5000 727 637 114 90 78 Resnet 50 75 103 3900 538 557 77 74 71 MobileNet 71 17 569 129 109 44 35 33 SqueezeN et 57 5 800 75 78 36 30 29

2014 2015 2016

Huge improv

  • vement in

GPU hardwa dware in 2015

2013 2017

slide-79
SLIDE 79

Mob MobileNet family

Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv Tune with two parameters – Width Multiplier and resolution multiplier

Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017

slide-80
SLIDE 80

Ef Efficient Classification Architectures

https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html

MobileNetV2 is the current favorite

slide-81
SLIDE 81

Ef Efficient Detection Architectures

Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

slide-82
SLIDE 82

Ef Efficient Detection Architectures

Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

slide-83
SLIDE 83

Ef Efficient Seg Segmentation Architectures

ICNet - Image cascade network

slide-84
SLIDE 84

Tri ricks while des designing you

  • ur ow
  • wn network
  • Dilated Convolutions
  • Great for Segmentation / when target object has high area in image
  • Replace NxN convolutions with Nx1 followed by 1xN
  • Depth wise Separable Convolutions (e.g. MobileNet)
  • Inverted residual block (e.g. MobileNetV2)
  • Replacing large filters with multiple small filters
  • 5x5 is slower than 3x3 followed by 3x3
slide-85
SLIDE 85

Design con

  • nsideration for
  • r custom architectures – Sm

Small Filters

Three layers of 3x3 convolutions >> One layer of 7x7 convolution Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions Replace NxN convolutions with stack of 1xN and Nx1 Fewer parameters ☺ Less compute ☺ More non-linearity ☺

Better Faster Stronge ger

Andrej Karpathy, CS-231n Notes, Lecture 11

slide-86
SLIDE 86

Sele Selective tra raining to

  • keep networks shallow

Idea : Augment data limited to how your network will be used Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate. Followed by WordLens / Google Translate Example : Add blur if analyzing mobile phone frames

slide-87
SLIDE 87

Pru Pruning

Aim : Remove all connections with absolute weights below a threshold

Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

slide-88
SLIDE 88

Observation : Mos Most pa parameters in Fully Con

  • nnected La

Layers

AlexNet 240 MB VGG-16 552 MB 96% of all parameters 90% of all parameters

slide-89
SLIDE 89

Pr Pruning gets ts quickest model compression wit ithout accuracy lo loss ss

AlexNet 240 MB VGG-16 552 MB First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy

slide-90
SLIDE 90

Pru Prune in Kera ras (Before)

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile( optimizer='adam’, loss= ‘sparse_categorical_crossentropy’, metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test)

slide-91
SLIDE 91

Pru Prune in Kera ras (After)

(x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(),

prune.Prune(tf.keras.layers.Dense(512, activation=tf.nn.relu)),

tf.keras.layers.Dropout(0.2),

prune.Prune(tf.keras.layers.Dense(10, activation=tf.nn.softmax))

]) model.compile( optimizer='adam’, loss= ‘sparse_categorical_crossentropy’, metrics=['accuracy']) model.fit(x_train, y_train, epochs=5) model.evaluate(x_test, y_test)

slide-92
SLIDE 92

Weight Sh Sharing

Idea : Cluster weights with similar values together, and store in a dictionary. Codebook Huffman coding HashedNets Cons: Need a special inference engine, doesn’t work for most applications

slide-93
SLIDE 93

Filt ilter Pr Pruning - ThiN iNet

Idea : Discard whole filter if not important to predictions Advantage:

  • No change in architecture, other than thinning of filters per layer
  • Can be further compressed with other methods

Just like feature selection, select filter to discard. Possible greedy methods:

  • Absolute weight sum of entire filter closest to 0
  • Average percentage of ‘Zeros’ as outputs
  • ThiNet – Collect statistics on the output of the next layer
slide-94
SLIDE 94

Quantization

Reduce precision from 32 bits to <=16 bits or lesser Use stochastic rounding for best results In Practice:

  • Ristretto + Caffe
  • Automatic Network quantization
  • Finds balance between compression rate and accuracy
  • Apple Metal Performance Shaders automatically quantize to 16 bits
  • Tensorflow has 8 bit quantization support
  • Gemmlowp – Low precision matrix multiplication library
slide-95
SLIDE 95

Quantizing CNN NNs in Pra Practice

Reducing CoreML models to half size

# Load a model, lower its precision, and then save the smaller model. model_spec = coremltools.utils.load_spec(‘model.mlmodel’) model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec) coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')

slide-96
SLIDE 96

Quantizing CNN NNs in Pra Practice

Reducing CoreML models to even smaller size Choose bits and quantization mode Bits from [1,2,4,8] Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]

  • Lut = look up table

from coremltools.models.neural_network.quantization_utils import * quantized_model= quantize_weights(model, 8, 'linear') quantized_model.save('quantizedModel.mlmodel’) compare_model(model, quantized_model, './sample_data/')

slide-97
SLIDE 97

Bin Binary weighted Ne Networks

Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

slide-98
SLIDE 98

Bin Binary weighted Ne Networks

Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

slide-99
SLIDE 99

Bin Binary weighted Ne Networks

Idea :Reduce the weights to -1,+1 Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

slide-100
SLIDE 100

XNO XNOR-Net

Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

slide-101
SLIDE 101

XNO XNOR-Net

Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

slide-102
SLIDE 102

XNO XNOR-Net

Idea :Reduce both weights + inputs to -1,+1 Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

slide-103
SLIDE 103

XNO XNOR-Net on

  • n Mob

Mobile

slide-104
SLIDE 104

Ba Battery free, sol

  • lar po

powered AI I Device from

  • m XNO

XNOR.AI

slide-105
SLIDE 105

Challenges

Off the shelf CNNs not robust for video Solutions:

  • Collective confidence over several frames
  • CortexNet
slide-106
SLIDE 106

Building ng a DL App and get $10 million n in funding ng (or a PhD)

slide-107
SLIDE 107

Competitions to to foll llow

Winners = High accuracy + Low energy consumption * LPIRC - Low-Power Image Recognition Challenge * EDLDC - Embedded deep learning design contest * System Design Contest at Design Automation Conference (DAC)

slide-108
SLIDE 108

AutoML – Le Let AI I des design an efficient AI I architecture

MnasNet: Platform-Aware Neural Architecture Search for Mobile

  • An automated neural architecture search approach for designing mobile

models using reinforcement learning

  • Incorporates latency information into the reward objective function
  • Measure real-world inference latency by executing on a particular

platform

Sample models from search space

Trainer Mobile phones Multi-objective reward

latency reward

Controller

accuracy

slide-109
SLIDE 109

AutoML – Le Let AI I des design an efficient AI I architecture

For same accuracy:

  • 1.5x faster than MobileNetV2
  • ResNet-50 accuracy with 19x less

parameters

  • SSD300 mAP with 35x less FLOPs
slide-110
SLIDE 110
  • Mr. Data Scienti

ntist t PhD

slide-111
SLIDE 111

One Last Questio estion

slide-112
SLIDE 112

How to access ess the slides es in 1 second nd

http://b //bit.ly/m /ml-slides @anirud rudhk hkoul ul

slide-113
SLIDE 113