Deep Learning for Mobile Part II Instructor - Simon Lucey 16-623 - - - PowerPoint PPT Presentation

deep learning for mobile part ii
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Mobile Part II Instructor - Simon Lucey 16-623 - - - PowerPoint PPT Presentation

Deep Learning for Mobile Part II Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Today Motivation XNOR Networks YOLO State of the Art Recognition Methods State of the art recognition methods


slide-1
SLIDE 1

Deep Learning for Mobile Part II

Instructor - Simon Lucey

16-623 - Designing Computer Vision Apps

slide-2
SLIDE 2
slide-3
SLIDE 3

Today

  • Motivation
  • XNOR Networks
  • YOLO
slide-4
SLIDE 4

State of the Art Recognition Methods

State of the art recognition methods

  • 'Very'Expensive''
  • Memory'
  • ComputaIon'
  • Power'
slide-5
SLIDE 5

Convolutional Neural Networks

…' …'

slide-6
SLIDE 6

Common Deep Learning Packages

  • Deep Learning Packages used include,
  • Caffe (out of Berkley - first popular package).
  • MatConvNet (MATLAB interface very easy to use).
  • Torch (based on LUA used by Facebook)
  • TensorFlow (based on Python used by Google).
slide-7
SLIDE 7

TensorFlow

slide-8
SLIDE 8

TensorFlow

slide-9
SLIDE 9

I ⇤

  • AlexNet''!1.5B'FLOPs'
  • VGG''''''''!'19.6B'FLOPs'

Number'of'Opera-ons':'

  • AlexNet''!~3'fps'
  • VGG''''''''!'~0.25'fps'

Inference'-me'on'CPU':'

GPU !

R R

+''−''×'

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-10
SLIDE 10

TensorFlow iOS

slide-11
SLIDE 11

Accelerate Framework

“image operations” “matrix operations” “signal processing” “misc math”

slide-12
SLIDE 12

Accelerate Framework

“image operations” “matrix operations” “signal processing” “misc math” “basic neural network subroutines”

BNNS

(2016)

slide-13
SLIDE 13

Accelerate Framework

“image operations” “matrix operations” “signal processing” “misc math” “basic neural network subroutines”

BNNS

(2016)

(Taken from https://www.bignerdranch.com/blog/neural-networks-in-ios-10-and-macos/ )

slide-14
SLIDE 14

Deep Learning Kit

http://deeplearningkit.org/

slide-15
SLIDE 15

Today

  • Motivation
  • XNOR Networks
  • YOLO
slide-16
SLIDE 16

Lower Precision

R

32Xbit'

B

1Xbit'

Reducing'Precision'

  • Saving'Memory'
  • Saving'ComputaIon'

∈ {−1, +1}

{X1,+1}' {0,1}' MUL' XNOR' ADD,'SUB' BitXCount'(popcount)' [Han'et'al.'2016]'

−0.05 0.05 1600 3200 4800 6400

8Xbit'

I

Lower Precision

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-17
SLIDE 17

Why Binary?

  • Binary'InstrucIons''
  • AND,'OR,'XOR,'XNOR,''PoPCount'(BitXCount)'

'

  • Low'Power'Device'

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-18
SLIDE 18

Why Binary?

I ⇤

+''−''×' 1x' 1x'

OperaIons' Memory' ComputaIon'

+''−''' ~32x' ~2x' XNOR' BitXcount' ~32x' ~58x'

I ⇤ I ⇤ I ⇤

R R R

R B R B R B

Binary'Weight'Networks' XNORXNetworks'

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-19
SLIDE 19

Why Binary?

I ⇤

+''−''×' 1x' 1x'

OperaIons' Memory' ComputaIon'

+''−''' ~32x' ~2x'

I ⇤

XNOR' BitXcount' ~32x' ~58x'

I ⇤ I ⇤

R R R

R B R B R B

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-20
SLIDE 20

Reminder: XNOR

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-21
SLIDE 21

I ⇤

⇤ W

))

gn(XT

R R

))

gn(XT

B

R

WB

⇤ W

B

R

WB

WB = sign(W)

Why Binary?

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV 2016

slide-22
SLIDE 22

WB = sign(W)

_'

0.75

R

B

⇤ W

WB

Quantization Methods

slide-23
SLIDE 23

Optimal Scaling Factor

arg min

α,WB = J(α, WB)

J(α, WB) = ||W − αWB||2

2

slide-24
SLIDE 24

Optimal Scaling Factor

arg min

α,WB = J(α, WB)

J(α, B) = tr

  • α2BT B − 2αBT W + WT W
  • J(α, WB) = ||W − αWB||2

2

slide-25
SLIDE 25

Optimal Scaling Factor

arg min

α,WB = J(α, WB)

J(α, B) = tr

  • α2BT B − 2αBT W + WT W
  • J(α, WB) = ||W − αWB||2

2

= α2n − 2α · tr

  • BT W
  • + constant
slide-26
SLIDE 26

Simple Example

arg min

b

α2 − 2α · b · w + constant

s.t. b ∈ {+1, −1}

  • Since we know that is always positive then,

α

  • Or more simply,

b = sign(w)

b = +1, if w > 0

b = −1, if w < 0

slide-27
SLIDE 27

Optimal Scaling Factor

  • Since then,

J(α) = α2n − 2α · ||W||`1 + constant

WB = sign(W)

||W||`1 = tr

  • WT sign(W)
  • Therefore,
slide-28
SLIDE 28

Optimal Scaling Factor

α∗, WB∗ = arg min

WB,α{||W − αWB||2}

α∗ | = 1

nkWk`1

WB∗ = sign(W) ⇤ W

computing α

R

B

WB

slide-29
SLIDE 29

How to train a CNN with binary filters?

I ⇤

R

I ⇤computing α

R

B

( )

R

slide-30
SLIDE 30

Naive Solution

  • 1. Train a network with real parameters.
  • 2. Binarize the weight filters.
slide-31
SLIDE 31

Naive Solution

''.'.'.''' '.'.'.'' W W

R R R

slide-32
SLIDE 32

Naive Solution

''.'.'.''' '.'.'.'' W W

R R R

''.'.'.''' '.'.'.'' WB

B B B

slide-33
SLIDE 33

Naive Solution

0' 10' 20' 30' 40' 50' 60' '' ' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' Full'Precision' N a ï v e '

slide-34
SLIDE 34

Binary Weight Network

Binary Weight Network

''.'.'.''' '.'.'.''

R R R

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

slide-35
SLIDE 35

Binary Weight Network

''.'.'.''' '.'.'.''

R R R

''.'.'.''' '.'.'.''

R R R

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

slide-36
SLIDE 36

Binary Weight Network

''.'.'.''' '.'.'.'' ''.'.'.''' '.'.'.'' W W WB

R R R

B B B

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

Binary Weight Network

slide-37
SLIDE 37

Binary Weight Network

W W ''.'.'.''' '.'.'.'' ''.'.'.''' '.'.'.'' WB

R R R

B B B

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

Binary Weight Network

slide-38
SLIDE 38

Binary Weight Network

LOSS$

''.'.'.''' '.'.'.'' W W

R R R

''.'.'.''' '.'.'.'' WB

B B B

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

Binary Weight Network

LOSS$

slide-39
SLIDE 39

Binary Weight Network

LOSS$

''.'.'.''' '.'.'.'' WB ''.'.'.''' '.'.'.'' W W

R R R

B B B

''.'.'.''' '.'.'.'' Gw

R R R

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

Binary Weight Network

slide-40
SLIDE 40

Gradients of Binary Weights

sign(x) ! Gx !

X1' +1' X1' +1' +1'

∂g(w) ∂wT = ∂f(wb) ∂[wb]T ∂sign(w) ∂wT

sign(x) ∂sign(x) ∂x

g(w) = f(sign{w}) = f(wb)

slide-41
SLIDE 41

Binary Weight Network

W = = W W -

  • Gw

''.'.'.''' '.'.'.''

R R R

''.'.'.''' '.'.'.''

R R R

''.'.'.''' '.'.'.''

R R R

Gw W W

Train f for b binary w y weights:

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

Binary Weight Network

slide-42
SLIDE 42

Binary Weight Network

0' 10' 20' 30' 40' 50' 60' '' ' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' 56.8' Full'Precision' Naïve' Binary'Weight'

slide-43
SLIDE 43

Reminder

+''−''×' 1x' 1x'

OperaIons' Memory' ComputaIon'

+''−''' ~32x' ~2x' XNOR' BitXcount' ~32x' ~58x'

I ⇤ I ⇤ I ⇤ I ⇤

R R R

R B R B R B

XNORXNetworks'

slide-44
SLIDE 44

Binary Weight - Binary Input Network

))

))

gn(XT

R

⇤ W

R

B

WB XB

B

B α β

R B α

slide-45
SLIDE 45

Binary Weight - Binary Input Network

))

))

gn(XT

R

⇤ W

R

B

WB XB

B

B α β

R B α

Y

γ

YB

Y

γ YB

γ∗ = 1 n∥Y∥ℓ1

YB∗ = sign(Y)

α∗ = 1 n ∥W∥ℓ1

β∗ = 1 n ∥X∥ℓ1

WB∗ = sign(W)

XB∗ = sign(X)

YB∗, γ∗ = arg min

YB,γ ∥Y − γYB∥2

slide-46
SLIDE 46

Binary Weight - Binary Input Network

B sign(X)

R

B

(1) Binarizing Weights

= = =

(3) Convolution with XNOR-Bitcount

B B

R R

sign(X)

c"

(2) Binarizing Input Efficient

=

|X:,:,i| c

Redundant computation in overlapping areas Inefficient (2) Binarizing Input

X

R

B sign(X)

=

Average Filter

slide-47
SLIDE 47

Binary Convolution

  • Convolution has typically operations.

I ⇤ W ⇡ (sign(I) ~ sign(W)) Kα

is cNWNI, that some modern

c = no. of channels

NW = no. of elements in W NI = no. of elements in I

  • Replacing regular convolution with binary convolution.
  • One can obtain a noticeable speedup since 64 binary
  • perations can be performed in one clock cycle.
  • perations in one clock

I =

64cNW cNW+64.

ut not the input

S

slide-48
SLIDE 48

Binary Weight - Binary Input Network

B B

R R

sign(X)

∗, β∗

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

slide-49
SLIDE 49

XNOR Networks

0' 10' 20' 30' 40' 50' 60' '' ' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' 56.8' 30.5'

slide-50
SLIDE 50

Problem with Pooling

A typical block in CNN

BNorm$ Ac/v$ Pool$ Conv$ $

✗InformaIon'Loss' ✓MulIple'Maximums'

MaxXPooling'

slide-51
SLIDE 51

Problem with Pooling

A block in XNOR-Net

Pool$ BinConv$ $ BNorm$ BinAc/v$

slide-52
SLIDE 52

XNOR Network

BNorm' AcIv' Pool' Conv' '

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

B B

R R

sign(X)

∗, β∗

slide-53
SLIDE 53

XNOR Network

BNorm' AcIv' Pool' Conv' '

  • 1. Randomly initialize W
  • 2. For iter = 1 to N

3. Load a random input image X 4. WB = sign(W) 5. α = kWk`1

n

6. Forward pass with α, WB 7. Compute loss function C 8.

∂C ∂W = Backward pass with α, WB

9. Update W (W = W ∂C

∂W)

B B

R R

sign(X)

∗, β∗

Refer to this as an “XNOR” Network

slide-54
SLIDE 54

Results

0' 10' 20' 30' 40' 50' 60' AlexNet'TopX1'(%)'ILSVRC2012' 56.7' 0.2' 56.8' 30.5' 44.2' ✓ 32x'Smaller'Model'

50 100 150 200 250 300 350 400 450 500 AlexNet VGG ResNet-18 Float Binary 245 MB 500 MB 100 MB 7.4 MB 16 MB 1.5 MB

✓ 58x'Less'ComputaIon'

1 32 1024

number of channels

0x 20x 40x 60x 80x

Speedup by varying channel size

0x0 10x10 20x20

filter size

50x 55x 60x 65x

Speedup by varying filter size

slide-55
SLIDE 55

Results

0' 10' 20' 30' 40' 50' 60' 70' 80' 90' AlexNet'Top.1$&$5'(%)'ILSVRC2012'

slide-56
SLIDE 56

Today

  • Motivation
  • XNOR Networks
  • YOLO
slide-57
SLIDE 57

You Only Look Once (YOLO)

  • 1. Resize image.
  • 2. Run convolutional network.
  • 3. Non-max suppression.
Dog: 0.30 Person: 0.64 Horse: 0.28

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

slide-58
SLIDE 58

You Only Look Once (YOLO)

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

conditional probability map

slide-59
SLIDE 59

You Only Look Once (YOLO)

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

slide-60
SLIDE 60

You Only Look Once (YOLO)

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

slide-61
SLIDE 61

YOLO on Nature

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.

slide-62
SLIDE 62

YOLO on Nature

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR 2016.