Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - - - PowerPoint PPT Presentation

deep learning for mobile part i
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - - - PowerPoint PPT Presentation

Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Today Single Layer Perceptron Multi-Layer Perceptron Convolutional Neural Network Linear Binary Classification T


slide-1
SLIDE 1

Deep Learning for Mobile Part I

Instructor - Simon Lucey

16-623 - Designing Computer Vision Apps

slide-2
SLIDE 2
slide-3
SLIDE 3

Today

  • Single Layer Perceptron
  • Multi-Layer Perceptron
  • Convolutional Neural Network
slide-4
SLIDE 4

Linear Binary Classification

4

[65,09,67,.......,78,66,76,215]

x ∈ RD

≥ < 0

x ∈ C1 x ∈ C2

wT x + w0

T

slide-5
SLIDE 5

Linear Binary Classification

4

[65,09,67,.......,78,66,76,215]

x ∈ RD

≥ < 0

x ∈ C1 x ∈ C2

wT x + w0

“Perceptron”

T

slide-6
SLIDE 6

Linear Binary Classification

4

[65,09,67,.......,78,66,76,215]

x ∈ RD

≥ < 0

x ∈ C1 x ∈ C2

wT x + w0

“Linear Discriminant”

T

slide-7
SLIDE 7

Why Linear?

  • Linear discriminant functions are useful in this regard as the

number of required samples is linear with respect to the dimensionality .

n

  • No. of samples

Dimensionality(D)

D

slide-8
SLIDE 8

Why Linear?

  • Linear discriminant functions are useful in this regard as the

number of required samples is linear with respect to the dimensionality .

n

  • No. of samples

Dimensionality(D)

D

slide-9
SLIDE 9

Perceptron

  • Rosenblatt simulated the perceptron on

a IBM 704 computer at Cornell in 1957.

  • Input scene (i.e. printed character) was

illuminated by powerful lights and captured on a 20x20 cadmium sulphide photo cells.

  • Weights of perceptron were applied

using variable rotary resistors.

  • Often times referred to as the very first

neural network.

“Frank Rosenblatt”

slide-10
SLIDE 10

Perceptron

slide-11
SLIDE 11

Linear Discriminant Functions

a . pen- the . gen- en

x2 x1 w x

y(x) ∥w∥

x⊥

−w0 ∥w∥

y = 0 y < 0 y > 0 R2 R1

C1 C2

slide-12
SLIDE 12

Linear Binary Classification

9

[65,09,67,.......,78,66,76,215]

x ∈ RD

≥ < 0

x ∈ C1 x ∈ C2

T

 w w0 T x 1

slide-13
SLIDE 13

Linear Binary Classification

9

[65,09,67,.......,78,66,76,215]

x ∈ RD

≥ < 0

x ∈ C1 x ∈ C2

T

wT x

slide-14
SLIDE 14

binary labels

Perceptron Linear Discriminant

ti = −1 ti = +1 xi = i-th training example w = weight vector arg min

w N

X

n=1

max(0, tn · xT

nw)

slide-15
SLIDE 15

binary labels

Perceptron Linear Discriminant

ti = −1 ti = +1 xi = i-th training example w = weight vector arg min

w N

X

n=1

max(0, tn · xT

nw)

slide-16
SLIDE 16

binary labels

Perceptron Linear Discriminant

ti = −1 ti = +1 xi = i-th training example w = weight vector arg min

w N

X

n=1

E(tn · xT

nw)

slide-17
SLIDE 17

margin ∝ (wT w)−1

Perceptron Linear Discriminant

arg min

w N

X

n=1

E(tn · xT

nw) + λ

2 ||w||2

2

slide-18
SLIDE 18

Other Objectives

  • Other objectives are possible,

−2 −1 1 2 z E(z)

least-squares ← ||z − 1||2

2

sigmoid ← 1 1 + exp(−z) hinge ← max(0, 1 − z)

slide-19
SLIDE 19

Optimizing Weights

  • Expressing the final objective as,

f(w) =

N

X

n=1

E(tn · xT

nw) + λ

2 ||w||2

2

  • Simplest strategy is to employ gradient-descent
  • ptimization,

w → w − η ∂f(w) ∂w

slide-20
SLIDE 20

Optimizing Weights

  • Expressing the final objective as,

“Learning Rate”

f(w) =

N

X

n=1

E(tn · xT

nw) + λ

2 ||w||2

2

  • Simplest strategy is to employ gradient-descent
  • ptimization,

w → w − η ∂f(w) ∂w

slide-21
SLIDE 21

Gradient-Descent Optimization

  • Works for any function that can have a gradient estimated.
  • Guaranteed to converge towards local-minima.
  • Scales well to extremely large amounts of data.
  • Notoriously slow (linear convergence).
  • Often guess work associated tuning the learning rate.
slide-22
SLIDE 22

Gradient-Descent Optimization

  • Works for any function that can have a gradient estimated.
  • Guaranteed to converge towards local-minima.
  • Scales well to extremely large amounts of data.
  • Notoriously slow (linear convergence).
  • Often guess work associated tuning the learning rate.
slide-23
SLIDE 23

Optimizing Weights

   w1 . . . wK    ←    w1 . . . wK    + η    

∂f(w) ∂w1

. . .

∂f(w) ∂wK

   

slide-24
SLIDE 24

Optimizing Weights

   w1 . . . wK    ←    w1 . . . wK    + η    

∂f(w) ∂w1

. . .

∂f(w) ∂wK

   

slide-25
SLIDE 25

Optimizing Weights - Per Sample

  • Objective nearly always summation over N samples,

“Learning Rate”

  • So one can update the weights per sample,

f(w) =

N

X

n=1

fn(w)

w → w − η N ∂fn(w) ∂w

slide-26
SLIDE 26

Single Layer - Example

fn(w) = 1 2||1 − tn · xT

nw||2 2 + λ

2N ||w||2

2

slide-27
SLIDE 27

Single Layer - Example

fn(w) = 1 2||1 − tn · xT

nw||2 2 + λ

2N ||w||2

2

∂fn(w) ∂w = (xT

nw − tn)xn + λ

N w

slide-28
SLIDE 28

Today

  • Single-Layer Perceptron
  • Multi-Layer Perceptron
  • Convolutional Neural Network
slide-29
SLIDE 29

Shallow Networks

  • Theorem:!Gaussian!kernel!machines!need!at!least!k!examples!

to!learn!a!func:on!that!has!2k!zeroZcrossings!along!some!line! ! ! ! ! !

  • Theorem:!For!a!Gaussian!kernel!machine!to!learn!some!

maximally!varying!func:ons!!over!d!inputs!requires!O(2d)! examples! !

  • Y. Bengio, O. Delalleau, and N. Le Roux, “The Curse of Highly Variable Functions for Local Kernel Machines”, NIPS 2006
slide-30
SLIDE 30

View-tuned cells Complex Simple

Bob Crimi

Hierarchical Learning

slide-31
SLIDE 31

View-tuned cells Complex Simple

Bob Crimi

V1

V2/V4

IT

Ventral Visual Stream

Hierarchical Learning

slide-32
SLIDE 32

Hierarchical Learning

Successive!model!layers!learn!deeper!intermediate!representa:ons! !

Layer!1! Layer!2! Layer!3!

HighZlevel! linguis:c!representa:ons!

(Lee,!Grosse,!Ranganath!&!Ng,!ICML!2009)!

12!

Prior:$underlying$factors$&$concepts$compactly$expressed$w/$mul/ple$levels$of$abstrac/on$ ! Parts!combine! to!form!objects!

slide-33
SLIDE 33

Why Deep?

  • Deep network can be considered as an MLP with

several or more hidden layers.

  • Deeper nets are exponentially more expressive than

shallow ones.

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014. Shallow Network Deep Network

slide-34
SLIDE 34

Shallow Computer Program

main subroutine1 includes subsub1 code and subsub2 code and subsubsub1 code subroutine2 includes subsub2 code and subsub3 code and subsubsub3 code and …

slide-35
SLIDE 35

Deep Computer Program

main sub1 sub2 sub3 subsub1 subsub2 subsub3 subsubsub1 subsubsub2 subsubsub3

slide-36
SLIDE 36

Multi-Layer Perceptron

slide-37
SLIDE 37

Multi-Layer Perceptron

W(1) x (M × D)

slide-38
SLIDE 38

Multi-Layer Perceptron

W(1) x h(W(1)x)

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 1
  • 0.5

0.5 1

x h(x) (M × D)

slide-39
SLIDE 39

Multi-Layer Perceptron

W(1) x h(W(1)x)

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 1
  • 0.5

0.5 1

x h(x) (M × D)

slide-40
SLIDE 40

Multi-Layer Perceptron

W(1) x z

< 0

x ∈ C1 x ∈ C2

(M × D) (1 × M) [w(2)]T    

T

slide-41
SLIDE 41

Multi-Layer Perceptron

  • corre-

input, rep- pa- input direc-

x0 x1 xD z0 z1 zM y1 yK w(1)

MD

w(2)

KM

w(2)

10

hidden units inputs

  • utputs
slide-42
SLIDE 42

Layer 1 - MLP

h() = non-linear function

z =    z1 . . . zM    ←     h[xT w(1)

1 ]

. . . h[xT w(1)

M ]

   

[w(1)

1 , . . . , w(1) M ] = 1st layer’s D × M weights

x = D × 1 raw input

slide-43
SLIDE 43

Layer 2 - MLP

zT w(2)

≥ < 0

z ∈ C1

z ∈ C2

[65,09,67,.......,78,66,76,215]

x ∈ RD

T

z ∈ RM

z = M × 1 output of layer 1

w(2) = 2nd layer’s M × 1 weight vector

slide-44
SLIDE 44

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-45
SLIDE 45

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-46
SLIDE 46

How Deep?

  • Recent work has suggested that network depth is crucial for

good performance (e.g. ImageNet).

  • Counter intuitively, naively trained deeper networks tend to

have higher train error than shallow networks.

  • Innovation of residual learning has greatly helped with this.

identity

weight layer weight layer

relu relu

F(x)+x x F(x) x

Figure 2. Residual learning: a building block.

He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

slide-47
SLIDE 47

How Deep?

1 2 3 4 5 6 5 10 20

  • iter. (1e4)

error (%)

ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110

110-layer 20-layer

training error, and bold lines denote testing error

He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).

slide-48
SLIDE 48

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-49
SLIDE 49

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-50
SLIDE 50

Convexity Not Needed

  • Increasing evidence is demonstrating that convexity is not

needed to guarantee the global optimality of deep networks. Co

Convexity is s not needed

  • (Pascanu,!Dauphin,!Ganguli,!Bengio,!arXiv!May!2014):!On3the3

saddle3point3problem3for3non@convex3op/miza/on3

  • (Dauphin,!Pascanu,!Gulcehre,!Cho,!Ganguli,!Bengio,!NIPS’!2014):!

Iden/fying3and3aGacking3the3saddle3point3problem3in3high@ dimensional3non@convex3op/miza/on33

  • (Choromanska,!Henaff,!Mathieu,!Ben!Arous!&!LeCun!2014):!The3

Loss3Surface3of3Mul/layer3Nets3

slide-51
SLIDE 51

Saddle Points

f(w)

Pascanu, Razvan, et al. "On the saddle point problem for non-convex optimization." arXiv preprint arXiv:1405.4604 (2014).

slide-52
SLIDE 52

Saddle Point

r2f(w) = H

“Hessian matrix”

H = Vdiag(λ)VT

“Eigen-decomposition”

PD

d=1(λd < 0)

D

“Critical Point”

Pascanu, Razvan, et al. "On the saddle point problem for non-convex optimization." arXiv preprint arXiv:1405.4604 (2014).

slide-53
SLIDE 53

Index of Critical Point

Pascanu, Razvan, et al. "On the saddle point problem for non-convex optimization." arXiv preprint arXiv:1405.4604 (2014).

slide-54
SLIDE 54

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-55
SLIDE 55

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-56
SLIDE 56

ReLU

Krizhevsky et al. ”ImageNet Classification with Deep Convolutional Neural Networks" NIPS 2012. ReLU Sigmoid

ReLU(x) = max(0, x)

slide-57
SLIDE 57

ReLU

  • ReLU is not only important for improved convergence.
  • A deep network with ReLU (referred to as a rectifier network)

produces substantially more linear regions than shallow

  • nes.

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.

x ∈ R2 x1 x2

slide-58
SLIDE 58

ReLU

  • ReLU is not only important for improved convergence.
  • A deep network with ReLU (referred to as a rectifier network)

produces substantially more linear regions than shallow

  • nes.

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.

x ∈ R2 x1 x2 ReLu         1 1 −1 −1     x    

slide-59
SLIDE 59

ReLU

  • ReLU is not only important for improved convergence.
  • A deep network with ReLU (referred to as a rectifier network)

produces substantially more linear regions than shallow

  • nes.

Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.

x ∈ R2 x1 x2 ReLu         1 1 −1 −1     x    

slide-60
SLIDE 60

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-61
SLIDE 61

Obvious Questions?

  • How many layers?
  • Is the solution globally optimal?
  • What non-linearity should you use?
  • What learning rate?
  • How to should I estimate my gradients?
slide-62
SLIDE 62

Back-Propagation

   w1 . . . wK    ←    w1 . . . wK    + η    

∂f(w) ∂w1

. . .

∂f(w) ∂wK

   

slide-63
SLIDE 63

Back-Propagation

   w1 . . . wK    ←    w1 . . . wK    + η    

∂f(w) ∂w1

. . .

∂f(w) ∂wK

   

slide-64
SLIDE 64

Back Propagation

  • Back propagation refers to the property that components of

gradients found at higher layers, can be re-used at lower layers.

by which the propagation, propagation

zi zj δj δk δ1 wji wkj

slide-65
SLIDE 65

Back Propagation

  • Overfitting can occur when training large networks.
  • Common strategies include,
  • “Dropout” - randomly omits hidden layers in the network (kind of like

very efficient bagging).

  • “Maxout” - like dropout but replaces softmax with max activation

functions.

  • I. J. Goodfellow et al. Maxout networks. arXiv:1302.4389, 2013.
  • G. E. Hinton et al. . Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580, 2012.
slide-66
SLIDE 66

Multiple Layers

“2Player!neural!net,”!or! “1PhiddenPlayer!neural!net”! “3Player!neural!net,”!or! “2PhiddenPlayer!neural!net”!

slide-67
SLIDE 67
slide-68
SLIDE 68

I(x, y) I(x + 1, y + 1) I

Simoncelli & Olshausen 2001

slide-69
SLIDE 69

I(x, y) I I(x + 8, y + 8)

Simoncelli & Olshausen 2001

slide-70
SLIDE 70

I(x, y) I I(x + 16, y + 16)

Simoncelli & Olshausen 2001

slide-71
SLIDE 71

I(x, y) I I(x + 50, y + 50)

Simoncelli & Olshausen 2001

slide-72
SLIDE 72

I(x, y) I I(x + 50, y + 50)

Simoncelli & Olshausen 2001

slide-73
SLIDE 73

Today

  • Single-Layer Perceptron
  • Multi-Layer Perceptron
  • Convolutional Neural Network
slide-74
SLIDE 74

Convolutional Neural Network

Input image Convolutional layer Sub-sampling layer

LeCun 1980

slide-75
SLIDE 75

Reminder: Convolution

8 4 6 2 7

1 2

x

h

“signal” “filter” “convolution

  • perator”
slide-76
SLIDE 76

Reminder: Convolution

8 4 6 2 7

1 2

x

h

“signal” “filter” “convolution

  • perator”

>> conv(x,h,’valid’) ans = 20 14 14 11

slide-77
SLIDE 77

Reminder: Convolution

8 4 6 2 7 20 14 14 11 2 1 2 1 2 1 2 1

“signal” “convolutional matrix”

H

x

Hx

slide-78
SLIDE 78

Reminder: Convolution

1 2 20 14 14 11 4 8 6 4 2 6 7 2

“filter” “convolutional signal”

X

h

Xh

slide-79
SLIDE 79

Question?

  • Can you derive?

∂(h ∗ x) ∂hT

slide-80
SLIDE 80

Multiple Filters

   x ∗ h1 . . . x ∗ hM   

(D · M × 1)

slide-81
SLIDE 81

Multiple Filters

   x ∗ h1 . . . x ∗ hM       H1 . . . HM    x

(D · M × 1) (D · M × D) (D × 1)

slide-82
SLIDE 82

Multiple Filters

   x ∗ h1 . . . x ∗ hM       H1 . . . HM    x

(D · M × 1) (D · M × D)

“convolution matrix”

(D × 1)

slide-83
SLIDE 83

                       

T

Convolutional Neural Network

W(1) x z

≥ < 0

x ∈ C1 x ∈ C2

(1 × D · M) (D · M × D)

[w(2)]T

slide-84
SLIDE 84

                       

T

Convolutional Neural Network

W(1) x z

≥ < 0

x ∈ C1 x ∈ C2

(1 × D · M) (D · M × D)

W(1)x =     W(1)

1

. . . W(1)

M

    x =     x ∗ w(1)

1

. . . x ∗ w(1)

M

    [w(2)]T

slide-85
SLIDE 85

                       

T

Convolutional Neural Network

W(1) x z

≥ < 0

x ∈ C1 x ∈ C2

(1 × D · M) (D · M × D)

[w(2)]T

z = h[W(1)x]

slide-86
SLIDE 86

Convolutional Neural Network

W(1) x z

(D · M × D)

(D · M × 1)

slide-87
SLIDE 87

Convolutional Neural Network

W(1) x z

(D · M × D)

(D · M × 1)

< 0

x ∈ C1

x ∈ C2

[w(2)]Tψ{z}    

T

(1 × K)

slide-88
SLIDE 88

Convolutional Neural Network

W(1) x z

(D · M × D)

ψ{z} = Dz

(K × D · M) (D · M × 1)

< 0

x ∈ C1

x ∈ C2

[w(2)]Tψ{z}    

T

(1 × K)

slide-89
SLIDE 89

Convolutional Neural Network

W(1) x z

(D · M × D)

ψ{z} = Dz

(K × D · M) (D · M × 1)

< 0

x ∈ C1

x ∈ C2

[w(2)]Tψ{z}    

T

(1 × K)

“pooling”

slide-90
SLIDE 90

Current State of the Art

image patch 3@ (227x227)

conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)

  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

“car” “bird” “cat”

. . .

slide-91
SLIDE 91

Current State of the Art

image patch 3@ (227x227)

conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)

  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

“car” “bird” “cat”

. . .

slide-92
SLIDE 92

Current State of the Art

image patch 3@ (227x227)

conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)

  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.

     1 . . .     

K × 1

slide-93
SLIDE 93

Current State of the Art - Pose Selection

image patch 3@ (224x224)

fc-8 conv1 64@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (4096)

“car” “bird” “cat”

. . .

  • K. Chatfield, V. Lempitsky, A. Vedaldi and A. Zisserman. “Return of the Devil in the Details: Delving Deep into Convolutional Networks.”

In BMVC, 2014.

  • A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
slide-94
SLIDE 94
slide-95
SLIDE 95

Impact on Speech Recognition

slide-96
SLIDE 96

Impact on Object Recognition

ImageNet Challenge Year

BC

(before ConvNets)

AD

(after deep learning)

6.8%

slide-97
SLIDE 97

TIMIT*Phone*classificaUon* Accuracy*

Prior!art!(Clarkson!et!al.,1999)!

79.6%!

Feature!learning!

80.3%* TIMIT*Speaker*idenUficaUon* Accuracy*

Prior!art!(Reynolds,!1995)!

99.7%!

Feature!learning!

100.0%*

Audio! Images! MulFmodal!(audio/video)!

CIFAR*Object*classificaUon* Accuracy*

Prior!art!(Ciresan!et!al.,!2011)!!

80.5%!

Feature!learning!

82.0%* NORB*Object*classificaUon* Accuracy*

Prior!art!(Scherer!et!al.,!2010)!

94.4%!

Feature!learning!

95.0%* AVLe_ers*Lip*reading* Accuracy*

Prior!art!(Zhao!et!al.,!2009)!

58.9%!

Stanford!Feature!learning!

65.8%*

Galaxy!

Hollywood2*ClassificaUon* Accuracy*

Prior!art!(Laptev!et!al.,!2004)!

48%!

Feature!learning!

53%* KTH* Accuracy*

Prior!art!(Wang!et!al.,!2010)!

92.1%!

Feature!learning!

93.9%* UCF* Accuracy*

Prior!art!(Wang!et!al.,!2010)!

85.6%!

Feature!learning!

86.5%* YouTube* Accuracy*

Prior!art!(Liu!et!al.,!2009)!

71.2%!

Feature!learning!

75.8%*

Video! Text/NLP!

Paraphrase*detecUon* Accuracy*

Prior!art!(Das!&!Smith,!2009)!!

76.1%!

Feature!learning!

76.4%* SenUment*(MR/MPQA*data)* Accuracy*

Prior!art!(Nakagawa!et!al.,!2010)!!

77.3%!

Feature!learning!

77.7%*

slide-98
SLIDE 98

Visualizing CNNs

slide-99
SLIDE 99

More to read…

  • Bishop “Pattern Recognition and Machine Learning”,
  • 2006. Chapter 5.
  • Goodfellow, Bengio & Courville “Deep Learning”.