Deep Learning for Mobile Part I
Instructor - Simon Lucey
16-623 - Designing Computer Vision Apps
Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - - - PowerPoint PPT Presentation
Deep Learning for Mobile Part I Instructor - Simon Lucey 16-623 - Designing Computer Vision Apps Today Single Layer Perceptron Multi-Layer Perceptron Convolutional Neural Network Linear Binary Classification T
Instructor - Simon Lucey
16-623 - Designing Computer Vision Apps
Today
Linear Binary Classification
4
[65,09,67,.......,78,66,76,215]
x ∈ RD
≥ < 0
x ∈ C1 x ∈ C2
wT x + w0
T
Linear Binary Classification
4
[65,09,67,.......,78,66,76,215]
x ∈ RD
≥ < 0
x ∈ C1 x ∈ C2
wT x + w0
“Perceptron”
T
Linear Binary Classification
4
[65,09,67,.......,78,66,76,215]
x ∈ RD
≥ < 0
x ∈ C1 x ∈ C2
wT x + w0
“Linear Discriminant”
T
Why Linear?
number of required samples is linear with respect to the dimensionality .
n
Dimensionality(D)
D
Why Linear?
number of required samples is linear with respect to the dimensionality .
n
Dimensionality(D)
D
Perceptron
a IBM 704 computer at Cornell in 1957.
illuminated by powerful lights and captured on a 20x20 cadmium sulphide photo cells.
using variable rotary resistors.
neural network.
“Frank Rosenblatt”
Perceptron
Linear Discriminant Functions
a . pen- the . gen- en
x2 x1 w x
y(x) ∥w∥
x⊥
−w0 ∥w∥
y = 0 y < 0 y > 0 R2 R1
C1 C2
Linear Binary Classification
9
[65,09,67,.......,78,66,76,215]
x ∈ RD
≥ < 0
x ∈ C1 x ∈ C2
T
w w0 T x 1
Linear Binary Classification
9
[65,09,67,.......,78,66,76,215]
x ∈ RD
≥ < 0
x ∈ C1 x ∈ C2
T
wT x
binary labels
Perceptron Linear Discriminant
ti = −1 ti = +1 xi = i-th training example w = weight vector arg min
w N
X
n=1
max(0, tn · xT
nw)
binary labels
Perceptron Linear Discriminant
ti = −1 ti = +1 xi = i-th training example w = weight vector arg min
w N
X
n=1
max(0, tn · xT
nw)
binary labels
Perceptron Linear Discriminant
ti = −1 ti = +1 xi = i-th training example w = weight vector arg min
w N
X
n=1
E(tn · xT
nw)
margin ∝ (wT w)−1
Perceptron Linear Discriminant
arg min
w N
X
n=1
E(tn · xT
nw) + λ
2 ||w||2
2
Other Objectives
−2 −1 1 2 z E(z)
least-squares ← ||z − 1||2
2
sigmoid ← 1 1 + exp(−z) hinge ← max(0, 1 − z)
Optimizing Weights
f(w) =
N
X
n=1
E(tn · xT
nw) + λ
2 ||w||2
2
w → w − η ∂f(w) ∂w
Optimizing Weights
“Learning Rate”
f(w) =
N
X
n=1
E(tn · xT
nw) + λ
2 ||w||2
2
w → w − η ∂f(w) ∂w
Gradient-Descent Optimization
Gradient-Descent Optimization
Optimizing Weights
w1 . . . wK ← w1 . . . wK + η
∂f(w) ∂w1
. . .
∂f(w) ∂wK
Optimizing Weights
w1 . . . wK ← w1 . . . wK + η
∂f(w) ∂w1
. . .
∂f(w) ∂wK
Optimizing Weights - Per Sample
“Learning Rate”
f(w) =
N
X
n=1
fn(w)
w → w − η N ∂fn(w) ∂w
Single Layer - Example
fn(w) = 1 2||1 − tn · xT
nw||2 2 + λ
2N ||w||2
2
Single Layer - Example
fn(w) = 1 2||1 − tn · xT
nw||2 2 + λ
2N ||w||2
2
∂fn(w) ∂w = (xT
nw − tn)xn + λ
N w
Today
Shallow Networks
to!learn!a!func:on!that!has!2k!zeroZcrossings!along!some!line! ! ! ! ! !
maximally!varying!func:ons!!over!d!inputs!requires!O(2d)! examples! !
View-tuned cells Complex Simple
Bob Crimi
Hierarchical Learning
View-tuned cells Complex Simple
Bob Crimi
V1
V2/V4
IT
Ventral Visual Stream
Hierarchical Learning
Hierarchical Learning
Successive!model!layers!learn!deeper!intermediate!representa:ons! !
Layer!1! Layer!2! Layer!3!
HighZlevel! linguis:c!representa:ons!
(Lee,!Grosse,!Ranganath!&!Ng,!ICML!2009)!
12!
Prior:$underlying$factors$&$concepts$compactly$expressed$w/$mul/ple$levels$of$abstrac/on$ ! Parts!combine! to!form!objects!
Why Deep?
several or more hidden layers.
shallow ones.
Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014. Shallow Network Deep Network
Shallow Computer Program
main subroutine1 includes subsub1 code and subsub2 code and subsubsub1 code subroutine2 includes subsub2 code and subsub3 code and subsubsub3 code and …
Deep Computer Program
main sub1 sub2 sub3 subsub1 subsub2 subsub3 subsubsub1 subsubsub2 subsubsub3
Multi-Layer Perceptron
Multi-Layer Perceptron
W(1) x (M × D)
Multi-Layer Perceptron
W(1) x h(W(1)x)
1 2 3 4
0.5 1
x h(x) (M × D)
Multi-Layer Perceptron
W(1) x h(W(1)x)
1 2 3 4
0.5 1
x h(x) (M × D)
Multi-Layer Perceptron
W(1) x z
≥
< 0
x ∈ C1 x ∈ C2
(M × D) (1 × M) [w(2)]T
T
Multi-Layer Perceptron
input, rep- pa- input direc-
x0 x1 xD z0 z1 zM y1 yK w(1)
MD
w(2)
KM
w(2)
10
hidden units inputs
Layer 1 - MLP
h() = non-linear function
z = z1 . . . zM ← h[xT w(1)
1 ]
. . . h[xT w(1)
M ]
[w(1)
1 , . . . , w(1) M ] = 1st layer’s D × M weights
x = D × 1 raw input
Layer 2 - MLP
zT w(2)
≥ < 0
z ∈ C1
z ∈ C2
[65,09,67,.......,78,66,76,215]
x ∈ RD
T
z ∈ RM
z = M × 1 output of layer 1
w(2) = 2nd layer’s M × 1 weight vector
Obvious Questions?
Obvious Questions?
How Deep?
good performance (e.g. ImageNet).
have higher train error than shallow networks.
identity
weight layer weight layer
relu relu
F(x)+x x F(x) x
Figure 2. Residual learning: a building block.
He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
How Deep?
1 2 3 4 5 6 5 10 20
error (%)
ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110
110-layer 20-layer
training error, and bold lines denote testing error
He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
Obvious Questions?
Obvious Questions?
Convexity Not Needed
needed to guarantee the global optimality of deep networks. Co
Convexity is s not needed
saddle3point3problem3for3non@convex3op/miza/on3
Iden/fying3and3aGacking3the3saddle3point3problem3in3high@ dimensional3non@convex3op/miza/on33
Loss3Surface3of3Mul/layer3Nets3
Saddle Points
f(w)
Pascanu, Razvan, et al. "On the saddle point problem for non-convex optimization." arXiv preprint arXiv:1405.4604 (2014).
Saddle Point
r2f(w) = H
“Hessian matrix”
H = Vdiag(λ)VT
“Eigen-decomposition”
PD
d=1(λd < 0)
D
“Critical Point”
Pascanu, Razvan, et al. "On the saddle point problem for non-convex optimization." arXiv preprint arXiv:1405.4604 (2014).
Index of Critical Point
Pascanu, Razvan, et al. "On the saddle point problem for non-convex optimization." arXiv preprint arXiv:1405.4604 (2014).
Obvious Questions?
Obvious Questions?
ReLU
Krizhevsky et al. ”ImageNet Classification with Deep Convolutional Neural Networks" NIPS 2012. ReLU Sigmoid
ReLU(x) = max(0, x)
ReLU
produces substantially more linear regions than shallow
Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.
x ∈ R2 x1 x2
ReLU
produces substantially more linear regions than shallow
Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.
x ∈ R2 x1 x2 ReLu 1 1 −1 −1 x
ReLU
produces substantially more linear regions than shallow
Montufar, Guido F., et al. "On the number of linear regions of deep neural networks." NIPS 2014.
x ∈ R2 x1 x2 ReLu 1 1 −1 −1 x
Obvious Questions?
Obvious Questions?
Back-Propagation
w1 . . . wK ← w1 . . . wK + η
∂f(w) ∂w1
. . .
∂f(w) ∂wK
Back-Propagation
w1 . . . wK ← w1 . . . wK + η
∂f(w) ∂w1
. . .
∂f(w) ∂wK
Back Propagation
gradients found at higher layers, can be re-used at lower layers.
by which the propagation, propagation
zi zj δj δk δ1 wji wkj
Back Propagation
very efficient bagging).
functions.
Multiple Layers
“2Player!neural!net,”!or! “1PhiddenPlayer!neural!net”! “3Player!neural!net,”!or! “2PhiddenPlayer!neural!net”!
I(x, y) I(x + 1, y + 1) I
Simoncelli & Olshausen 2001
I(x, y) I I(x + 8, y + 8)
Simoncelli & Olshausen 2001
I(x, y) I I(x + 16, y + 16)
Simoncelli & Olshausen 2001
I(x, y) I I(x + 50, y + 50)
Simoncelli & Olshausen 2001
I(x, y) I I(x + 50, y + 50)
Simoncelli & Olshausen 2001
Today
Convolutional Neural Network
Input image Convolutional layer Sub-sampling layer
LeCun 1980
Reminder: Convolution
8 4 6 2 7
1 2
x
h
“signal” “filter” “convolution
Reminder: Convolution
8 4 6 2 7
1 2
x
h
“signal” “filter” “convolution
>> conv(x,h,’valid’) ans = 20 14 14 11
Reminder: Convolution
8 4 6 2 7 20 14 14 11 2 1 2 1 2 1 2 1
“signal” “convolutional matrix”
H
x
Hx
Reminder: Convolution
1 2 20 14 14 11 4 8 6 4 2 6 7 2
“filter” “convolutional signal”
X
h
Xh
Question?
∂(h ∗ x) ∂hT
Multiple Filters
x ∗ h1 . . . x ∗ hM
(D · M × 1)
Multiple Filters
x ∗ h1 . . . x ∗ hM H1 . . . HM x
(D · M × 1) (D · M × D) (D × 1)
Multiple Filters
x ∗ h1 . . . x ∗ hM H1 . . . HM x
(D · M × 1) (D · M × D)
“convolution matrix”
(D × 1)
T
Convolutional Neural Network
W(1) x z
≥ < 0
x ∈ C1 x ∈ C2
(1 × D · M) (D · M × D)
[w(2)]T
T
Convolutional Neural Network
W(1) x z
≥ < 0
x ∈ C1 x ∈ C2
(1 × D · M) (D · M × D)
W(1)x = W(1)
1
. . . W(1)
M
x = x ∗ w(1)
1
. . . x ∗ w(1)
M
[w(2)]T
T
Convolutional Neural Network
W(1) x z
≥ < 0
x ∈ C1 x ∈ C2
(1 × D · M) (D · M × D)
[w(2)]T
z = h[W(1)x]
Convolutional Neural Network
W(1) x z
(D · M × D)
(D · M × 1)
Convolutional Neural Network
W(1) x z
(D · M × D)
(D · M × 1)
≥
< 0
x ∈ C1
x ∈ C2
[w(2)]Tψ{z}
T
(1 × K)
Convolutional Neural Network
W(1) x z
(D · M × D)
ψ{z} = Dz
(K × D · M) (D · M × 1)
≥
< 0
x ∈ C1
x ∈ C2
[w(2)]Tψ{z}
T
(1 × K)
Convolutional Neural Network
W(1) x z
(D · M × D)
ψ{z} = Dz
(K × D · M) (D · M × 1)
≥
< 0
x ∈ C1
x ∈ C2
[w(2)]Tψ{z}
T
(1 × K)
“pooling”
Current State of the Art
image patch 3@ (227x227)
conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)
“car” “bird” “cat”
. . .
Current State of the Art
image patch 3@ (227x227)
conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)
“car” “bird” “cat”
. . .
Current State of the Art
image patch 3@ (227x227)
conv1 96@ (55x55) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (K)
1 . . .
K × 1
Current State of the Art - Pose Selection
image patch 3@ (224x224)
fc-8 conv1 64@ (54x54) conv2 256@ (27x27) conv3 384@ (13x13) conv4 384@ (13x13) conv5 256@ (13x13) fc-6 (4096) fc-7 (4096)
“car” “bird” “cat”
. . .
In BMVC, 2014.
Impact on Speech Recognition
Impact on Object Recognition
ImageNet Challenge Year
BC
(before ConvNets)
AD
(after deep learning)
6.8%
TIMIT*Phone*classificaUon* Accuracy*
Prior!art!(Clarkson!et!al.,1999)!
79.6%!
Feature!learning!
80.3%* TIMIT*Speaker*idenUficaUon* Accuracy*
Prior!art!(Reynolds,!1995)!
99.7%!
Feature!learning!
100.0%*
Audio! Images! MulFmodal!(audio/video)!
CIFAR*Object*classificaUon* Accuracy*
Prior!art!(Ciresan!et!al.,!2011)!!
80.5%!
Feature!learning!
82.0%* NORB*Object*classificaUon* Accuracy*
Prior!art!(Scherer!et!al.,!2010)!
94.4%!
Feature!learning!
95.0%* AVLe_ers*Lip*reading* Accuracy*
Prior!art!(Zhao!et!al.,!2009)!
58.9%!
Stanford!Feature!learning!
65.8%*
Galaxy!
Hollywood2*ClassificaUon* Accuracy*
Prior!art!(Laptev!et!al.,!2004)!
48%!
Feature!learning!
53%* KTH* Accuracy*
Prior!art!(Wang!et!al.,!2010)!
92.1%!
Feature!learning!
93.9%* UCF* Accuracy*
Prior!art!(Wang!et!al.,!2010)!
85.6%!
Feature!learning!
86.5%* YouTube* Accuracy*
Prior!art!(Liu!et!al.,!2009)!
71.2%!
Feature!learning!
75.8%*
Video! Text/NLP!
Paraphrase*detecUon* Accuracy*
Prior!art!(Das!&!Smith,!2009)!!
76.1%!
Feature!learning!
76.4%* SenUment*(MR/MPQA*data)* Accuracy*
Prior!art!(Nakagawa!et!al.,!2010)!!
77.3%!
Feature!learning!
77.7%*
Visualizing CNNs
More to read…