IDA Machine Learning Seminars - Fall 2015
Deep Convolutional Networks and their impact
- n solving large scale visual recognition problems
Hossien Azizpour, Computer Vision Group, KTH. Thanks to: J. Sullivan, A. S. Razavian, A. Maki and S. Carlsson
Deep Convolutional Networks and their impact on solving large scale - - PowerPoint PPT Presentation
IDA Machine Learning Seminars - Fall 2015 Deep Convolutional Networks and their impact on solving large scale visual recognition problems Hossien Azizpour, Computer Vision Group , KTH . Thanks to: J. Sullivan, A. S. Razavian, A. Maki and S.
IDA Machine Learning Seminars - Fall 2015
Hossien Azizpour, Computer Vision Group, KTH. Thanks to: J. Sullivan, A. S. Razavian, A. Maki and S. Carlsson
Deep Learning has resulted in
ConvNets have resulted in
Output:* Scale! TPshirt! Steel!drum! Drums1ck! Mud!turtle!
Steel!drum!
Output:* Scale! TPshirt! Giant!panda! Drums1ck! Mud!turtle!
Error = 1 100, 000
1(incorrect on image i)
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
2010 2011 2012 2013 2014 10 20 30 28.2 25.8 16.4 11.7 6.7
Classification error (%)
Performance of winning entry in ILSVRC competitions (2010-14).
Red indicates when deep ConvNets were introduced.
competing against a ConvNet on ImageNet
some training period.)
human classifier capable of error rate in the range of 2%–3%
2010 2011 2012 2013 2014 Jan Feb Mar 10 20 30 28.2 25.8 16.4 11.7 6.7 5.33 4.94 4.82
Classification error (%)
Recent progress made by Baidu, MSR and Google.
2007 2008 2009 2010 2011 2012 2013 2014 2015 10 20 30 40 50 60 70 80 Deep learning
Year Accuracy plant person chair cat car aeroplane all classes
Progress of object detection for the Pascal VOC 2007 challenge.
ConvNets → much better image representation
Task:
Task:
Have a database of images. Task:
Query image
Database images ranked closest to query image.
correct result, incorrect result.
Image
Part Annotations
Learn Normalized Pose Extract Features RGB, gradient, LBP
CNN Representation SVM Strong DPM
Image
Part Annotations
Learn Normalized Pose Extract Features RGB, gradient, LBP
CNN Representation SVM Strong DPM
pipeline with
Image
Part Annotations
Learn Normalized Pose Extract Features RGB, gradient, LBP
CNN Representation SVM Strong DPM
labelled dataset.
224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
O b j e c t C l a s s i fi c a t i
S c e n e C l a s s i fi c a t i
B i r d S u b c a t e g
i z a t i
F l
e r s R e c
n i t i
H u m a n A t t r i b u t e D e t e c t i
O b j e c t A t t r i b u t e D e t e c t i
P a r i s B u i l d i n g s R e t r i e v a l O x f
d B u i l d i n g s R e t r i e v a l S c u l p t u r e s R e t r i e v a l S c e n e I m a g e R e t r i e v a l O b j e c t I n s t a n c e R e t r i e v a l 40 60 80 100
71.1 64 56.8 80.7 69.9 89.5 74.9 81.7 45.4 81.9 89.3 77.2 69 61.8 86.8 73 91.4 79.5 68 42.3 84.3 91.1
Best state-of-the-art ConvNet off-the-shelf + Linear SVM
Source: CNN Features off-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.
Reason for jump in performance:
It’s just supervised learning.
Decision Boundary
Ideal features Far from ideal
Ideal features Far from ideal
Supervised Deep Learning allows you to learn more ideal features.
Traditional Pattern Recognition: Fixed/Handcrafted feature extraction
Feature Extractor Trainable Classifier
Modern Pattern Recognition: Unsupervised mid-level features
Feature Extractor Trainable Classifier Mid-level Features
Deep Learning: Trained hierarchical representations
Low-level Features Mid-level Features High-level Features Trainable Classifier
Source: Talk Computer Perception with Deep Learning by Yann LeCun
Provides a mechanism to:
(Efficiently encoded in a deep structure.)
Image recognition, speech recognition, Google’s and Baidu’s photo taggers
ImageNet, Kaggle Facial Expression, Kaggle Multimodal Learning, German Traffic Signs, Connectomics, Handwriting...
Images, sound, time-frequency representations, video, volumetric images, RGB-Depth images....
Source: Talk Computer Perception with Deep Learning by Yann LeCun
descent.
Source: Talk Computer Perception with Deep Learning by Yann LeCun
did not train by backpropagation.
popularized and deployed ConvNets for OCR applications etc.
to top layer (k-means at second stage)
learning unsupervised and supervised.
T I M E
Convolutional Neural Net 2012 Convolutional Neural Net 1998 Convolutional Neural Net 1988
Reasons for breakthrough now:
224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
Alex Net 2012
Convolutional Networks for RGB Images: The Basic Operations
each xi has size W × W.
each ki has size (2w + 1) × (2w + 1).
224×224×3 224×224×48
Input Image Convolution response maps
Convolutional Operator: Define conv(·, ·, ·), the convolution of x1:m with k1:m, as: conv(x1:m, k1:m, b) =
m
(xi ∗ ki) + b where
224×224×3 224×224×48
Input Image Convolution response maps
Image f Spatial domain Origin (x, y)
w(-1, -1) w(-1,0) w(-1,1) w(0,0) w(1,0) w(1,1) w(0,1) w(0,-1) w(1,-1)Filter coefficients
f(x-1, y-1) f(x-1,y) f(x-1,y+1) f(x,y) f(x+1,y) f(x+1,y+1) f(x,y+1) f(x,y-1) f(x+1,y--1)Pixels of image section under filter
g(x, y) =
a
b
w(s, t) f(x + s, y + t)
Create a new 2d feature map by applying two more operators: ˜ x = pool(σ(conv(x1:m, k1:m, b))) where
224×224×3 224×224×48 55×55×48 Input Image Activation response maps Max-pooled response maps
x(l)
1:ml = {x(l) 1 , . . . , x(l) ml}
j,1:ml, j = 1, . . . , ml+1
j,1:ml create a new set of 2d feature map:
x(l+1)
j
= pool
1:ml, k(l+1) j,1:ml, b(l+1) j
55×55×48 55×55×128 27×27×128 Layer 1 output Activation response maps Max-pooled response maps
For j = 1, . . . , ml+1
j,1:ml:
ˆ x(l+1)
j
= conv
1:ml, k(l+1) j,1:ml, b(l+1) j
z(l+1)
j
= σ
x(l+1)
j
x(l+1)
j
= pool
j
224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
For j = 1, . . . , mlc+1 x(lc+1)
j
= max mlc
w(lc+1)
j,i
· x(lc)
i
+ b(lc+1)
j
, 0
55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
For j = 1, . . . , mlc+2 x(lc+2)
j
= max
j
· x(lc+1) + b(lc+2)
j
, 0
55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
j = w(lc+3) j
· x(lc+2) + b(lc+3)
j
,
exp(o′
r)
M
j=1 exp
j
55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
For l = 1, . . . , lc k(l)
j,1:ml−1
1 ≤ j ≤ ml, each k(l)
j,i has size wl × wl
* First fully connected layer: w(lc+1)
j,i
, 1 ≤ j ≤ mlc+1, 1 ≤ i ≤ mlc, each w(lc+1)
j,i
and x(lc)
i
have equal size. * Subsequent fully connected layers: For l = lc + 2, . . . , lc + L w(l)
j ,
1 ≤ j ≤ ml, each w(l)
j
has size ml−1 × ml−1
x(0)
1:3 = {xred channel, xgreen channel, xblue channel}
parameters W = {Wconvolutional, Wfully connected} to network’s prediction performance on D.
fConvNet : [0, 1]W ×W ×3 × Rp → [0, 1]M so for input x the function fConvNet predicts its label fConvNet(x; W) = ˆ y
predicted label for input x in D.
L(y, ˆ y) increases as y − ˆ y increases
L(y, fConvNet(x; W)) = −
M
yj log (ˆ yj)
E(D, W) = 1 |D|
L(y, fConvNet(x; W))
min
W E(D, W)
Our optimization problem: min
W E(D, W) = min W
1 |D|
L(y, fConvNet(x; W))
(SGD):
W(t+1) = W(t) − α(t) ∇WE(D(t), W)
Is this good enough?
Our optimization problem: min
W E(D, W) = min W
1 |D|
L(y, fConvNet(x; W))
(SGD):
W(t+1) = W(t) − α(t) ∇WE(D(t), W)
Is this good enough?
Next slides: Deep Learning for Vision: Tricks of the Trade, M. Ranzato,BAVM, Oct ’13
28
Common wisdom: training does not work because we “get stuck in local minima”
31
Local minima are all similar, there are long plateaus, it can take long to break symmetries. Optimization is not the real problem when:
– dataset is large – unit do not saturate too much – normalization layer
30
Avoid overfitting by:
(only classify with and update a random subset of the network at each training iteration)
When training constantly monitor performance with a validation set.
Modular algorithms for backprop
backward mode forward mode f(x; w) x y w g(y) z ! ℝ f(x; w)
layer rest of the network scalar output (loss)
f(x; w) input
V T Pra
Source: http://image9net.org/challenges/LSVRC/2013
Backpack! Flute! Strawberry! Traffic!light! Bathing!cap! Matchs1ck! Racket! Sea!lion!
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
DET! CLSPLOC!
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
ImageNet Classification with Deep Convolutional Neural Networks (NIPS ’12)
224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
labeled training samples
2010 2011 2012 2013 2014 10 20 30 28.2 25.8 16.4 11.7 6.7
Classification error (%)
Name Institution ConvNet Error GoogLeNet Google
VGG Oxford university
MSRA Visual Computing Microsoft Research Asia
Andrew Howard consultant
DeeperVision company
For more details check out http://www.image-net.org/challenges/LSVRC/2014/results.php
Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013
224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers
O b j e c t C l a s s i fi c a t i
S c e n e C l a s s i fi c a t i
B i r d S u b c a t e g
i z a t i
F l
e r s R e c
n i t i
H u m a n A t t r i b u t e D e t e c t i
O b j e c t A t t r i b u t e D e t e c t i
P a r i s B u i l d i n g s R e t r i e v a l O x f
d B u i l d i n g s R e t r i e v a l S c u l p t u r e s R e t r i e v a l S c e n e I m a g e R e t r i e v a l O b j e c t I n s t a n c e R e t r i e v a l 40 60 80 100
71.1 64 56.8 80.7 69.9 89.5 74.9 81.7 45.4 81.9 89.3 77.2 69 61.8 86.8 73 91.4 79.5 68 42.3 84.3 91.1
Best state-of-the-art ConvNet off-the-shelf + Linear SVM
Source: CNN Features off-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.
How to optimize ConvNet representations for transfer learning
training data for computer vision.
classification
But I still want to use a deep ConvNet representation.
learn how to solve the target task?
Target image Source ConvNet Target ConvNet Representation SVM Target label layer?
spatial pooling? fine-tuning? Backprop with Source images & labels Random ConvNet Source ConvNet network architecture? source task? early stopping?
⇓
Training of Source ConvNet from scratch Exploit Source ConvNet for Target Task
Task’s distance from ImageNet increases − − − − − − − − − − − − − →
Attribute Detect. Fine-grained Recog. Compositional Instance Retrieval PASCAL VOC Object H3D human attrib. Cat&Dog breeds VOC Human Act. Holiday scenes MIT 67 Indoor Scenes Object attrib. Bird subordinate Stanford 40 Act. Paris buildings SUN 397 Scene SUN scene attrib. 102 Flowers Visual Phrases Sculptures
Target task Factor
Source task ImageNet
· · ·
FineGrained recognition
· · ·
Instance xx retrieval Early stopping Don’t do it Fine-tuning Yes, more improvement with more labelled data Network depth As deep as possible1 Network width Wider Moderately wide
Original dim Reduced dim
Later layers Earlier layers
1In general the network should be as deep as possible but in the final experiments a couple of the instance retrieval tasks defied this advice!
V O C M I T S C E N E S U N S c e n e A t t O b j A t t H u m a n A t t P e t b r e e d B i r d S u b
d F l
e r s V O C A c t i
S t f d A c t i
V i s . P h r a s e H
i d a y s U K B O x f
d P a r i s S c u l p t u r e 30 40 50 60 70 80 90 100 Best non-ConvNet Deep Standard Deep Optimal
Reduce the amount of training data by removing
10 20 50 100 50 55 60 65 70 75 80 85 90 95
% of classes
(Full set: 1000)
Performance
VOC07 MIT H3D UIUC Pet CUB Flowers VOC action Holidays UKB
10 20 50 100 50 55 60 65 70 75 80 85 90 95
% of data
(Full set: ∼1.3M)
Performance
VOC07 MIT H3D UIUC Pet CUB Flowers VOC action Holidays UKB
all examples from a % of classes % of examples from each class
A B D G C E H I J F
AlexNet
Trained networks of different size.
Depth of network: depth > 8 depth = 8 How network altered: increase depth increase width
Do we learn a better representation with
more layers or more filters?
20 40 60 80 100 120 140 50 55 60 65 70 75 80 85 90
B D G H I J
A C E F
B D G H I J
A C E F
B D G H I J
A C E F
B D G H I J
A C E F
# of parameters (Millions) Performance
Datasets: VOC07 MIT Pet CUB Depth of network: depth > 8 depth = 8 How network altered: increase depth increase width
Automatically generate a sentence description for an image
Image − → sentence description
automatically solving a statistical translation problem sentence in Swedish − → same sentence in English.
representation using a Recurrent Neural Network.
using the Recurrent Neural Network.
Yang, Wang, Yuille.
Models; Kiros, Salakhutdinov, Zemel
Description; Donahue, et al
Bengio, Erhan
Karpathy, Fei-Fei
representation using a Recurrent Neural Network.
using the Recurrent Neural Network.
Yang, Wang, Yuille.
Models; Kiros, Salakhutdinov, Zemel
Description; Donahue, et al
Bengio, Erhan
Karpathy, Fei-Fei
log p(S | I) =
N
log p (St | fConvNet(I), S0, S1, . . . , St−1)
p (St | fConvNet(I), S0, S1, . . . , St−1)
ht+1 = f(ht, xt)
where xt is embedding of word St
Initialize RNN with input from image: x−1 = fConvNet(I) The for t ∈ 0, . . . , N xt = WeSt, ht = σ (Whht−1 + Wxxt) pt = SoftMax(V ht)
LSTM LSTM LSTM
WeS1 WeSN-1
p1 pN p2
log p1(S1) log p2(S2) log pN(SN)
LSTM
WeS0
S1 SN-1 S0 image