Deep Convolutional Networks and their impact on solving large scale - - PowerPoint PPT Presentation

deep convolutional networks and their impact on solving
SMART_READER_LITE
LIVE PREVIEW

Deep Convolutional Networks and their impact on solving large scale - - PowerPoint PPT Presentation

IDA Machine Learning Seminars - Fall 2015 Deep Convolutional Networks and their impact on solving large scale visual recognition problems Hossien Azizpour, Computer Vision Group , KTH . Thanks to: J. Sullivan, A. S. Razavian, A. Maki and S.


slide-1
SLIDE 1

IDA Machine Learning Seminars - Fall 2015

Deep Convolutional Networks and their impact

  • n solving large scale visual recognition problems

Hossien Azizpour, Computer Vision Group, KTH. Thanks to: J. Sullivan, A. S. Razavian, A. Maki and S. Carlsson

slide-2
SLIDE 2

What Deep Learning has done for Computer Vision?

Deep Learning has resulted in

  • 1. much better automatic
  • visual image classification and
  • object detection,
  • 2. much more powerful generic image representations.
slide-3
SLIDE 3

What ConvNets have done for Computer Vision?

ConvNets have resulted in

  • 1. much better automatic
  • visual image classification and
  • object detection,
  • 2. much more powerful generic image representations.
slide-4
SLIDE 4

Image Classification Task: ILSVRC

Output:* Scale! TPshirt! Steel!drum! Drums1ck! Mud!turtle!

Steel!drum!

✔! ✗!

Output:* Scale! TPshirt! Giant!panda! Drums1ck! Mud!turtle!

Error = 1 100, 000

  • 100, 000 images

1(incorrect on image i)

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

slide-5
SLIDE 5

ConvNets → much better image classification

2010 2011 2012 2013 2014 10 20 30 28.2 25.8 16.4 11.7 6.7

Classification error (%)

Performance of winning entry in ILSVRC competitions (2010-14).

Red indicates when deep ConvNets were introduced.

slide-6
SLIDE 6

How well would a human perform on ImageNet?

  • Andrej Karpathy, Stanford, set himself this challenge.
  • Replicated the 1000 way classification problem for a human.
  • Person shown image on the left of the image
  • On the right shown 13 examples from each of the 1000 classes.
  • Must pick 5 of classes as the potential ground truth label.
slide-7
SLIDE 7

How well would a human perform on ImageNet?

  • Efforts and results reported on his blog What I learned from

competing against a ConvNet on ImageNet

  • Estimated his own accuracy on ImageNet as 5.1%. (After

some training period.)

  • Later conjectured (Feb 2015) a dedicated and motivated

human classifier capable of error rate in the range of 2%–3%

slide-8
SLIDE 8

Race is on to beat human level performance

2010 2011 2012 2013 2014 Jan Feb Mar 10 20 30 28.2 25.8 16.4 11.7 6.7 5.33 4.94 4.82

Classification error (%)

Recent progress made by Baidu, MSR and Google.

slide-9
SLIDE 9
slide-10
SLIDE 10

Pascal VOC: Object Detection

Classifica>on:=person,=motorcycle= Detec4on(

Person= Motorcycle(

slide-11
SLIDE 11

ConvNets → much better object detection

2007 2008 2009 2010 2011 2012 2013 2014 2015 10 20 30 40 50 60 70 80 Deep learning

Year Accuracy plant person chair cat car aeroplane all classes

Progress of object detection for the Pascal VOC 2007 challenge.

slide-12
SLIDE 12

ConvNets → much better image representation

slide-13
SLIDE 13

Other Common Tasks in Computer Vision

  • Fine-Grained classification

Task:

  • Label the sub-categories within a class.
slide-14
SLIDE 14

Other Common Tasks in Computer Vision

  • Attribute Classification

Task:

  • Predict the attributes describing a scene (person, etc.)
slide-15
SLIDE 15

Other Common Tasks in Computer Vision

  • Image Retrieval

Have a database of images. Task:

  • Given a query image.
  • Find images in database with same content as the query image.

Query image

Database images ranked closest to query image.

correct result, incorrect result.

slide-16
SLIDE 16

Solving these tasks often involves a complicated pipeline

  • Example: fine-grained classification

Image

Part Annotations

Learn Normalized Pose Extract Features RGB, gradient, LBP

CNN Representation SVM Strong DPM

slide-17
SLIDE 17

Solving these tasks often involves a complicated pipeline

  • Example: fine-grained classification

Image

Part Annotations

Learn Normalized Pose Extract Features RGB, gradient, LBP

CNN Representation SVM Strong DPM

  • Can IMPROVE RESULTS by replacing the complicated

pipeline with

Image

Part Annotations

Learn Normalized Pose Extract Features RGB, gradient, LBP

CNN Representation SVM Strong DPM

  • ConvNet used must be deep and trained on a large diverse

labelled dataset.

slide-18
SLIDE 18

What we mean by a ConvNet feature

224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-19
SLIDE 19

ConvNets → much better image representation

O b j e c t C l a s s i fi c a t i

  • n

S c e n e C l a s s i fi c a t i

  • n

B i r d S u b c a t e g

  • r

i z a t i

  • n

F l

  • w

e r s R e c

  • g

n i t i

  • n

H u m a n A t t r i b u t e D e t e c t i

  • n

O b j e c t A t t r i b u t e D e t e c t i

  • n

P a r i s B u i l d i n g s R e t r i e v a l O x f

  • r

d B u i l d i n g s R e t r i e v a l S c u l p t u r e s R e t r i e v a l S c e n e I m a g e R e t r i e v a l O b j e c t I n s t a n c e R e t r i e v a l 40 60 80 100

71.1 64 56.8 80.7 69.9 89.5 74.9 81.7 45.4 81.9 89.3 77.2 69 61.8 86.8 73 91.4 79.5 68 42.3 84.3 91.1

Best state-of-the-art ConvNet off-the-shelf + Linear SVM

Source: CNN Features off-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.

slide-20
SLIDE 20

Reason for jump in performance:

Learn feature hierarchies from the data

slide-21
SLIDE 21

Modern Visual Recognition Systems

  • 1. Training Phase
  • Gather labelled training data.
  • Extract a feature representation for each training example.
  • Construct a decision boundary.
  • 2. Test Phase
  • Extract feature representation from the test example.
  • Compare to the learnt decision boundary.
slide-22
SLIDE 22

Modern Visual Recognition Systems

  • 1. Training Phase
  • Gather labelled training data.
  • Extract a feature representation for each training example.
  • Construct a decision boundary.
  • 2. Test Phase
  • Extract feature representation from the test example.
  • Compare to the learnt decision boundary.

It’s just supervised learning.

slide-23
SLIDE 23

Is it a bike or a face?

?

slide-24
SLIDE 24

Construct a decision boundary

Decision Boundary

slide-25
SLIDE 25

The two extremes of feature extraction

Ideal features Far from ideal

slide-26
SLIDE 26

The two extremes of feature extraction

Ideal features Far from ideal

Supervised Deep Learning allows you to learn more ideal features.

slide-27
SLIDE 27

Learning Representations/Features

Traditional Pattern Recognition: Fixed/Handcrafted feature extraction

Feature Extractor Trainable Classifier

Modern Pattern Recognition: Unsupervised mid-level features

Feature Extractor Trainable Classifier Mid-level Features

Deep Learning: Trained hierarchical representations

Low-level Features Mid-level Features High-level Features Trainable Classifier

Source: Talk Computer Perception with Deep Learning by Yann LeCun

slide-28
SLIDE 28

Key Properties of Deep Learning

Provides a mechanism to:

  • Learn a highly non-linear function.

(Efficiently encoded in a deep structure.)

  • Learn it from data.
  • Build feature hierarchies
  • Distributed representations
  • Compositionality
  • Perform end-to-end learning.
slide-29
SLIDE 29

How? Convolutional Networks

slide-30
SLIDE 30

Convolutional Networks

  • Are deployed in many practical applications

Image recognition, speech recognition, Google’s and Baidu’s photo taggers

  • Have won several competitions

ImageNet, Kaggle Facial Expression, Kaggle Multimodal Learning, German Traffic Signs, Connectomics, Handwriting...

  • Are applicable to array data where nearby values are correlated

Images, sound, time-frequency representations, video, volumetric images, RGB-Depth images....

Source: Talk Computer Perception with Deep Learning by Yann LeCun

slide-31
SLIDE 31

Convolutional Network

  • Training is supervised and with stochastic gradient

descent.

  • LeCun et al. ’89, ’98

Source: Talk Computer Perception with Deep Learning by Yann LeCun

slide-32
SLIDE 32

ConvNets: History

  • Fukushima 1980: designed network with same basic structure but

did not train by backpropagation.

  • LeCun from late 80s: figured out backpropagation for ConvNets,

popularized and deployed ConvNets for OCR applications etc.

  • Poggio from 1999: same basic structure but learning is restricted

to top layer (k-means at second stage)

  • LeCun from 2006: unsupervised feature learning
  • DiCarlo from 2008: large scale experiments, normalization layer
  • LeCun from 2009: harsher non-linearities, normalization layer,

learning unsupervised and supervised.

  • Mallat from 2011: provides a theory behind the architecture
  • Hinton 2012: use bigger nets, GPUs, more data
slide-33
SLIDE 33

T I M E

Convolutional Neural Net 2012 Convolutional Neural Net 1998 Convolutional Neural Net 1988

Reasons for breakthrough now:

  • Data and GPUs,
  • Networks have been made deeper.
slide-34
SLIDE 34

Modern Convolutional Network

224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

Alex Net 2012

slide-35
SLIDE 35

Convolutional Networks for RGB Images: The Basic Operations

slide-36
SLIDE 36

Convolution Operation

  • Input:
  • a set of 2d feature maps x1:m = {x1, . . . , xm}

each xi has size W × W.

  • Convolutional Parameters:
  • a set of 2d convolutional kernels k1:m = {k1, . . . , km}

each ki has size (2w + 1) × (2w + 1).

  • a bias term b

224×224×3 224×224×48

Input Image Convolution response maps

slide-37
SLIDE 37

Convolution Operation

Convolutional Operator: Define conv(·, ·, ·), the convolution of x1:m with k1:m, as: conv(x1:m, k1:m, b) =

m

  • i=1

(xi ∗ ki) + b where

  • xi ∗ ki denotes a 2d convolution,
  • conv returns a 2d response map (equivalent to a 3D convolution).

224×224×3 224×224×48

Input Image Convolution response maps

slide-38
SLIDE 38

Remember 2D convolution

Image f Spatial domain Origin (x, y)

w(-1, -1) w(-1,0) w(-1,1) w(0,0) w(1,0) w(1,1) w(0,1) w(0,-1) w(1,-1)

Filter coefficients

f(x-1, y-1) f(x-1,y) f(x-1,y+1) f(x,y) f(x+1,y) f(x+1,y+1) f(x,y+1) f(x,y-1) f(x+1,y--1)

Pixels of image section under filter

g(x, y) =

a

  • s=−a

b

  • t=−b

w(s, t) f(x + s, y + t)

slide-39
SLIDE 39

Next non-linear activation and then max-pool

Create a new 2d feature map by applying two more operators: ˜ x = pool(σ(conv(x1:m, k1:m, b))) where

  • σ(·) is a non-linear function typically σ(x) = max(0, x)
  • pool(·) represents a local max-pooling operator.

224×224×3 224×224×48 55×55×48 Input Image Activation response maps Max-pooled response maps

slide-40
SLIDE 40

From one convolutional layer to the next

  • At convolutional layer l have a set of 2d feature maps

x(l)

1:ml = {x(l) 1 , . . . , x(l) ml}

  • Have multiple sets of convolutional kernels k(l+1)

j,1:ml, j = 1, . . . , ml+1

  • For each kernel set k(l+1)

j,1:ml create a new set of 2d feature map:

x(l+1)

j

= pool

  • σ
  • conv
  • x(l)

1:ml, k(l+1) j,1:ml, b(l+1) j

slide-41
SLIDE 41

Convolve → Activation → Max-pool

55×55×48 55×55×128 27×27×128 Layer 1 output Activation response maps Max-pooled response maps

For j = 1, . . . , ml+1

  • Convolve current response maps with k(l+1)

j,1:ml:

ˆ x(l+1)

j

= conv

  • x(l)

1:ml, k(l+1) j,1:ml, b(l+1) j

  • Non-linear activation:

z(l+1)

j

= σ

  • ˆ

x(l+1)

j

  • Max-pool:

x(l+1)

j

= pool

  • z(l+1)

j

slide-42
SLIDE 42

AlexNet 2012

224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-43
SLIDE 43

1st fully connected layer

For j = 1, . . . , mlc+1 x(lc+1)

j

= max mlc

  • i=1

w(lc+1)

j,i

· x(lc)

i

+ b(lc+1)

j

, 0

  • 224×224×3

55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-44
SLIDE 44

2nd fully connected layer

For j = 1, . . . , mlc+2 x(lc+2)

j

= max

  • w(lc+2)

j

· x(lc+1) + b(lc+2)

j

, 0

  • 224×224×3

55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-45
SLIDE 45

Output layer: soft-max operator

  • For j = 1, . . . , M:

j = w(lc+3) j

· x(lc+2) + b(lc+3)

j

,

  • r =

exp(o′

r)

M

j=1 exp

j

  • 224×224×3

55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-46
SLIDE 46

Parameters of the model

  • Filter parameters:
  • Convolutional layers

For l = 1, . . . , lc k(l)

j,1:ml−1

1 ≤ j ≤ ml, each k(l)

j,i has size wl × wl

  • Fully connnected layers

* First fully connected layer: w(lc+1)

j,i

, 1 ≤ j ≤ mlc+1, 1 ≤ i ≤ mlc, each w(lc+1)

j,i

and x(lc)

i

have equal size. * Subsequent fully connected layers: For l = lc + 2, . . . , lc + L w(l)

j ,

1 ≤ j ≤ ml, each w(l)

j

has size ml−1 × ml−1

slide-47
SLIDE 47

Training Convolutional Networks

  • Set-up - Supervised learning
  • For RGB image x set ConvNet’s first set of 2d feature maps:

x(0)

1:3 = {xred channel, xgreen channel, xblue channel}

  • Have a set D of labelled training images i.e. have many (x, y)
  • To learn the network’s parameters must link value of the

parameters W = {Wconvolutional, Wfully connected} to network’s prediction performance on D.

slide-48
SLIDE 48

Training ConvNets: Measuring Performance

  • Remember a ConvNet represents a function:

fConvNet : [0, 1]W ×W ×3 × Rp → [0, 1]M so for input x the function fConvNet predicts its label fConvNet(x; W) = ˆ y

  • Use a loss function to measure the error in fConvNet(x; W)’s

predicted label for input x in D.

  • Loss function typically has the property that

L(y, ˆ y) increases as y − ˆ y increases

  • Cross-Entropy loss frequently used

L(y, fConvNet(x; W)) = −

M

  • j=1

yj log (ˆ yj)

slide-49
SLIDE 49

Training ConvNets: The Optimization Problem

  • Define performance of a network with parameters W on D as

E(D, W) = 1 |D|

  • (x,y)∈D

L(y, fConvNet(x; W))

  • The learning problem is to find the W that minimizes

min

W E(D, W)

  • How do we do the optimization?
slide-50
SLIDE 50

Training ConvNets: How to optimize

Our optimization problem: min

W E(D, W) = min W

1 |D|

  • (x,y)∈D

L(y, fConvNet(x; W))

  • Initialize the network’s parameters randomly to get W(0).
  • Update W’s using batch-mode Stochastic Gradient Descent

(SGD):

  • At iteration t randomly choose a small subset D(t) of D.
  • And perform the update with learning rate α(t)

W(t+1) = W(t) − α(t) ∇WE(D(t), W)

  • W(t)
  • This procedure allows us to find a local minima of E(D, W).

Is this good enough?

slide-51
SLIDE 51

Training ConvNets: How to optimize

Our optimization problem: min

W E(D, W) = min W

1 |D|

  • (x,y)∈D

L(y, fConvNet(x; W))

  • Initialize the network’s parameters randomly to get W(0).
  • Update W’s using batch-mode Stochastic Gradient Descent

(SGD):

  • At iteration t randomly choose a small subset D(t) of D.
  • And perform the update with learning rate α(t)

W(t+1) = W(t) − α(t) ∇WE(D(t), W)

  • W(t)
  • This procedure allows us to find a local minima of E(D, W).

Is this good enough?

slide-52
SLIDE 52

Intuition about hardness of training a deep ConvNet with Backpropagation

Next slides: Deep Learning for Vision: Tricks of the Trade, M. Ranzato,BAVM, Oct ’13

slide-53
SLIDE 53

28

ConvNets: till 2012

Loss parameter

Common wisdom: training does not work because we “get stuck in local minima”

slide-54
SLIDE 54

31

ConvNets: today

Loss parameter

Local minima are all similar, there are long plateaus, it can take long to break symmetries. Optimization is not the real problem when:

– dataset is large – unit do not saturate too much – normalization layer

slide-55
SLIDE 55

30

Like walking on a ridge between valleys

slide-56
SLIDE 56

Regularization during training is very important

Avoid overfitting by:

  • Training with large labelled datasets.
  • Augment training sets with random jitterings of input.
  • Train for a long time with small learning rates.
  • Dropout (idea from Geoff Hinton)

(only classify with and update a random subset of the network at each training iteration)

  • Be vigilant!

When training constantly monitor performance with a validation set.

slide-57
SLIDE 57

Compute derivatives efficiently with chain-rule via BackProp

Modular algorithms for backprop

backward mode forward mode f(x; w) x y w g(y) z ! ℝ f(x; w)

layer rest of the network scalar output
 (loss)

f(x; w) input

  • utput

V T Pra

slide-58
SLIDE 58

Source for many labelled images:

ImageNet

slide-59
SLIDE 59

ImageNet: Large scale visual recognition challenge (2010-13)

  • Classification (+ Localization) Challenge
  • 1000 object classes
  • 1,431,167 images
  • Detection Challenge
  • 200 object classes
  • 456,191 images

Source: http://image9net.org/challenges/LSVRC/2013

slide-60
SLIDE 60

Backpack! Flute! Strawberry! Traffic!light! Bathing!cap! Matchs1ck! Racket! Sea!lion!

slide-61
SLIDE 61

Variety of object classes in ILSVRC

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

slide-62
SLIDE 62

Variety of object classes

DET! CLSPLOC!

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

slide-63
SLIDE 63

A revolution in computer vision

slide-64
SLIDE 64

ConvNet of Krizhevsky, Sutskever, Hinton 2012

ImageNet Classification with Deep Convolutional Neural Networks (NIPS ’12)

224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-65
SLIDE 65

Image Classification: Dramatic ILSVRC Results since 2012

  • ImageNet Large Scale Visual Recognition Challenge
  • 1000 categories, 1.3 Million (×10 with data augmentation)

labeled training samples

2010 2011 2012 2013 2014 10 20 30 28.2 25.8 16.4 11.7 6.7

Classification error (%)

slide-66
SLIDE 66

First layer filters learnt

slide-67
SLIDE 67

AlexNet: Object Recognition Results

slide-68
SLIDE 68

AlexNet: Object Recognition Results

slide-69
SLIDE 69

AlexNet: Object Recognition Results

slide-70
SLIDE 70

Leader Board from ImageNet LSVRC-2014

Name Institution ConvNet Error GoogLeNet Google

  • 6.656

VGG Oxford university

  • 7.337

MSRA Visual Computing Microsoft Research Asia

  • 8.060

Andrew Howard consultant

  • 8.111

DeeperVision company

  • 9.058

For more details check out http://www.image-net.org/challenges/LSVRC/2014/results.php

slide-71
SLIDE 71
slide-72
SLIDE 72

Where do these ConvNets work well and not so well?

Source: Detecting avocados to zucchinis: what have we done, and where are we going? O. Russakovsky et al., ICCV 2013

slide-73
SLIDE 73

ImageNet: Image Classification

slide-74
SLIDE 74

Remember: What we mean by a ConvNet Feature

224×224×3 55×55×48 27×27×128 13×13×192 13×13×192 13×13×128 dense 4096 dense 4096 dense 1000 Output Input Image Fully connected layers Convolutional layers

slide-75
SLIDE 75

ConvNets → much better image representation

O b j e c t C l a s s i fi c a t i

  • n

S c e n e C l a s s i fi c a t i

  • n

B i r d S u b c a t e g

  • r

i z a t i

  • n

F l

  • w

e r s R e c

  • g

n i t i

  • n

H u m a n A t t r i b u t e D e t e c t i

  • n

O b j e c t A t t r i b u t e D e t e c t i

  • n

P a r i s B u i l d i n g s R e t r i e v a l O x f

  • r

d B u i l d i n g s R e t r i e v a l S c u l p t u r e s R e t r i e v a l S c e n e I m a g e R e t r i e v a l O b j e c t I n s t a n c e R e t r i e v a l 40 60 80 100

71.1 64 56.8 80.7 69.9 89.5 74.9 81.7 45.4 81.9 89.3 77.2 69 61.8 86.8 73 91.4 79.5 68 42.3 84.3 91.1

Best state-of-the-art ConvNet off-the-shelf + Linear SVM

Source: CNN Features off-the-shelf: an Astounding Baseline for Recognition, A. Sharif Razavian et al., arXiv, March 2013.

slide-76
SLIDE 76

How to optimize ConvNet representations for transfer learning

slide-77
SLIDE 77

When do I need transfer learning?

  • Currently ImageNet is one of the few sets of large labelled

training data for computer vision.

  • What do I do?
  • I have a visual recognition task that differs from image

classification

  • I have limited labelled training data.

But I still want to use a deep ConvNet representation.

slide-78
SLIDE 78

Terminology of transfer learning

  • Source Task - task for which I have lots of labelled data.
  • Target Task - task I want to solve.
  • only have limited labelled training examples.
  • but task is related to the target task
  • Best way to leverage the source task training data to help

learn how to solve the target task?

slide-79
SLIDE 79

Factors that influence a ConvNet’s representation

Target image Source ConvNet Target ConvNet Representation SVM Target label layer?

  • dim. reduction?

spatial pooling? fine-tuning? Backprop with Source images & labels Random ConvNet Source ConvNet network architecture? source task? early stopping?

Training of Source ConvNet from scratch Exploit Source ConvNet for Target Task

slide-80
SLIDE 80

Can order visual recognition tasks relative to ImageNet

Task’s distance from ImageNet increases − − − − − − − − − − − − − →         

  • Image Classification

Attribute Detect. Fine-grained Recog. Compositional Instance Retrieval PASCAL VOC Object H3D human attrib. Cat&Dog breeds VOC Human Act. Holiday scenes MIT 67 Indoor Scenes Object attrib. Bird subordinate Stanford 40 Act. Paris buildings SUN 397 Scene SUN scene attrib. 102 Flowers Visual Phrases Sculptures

slide-81
SLIDE 81

Best practices for a ConvNet rep. for transfer learning

Target task Factor

Source task ImageNet

· · ·

FineGrained recognition

· · ·

Instance xx retrieval Early stopping Don’t do it Fine-tuning Yes, more improvement with more labelled data Network depth As deep as possible1 Network width Wider Moderately wide

  • Dim. reduction

Original dim Reduced dim

  • Rep. layer

Later layers Earlier layers

1In general the network should be as deep as possible but in the final experiments a couple of the instance retrieval tasks defied this advice!

slide-82
SLIDE 82

Gains to be made by optimizing these parameters for a task

V O C M I T S C E N E S U N S c e n e A t t O b j A t t H u m a n A t t P e t b r e e d B i r d S u b

  • r

d F l

  • w

e r s V O C A c t i

  • n

S t f d A c t i

  • n

V i s . P h r a s e H

  • l

i d a y s U K B O x f

  • r

d P a r i s S c u l p t u r e 30 40 50 60 70 80 90 100 Best non-ConvNet Deep Standard Deep Optimal

slide-83
SLIDE 83

Effect of diversity/amount of training data

Reduce the amount of training data by removing

10 20 50 100 50 55 60 65 70 75 80 85 90 95

% of classes

(Full set: 1000)

Performance

VOC07 MIT H3D UIUC Pet CUB Flowers VOC action Holidays UKB

10 20 50 100 50 55 60 65 70 75 80 85 90 95

% of data

(Full set: ∼1.3M)

Performance

VOC07 MIT H3D UIUC Pet CUB Flowers VOC action Holidays UKB

all examples from a % of classes % of examples from each class

slide-84
SLIDE 84

How should we increase the size of the network?

A B D G C E H I J F

AlexNet

Trained networks of different size.

Depth of network: depth > 8 depth = 8 How network altered: increase depth increase width

Do we learn a better representation with

more layers or more filters?

slide-85
SLIDE 85

How should we increase the size of the network?

20 40 60 80 100 120 140 50 55 60 65 70 75 80 85 90

B D G H I J

A C E F

B D G H I J

A C E F

B D G H I J

A C E F

B D G H I J

A C E F

# of parameters (Millions) Performance

Datasets: VOC07 MIT Pet CUB Depth of network: depth > 8 depth = 8 How network altered: increase depth increase width

slide-86
SLIDE 86

Latest Challenge: Images + Text

slide-87
SLIDE 87

The next challenge

Automatically generate a sentence description for an image

slide-88
SLIDE 88

How? Interpretation of the problem

  • View the problem as an example of a translation problem:

Image − → sentence description

  • Automatic translation is a “successful” example of

automatically solving a statistical translation problem sentence in Swedish − → same sentence in English.

slide-89
SLIDE 89

How? General form of the solution

  • Encode a variable length input into a fixed length

representation using a Recurrent Neural Network.

  • Use a Deep ConvNet representation to encode the image.
  • Decode the representation to the desired output sentence

using the Recurrent Neural Network.

  • Use the same approach.
  • RNN is trained from labelled training examples.
  • Explosion of recent papers on this problem.
  • Explain Images with Multimodal Recurrent Neural Networks; Mao, Xu,

Yang, Wang, Yuille.

  • Unifying Visual-Semantic Embeddings with Multimodal Neural Language

Models; Kiros, Salakhutdinov, Zemel

  • Long-term Recurrent Convolutional Networks for Visual Recognition and

Description; Donahue, et al

  • Show and Tell: A Neural Image Caption Generator; Vinyals, Toshev,

Bengio, Erhan

  • Deep Visual-Semantic Alignments for Generating Image Descriptions;

Karpathy, Fei-Fei

slide-90
SLIDE 90

How? General form of the solution

  • Encode a variable length input into a fixed length

representation using a Recurrent Neural Network.

  • Use a Deep ConvNet representation to encode the image.
  • Decode the representation to the desired output sentence

using the Recurrent Neural Network.

  • Use the same approach.
  • RNN is trained from labelled training examples.
  • Explosion of recent papers on this problem.
  • Explain Images with Multimodal Recurrent Neural Networks; Mao, Xu,

Yang, Wang, Yuille.

  • Unifying Visual-Semantic Embeddings with Multimodal Neural Language

Models; Kiros, Salakhutdinov, Zemel

  • Long-term Recurrent Convolutional Networks for Visual Recognition and

Description; Donahue, et al

  • Show and Tell: A Neural Image Caption Generator; Vinyals, Toshev,

Bengio, Erhan

  • Deep Visual-Semantic Alignments for Generating Image Descriptions;

Karpathy, Fei-Fei

slide-91
SLIDE 91

More formally

  • Let S = {S0, . . . , SN} represent the words in a sentences.
  • Each St is a one-hot representation.
  • Model the joint probability of S given the image I as

log p(S | I) =

N

  • t=1

log p (St | fConvNet(I), S0, S1, . . . , St−1)

  • Use a Recurrent Neural Network to model each

p (St | fConvNet(I), S0, S1, . . . , St−1)

  • Encode I ∪ {S0, . . . , St} with ht (fixed length hidden variable) and

ht+1 = f(ht, xt)

where xt is embedding of word St

slide-92
SLIDE 92

Parameters of the model

Initialize RNN with input from image: x−1 = fConvNet(I) The for t ∈ 0, . . . , N xt = WeSt, ht = σ (Whht−1 + Wxxt) pt = SoftMax(V ht)

slide-93
SLIDE 93

Pictorially

LSTM LSTM LSTM

WeS1 WeSN-1

p1 pN p2

log p1(S1) log p2(S2) log pN(SN)

...

LSTM

WeS0

S1 SN-1 S0 image

slide-94
SLIDE 94

Example results

slide-95
SLIDE 95

Software packages for deep learning

  • Theano/Pylearn2
  • University of Montreal
  • scientific computing framework in Python
  • symbolic computation and automatic differentiation
  • Torch7
  • NYU, supported by Facebook
  • scientific computing framework in Lua
  • Caffe
  • Berkeley University
  • Devoted to analyzing images and learning ConvNets
  • C++/CUDA library
  • Cuda-ConvNet2
  • Alex Krizhevsky
  • very fast state-of-the-art GPUs with Multi-GPU parallelism
  • C++/CUDA library
slide-96
SLIDE 96

Software packages for deep learning

  • MatConvNet
  • Oxford University
  • Matlab toolbox for ConvNets
  • Can run Caffe models
  • Native Matlab GPU support