Return of the Devil in the Details: Delving Deep into Convolutional - - PowerPoint PPT Presentation

return of the devil in the details
SMART_READER_LITE
LIVE PREVIEW

Return of the Devil in the Details: Delving Deep into Convolutional - - PowerPoint PPT Presentation

Return of the Devil in the Details: Delving Deep into Convolutional Nets Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford The Devil is still in the Details 2011 2014 Comparing Apples to Apples:


slide-1
SLIDE 1

Return of the Devil in the Details:

Delving Deep into Convolutional Nets

Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford

slide-2
SLIDE 2

2011 2014

The Devil is still in the Details

slide-3
SLIDE 3

Back in 2011, state-of-the-art image classification pipelines were commonly based on the bag of visual words approach, with highly tuned feature encoders

  • There were many feature encodings for this being

proposed, but it was difficult to tell which worked best

Comparing Apples to Apples:

State-of-the-art back in 2011

3

IFV LLC SV

Improved Fisher Vector Locality Constrained Linear Coding Super-Vector Encoding

slide-4
SLIDE 4

In our previous work (BMVC 2011) we conducted an extensive evaluation of these encodings comparing them all

  • n a common-ground:
  • * we’ll call the features from these encodings shallow to

distinguish them from the CNN-based features which follow

Comparing Apples to Apples:

State-of-the-art back in 2011

4

Fixed Evaluation Protocol Fixed Learning Fixed Feature Extractor Input Dataset IFV LLC SV

slide-5
SLIDE 5

What’s Changed?

State-of-the-art in 2014

  • Introduction of CNN-based deep visual features to

the community, all using pre-trained networks

(Krizhevsky et al. 2012, Donahue et al. 2013, Oquab et al. 2014, Sermanet et al. 2014)

  • Have shown to perform excellently over standard

classification and detection benchmarks

  • Unclear how the different methods introduced

recently compare to each other, and to shallow methods such as IFV

5

slide-6
SLIDE 6

Comparing Apples to Apples:

State-of-the-art in 2014

  • This work is again about comparing the latest

methods on a common ground

  • We compare both different pre-trained network

architectures and different learning heuristics

6

Fixed Evaluation Protocol Fixed Learning CNN Arch 1 CNN Arch 2 IFV Input Dataset …

slide-7
SLIDE 7

Performance Evolution over VOC2007

BOW 32K – IFV-BL 327K – IFV 84K – IFV 84K f s DeCAF 4K t t CNN-F 4K f s CNN-M 2K 2K f s CNN-S 4K (TN) f s

54 56 58 60 62 64 66 68 70 72 74 76 78 80 82

mAP 68.02 54.48 61.69 64.36 73.41 77.15 80.13 2008 2010 2013 2014 ... 82.42

Method Dim. Aug.

  • Our best CNN method

achieves state-of-the-art performance over several datasets

  • How do we get there?


through comparison on equal footing, we determine what’s important and what’s not

7

CNN-based methods

slide-8
SLIDE 8

Outline

Different pre-trained networks Data augmentation (for both CNN and IFV) Dataset fine-tuning Reducing CNN final layer output dimensionality Colour and CNN / IFV

8

1 2 3 4 5

Study Introduction and Evaluation Setup

1 3 4 2 Augmentation 5

slide-9
SLIDE 9

SVM Classifier train test training set test set Evaluate using mAP , accuracy etc. classifier output

Evaluation Setup

9

Pre-trained Net

  • n 1,000 ImageNet Classes

CNN Feature Extractor

(4096-D feature vector out)

slide-10
SLIDE 10

Pre-trained Networks

  • CNN-F similar to Krizhevsky et al., NIPS 2012:
  • CNN-M similar to Zeiler and Fergus, CoRR 2013:
  • CNN-S similar to OverFeat ‘accurate’ network, ICLR 2014:

10

conv1 96x7x7 stride 2 conv1 96x7x7 stride 2 conv2 256x5x5 stride 2 conv2 256x5x5 stride 1 conv3 512x3x3 stride 1 conv3 512x3x3 stride 1 conv1 64x11x11 stride 4 conv2 256x5x5 stride 1 conv3 256x3x3 stride 1 conv4

256x3x3

conv5

256x3x3

conv4

512x3x3

conv5

512x3x3

conv4

512x3x3

conv5

512x3x3

fc6

4096 d.o.

fc7 4096 drop-out fc6

4096 d.o.

fc7 4096 drop-out fc6

4096 d.o.

fc7 4096 drop-out

2 3 4 1 Nets 5

‘ImageNet classification with deep convolutional networks’ ‘Visualising and understanding convolutional networks’ ‘OverFeat: integrated recognition, localisation and detection using ConvNets'

slide-11
SLIDE 11

Pre-trained Networks

11

2 3 4 1 Nets

mAP ( VOC07 ) 73 74.75 76.5 78.25 80 Decaf CNN-F CNN-M CNN-S

79.74 79.89 77.38 73.41

5

slide-12
SLIDE 12

Outline

Different pre-trained networks Data augmentation (for both CNN and IFV) Dataset fine-tuning Reducing CNN final layer output dimensionality Colour and CNN / IFV

12

1 2 3 4 5

Study Introduction and Evaluation Setup

1 3 4 2 Augmentation 5

slide-13
SLIDE 13

Data Augmentation

CNN Feature Extractor

What do we mean by data augmentation?

13

1 3 4 2 Augmentation

Pre-trained Network

Network Pre-training (with jittering)

  • a. Extract crops
  • b. Pool features

(average, max) 5

slide-14
SLIDE 14

Data Augmentation

14

1 3 4 2 Augmentation

  • a. No augmentation (= 1 image)
  • b. Flip augmentation (= 2 images)
  • c. Crop+Flip augmentation (= 10 images)

+ + flips

224x224 224x224 224x224

5

slide-15
SLIDE 15

Data Augmentation

15

1 3 4 2 Augmentation

mAP ( VOC07 ) 60 65 70 75 80 IFV CNN-M

79.89 67.17

79.44 66.68 76.99 64.35 76.97 64.36

None Flip Crop+Flip (train pooling: sum, test pooling: sum) Crop+Flip (train pooling: none, test pooling: sum)

5

slide-16
SLIDE 16

Outline

Different pre-trained networks Data augmentation (for both CNN and IFV) Dataset fine-tuning Reducing CNN final layer output dimensionality Colour and CNN / IFV

16

1 2 3 4 5

Study Introduction and Evaluation Setup

2 4 3 Fine-tuning 1 5

slide-17
SLIDE 17

Fine-tuning

2 4 3 Fine-tuning 1 5

Pre-trained Network

Network Pre-training images from ILSVRC-2012

Fine-tuned Network

Network Fine-tuning images from target dataset General-purpose Features Dataset-specific Features

For VOC 2007, the following loss functions were evaluated for the final fully connected layer:

  • TN-CLS – classification loss max{ 0, 1 - ywTφ( I ) }
  • TN-RNK – ranking loss max{ 0, 1 - wT( φ( IPOS ) - φ( INEG ) ) }

17

slide-18
SLIDE 18

Fine-tuning

18

2 4 3 Fine-tuning 1 5

mAP ( VOC07 ) 79 80 81 82 83 No TN TN-RNK

82.4 79.7

CNN-S

slide-19
SLIDE 19

Outline

Different pre-trained networks Data augmentation (for both CNN and IFV) Dataset fine-tuning Reducing CNN final layer output dimensionality Colour and CNN / IFV

19

1 2 3 4 5

Study Introduction and Evaluation Setup

4 Output Dim 2 3 1 5

slide-20
SLIDE 20

Low Dimensional CNN Features

4 Output Dim 2 3 1 5

conv1 96x7x7

  • st. 2

conv2 256x5x5

  • st. 2, pad 1

conv3 512x3x3

  • st. 1, pad 1

conv4

512x3x3

conv5

512x3x3

fc6

4096 d.o.

fc7 4096 drop-out 2048 1024 128

  • Baseline networks all have 4096-D last hidden layer
  • We further trained three modifications to CNN-M with

lower dimensional full7 layers

20

* Note: as only the original ILSVRC-2012 data

was used for re-training this differs from fine-tuning and is simply a way of reducing the final output dimension

slide-21
SLIDE 21

Low Dimensional CNN Features

21

4 Output Dim 2 3 1 5

mAP ( VOC07 ) 78 78.75 79.5 80.25 81 4096 2048 1024 128

78.6 79.91 80.1 79.89

CNN-M

slide-22
SLIDE 22

Outline

Different pre-trained networks Data augmentation (for both CNN and IFV) Dataset fine-tuning Reducing CNN final layer output dimensionality Colour and CNN / IFV

22

1 2 3 4 5

Study Introduction and Evaluation Setup

5 IFV Exts. 2 3 1 4

slide-23
SLIDE 23

Impact of Colour

23

mAP ( VOC07 ) 60 65 70 75 80 IFV-512 CNN-M

79.89 67.93 77 68.02 76.97 66.37 73.59 65.36

Greyscale Colour Greyscale+aug Colour+aug

5 IFV Exts. 2 3 1 4

slide-24
SLIDE 24

Comparison to State-of-the-art

24

VOC2007 VOC2012 ILSVRC-2012

CNN-M 2048 CNN-S CNN-S TUNE-RNK 13.5 13.1 13.1 80.1 79.7 82.4 82.4 82.9 83.2 Zeiler & Fergus Oquab et al. Oquab et al. Wei et al. 16.1 79.0 18.0 77.7 78.7 (82.8*) 86.3* 81.5 (85.2*) 81.7 (90.3*)

* Uses extended training data and/or fusion with other methods

slide-25
SLIDE 25

Take Home Messages

  • CNN-based methods >> shallow methods
  • We can transfer tricks from deep features to shallow

features

  • We can achieve incredibly low dimensional (~128-D)

but performant features with CNN-based methods

  • If you get the details right, it’s possible to get to state-
  • f-the-art with very simple methods

25

slide-26
SLIDE 26

There’s more…

  • Presented here was just a subset of the full results

from the paper

  • Check out the paper for full results on:
  • VOC 2007
  • VOC 2012
  • Caltech-101
  • Caltech-256
  • ILSVRC-2012

26

slide-27
SLIDE 27

One more thing…

  • CNN models and feature computation code can now

be downloaded from the project website:
 http://www.robots.ox.ac.uk/~vgg/software/deep_eval/

  • As before, source code to reproduce all experiments

will be made available

27

slide-28
SLIDE 28

Questions?