Return of the Devil in the Details:
Delving Deep into Convolutional Nets
Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford
Return of the Devil in the Details: Delving Deep into Convolutional - - PowerPoint PPT Presentation
Return of the Devil in the Details: Delving Deep into Convolutional Nets Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford The Devil is still in the Details 2011 2014 Comparing Apples to Apples:
Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford
3
IFV LLC SV
Improved Fisher Vector Locality Constrained Linear Coding Super-Vector Encoding
4
Fixed Evaluation Protocol Fixed Learning Fixed Feature Extractor Input Dataset IFV LLC SV
(Krizhevsky et al. 2012, Donahue et al. 2013, Oquab et al. 2014, Sermanet et al. 2014)
5
6
Fixed Evaluation Protocol Fixed Learning CNN Arch 1 CNN Arch 2 IFV Input Dataset …
BOW 32K – IFV-BL 327K – IFV 84K – IFV 84K f s DeCAF 4K t t CNN-F 4K f s CNN-M 2K 2K f s CNN-S 4K (TN) f s
54 56 58 60 62 64 66 68 70 72 74 76 78 80 82
mAP 68.02 54.48 61.69 64.36 73.41 77.15 80.13 2008 2010 2013 2014 ... 82.42
Method Dim. Aug.
achieves state-of-the-art performance over several datasets
through comparison on equal footing, we determine what’s important and what’s not
7
CNN-based methods
8
1 2 3 4 5
1 3 4 2 Augmentation 5
SVM Classifier train test training set test set Evaluate using mAP , accuracy etc. classifier output
9
Pre-trained Net
CNN Feature Extractor
(4096-D feature vector out)
10
conv1 96x7x7 stride 2 conv1 96x7x7 stride 2 conv2 256x5x5 stride 2 conv2 256x5x5 stride 1 conv3 512x3x3 stride 1 conv3 512x3x3 stride 1 conv1 64x11x11 stride 4 conv2 256x5x5 stride 1 conv3 256x3x3 stride 1 conv4
256x3x3
conv5
256x3x3
conv4
512x3x3
conv5
512x3x3
conv4
512x3x3
conv5
512x3x3
fc6
4096 d.o.
fc7 4096 drop-out fc6
4096 d.o.
fc7 4096 drop-out fc6
4096 d.o.
fc7 4096 drop-out
2 3 4 1 Nets 5
‘ImageNet classification with deep convolutional networks’ ‘Visualising and understanding convolutional networks’ ‘OverFeat: integrated recognition, localisation and detection using ConvNets'
11
2 3 4 1 Nets
mAP ( VOC07 ) 73 74.75 76.5 78.25 80 Decaf CNN-F CNN-M CNN-S
79.74 79.89 77.38 73.41
5
12
1 2 3 4 5
1 3 4 2 Augmentation 5
CNN Feature Extractor
13
1 3 4 2 Augmentation
Pre-trained Network
Network Pre-training (with jittering)
(average, max) 5
14
1 3 4 2 Augmentation
+ + flips
224x224 224x224 224x224
5
15
1 3 4 2 Augmentation
mAP ( VOC07 ) 60 65 70 75 80 IFV CNN-M
79.89 67.17
79.44 66.68 76.99 64.35 76.97 64.36
None Flip Crop+Flip (train pooling: sum, test pooling: sum) Crop+Flip (train pooling: none, test pooling: sum)
5
16
1 2 3 4 5
2 4 3 Fine-tuning 1 5
2 4 3 Fine-tuning 1 5
Pre-trained Network
Network Pre-training images from ILSVRC-2012
Fine-tuned Network
Network Fine-tuning images from target dataset General-purpose Features Dataset-specific Features
For VOC 2007, the following loss functions were evaluated for the final fully connected layer:
17
18
2 4 3 Fine-tuning 1 5
mAP ( VOC07 ) 79 80 81 82 83 No TN TN-RNK
82.4 79.7
CNN-S
19
1 2 3 4 5
4 Output Dim 2 3 1 5
4 Output Dim 2 3 1 5
conv1 96x7x7
conv2 256x5x5
conv3 512x3x3
conv4
512x3x3
conv5
512x3x3
fc6
4096 d.o.
fc7 4096 drop-out 2048 1024 128
20
* Note: as only the original ILSVRC-2012 data
was used for re-training this differs from fine-tuning and is simply a way of reducing the final output dimension
21
4 Output Dim 2 3 1 5
mAP ( VOC07 ) 78 78.75 79.5 80.25 81 4096 2048 1024 128
78.6 79.91 80.1 79.89
CNN-M
22
1 2 3 4 5
5 IFV Exts. 2 3 1 4
23
mAP ( VOC07 ) 60 65 70 75 80 IFV-512 CNN-M
79.89 67.93 77 68.02 76.97 66.37 73.59 65.36
Greyscale Colour Greyscale+aug Colour+aug
5 IFV Exts. 2 3 1 4
24
CNN-M 2048 CNN-S CNN-S TUNE-RNK 13.5 13.1 13.1 80.1 79.7 82.4 82.4 82.9 83.2 Zeiler & Fergus Oquab et al. Oquab et al. Wei et al. 16.1 79.0 18.0 77.7 78.7 (82.8*) 86.3* 81.5 (85.2*) 81.7 (90.3*)
* Uses extended training data and/or fusion with other methods
25
26
27