return of the devil in the details
play

Return of the Devil in the Details: Delving Deep into Convolutional - PowerPoint PPT Presentation

Return of the Devil in the Details: Delving Deep into Convolutional Nets Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford The Devil is still in the Details 2011 2014 Comparing Apples to Apples:


  1. Return of the Devil in the Details: Delving Deep into Convolutional Nets Ken Chatfield - Karen Simonyan - Andrea Vedaldi - Andrew Zisserman University of Oxford

  2. The Devil is still in the Details 2011 2014

  3. Comparing Apples to Apples: State-of-the-art back in 2011 Back in 2011, state-of-the-art image classification pipelines were commonly based on the bag of visual words approach, with highly tuned feature encoders � � LLC SV IFV � Improved Locality Constrained Super-Vector Fisher Vector Linear Coding Encoding � � There were many feature encodings for this being proposed, but it was difficult to tell which worked best 3

  4. Comparing Apples to Apples: State-of-the-art back in 2011 In our previous work (BMVC 2011) we conducted an extensive evaluation of these encodings comparing them all on a common-ground: � IFV Fixed Fixed Input Fixed � Feature LLC Evaluation Dataset Learning Extractor Protocol � SV � * we’ll call the features from these encodings shallow to distinguish them from the CNN-based features which follow 4

  5. What’s Changed? State-of-the-art in 2014 • Introduction of CNN-based deep visual features to the community, all using pre-trained networks (Krizhevsky et al. 2012, Donahue et al. 2013, Oquab et al. 2014, Sermanet et al. 2014) • Have shown to perform excellently over standard classification and detection benchmarks • Unclear how the different methods introduced recently compare to each other, and to shallow methods such as IFV 5

  6. Comparing Apples to Apples: State-of-the-art in 2014 • This work is again about comparing the latest methods on a common ground • We compare both different pre-trained network architectures and different learning heuristics CNN Arch 1 Fixed Fixed CNN Input Evaluation Learning Arch 2 Dataset Protocol … IFV 6

  7. Performance Evolution over VOC2007 2008 2010 ... 2013 2014 82.42 82 80.13 80 78 77.15 76 74 73.41 72 • Our best CNN method 70 mAP achieves state-of-the-art 68.02 68 performance over several 66 datasets 64.36 64 62 61.69 • How do we get there? 
 60 through comparison on equal 58 footing, we determine what’s 56 54.48 important and what’s not 54 Method BOW IFV-BL IFV IFV DeCAF CNN-F CNN-M 2K CNN-S Dim. 32K 327K 84K 84K 4K 4K 2K 4K (TN) Aug. – – – f s t t f s f s f s CNN-based methods 7

  8. 1 2 Augmentation 3 4 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 8

  9. Evaluation Setup Pre-trained Net on 1,000 ImageNet Classes training set classifier output train CNN Feature Extractor SVM Classifier (4096-D feature vector out) test test set Evaluate using mAP , accuracy etc. 9

  10. 1 Nets 2 3 4 5 Pre-trained Networks • CNN-F similar to Krizhevsky et al., NIPS 2012: ‘ImageNet classification with deep convolutional networks’ conv4 � conv5 � fc7 � conv1 � conv2 � conv3 � 4096 d.o. fc6 � 256x3x3 256x3x3 � 64x11x11 256x5x5 256x3x3 4096 stride 4 stride 1 stride 1 drop-out • CNN-M similar to Zeiler and Fergus, CoRR 2013: ‘Visualising and understanding convolutional networks’ conv4 � conv5 � fc7 � conv1 � conv2 � conv3 � 4096 d.o. fc6 � � 512x3x3 512x3x3 96x7x7 256x5x5 512x3x3 4096 stride 2 stride 2 stride 1 drop-out • CNN-S similar to OverFeat ‘accurate’ network, ICLR 2014: ‘OverFeat: integrated recognition, localisation and detection using ConvNets' conv4 � conv5 � fc7 � conv1 � conv2 � conv3 � 4096 d.o. fc6 � 512x3x3 512x3x3 96x7x7 512x3x3 4096 256x5x5 stride 2 stride 1 drop-out stride 1 10

  11. 1 Nets 2 3 4 5 Pre-trained Networks 79.89 79.74 80 78.25 mAP ( VOC07 ) 77.38 76.5 74.75 73.41 73 Decaf CNN-F CNN-M CNN-S 11

  12. 1 2 Augmentation 3 4 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 12

  13. 1 2 Augmentation 3 4 5 Data Augmentation What do we mean by data augmentation? Network Pre-training Pre-trained Network (with jittering) CNN Feature Extractor a. Extract crops b. Pool features (average, max) 13

  14. 1 2 Augmentation 3 4 5 Data Augmentation a. No augmentation (= 1 image) 224x224 b. Flip augmentation (= 2 images) 224x224 + c. Crop+Flip augmentation (= 10 images) 224x224 + flips 14

  15. 1 2 Augmentation 3 4 5 Data Augmentation None Flip Crop+Flip (train pooling: sum, test pooling: sum) Crop+Flip (train pooling: none, test pooling: sum) 79.89 79.44 80 76.97 76.99 75 mAP ( VOC07 ) 70 67.17 66.68 64.36 64.35 65 60 IFV CNN-M 15

  16. 1 2 3 Fine-tuning 4 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 16

  17. 1 2 3 Fine-tuning 4 5 Fine-tuning Network Pre-training General-purpose Pre-trained Network images from ILSVRC-2012 Features Network Fine-tuning Dataset-specific Fine-tuned Network images from target dataset Features For VOC 2007, the following loss functions were evaluated for the final fully connected layer: • TN-CLS – classification loss max{ 0, 1 - y w T φ ( I ) } • TN-RNK – ranking loss max{ 0, 1 - w T ( φ ( I POS ) - φ ( I NEG ) ) } 17

  18. 1 2 3 Fine-tuning 4 5 Fine-tuning 83 82.4 82 mAP ( VOC07 ) 81 80 79.7 79 No TN TN-RNK CNN-S 18

  19. 1 2 3 4 Output Dim 5 Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 19

  20. 1 2 3 4 Output Dim 5 Low Dimensional CNN Features • Baseline networks all have 4096-D last hidden layer • We further trained three modifications to CNN-M with lower dimensional full7 layers conv4 � conv5 � conv1 � conv3 � fc7 � conv2 � 4096 d.o. fc6 � 512x3x3 512x3x3 96x7x7 256x5x5 512x3x3 4096 st. 2 st. 2, pad 1 st. 1, pad 1 drop-out 2048 * Note: as only the original ILSVRC-2012 data was used for re-training this differs from fine-tuning 1024 and is simply a way of reducing the final output dimension 128 20

  21. 1 2 3 4 Output Dim 5 Low Dimensional CNN Features 81 80.1 80.25 79.91 mAP ( VOC07 ) 79.89 79.5 78.6 78.75 78 4096 2048 1024 128 CNN-M 21

  22. 1 2 3 4 5 IFV Exts. Outline Study Introduction and Evaluation Setup Different pre-trained networks 1 Data augmentation (for both CNN and IFV) 2 Dataset fine-tuning 3 Reducing CNN final layer output dimensionality 4 5 Colour and CNN / IFV 22

  23. 1 2 3 4 5 IFV Exts. Impact of Colour Greyscale Colour Greyscale+aug Colour+aug 79.89 80 76.97 77 75 73.59 mAP ( VOC07 ) 70 68.02 67.93 66.37 65.36 65 60 IFV-512 CNN-M 23

  24. Comparison to State-of-the-art ILSVRC-2012 VOC2007 VOC2012 CNN-M 2048 13.5 80.1 82.4 CNN-S 13.1 79.7 82.9 13.1 CNN-S TUNE-RNK 82.4 83.2 16.1 79.0 Zeiler & Fergus Oquab et al. 18.0 77.7 78.7 ( 82.8 *) Oquab et al. 86.3 * Wei et al. 81.5 ( 85.2 * ) 81.7 ( 90.3 * ) * Uses extended training data and/or fusion with other methods 24

  25. Take Home Messages • CNN-based methods >> shallow methods • We can transfer tricks from deep features to shallow features • We can achieve incredibly low dimensional (~128-D) but performant features with CNN-based methods • If you get the details right, it’s possible to get to state- of-the-art with very simple methods 25

  26. There’s more… • Presented here was just a subset of the full results from the paper • Check out the paper for full results on: • VOC 2007 • VOC 2012 • Caltech-101 • Caltech-256 • ILSVRC-2012 26

  27. One more thing… • CNN models and feature computation code can now be downloaded from the project website: 
 http://www.robots.ox.ac.uk/~vgg/software/deep_eval/ • As before, source code to reproduce all experiments will be made available 27

  28. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend