for large scale image recognition
play

for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman - PowerPoint PPT Presentation

Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014 2 Summary of VGG Submission Localisation task 1 st place, 25.3% error


  1. Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014

  2. 2 Summary of VGG Submission • Localisation task • 1 st place, 25.3% error • Classification task • 2 nd place, 7.3% error • Key component: very deep ConvNets • up to 19 weight layers

  3. 3 Effect of Depth • How does ConvNet depth affect the performance? • Comparison of ConvNets • same generic design – fair evaluation • increasing depth • from 11 to 19 weight layers

  4. 4 image Network Design conv-64 conv-64 maxpool Key design choices: conv-128 conv-128 • 3x3 conv. kernels – very small maxpool • conv. stride 1 – no loss of information conv-256 conv-256 maxpool conv-512 Other details: conv-512 maxpool • Rectification (ReLU) non-linearity conv-512 • 5 max-pool layers (x2 reduction) conv-512 maxpool • no normalisation FC-4096 • 3 fully-connected (FC) layers FC-4096 FC-1000 softmax

  5. 5 Discussion Why 3x3 layers? • Stacked conv. layers have a large receptive field • two 3x3 layers – 5x5 receptive field 5 • three 3x3 layers – 7x7 receptive field • More non-linearity 5 • Less parameters to learn • ~140M per net 1 st 3x3 conv. layer 2 nd 3x3 conv. layer

  6. 6 Training • Solver • multinomial logistic regression • mini-batch gradient descent with momentum • dropout and weight decay regularisation • fast convergence (74 training epochs) • Initialisation • large number of ReLU layers – prone to stalling • most shallow net (11 layers) uses Gaussian initialisation • deeper nets • top 4 conv. and FC layers initialised with 11 layer net • other layers – random Gaussian

  7. 7 Training (2) • Multi-scale training • randomly-cropped ConvNet input 256 224 • fixed-size 224x224 N≥256 • different training image size • 256xN • 384xN • [256;512]xN – random image size 384 224 (scale jittering) • Standard jittering • random horizontal flips N≥384 • random RGB shift

  8. 8 Testing • Dense application over the whole image • FC layers converted to conv. layers • sum-pooling of class score maps image • more efficient than applying the net to multiple crops conv. • Jittering layers • multiple image sizes: 256xN, 384xN, etc. class score map • horizontal flips • class scores averaged pooling class scores

  9. 9 Implementation • Heavily-modified Caffe C++ toolbox • Multiple GPU support • 4 x NVIDIA Titan, off-the-shelf workstation • data parallelism for training and testing • ~3.75 times speed-up, 2-3 weeks for training image batch

  10. 10 Comparison – Fixed Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8 [256;512] 7.5 7 13 layers 16 layers 19 layers • 16 or 19 layers trained on 384xN images are the best

  11. 11 Comparison – Random Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8 8 [256;512] 7.5 7 13 layers 16 layers 19 layers • Training scale jittering is better than fixed scales • Before submission: single net, FC-layers tuning

  12. 12 Comparison – Random Training Size Top-5 Classification Error (Val. Set) 9.5 9.4 9.3 9 9 training image smallest side 8.8 8.7 8.7 8.5 256 better 384 8.2 8 [256;512] 7.5 7.6 7.5 7 13 layers 16 layers 19 layers • Training scale jittering is better than fixed scales • After submission: three nets, all-layers tuning

  13. 13 Final Results Top-5 Classification Error (Test Set) 12 12.5 11.7 11 10 better 9 9.1 8 multiple nets 8.4 8.1 7.9 7 single net 7.3 6.7 6 • 2 nd place with 7.3% error • combination of 7 models: 6 fixed-scale, 1 multi-scale • Single model: 8.4% error

  14. 14 Final Results (Post-Competition) Top-5 Classification Error (Test Set) 12 12.5 11.7 11 10 better 9 9.1 8 multiple nets 8.4 8.1 7.9 7 single net 7.3 7.3 7 6.7 6 • 2 nd place with 7.0% error • combination of two multi-scale models (16- and 19-layer) • Single model: 7.3% error

  15. 15 Localisation Our localisation method • Builds on very deep classification ConvNets • Similar to OverFeat 1. Localisation ConvNet predicts a set of bounding boxes 2. Bounding boxes are merged 3. Resulting boxes are scored by a classification ConvNet

  16. 16 Localisation (2) • Last layer predicts a bbox for each class • Bbox parameterisation: (x,y,w,h) 224x224 • 1000 classes x 4-D / class = 4000-D crop 0 bbox • Training • Euclidean loss • initialised with a classification net • fine-tuning of all layers

  17. 17 Final Results Top-5 Localisation Error (Test Set) 32 31.9 31 30 29.9 29 better 28 27 26 26.4 25 25.3 24 • 1 st place with 25.3% error • combination of 2 localisation models

  18. 18 Summary • Excellent results using classical ConvNets • small receptive fields • but very deep → lots of non -linearity • Depth matters! • Details in the arXiv pre-print: arxiv.org/pdf/1409.1556/ VGG Team ILSVRC Progress 30 27 20 10 15.2 7 0 2012 2013 2014 We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend