for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman - - PowerPoint PPT Presentation

for large scale image recognition
SMART_READER_LITE
LIVE PREVIEW

for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman - - PowerPoint PPT Presentation

Very Deep ConvNets for Large-Scale Image Recognition Karen Simonyan , Andrew Zisserman Visual Geometry Group, University of Oxford ILSVRC Workshop 12 September 2014 2 Summary of VGG Submission Localisation task 1 st place, 25.3% error


slide-1
SLIDE 1

Very Deep ConvNets for Large-Scale Image Recognition

Karen Simonyan, Andrew Zisserman Visual Geometry Group, University of Oxford

ILSVRC Workshop 12 September 2014

slide-2
SLIDE 2

Summary of VGG Submission

  • Localisation task
  • 1st place, 25.3% error
  • Classification task
  • 2nd place, 7.3% error
  • Key component: very deep ConvNets
  • up to 19 weight layers

2

slide-3
SLIDE 3

Effect of Depth

  • How does ConvNet depth affect the performance?
  • Comparison of ConvNets
  • same generic design – fair evaluation
  • increasing depth
  • from 11 to 19 weight layers

3

slide-4
SLIDE 4

Network Design

Key design choices:

  • 3x3 conv. kernels – very small
  • conv. stride 1 – no loss of information

Other details:

  • Rectification (ReLU) non-linearity
  • 5 max-pool layers (x2 reduction)
  • no normalisation
  • 3 fully-connected (FC) layers

4

image conv-64 conv-64 maxpool FC-4096 FC-4096 FC-1000 softmax conv-128 conv-128 maxpool conv-256 conv-256 maxpool conv-512 conv-512 maxpool conv-512 conv-512 maxpool

slide-5
SLIDE 5

Why 3x3 layers?

  • Stacked conv. layers have a large receptive field
  • two 3x3 layers – 5x5 receptive field
  • three 3x3 layers – 7x7 receptive field
  • More non-linearity
  • Less parameters to learn
  • ~140M per net

Discussion

5

1st 3x3 conv. layer 2nd 3x3 conv. layer 5 5

slide-6
SLIDE 6

Training

  • Solver
  • multinomial logistic regression
  • mini-batch gradient descent with momentum
  • dropout and weight decay regularisation
  • fast convergence (74 training epochs)
  • Initialisation
  • large number of ReLU layers – prone to stalling
  • most shallow net (11 layers) uses Gaussian initialisation
  • deeper nets
  • top 4 conv. and FC layers initialised with 11 layer net
  • other layers – random Gaussian

6

slide-7
SLIDE 7

Training (2)

  • Multi-scale training
  • randomly-cropped ConvNet input
  • fixed-size 224x224
  • different training image size
  • 256xN
  • 384xN
  • [256;512]xN – random image size

(scale jittering)

  • Standard jittering
  • random horizontal flips
  • random RGB shift

7

256 N≥256 224 224 384 N≥384

slide-8
SLIDE 8

Testing

  • Dense application over the whole image
  • FC layers converted to conv. layers
  • sum-pooling of class score maps
  • more efficient than applying the net

to multiple crops

  • Jittering
  • multiple image sizes: 256xN, 384xN, etc.
  • horizontal flips
  • class scores averaged

8

image class score map conv. layers pooling class scores

slide-9
SLIDE 9

Implementation

  • Heavily-modified Caffe C++ toolbox
  • Multiple GPU support
  • 4 x NVIDIA Titan, off-the-shelf workstation
  • data parallelism for training and testing
  • ~3.75 times speed-up, 2-3 weeks for training

9

image batch

slide-10
SLIDE 10

Comparison – Fixed Training Size

  • 16 or 19 layers trained on 384xN images are the best

10

9.4 8.8 9 9.3 8.7 8.7 7 7.5 8 8.5 9 9.5 13 layers 16 layers 19 layers

Top-5 Classification Error (Val. Set)

256 384 [256;512] training image smallest side

better

slide-11
SLIDE 11

Comparison – Random Training Size

  • Training scale jittering is better than fixed scales
  • Before submission: single net, FC-layers tuning

11

9.4 8.8 9 9.3 8.7 8.7 8 7 7.5 8 8.5 9 9.5 13 layers 16 layers 19 layers

Top-5 Classification Error (Val. Set)

256 384 [256;512]

better

training image smallest side

slide-12
SLIDE 12

Comparison – Random Training Size

  • Training scale jittering is better than fixed scales
  • After submission: three nets, all-layers tuning

12

9.4 8.8 9 9.3 8.7 8.7 8.2 7.6 7.5 7 7.5 8 8.5 9 9.5 13 layers 16 layers 19 layers

Top-5 Classification Error (Val. Set)

256 384 [256;512]

better

training image smallest side

slide-13
SLIDE 13

Final Results

  • 2nd place with 7.3% error
  • combination of 7 models: 6 fixed-scale, 1 multi-scale
  • Single model: 8.4% error

13

7.3 6.7 8.1 11.7 8.4 7.9 9.1 12.5

6 7 8 9 10 11 12

Top-5 Classification Error (Test Set)

multiple nets single net

better

slide-14
SLIDE 14

Final Results (Post-Competition)

14

7.3 7 6.7 8.1 11.7 8.4 7.3 7.9 9.1 12.5

6 7 8 9 10 11 12

Top-5 Classification Error (Test Set)

multiple nets single net

  • 2nd place with 7.0% error
  • combination of two multi-scale models (16- and 19-layer)
  • Single model: 7.3% error

better

slide-15
SLIDE 15

Localisation

Our localisation method

  • Builds on very deep classification ConvNets
  • Similar to OverFeat

1. Localisation ConvNet predicts a set of bounding boxes 2. Bounding boxes are merged 3. Resulting boxes are scored by a classification ConvNet

15

slide-16
SLIDE 16

Localisation (2)

  • Last layer predicts a bbox for each class
  • Bbox parameterisation: (x,y,w,h)
  • 1000 classes x 4-D / class = 4000-D
  • Training
  • Euclidean loss
  • initialised with a classification net
  • fine-tuning of all layers

16

224x224 crop bbox

slide-17
SLIDE 17

Final Results

  • 1st place with 25.3% error
  • combination of 2 localisation models

17

25.3 26.4 29.9 31.9 24 25 26 27 28 29 30 31 32

Top-5 Localisation Error (Test Set)

better

slide-18
SLIDE 18

Summary

  • Excellent results using classical ConvNets
  • small receptive fields
  • but very deep → lots of non-linearity
  • Depth matters!
  • Details in the arXiv pre-print: arxiv.org/pdf/1409.1556/

18

27 15.2 7 10 20 30 2012 2013 2014

VGG Team ILSVRC Progress

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.