Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO - - PowerPoint PPT Presentation

deep residual learning for image recognition
SMART_READER_LITE
LIVE PREVIEW

Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO - - PowerPoint PPT Presentation

Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 K. He, X. Zhang, S. Ren and J. Sun WINNER Microsoft Research Article overview by Ilya Kuzovkin Computational Neuroscience Seminar University of Tartu 2016 T HE I DEA


slide-1
SLIDE 1

Article overview by Ilya Kuzovkin

  • K. He, X. Zhang, S. Ren and J. Sun

Microsoft Research

Computational Neuroscience Seminar University of Tartu 2016

Deep Residual Learning for Image Recognition

ILSVRC 2015 MS COCO 2015 WINNER

slide-2
SLIDE 2

THE IDEA

slide-3
SLIDE 3

1000 classes

slide-4
SLIDE 4

2012

8 layers 15.31% error

slide-5
SLIDE 5

2012

8 layers 15.31% error

2013

9 layers, 2x params 11.74% error

slide-6
SLIDE 6

2012

8 layers 15.31% error

2013 2014

9 layers, 2x params 11.74% error 19 layers 7.41% error

slide-7
SLIDE 7

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error

?

slide-8
SLIDE 8

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error

Is learning better networks as easy as stacking more layers?

?

slide-9
SLIDE 9

?

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error

Is learning better networks as easy as stacking more layers? Vanishing / exploding gradients

slide-10
SLIDE 10

?

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error

Is learning better networks as easy as stacking more layers? Vanishing / exploding gradients Normalized initialization & intermediate normalization

slide-11
SLIDE 11

?

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error

Is learning better networks as easy as stacking more layers? Vanishing / exploding gradients Normalized initialization & intermediate normalization Degradation problem

slide-12
SLIDE 12

Degradation problem

“with the network depth increasing, accuracy gets saturated”

slide-13
SLIDE 13

Not caused by overfitting:

Degradation problem

“with the network depth increasing, accuracy gets saturated”

slide-14
SLIDE 14

Conv Conv Conv Conv

Trained Accuracy X% Tested

slide-15
SLIDE 15

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Tested Tested

slide-16
SLIDE 16

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Same performance Tested Tested

slide-17
SLIDE 17

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Same performance

Conv Conv Conv Conv Conv Conv Conv Conv Trained

Tested Tested Tested

slide-18
SLIDE 18

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Same performance

Conv Conv Conv Conv Conv Conv Conv Conv Trained

Worse! Tested Tested Tested

slide-19
SLIDE 19

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Same performance

Conv Conv Conv Conv Conv Conv Conv Conv Trained

Worse! Tested Tested Tested

“Our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)”

slide-20
SLIDE 20

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Same performance

Conv Conv Conv Conv Conv Conv Conv Conv Trained

Worse! Tested Tested Tested

“Our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)” “Solvers might have difficulties in approximating identity mappings by multiple nonlinear layers”

slide-21
SLIDE 21

Conv Conv Conv Conv

Trained Accuracy X%

Conv Conv Conv Conv Identity Identity Identity Identity

Same performance

Conv Conv Conv Conv Conv Conv Conv Conv Trained

Worse! Tested Tested Tested

“Our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)” “Solvers might have difficulties in approximating identity mappings by multiple nonlinear layers” Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero”

slide-22
SLIDE 22

Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero”

is the true function we want to learn

slide-23
SLIDE 23

Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero”

is the true function we want to learn Let’s pretend we want to learn instead.

slide-24
SLIDE 24

Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero”

is the true function we want to learn Let’s pretend we want to learn instead. The original function is then

slide-25
SLIDE 25
slide-26
SLIDE 26

Network can decide how deep it needs to be…

slide-27
SLIDE 27

Network can decide how deep it needs to be… “The identity connections introduce neither extra parameter nor computation complexity”

slide-28
SLIDE 28

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error

?

slide-29
SLIDE 29

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error 19 layers 7.41% error 152 layers 3.57% error

slide-30
SLIDE 30

EXPERIMENTS AND DETAILS

slide-31
SLIDE 31
  • Lots of convolutional 3x3 layers
  • VGG complexity is 19.6 billion FLOPs

34-layer-ResNet is 3.6 bln. FLOPs

slide-32
SLIDE 32
  • Lots of convolutional 3x3 layers
  • VGG complexity is 19.6 billion FLOPs

34-layer-ResNet is 3.6 bln. FLOPs

  • Batch normalization
  • SGD with batch size 256
  • (up to) 600,000 iterations
  • LR 0.1 (divided by 10 when error plateaus)
  • Momentum 0.9
  • No dropout
  • Weight decay 0.0001
slide-33
SLIDE 33
  • Lots of convolutional 3x3 layers
  • VGG complexity is 19.6 billion FLOPs

34-layer-ResNet is 3.6 bln. FLOPs

  • Batch normalization
  • SGD with batch size 256
  • (up to) 600,000 iterations
  • LR 0.1 (divided by 10 when error plateaus)
  • Momentum 0.9
  • No dropout
  • Weight decay 0.0001
  • 1.28 million training images
  • 50,000 validation
  • 100,000 test
slide-34
SLIDE 34
  • 34-layer ResNet has lower training error.

This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth.

slide-35
SLIDE 35
  • 34-layer ResNet has lower training error.

This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth.

  • 34-layer-ResNet reduces the top-1 error

by 3.5%

slide-36
SLIDE 36
  • 34-layer ResNet has lower training error.

This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth.

  • 34-layer-ResNet reduces the top-1 error

by 3.5%

  • 18-layer ResNet converges faster and

thus ResNet eases the optimization by providing faster convergence at the early stage.

slide-37
SLIDE 37

GOING DEEPER

slide-38
SLIDE 38

Due to time complexity the usual building block is replaced by Bottleneck Block 50 / 101 / 152 - layer ResNets are build from those blocks

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

ANALYSIS ON CIFAR-10

slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

ImageNet Classification 2015 1st 3.57% error ImageNet Object Detection 2015 1st 194 / 200 categories ImageNet Object Localization 2015 1st 9.02% error COCO Detection 2015 1st 37.3% COCO Segmentation 2015 1st 28.2%

http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf