Augmenting Supervised Neural Networks with Unsupervised Objectives - - PowerPoint PPT Presentation

augmenting supervised neural networks with unsupervised
SMART_READER_LITE
LIVE PREVIEW

Augmenting Supervised Neural Networks with Unsupervised Objectives - - PowerPoint PPT Presentation

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification Yu Yuting Zh Zhang, Ki Kibok Le Lee, Ho Honglak Le Lee University of Michigan, Ann Arbor Unsupervised and supervised deep learning


slide-1
SLIDE 1

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-scale Image Classification

Yu Yuting Zh Zhang, Ki Kibok Le Lee, Ho Honglak Le Lee

University of Michigan, Ann Arbor

slide-2
SLIDE 2

Yuting Zhang, Kibok Lee, Honglak Lee

Unsupervised and supervised deep learning

  • Deep feature representations can be learned in supervised and unsupervised manners.

§ Supervised objectives learns from the correspondence between data and label space. § Unsupervised objectives learns from the data space itself.

  • Supervised deep learning

§ Deep neural networks, convolutional neural networks, recurrent neural networks, … § Task-specific, requires large amounts of supervision

  • Unsupervised deep learning

§ Stacked autoencoders, deep belief networks, deep Boltzmann machines, … § Preserves input information, can leverage large amounts of unlabeled data, but may be suboptimal for supervised tasks.

slide-3
SLIDE 3

Yuting Zhang, Kibok Lee, Honglak Lee

Unsupervised and supervised deep learning

  • Historically, unsupervised learning (e.g., SAE) can be used as a pretraining step for

improving and even enabling the supervised learning of deep networks.

  • However, such pretraining became unnecessary if the deep neural network is

initialized properly, and large amount of labeled data are available.

§ E.g., large-scale convolutional neural networks: AlexNet (Krizhevskyet al., 2012), VGGNet (Simounyanand Zisserman, 2015), GoogLeNet (Szegedy et al., 2015), etc.

  • As a result, unsupervised deep learning has been overshadowed by supervised

methods.

slide-4
SLIDE 4

Yuting Zhang, Kibok Lee, Honglak Lee

  • Pretraining:

Unsupervised Supervised

Revisiting the importance of unsupervised learning

à

slide-5
SLIDE 5

Yuting Zhang, Kibok Lee, Honglak Lee

  • Combination:

Unsupervised Supervised reconstruction classification

  • Previous work:

§ Autoencoders: Ranzato & Szummer (2008); Larochelle et al. (2009) § (Restricted) Boltzmann machines: Larochelle & Bengio, (2008); Goodfellow et

  • al. (2013); Sohn et al. (2013)

§ Dictionary learning: Boureau et al. (2010); Mairal et al. (2010) Ladder network: Rasmus et al. (2015)

layer-wise skip links & pathway combinators

Stacked “what-where” AE (SWWAE): Zhao et al. (2015)

unpooling switches (Zeiler and Fergus, 2009)

Promising for improving classification performance, but have not been shown to be beneficial for large-scale supervised deep neural nets.

+

Revisiting the importance of unsupervised learning

+

slide-6
SLIDE 6

Yuting Zhang, Kibok Lee, Honglak Lee

  • Combination:

Unsupervised Supervised reconstruction classification

  • Previous work:

§ Autoencoders: Ranzato & Szummer (2008); Larochelle et al. (2009) § (Restricted) Boltzmann machines: Larochelle & Bengio, (2008); Goodfellow et

  • al. (2013); Sohn et al. (2013)

§ Dictionary learning: Boureau et al. (2010); Mairal et al. (2010) § Ladder network: Rasmus et al. (2015)

  • layer-wise skip links & pathway combinators

§ Stacked “what-where” AE (SWWAE): Zhao et al. (2015)

  • using unpooling switches (Zeiler and Fergus, 2009)
  • Promising for improving classification performance, but have not been

shown to be beneficial for large-scale supervised deep neural nets.

+

Revisiting the importance of unsupervised learning

+

slide-7
SLIDE 7

Yuting Zhang, Kibok Lee, Honglak Lee

The invertibility of large-scale image classification networks Large-scale image classification networks with stronger invertibility

Outlines

1 2

slide-8
SLIDE 8

Invertibility of deep convolutional neural networks

slide-9
SLIDE 9

Yuting Zhang, Kibok Lee, Honglak Lee a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

A typicalclassification network (VGGNet)

conv3_1 conv3_2 conv3_3 pool3 pool2

One or more convolutional layers + a max-pooling layer

slide-10
SLIDE 10

Yuting Zhang, Kibok Lee, Honglak Lee a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

Inducing an autoencoder from a classification network (VGGNet, pool5)

slide-11
SLIDE 11

Yuting Zhang, Kibok Lee, Honglak Lee

Training a decoding pathway fora classification network (VGGNet, pool5)

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture) Fixed Learnable

slide-12
SLIDE 12

Yuting Zhang, Kibok Lee, Honglak Lee

Micro-architectures for decoders

  • Use “Unpooling” to approximately invert the pooling operation

conv3_1 conv3_2 conv3_3 pool3 pool2 dec:conv3_1 dec: conv3_2 dec: conv3_3 dec: pool3

slide-13
SLIDE 13

Yuting Zhang, Kibok Lee, Honglak Lee

Micro-architectures for decoders (Unpoolingwith fi fixed switches, ordinary SAE)

  • One can use the ordinary stacked autoencoder (SAE).

§ Related work: Dosovitskiy, A. and Brox, T , “Inverting visual representations with convolutional networks”, CVPR 2016.

conv3_1 conv3_2 conv3_3 pool3 pool2 dec:conv3_1 dec: conv3_2 dec: conv3_3 dec: pool3

4 6 5 1 4 7 5 1

Unpooling with fixed switches (Upsampling)

slide-14
SLIDE 14

Yuting Zhang, Kibok Lee, Honglak Lee

Micro-architectures for decoders (Unpoolingwith kn known switches, SWWAE)

  • We can also use stacked “what-where” autoencoders (SWWAE).

§ Unpooling with the known switches transferred from the encoder. § More accurate inversion, since spatial details are recovered better.

conv3_1 conv3_2 conv3_3 pool3 pool2 dec:conv3_1 dec: conv3_2 dec: conv3_3 dec: pool3

6 4 5 1 4 7 5 1

Unpooling with known switches

(SWWAE only) Pooling switches

slide-15
SLIDE 15

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

  • ne layer
slide-16
SLIDE 16

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

pool1

slide-17
SLIDE 17

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

pool2

slide-18
SLIDE 18

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

pool3

slide-19
SLIDE 19

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

pool4

slide-20
SLIDE 20

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

pool5

slide-21
SLIDE 21

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

fc6

slide-22
SLIDE 22

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

fc7

slide-23
SLIDE 23

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction from different layers (AlexNet)

Input Image SAE

Dosovitskiy & Brox (2015)

SWWAE

Reconstructed from

fc8

slide-24
SLIDE 24

Yuting Zhang, Kibok Lee, Honglak Lee

Layer image pool1 pool2 conv3 conv4 pool5 fc6 fc7 fc8 Dosovitskiy & Brox (2016) SWWAE-first (known unpooling switches)

SWWAE SAE

Dosovitskiy & Brox (2016)

Reconstruction via SAE decoders

The network is less invertible for higher layers, so deeper representations preserve less input information.

  • Two possible sources of information loss

§ Convolutional filters and non-linearity (Transformation) § Max-pooling (Spatial invariance)

  • They are mixed in the SAE reconstruction results.
slide-25
SLIDE 25

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction via SAE and SWWAE decoders

  • Using the encoder pooling switches for unpooling, the information loss

due to max-pooling can be better recovered.

  • The extremely good reconstruction quality of SWWAE indicates

the “convolutional filters + ReLU” cause very minor information losses.

Layer image pool1 pool2 conv3 conv4 pool5 fc6 fc7 fc8 Dosovitskiy & Brox (2016) SWWAE-first (known unpooling switches)

SWWAE SAE

Dosovitskiy & Brox (2016)

slide-26
SLIDE 26

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction for 16-layer VG VGGNet

SAE SWWAE

Layer image pool1 pool2 pool3 pool4 pool5

slide-27
SLIDE 27

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction for 16-layer VG VGGNet

SAE SWWAE

Layer image pool1 pool2 pool3 pool4 pool5

slide-28
SLIDE 28

Yuting Zhang, Kibok Lee, Honglak Lee

Reconstruction for 16-layer VG VGGNet

SAE SWWAE

Layer image pool1 pool2 pool3 pool4 pool5

slide-29
SLIDE 29

Yuting Zhang, Kibok Lee, Honglak Lee

Ob Observations from reconstruction

Operator Effect Information loss Convolutional filters + ReLU Feature transformation Minor Max-pooling Spatial invariance Significant

slide-30
SLIDE 30

Yuting Zhang, Kibok Lee, Honglak Lee

Hy Hypotheses esfrom reconstruction

Operator Effect Information loss Convolutional filters + ReLU Feature transformation The less, the better

  • The invertiblilityis important and potentially helpful for

the convolutional filters in a deep classification network.

slide-31
SLIDE 31

Yuting Zhang, Kibok Lee, Honglak Lee

Hy Hypotheses esfrom reconstruction

Operator Effect Information loss Convolutional filters + ReLU Feature transformation The less, the better

Aim to improve the classification network

slide-32
SLIDE 32

Classification networks with stronger invertibility

slide-33
SLIDE 33

Yuting Zhang, Kibok Lee, Honglak Lee

  • Given a classification network

§ We take the 16-layer VGGNet as the baseline model

Classification networks with stronger invertibility

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

slide-34
SLIDE 34

Yuting Zhang, Kibok Lee, Honglak Lee

  • Augmenting the classification network with a decoding pathway

§ starting from the last convolutional layer (pool5 in VGGNet)

  • Multi-task learning using both classification and reconstruction objectives.

Classification networks with stronger invertibility

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

slide-35
SLIDE 35

Yuting Zhang, Kibok Lee, Honglak Lee

  • Step 1: Initialize the classification network with pretrained weights.
  • Step 2: Train the decoder while fixing the classification network.

§ For very deep network, it is hard to train it directly with random initialization.

Training procedure

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

slide-36
SLIDE 36

Yuting Zhang, Kibok Lee, Honglak Lee

  • Step 2: Train “layerwise” decoding pathways from random initialization.

Model variant: SAE/SWWAE-lay layerwise

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

slide-37
SLIDE 37

Yuting Zhang, Kibok Lee, Honglak Lee

  • Step 2: Train “layerwise” decoding pathways from random initialization.
  • Step 3: Train the top-down decoding pathways, which is initialized in Step 2.

§ The reconstruction loss is only at the “first” layer.

Model variant: SAE/SWWAE-fi first

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

slide-38
SLIDE 38

Yuting Zhang, Kibok Lee, Honglak Lee

  • Step 2: Train “layerwise” decoding pathways from random initialization.
  • Step 3: Train the top-down decoding pathways, which is initialized in Step 2.
  • Step 4: Finetune the entire augmented network together.

Training procedure

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

Mini-batch SGD for all steps

slide-39
SLIDE 39

Yuting Zhang, Kibok Lee, Honglak Lee

  • Every layer can have its own reconstruction loss

§ Decoder layers can better corresponds to encoder layers § Intermediate layers can get more training signals

Model variant: SAE/SWWAE-al all

a

deconv deconv deconv deconv conv conv deconv conv conv conv inner product inner product inner product L2 loss softmax loss

a a image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label image pool1 pool2 pool3 pool4 pool5 dec: pool1 dec: pool3 dec: pool4 dec: image fc6 fc7 probability dec: pool2

  • ne-hot

label

(a) SAE-first (stacked architecture; reconstruction loss at the first layer) (b) SAE-all (stacked architecture; reconstruction loss at all layers) (c) SAE-layerwise (layer-wise architecture)

slide-40
SLIDE 40

Yuting Zhang, Kibok Lee, Honglak Lee

Evaluations on ImageNet ILSVRC 2012

  • Baseline model: 16-Layer VGGNet
  • Augmented models: SAE/SWWAE - first/all/layerwise (6 in total)
  • Testing protocol

§ Rescaling the shorter edge to 256px § “Single crop” scheme: 224x224 patch in the center

  • Clean results without postprocessing

§ “Convolution” scheme: whole VGGNet as a convolutional operator

  • More practical results
slide-41
SLIDE 41

Yuting Zhang, Kibok Lee, Honglak Lee

Validation errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 VGGNet 29.05 10.07

slide-42
SLIDE 42

Yuting Zhang, Kibok Lee, Honglak Lee

Validation errors on ImageNet ILSVRC 2012

Get lower errors Sampling Single crop Model Top-1 Top-5 VGGNet 29.05 10.07 + SAE-first 27.70 9.28 + SAE-all 27.54 9.17 + SAE-layerwise 27.60 9.19

slide-43
SLIDE 43

Yuting Zhang, Kibok Lee, Honglak Lee

Validation errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 VGGNet 29.05 10.07 + SAE-first 27.70 9.28 + SAE-all 27.54 9.17 + SAE-layerwise 27.60 9.19 Layer-wise reconstruction loss is helpful.

slide-44
SLIDE 44

Yuting Zhang, Kibok Lee, Honglak Lee

Validation errors on ImageNet ILSVRC 2012

Layer-wise reconstruction loss is helpful. Even lower errors Sampling Single crop Model Top-1 Top-5 VGGNet 29.05 10.07 + SAE-first 27.70 9.28 + SAE-all 27.54 9.17 + SAE-layerwise 27.60 9.19 + SWWAE-first 27.60 9.23 + SWWAE-all 27.39 9.06 + SWWAE-layerwise 27.53 9.10

slide-45
SLIDE 45

Yuting Zhang, Kibok Lee, Honglak Lee

Validation errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 VGGNet 29.05 10.07 + SAE-first 27.70 9.28 + SAE-all 27.54 9.17 + SAE-layerwise 27.60 9.19 + SWWAE-first 27.60 9.23 + SWWAE-all 27.39 9.06 + SWWAE-layerwise 27.53 9.10 Layer-wise reconstruction loss is helpful. SWWAE performs slightly better than ordinary SAE

slide-46
SLIDE 46

Yuting Zhang, Kibok Lee, Honglak Lee

Validation errors on ImageNet ILSVRC 2012

Sampling Single crop Convolution Model Top-1 Top-5 Top-1 Top-5 VGGNet 29.05 10.07 26.97 8.94 + SAE-first 27.70 9.28 26.09 8.30 + SAE-all 27.54 9.17 26.10 8.21 + SAE-layerwise 27.60 9.19 26.06 8.17 + SWWAE-first 27.60 9.23 25.87 8.14 + SWWAE-all 27.39 9.06 25.79 8.13 + SWWAE-layerwise 27.53 9.10 25.97 8.20

slide-47
SLIDE 47

Yuting Zhang, Kibok Lee, Honglak Lee

Tr Training er errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 VGGNet 17.43 4.02 + SAE-first 15.36 3.13 + SAE-all 15.64 3.23 + SAE-layerwise 16.20 3.42 + SWWAE-first 15.10 3.08 + SWWAE-all 15.67 3.24 + SWWAE-layerwise 15.42 3.32

slide-48
SLIDE 48

Yuting Zhang, Kibok Lee, Honglak Lee

Tr Training er errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 VGGNet 17.43 4.02 + SAE-first 15.36 3.13 + SAE-all 15.64 3.23 + SAE-layerwise 16.20 3.42 + SWWAE-first 15.10 3.08 + SWWAE-all 15.67 3.24 + SWWAE-layerwise 15.42 3.32 Get lower training errors The unsupervised

  • bjectives help with the
  • ptimization of the

supervised objectives.

slide-49
SLIDE 49

Yuting Zhang, Kibok Lee, Honglak Lee

Tr Training er errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 + SAE-first 15.36 3.13 + SAE-all 15.64 3.23 + SWWAE-first 15.10 3.08 + SWWAE-all 15.67 3.24 Validation errors Top-1 Top-5 26.09 8.30 26.10 8.21 25.87 8.14 25.79 8.13

slide-50
SLIDE 50

Yuting Zhang, Kibok Lee, Honglak Lee

Tr Training er errors on ImageNet ILSVRC 2012

Sampling Single crop Model Top-1 Top-5 + SAE-first 15.36 3.13 + SAE-all 15.64 3.23 + SWWAE-first 15.10 3.08 + SWWAE-all 15.67 3.24 Validation errors Top-1 Top-5 26.09 8.30 26.10 8.21 25.87 8.14 25.79 8.13 Compared to SAE/SWWAE-first, SAE/SWWAE-all has

  • higher training errors
  • lower validation errors

Layer-wise reconstruction loss has regularization effects.

slide-51
SLIDE 51

Yuting Zhang, Kibok Lee, Honglak Lee

Conclusions

  • A simple and effective way to incorporate unsupervised objectives

into large-scale classification network learning.

  • The resultant autoencoder can reconstruct image with extremely high

quality from deep representations.

  • We improved the image classification performance of the 16-layer

VGGNet, a strong baseline model, by a noticeable margin.

  • We hope this paper will inspire further investigations on the use of

unsupervised learning in a large-scale setting.

slide-52
SLIDE 52

Thank you!

Full version: arxiv.org/abs/1606.06582 Code (GitHub): bit.ly/cnn-dec