Outline Introduction Convolutional neural networks (CNN) The - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Introduction Convolutional neural networks (CNN) The - - PowerPoint PPT Presentation

23rd International Conference on MultiMedia Modeling (MMM 2017) On the Exploration of Convolutional Fusion Networks for Visual Recognition Yu Liu, Yanming Guo, and Michael S. Lew Leiden Institute of Advanced Computer Science, Leiden University


slide-1
SLIDE 1

Discover the world at Leiden University

On the Exploration of Convolutional Fusion

Networks for Visual Recognition

Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yu Liu Yu Liu, Yanming Guo, and Michael S. Lew

23rd International Conference on MultiMedia Modeling (MMM 2017)

slide-2
SLIDE 2

Discover the world at Leiden University

Outline

  • Introduction
  • Convolutional neural networks (CNN)
  • The usage of intermediate layers
  • Multi-layer fusion
  • Motivation
  • How to develop an efficient multi-layer fusion network
  • Our approach
  • Convolutional fusion networks (CFN)
  • Results
  • Image-level and pixel-level classification
  • Conclusions
slide-3
SLIDE 3

Discover the world at Leiden University

Outline

  • Introduction
  • Convolutional neural networks (CNN)
  • The usage of intermediate layers
  • Multi-layer fusion
  • Motivation
  • How to develop an efficient multi-layer fusion network
  • Our approach
  • Convolutional fusion networks (CFN)
  • Results
  • Image-level and pixel-level classification
  • Conclusions
slide-4
SLIDE 4

Discover the world at Leiden University

Introduction: CNNs

Conv 1 Conv 2 Conv S-1 Conv S 1×1 Conv GAP

Pooling Pooling Pooling

Prediction

FC

  • A plain CNN

Conv: convolutional layer Pooling: max or average pooling layer 1x1 Conv: use 1x1 kernel size GAP: global average pooling FC: fully-connected layer This pipeline of CNN becomes widely used in recent works, because it can reduce a large number of parameters.

slide-5
SLIDE 5

Discover the world at Leiden University

Introduction: CNNs

Conv 1 Conv 2 Conv S-1 Conv S 1×1 Conv GAP

Pooling Pooling Pooling

Prediction

FC

  • A plain CNN

A plain CNN estimates a final prediction based on the topmost layer. If useful information in intermediate layers are lost during forward propagation? Can we develop a fusion architecture which exploits the intermediate layers?

slide-6
SLIDE 6

Discover the world at Leiden University

Introduction: intermediate layers

  • Apart from fully-connected layers, intermediate convolutional

layers can also offer discriminative representations.

BoW, VLAD, Fisher Vector, et al.

Feature encoder Feature extractor Input image

CNN

… …

Output vector Encoder Method

BoW DeepIndex (ICMR2015), BLCF (ICMR2016), MSCE (IJCV2016) VLAD MOP-CNN (ECCV2014), NetVLAD (CVPR2016), CCS (MM2016) Fisher Vector DSP (ICCV2015), MPP (CVPR2015), FV-CNN (CVPR2015) Other encoders SCFVC (NIPS2014), SPoC (ICCV2015), SPLeaP (ECCV2016)

slide-7
SLIDE 7

Discover the world at Leiden University

Introduction: multi-layer fusion

  • To integrate the strengths of different layers, aggregate multi-layer

activations and generate a richer representation.

References:

1. Lingqiao Liu, Chunhua Shen, Anton van den Hengel. “The treasure beneath convolutional layers: cross convolutional layer pooling for image classification”, CVPR, 2015. 2. Ying Li, Xiangwei Kong, Liang Zheng, Qi Tian. “Exploiting Hierarchical Activations of Neural Network for Image Retrieval”, ACM Multimedia 2016.

slide-8
SLIDE 8

Discover the world at Leiden University

Introduction: multi-layer fusion

  • To integrate the strengths of different layers, aggregate multi-layer

activations and generate a richer representation.

  • However, these works use a pre-trained model without improving

the training procedure.

slide-9
SLIDE 9

Discover the world at Leiden University

  • Add new side branches and train them jointly with the

full-depth main branch.

Figure from “Yang, S., Ramanan, D.: Multi-scale recognition with DAG-CNNs, ICCV 2015.”

DAG-CNNs

Introduction: multi-layer fusion

slide-10
SLIDE 10

Discover the world at Leiden University

  • Add new side branches and train them jointly with the

full-depth main branch.

Figure from “Yang, S., Ramanan, D.: Multi-scale recognition with DAG-CNNs, ICCV 2015.”

DAG-CNNs

Introduction: multi-layer fusion

  • This approach spends a large number of additional parameters for developing side

branches (i.e. fully-connected layers).

  • The summation operation ignores different importance of side branches.
slide-11
SLIDE 11

Discover the world at Leiden University

Outline

  • Introduction
  • Convolutional neural networks (CNN)
  • The usage of intermediate layers
  • Multi-layer fusion
  • Motivation
  • How to develop an efficient multi-layer fusion network
  • Our approach
  • Convolutional fusion networks (CFN)
  • Results
  • Image-level and pixel-level classification
  • Conclusions
slide-12
SLIDE 12

Discover the world at Leiden University

Motivation

  • Question: How to exploit an efficient multi-layer

fusion network built upon CNNs ?

  • Three key issues
  • Efficiency: adding few parameters in the side branches.
  • Better fusion module: learn adaptive weights for different

side branches.

  • Accuracy: considerable improvements over a plain CNN.
slide-13
SLIDE 13

Discover the world at Leiden University

Outline

  • Introduction
  • Convolutional neural networks (CNN)
  • The usage of intermediate layers
  • Multi-layer fusion
  • Motivation
  • How to develop an efficient multi-layer fusion network
  • Our approach
  • Convolutional fusion networks (CFN)
  • Results
  • Image-level and pixel-level classification
  • Conclusion & Discussion
slide-14
SLIDE 14

Discover the world at Leiden University

Our approach: CFN

Advantage I : Efficient side branches

  • Overall Architecture
slide-15
SLIDE 15

Discover the world at Leiden University

Efficient side outputs

  • 1. Creating the side branches from the pooling layers.
  • 2. Employing the 1x1 convolution to “receive” the side-branch inputs.
  • 3. Performing an efficient global average pooling (GAP) to “send” the side-branch outputs.

Conv 1 Conv 2 Conv S-1 Conv S 1×1 Conv GAP

Pooling Pooling Pooling

Side branch 1 Side branch 2 Side branch S-1

……

( Main branch )

CFN has a minimal increase in parameters for the side branches.

Side branch S

slide-16
SLIDE 16

Discover the world at Leiden University

Our approach: CFN

  • Overall Architecture

Advantage II : Early fusion and late prediction Advantage I : Efficient side branches

slide-17
SLIDE 17

Discover the world at Leiden University

Early fusion and late prediction

Prediction

FC

Conv 1 Conv 2 Conv S-1 Conv S 1×1 Conv GAP

Pooling Pooling Pooling

Side branch 1 Side branch 2 Side branch S-1

……

( Main branch )

Side branch S

Fusion module

  • 1. Using a fusion module to integrate the side-branch outputs.
  • 2. The fused feature is fed to a fully-connected layer to make a final prediction.
slide-18
SLIDE 18

Discover the world at Leiden University

Comparison

Early fusion and late prediction (EFLP) Early prediction and late fusion (EPLF)

The advantages of EFLP:

  • competitive performance with EPLF.
  • fewer parameters (i.e. use one FC layer) than EPLF.
  • the fused feature can act as a richer image representation.
slide-19
SLIDE 19

Discover the world at Leiden University

Our approach: CFN

  • Overall Architecture

Advantage II : Early fusion and late prediction Advantage I : Efficient side branches Advantage III: Locally-connected fusion

slide-20
SLIDE 20

Discover the world at Leiden University

Locally-connected (LC) fusion

  • The side outputs are first stacked together.
  • A locally-connected layer with 1x1 kernel size is performed
  • ver the stacked maps.

1×1 Conv GAP

Stack

S

Locally connected Prediction FC

LC layer can learn adaptive weights for different side outputs. Fusion module …

branch 1 branch 2 branch S

branch S-1

slide-21
SLIDE 21

Discover the world at Leiden University

  • Since a locally-connected layer does not share the weights over spatial

dimensions, it can learn better fusion than other fusion modules.

  • To the best of our knowledge, this is the first attempt to apply a locally-

connected layer to a fusion module.

Comparison

slide-22
SLIDE 22

Discover the world at Leiden University

Our approach: CFN

  • Overall Architecture

Advantage II : Early fusion and late prediction Advantage I : Efficient side branches Advantage III: Locally-connected fusion

slide-23
SLIDE 23

Discover the world at Leiden University

Our approach: CFN

  • Overall Architecture
  • CFN can integrate the intermediate layers using

additional side branches, and deliver their effects on the final prediction explicitly and directly.

slide-24
SLIDE 24

Discover the world at Leiden University

Discussion

(1) Difference from DSN

  • Deeply-supervised nets (DSN) add extra supervision to guide

intermediate layers earlier.

  • However, CFN aims to generate a fused and richer feature

and uses only one supervision towards the final prediction. “Loss fusion” “Feature fusion”

References:

Lee, C., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: AISTATS 2015.

slide-25
SLIDE 25

Discover the world at Leiden University

Discussion

(2) Difference from ResNet

  • ResNet makes use of “linear” shortcut connections to make

much deeper neural networks work well.

  • However, CFN exploits existing intermediate layers to

improve the discriminative capability of CNNs.

References:

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016.

“Depth that matters” “Fusion that matters”

slide-26
SLIDE 26

Discover the world at Leiden University

Outline

  • Introduction
  • Convolutional neural networks (CNN)
  • The usage of intermediate layers
  • Multi-layer fusion
  • Motivation
  • How to develop an efficient multi-layer fusion network
  • Our approach
  • Convolutional fusion networks (CFN)
  • Results
  • Image-level and pixel-level classification
  • Conclusions
slide-27
SLIDE 27

Discover the world at Leiden University

Dataset # Categories # Training images # Testing images CIFAR-10 10 50,000 10,000 CIFAR-100 100 50,000 10,000 ImageNet 2012 1000 1.2 Million 50,000 (validation) Hyper-parameters CIFAR ImageNet 2012 Initialization of learning rate 0.1 0.01 weight decay 0.0001 0.0001 momentum 0.9 0.9 Mini-batch size 100 64 Max iterations 120,000 200,000 Initialization of LC weights (1/S) 1/3 1/4 Dropout Yes No Batch norm No Yes

Results: Image-level classification

* More details can be seen in the paper.

slide-28
SLIDE 28

Discover the world at Leiden University

CNN with 8 layers CFN with three side branches

* For the convolutional layers, the right lower numbers indicate the kernel size; the right upper number indicates the number of channels.

Results on CIFAR

Initialization of LC weights is 1/3

slide-29
SLIDE 29

Discover the world at Leiden University

  • CFN outperforms a plain CNN.
  • The additional parameters for extra side branches are fewer

than the total number of basic parameters.

  • LC fusion is better than other fusion modules (i.e. convolution

and sum-pooling), but has a minimal increase in parameters.

Results on CIFAR

Table: Test error (%) on CIFAR-10/100 dataset (without data augmentation)

slide-30
SLIDE 30

Discover the world at Leiden University ICML 2013 ICLR 2014 AISTATS 2015.. ICLR 2015 CVPR 2015 AAAI 2016

Table: Comparison with other models. * indicates using data augmentation.

  • We compare CFN with other works which are based on a similar

plain CNN as our 8-layer model.

  • CFN is competitive with these works.

Results on CIFAR

slide-31
SLIDE 31

Discover the world at Leiden University

CNN CFN

Visualization of feature maps

  • We extract the feature maps in the 1x1 convolutional layer

and visualize the top-4 maps.

slide-32
SLIDE 32

Discover the world at Leiden University

  • The side branches in CFN can learn complementary clues to the full

depth main branch. For example, the side branch1 mainly learns the boundaries or shapes around the objects; the side branch 2 focuses on some semantic “parts”. near the objects.

slide-33
SLIDE 33

Discover the world at Leiden University

Results on CIFAR

  • Comparing the weights learned in LC fusion for side branches.
  • The side branch 3 (main branch) plays the core role, but the other

two side branches are still necessary.

slide-34
SLIDE 34

Discover the world at Leiden University

CNN with 11 layers CFN with four side branches

Results on ImageNet

Initialization of LC weights is 1/4

slide-35
SLIDE 35

Discover the world at Leiden University

Results on ImageNet

  • Table. Error rates (%) on the ImageNet 2012 validation set (single crop).
slide-36
SLIDE 36

Discover the world at Leiden University

Results on ImageNet

  • CNN-11 and CFN-11 can achieve competitive results as compared to

AlexNet, however, they consume much fewer parameters (~6.3 millions) than Alexnet (~60 millions).

  • For such a not-overly deep network, CFN-11 can achieve better

accuracy than DSN-11 and ResNet-11.

  • CFN can act as an alternative way to improving the discriminative

capacity of CNNs, by exploiting existing intermediate layers.

  • Table. Comparison with other models on ImageNet 2012 validation set.
slide-37
SLIDE 37

Discover the world at Leiden University

Results on ImageNet

slide-38
SLIDE 38

Discover the world at Leiden University

Results on other tasks

We transfer the trained ImageNet model to three tasks:

  • Scene recognition
  • The Scene 15 dataset (4485)
  • The Indoor-67 dataset (15620 images)
  • Fine-grained recognition
  • The Flower-102 dataset (8189 images)
  • The Bird-200 dataset (11788 images)
  • Image retrieval
  • The Holiday dataset (1491 images)
  • The UKBench dataset (10200 images)
slide-39
SLIDE 39

Discover the world at Leiden University

Results on other tasks

  • CFN-11 obtains consistent improvements on all datasets.
  • Interestingly, their improvements are even more remarkable than

those on the ImageNet dataset itself.

  • Fusing multi-layer deep representations is beneficial for diverse

visual tasks.

  • Table. Results on three target tasks (higher is better).
slide-40
SLIDE 40

Discover the world at Leiden University

  • Fully convolutional networks (FCN)
  • Copy weights from CNN
  • From images to images
  • End-to-end training
  • For example:
  • Edge detection
  • Semantic segmentation

Results: Pixel-level classification

slide-41
SLIDE 41

Discover the world at Leiden University

Results on Edge detection

HED

Figure from S. Xie and Z. Tu. Holistically-nested edge detection. ICCV, 2015.

slide-42
SLIDE 42

Discover the world at Leiden University

Results on Edge detection

Input image Conv1 net Conv2 net Conv3 net Conv4 net Conv5 net

Side-output1 Side-output5 Side-output4 Side-output3 Side-output2 Fusion-output Max-pooling Max-pooling Max-pooling Max-pooling

Locally-connected

  • From HED to CFN
  • Use a locally-connected layer in the fusion module.
  • Parameters in the LC fusion: H×W×5

CFN

Method ODS OIS AP HED 0.780 0.802 0.786 CFN 0.784 0.806 0.788

ODS: fixed contour threshold OIS: per-image best threshold AP: average precision

  • Table. Results on the BSDS 500 dataset.
slide-43
SLIDE 43

Discover the world at Leiden University

Results on Edge detection

Image Ground truth HED CFN

slide-44
SLIDE 44

Discover the world at Leiden University

Results on Edge detection

slide-45
SLIDE 45

Discover the world at Leiden University

Results on Semantic segmentation

FCN-8s

Figures from Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015.

slide-46
SLIDE 46

Discover the world at Leiden University

Results on Semantic segmentation

CFN-8s

Locally-connected fusion Locally-connected fusion

  • From FCN-8s to CFN-8s
  • Use a locally-connected layer in the two-stage fusion.
  • Parameters in the LC fusion: H×W×21×2

Method Mean IoU FCN-8s 58.69 CFN-8s 60.33

* Note that we fine-tune FCN-8s directly from VGG-16 model, without pre-training FCN-32s and FCN-16s. CFN-8s performs the same training procedure.

up x 8

  • Table. Results on Pascal VOC.

Results on Edge detection

slide-47
SLIDE 47

Discover the world at Leiden University

Results on Semantic segmentation

CFN-8s FCN-8s Ground truth Image

slide-48
SLIDE 48

Discover the world at Leiden University

Outline

  • Introduction
  • Convolutional neural networks (CNN)
  • The usage of intermediate layers
  • Multi-layer fusion
  • Motivation
  • How to develop an efficient multi-layer fusion network
  • Our approach
  • Convolutional fusion networks (CFN)
  • Results
  • Image-level and pixel-level classification
  • Conclusions
slide-49
SLIDE 49

Discover the world at Leiden University

Conclusions

  • Why CFN works well?
  • Integrate the strengths of intermediate layers.
  • Have a little increase in parameters.
  • Learn adaptive weights for side branches.
  • CFN can be applied to many visual tasks:
  • Image-level
  • Pixel-level
  • Much deeper models? Hundreds of layers?

Method Top-1 Top-5 CNN-19 36.99 14.74 CFN-19 35.47 13.93

We wish to raise awareness of the potential of multi-layer fusion networks!

  • Table. Results on

ImageNet 2012 validation set.

slide-50
SLIDE 50

Discover the world at Leiden University

Code: easy to follow

Github: https://github.com/yuLiu24/CFN

Live Demo: http://goliath.liacs.nl/

slide-51
SLIDE 51

Discover the world at Leiden University

Thanks for your attention! Questions please?