Embedding Network and Its Application to Visual Recognition Qilong - - PowerPoint PPT Presentation

embedding network and its application to
SMART_READER_LITE
LIVE PREVIEW

Embedding Network and Its Application to Visual Recognition Qilong - - PowerPoint PPT Presentation

G 2 DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition Qilong Wang 1 Peihua Li 1 Lei Zhang 2 1 Dalian University of Technology, 2 Hong Kong Polytechnic University Tendency of CNN architectures LeNet-5


slide-1
SLIDE 1

G2DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition

Qilong Wang1 Peihua Li1 Lei Zhang2

1Dalian University of Technology, 2Hong Kong Polytechnic University

slide-2
SLIDE 2

Tendency of CNN architectures

LeNet-5 AlexNet-8 VGG-VD-19 ResNet-152

CNN architectures tend to be Deeper & Wider

More accurate !

Only Convolution, Non-linear (ReLU), Pooling

/GoogLeNet-22 /Inception-V4

slide-3
SLIDE 3

Trainable structural layers

…… … …

Images

  • Conv. layers

Loss

Bilinear pooling (COV) [B-CNN, ICCV’15] O2P layer (LogCOV) [DeepO2P, ICCV’15] Mean Map Embedding [DMMs, arXiv’15] VLAD Coding [NetVLAD, CVPR’16]

Modeling outputs of the last convolutional layer as trainable structural layers.

slide-4
SLIDE 4

Trainable structural layers

B-CNN [D,D] (84.1, 84.1, 91.3) VGG-VD16 (76.4, 74.1, 79.8) Fine-grained Visual Classification

~ 8%

T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, 2015.

slide-5
SLIDE 5

Trainable structural layers

Place Recognition (Pitts30k)

NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet)

~ 15%

  • R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised

place recognition. In CVPR, 2016.

slide-6
SLIDE 6

Trainable structural layers

Scene Categorization (Place205)

DMMs + GoogLeNet (49.00) VS. GoogLeNet(47.5)

  • J. B. Oliva, D. J. Sutherland, B. P´oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
slide-7
SLIDE 7

Trainable structural layers

Scene Categorization (Place205)

DMMs + GoogLeNet (49.00) VS. GoogLeNet(47.5)

  • J. B. Oliva, D. J. Sutherland, B. P´oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.

Integration of trainable structural layers into deep CNNs achieves significant improvements in many challenging vision tasks.

slide-8
SLIDE 8

Parametric probability distribution modeling

Parametric probability distribution modeling

Gaussian Distribution Gaussian Mixture Model

……

Gaussian- Laplacian Model ① Modelling abundant statistics of features. ② Producing fixed size representations regardless of varying feature sizes. Promising modeling performance (> coding methods)  Nakayama et al. CVPR’10  Serra et al. CVIU’15  Wang et al. CVPR’16 High computational efficiency  Closed-form solution of parameters estimation

slide-9
SLIDE 9

Embedding of global Gaussian in CNN

…… … …

Images

  • Conv. layers

Loss

Global Gaussian

slide-10
SLIDE 10

Global Gaussian distribution embedding network (G2DeNet)

 A trainable global Gaussian embedding layer for modeling convolutional features.  The first attempt to plug a parametric probability distribution into deep CNNs.

Matrix Partition Sub- layer Square-rooted SPD Matrix Sub-layer

X Y

Z

……

( ) f Z

 

1 ( ) 2

T T MPL T T sym

f N N    X AX XA AX 1b B

1 2

( )

ESRL

f  Y Y

Global Gaussian Embedding Layer

… …

Images

  • Conv. Layers

Loss

 

1 2

, 1

T T

       Σ μμ μ μ Σ μ

Global Gaussian:

( ) f   Z Y

( ) f   Z X

slide-11
SLIDE 11

Challenges

Q: How to construct our trainable global Gaussian embedding layer? A: The key is to give the explicit forms of Gaussian distributions. Forward Propagation Riemannian Geometry Structure Algebraic Structure Backward Propagation Differentiable

slide-12
SLIDE 12

Gaussian embedding

[TPAMI’17] shows space of Gaussians is equipped with a Lie group structure. The space of Gaussians is a Riemannian manifold having special geometric structure.

[TPAMI’17] Peihua Li, Qilong Wang et al. Local Log-Euclidean Multivariate Gaussian Descriptor and Its Application to Image Classification. TPAMI, 2017.

 

,

        

,

1

T

T T L

L A

         

1 2

1

T T

P 

,

T

L

A PO

Gaussian Positive upper triangular matrix SPD matrix

1 T  

 L L  left polar decomp. Cholesky decomp.

slide-13
SLIDE 13

Global Gaussian embedding layer

 

          

1 2

, 1

T T

P  

  • 1. Matrix Partition Sub-layer :
  • 2. Square-rooted SPD Matrix Sub-layer:

 

 

1 1 2

T T MPL T T T T sym

f N N              Y X AX XA AX 1b B 

 

1 2 1 2 ESRL T

f    Z Y Y U U

Gaussian Embedding : Y is a function of convolutional features X. Computing square-root of Y via SVD.

slide-14
SLIDE 14

Matrix Partition Sub- layer Square-rooted SPD Matrix Sub-layer

X

Y

Z

……

( ) f Z

 

1 ( ) 2

T T MPL T T sym

f N N    X AX XA AX 1b B 1 2

( )

ESRL

f  Y Y

Global Gaussian Embedding Layer

… …

Images

  • Conv. Layers

Loss

 

1 2

, 1

T T

       Σ μμ μ μ Σ μ

Global Gaussian:

( ) f   Z Y

( ) f   Z X

BP for global Gaussian embedding layer

The goal is to compute

 

f   Z X

The first step is to compute

 

f   Z Y

slide-15
SLIDE 15

BP for square-rooted SPD matrix sub-layer

Compute

 

  f Z Y

: : : f f f d d d          Y U Y U

 

 

 

2 , .

T T T sym T diag

d d d d     U U K U YU U YU

                                      

2 2

1 2 ,

T T T ij diag i j sym

f f f U K U U K Y U

 

T

Y U U

[DeepO2P, ICCV’15]

[DeepO2P, ICCV’15]: Catalin Ionescu et al. Matrix Backpropagation for Deep Networks with Structured Layers. ICCV, 2015.

slide-16
SLIDE 16

BP for square-rooted SPD matrix sub-layer

: : : f f f d d d          Z U Z U

                 

1 1 2 2

1 2 , . 2

T sym

f f f f U U U U U Z Z

1 1 2 2

1 2 2

T T sym

d d d

             Z U U U U

  f U   f

Compute and

 

  

1 1 2 2 T ESRL

f Z Y Y U U

slide-17
SLIDE 17

BP for global Gaussian embedding layer

The goal is to compute 

 f X : : f f d d      X Y X Y

 

            2

T T sym

f f N XA 1b A X Y

 

 

    1 2

T T T T MPL sym

f N N Y X AX XA AX 1b B given 

 f Y

BP for global Gaussian embedding layer

slide-18
SLIDE 18

Matrix Partition Sub- layer Square-rooted SPD Matrix Sub-layer

X Y

Z

……

( ) f Z

 

1 ( ) 2

T T MPL T T sym

f N N    X AX XA AX 1b B

1 2

( )

ESRL

f  Y Y

Global Gaussian Embedding Layer

… …

Images

  • Conv. Layers

Loss

 

1 2

, 1

T T

       Σ μμ μ μ Σ μ

Global Gaussian:

( ) f   Z Y

( ) f   Z X

Global Gaussian distribution embedding network (G2DeNet)

 Gaussian Embedding.  Structural Backpropagation and .

  f X   f Y

slide-19
SLIDE 19

Experiments on MS-COCO

Convergence curve of our G2DeNet- FC with AlexNet on MS-COCO.

DeepO2P [ICCV 15] DeepO2P-FC (S) [ICCV 15] DeepO2P-FC [ICCV 15] Err. 28.6 28.9 25.2 G2DeNet (Ours) G2DeNet-FC (S) (Ours) G2DeNet-FC (Ours) Err. 24.4 22.6 21.5

Comparison of classification errors on MS-COCO.

890k segmented instances from MS-COCO

  • dataset. 80 classes, ~600k training instances,

~290k validation ones. [DeepO2P, ICCV’15]

slide-20
SLIDE 20

Experiments on MS-COCO

Convergence curve of our G2DeNet- FC with AlexNet on MS-COCO.

AlexNet (baseline) DeepO2P [ICCV 15] DeepO2P-FC (S) [ICCV 15] DeepO2P-FC [ICCV 15] Err. 25.3 28.6 28.9 25.2 DMMs-FC [arXiv‘15] G2DeNet (Ours) G2DeNet-FC (S) (Ours) G2DeNet-FC (Ours) Err. 24.6 24.4 22.6 21.5

Comparison of classification errors on MS-COCO.

890k segmented instances from MS-COCO

  • dataset. 80 classes, ~600k training instances,

~290k validation ones. [DeepO2P, ICCV’15]

slide-21
SLIDE 21

Experiments on FGVR - Benchmarks

Birds CUB-200-2011 FGVC-Aircraft FGVC-Car 100 classes 6,667 training/3,333 test 196 classes 8,144 training/8,041 test 200 classes 5,994 training/5,794 test

slide-22
SLIDE 22

Experiments on FGVR - Results

Methods Birds CUB-200-2011 FGVC-Aircraft FGVC-Cars FC-CNN 76.4 74.1 79.8 FV-CNN 77.5 77.6 85.7 VLAD-CNN 79.0 80.6 85.6 NetFV [TPAMI’17] 79.9 79.0 86.2 NetVLAD [CVPR’16] 81.9 81.8 88.6 B-CNN [ICCV’15] 84.1 84.1 91.3 G2DeNet (Ours) 87.1 89.0 92.5 Comparison of different counterparts by using VGG-VD16 without Bounding Box & Part sharing the same settings with B-CNN.

NetFV [TPAMI’17]: Lin et al. Bilinear CNNs for Fine-grained Visual Recognition. TPAMI, 2017.

slide-23
SLIDE 23

Experiments on FGVR - Results

Methods Birds CUB- 200-2011 FGVC- Aircraft FGVC- Cars Remarks PG-Alignment [CVPR’15]

82.0

  • 92.6

PG-Alignment + BB PD [CVPR’16]

84.5

  • PD+FC+SWFV-CNN

BoT [CVPR’16]

  • 88.4

92.5

Bag of Triplets + BB SPDA-CNN[CVPR’16]

85.1

  • SPDA-CNN + Ensemble

Boosted CNN [BMVC’16]

86.2 88.5 92.1

Boosted CNN + B-CNN RA-CNN [CVPR’17]

85.3

  • 92.5

Recurrent attention CNN CVL [CVPR’17]

85.6

  • Combining Vision and Language

KP-CNN [CVPR’17]

86.2 86.9 92.4

Kernel Pooling for CNN G2DeNet (Ours)

87.1 (87.5) 89.0 92.5

VG-VD16 w/o BB & Part

Comparison of various state-of-the-art methods.

slide-24
SLIDE 24

Experiments on ablation – Training methods

VD16-NoTr: Global Gaussian embedding layer + pre-trained VGG-VD16 on ImageNet. VD16-FT: Global Gaussian embedding layer + fine-tuned VGG-VD16. G2DeNet: Pre-trained VGG-VD16 on ImageNet + train G2DeNet in an end-to-end manner. Effects of different training methods on G2DeNet using VGG-VD16 on Birds dataset.

slide-25
SLIDE 25

Experiments on ablation - Embedding methods

Comparison of different Gaussian embedding methods for G2DeNet on Birds dataset. Method Gaussian Embedding

  • Acc. ( % )

Nakayama et al. [CVPR’2010] 83.5 Calvo et al. or Lovric et al. [JMVA’1990 & JMVA’2000] 84.1 Calvo et al. or Lovric et al. + Log- Euclidean [ICCV’2013] 83.8 Ours 87.1

        

1 2

1

T T

         log 1

T T

          1

T T

 

     ,

T T T

vec 

slide-26
SLIDE 26

Conclusion

The first attempt to plug a global Gaussian distribution into deep CNNs. More CNN architectures and computer vision applications. Please refer to our poster [ID #11] for more details.

…… … … … …

Images

  • Conv. layers

Loss

Global Gaussian