 
              G 2 DeNet: Global Gaussian Distribution Embedding Network and Its Application to Visual Recognition Qilong Wang 1 Peihua Li 1 Lei Zhang 2 1 Dalian University of Technology, 2 Hong Kong Polytechnic University
Tendency of CNN architectures LeNet-5 VGG-VD-19 /GoogLeNet-22 AlexNet-8 ResNet-152 /Inception-V4 CNN architectures tend to be Deeper & Wider More accurate ! Only Convolution, Non-linear (ReLU), Pooling
Trainable structural layers O 2 P layer (LogCOV ) [DeepO 2 P, ICCV’15] … Bilinear pooling (COV) … [B- CNN, ICCV’15] …… Mean Map Embedding [DMMs, arXiv’15] Images Conv. layers Loss VLAD Coding [NetVLAD , CVPR’16] Modeling outputs of the last convolutional layer as trainable structural layers .
Trainable structural layers Fine-grained Visual Classification B-CNN [D,D] (84.1, 84.1, 91.3) ~ 8% VGG-VD16 (76.4, 74.1, 79.8) T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In ICCV, 2015.
Trainable structural layers Place Recognition (Pitts30k) ~ 15% NetVLAD (85.6) VS. AlexNet(69.8) (+AlexNet) R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, 2016.
Trainable structural layers DMMs + GoogLeNet (49.00) VS. GoogLeNet(47.5) Scene Categorization (Place205) J. B. Oliva, D. J. Sutherland, B. P ´ oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
Trainable structural layers DMMs + GoogLeNet (49.00) VS. GoogLeNet(47.5) Integration of trainable structural layers into deep Scene Categorization (Place205) CNNs achieves significant improvements in many challenging vision tasks. J. B. Oliva, D. J. Sutherland, B. P ´ oczos, and J. G. Schneider. Deep mean maps. arXiv, abs/1511.04150, 2015.
Parametric probability distribution modeling ① Modelling abundant statistics of features. ② Producing fixed size representations regardless of varying feature sizes. Promising modeling performance ( > coding methods )  Nakayama et al. CVPR’10 Gaussian  Serra et al. CVIU’15 Parametric probability distribution modeling Distribution  Wang et al. CVPR’16 Gaussian Mixture Model High computational efficiency  Closed-form solution of parameters estimation …… Gaussian- Laplacian Model
Embedding of global Gaussian in CNN … … …… Images Conv. layers Loss Global Gaussian
Global Gaussian distribution embedding network ( G 2 DeNet ) 1    Global Σ μμ μ T 2   μ Σ   , Gaussian: μ T   1 … f Z ( ) Matrix Partition Sub- … X Y Square-rooted SPD Z layer Matrix Sub-layer 1   T T f ( ) X AX XA …… 1 MPL N  ( ) Y Y 2 f   2 ESRL  T T AX 1b B N sym ( ) Z  f ( ) Z  f  Y  X Images Conv. Layers Global Gaussian Embedding Layer Loss  A trainable global Gaussian embedding layer for modeling convolutional features.  The first attempt to plug a parametric probability distribution into deep CNNs .
Challenges Riemannian Geometry Structure Forward Q: How to construct our trainable Propagation Algebraic global Gaussian embedding layer? Structure A: The key is to give the explicit Backward forms of Gaussian distributions. Differentiable Propagation
Gaussian embedding The space of Gaussians is a Riemannian manifold having special geometric structure. [TPAMI’ 17] shows space of Gaussians is equipped with a Lie group structure. Cholesky decomp. left polar decomp.     L L T 1  A PO 1  T   , L    T L   T   2 ,     A    P  T T 0 1 T 1 , L         Gaussian Positive upper triangular matrix SPD matrix [TPAMI’ 17] Peihua Li, Qilong Wang et al. Local Log-Euclidean Multivariate Gaussian Descriptor and Its Application to Image Classification. TPAMI, 2017.
Global Gaussian embedding layer 1     T 2        , P Gaussian Embedding : T 1     2. Square-rooted SPD Matrix Sub-layer: 1. Matrix Partition Sub-layer : 1       T    Z f Y Y 2    Y f X ESRL T MPL 1     1   T U U 2   1 2    T T T T AX XA AX 1b B N N sym Y is a function of convolutional features X. Computing square-root of Y via SVD.
BP for global Gaussian embedding layer 1   Global  Σ μμ μ T 2   μ Σ   , Gaussian: μ T   1 … X Y f Z ( ) Z … Matrix Partition Sub- Square-rooted SPD layer Matrix Sub-layer 1   T T f ( ) X AX XA …… MPL 1 N    f ( ) Y Y 2 2  T T AX 1b B ESRL N sym ( ) Z  ( ) f Z  f  Y  X Images Conv. Layers Global Gaussian Embedding Layer Loss       f Z f Z The first step is to compute The goal is to compute   X Y
BP for square-rooted SPD matrix sub-layer    f Z   T Y U U Compute  Y    f f f    Y U : d : d : d [DeepO 2 P, ICCV’15 ]    Y U       T T T d U 2 U K U d YU , sym     T d U d YU . diag            f f f 1       T T T U 2 K U U , K              ij  Y U 2 2         diag i j sym [DeepO 2 P, ICCV’15]: Catalin Ionescu et al. Matrix Backpropagation for Deep Networks with Structured Layers. ICCV, 2015.
BP for square-rooted SPD matrix sub-layer   f f Compute and   U    1 1 f f f      T    Z Y Y U U f : d Z : d U : d 2 2 ESRL    Z U   1 1 1       T T d Z  d U U  U d U 2 2 2   2   sym     1  1  f f f 1 f      T 2 U U , U U . 2 2       U Z Z 2   sym
BP for global Gaussian embedding layer The goal is to compute  f given  f  X  Y     1 2     T T T T Y f X AX XA AX 1b B MPL N N sym       f 2 f   T T N XA 1b A     X Y     f f sym  X Y : d : d   X Y BP for global Gaussian embedding layer
Global Gaussian distribution embedding network ( G 2 DeNet ) 1    Global Σ μμ μ T 2   μ Σ   , Gaussian: μ T   1 … f Z ( ) Matrix Partition Sub- … X Y Square-rooted SPD Z layer Matrix Sub-layer 1   T T f ( ) X AX XA …… 1 MPL N  ( ) Y Y 2 f   2 ESRL  T T AX 1b B N sym ( ) Z  f ( ) Z  f  Y  X Images Conv. Layers Global Gaussian Embedding Layer Loss  Gaussian Embedding.   f f  Structural Backpropagation and .   X Y
Experiments on MS-COCO 890k segmented instances from MS-COCO dataset. 80 classes, ~600k training instances, ~290k validation ones. [DeepO 2 P, ICCV’ 15] DeepO 2 P DeepO 2 P-FC DeepO 2 P-FC [ICCV 15] (S) [ICCV 15] [ICCV 15] Err. 28.6 28.9 25.2 G 2 DeNet G 2 DeNet-FC (S) G 2 DeNet-FC (Ours) (Ours) (Ours) Err. 24.4 22.6 21.5 Convergence curve of our G 2 DeNet- Comparison of classification errors on MS-COCO. FC with AlexNet on MS-COCO.
Experiments on MS-COCO 890k segmented instances from MS-COCO dataset. 80 classes, ~600k training instances, ~290k validation ones. [DeepO 2 P, ICCV’ 15] AlexNet DeepO 2 P DeepO 2 P-FC DeepO 2 P-FC (baseline) [ICCV 15] (S) [ICCV 15] [ICCV 15] Err. 25.3 28.6 28.9 25.2 G 2 DeNet G 2 DeNet-FC (S) G 2 DeNet-FC DMMs-FC [arXiv‘15] (Ours) (Ours) (Ours) Err. 24.6 24.4 22.6 21.5 Convergence curve of our G 2 DeNet- Comparison of classification errors on MS-COCO. FC with AlexNet on MS-COCO.
Experiments on FGVR - Benchmarks Birds CUB-200-2011 FGVC-Aircraft FGVC-Car 200 classes 100 classes 196 classes 5,994 training/5,794 test 6,667 training/3,333 test 8,144 training/8,041 test
Experiments on FGVR - Results Methods Birds CUB-200-2011 FGVC-Aircraft FGVC-Cars FC-CNN 76.4 74.1 79.8 FV-CNN 77.5 77.6 85.7 VLAD-CNN 79.0 80.6 85.6 NetFV [TPAMI’17] 79.9 79.0 86.2 NetVLAD [CVPR’16] 81.9 81.8 88.6 B- CNN [ICCV’15] 84.1 84.1 91.3 G 2 DeNet (Ours) 87.1 89.0 92.5 Comparison of different counterparts by using VGG-VD16 without Bounding Box & Part sharing the same settings with B-CNN. NetFV [TPAMI’17]: Lin et al. Bilinear CNNs for Fine-grained Visual Recognition. TPAMI, 2017.
Recommend
More recommend