Deep Visual Learning on Hypersphere Weiyang Liu*, Zhen Liu* - - PowerPoint PPT Presentation

deep visual learning on hypersphere
SMART_READER_LITE
LIVE PREVIEW

Deep Visual Learning on Hypersphere Weiyang Liu*, Zhen Liu* - - PowerPoint PPT Presentation

Deep Visual Learning on Hypersphere Weiyang Liu*, Zhen Liu* College of Computing Georgia Institute of Technology 1 Outline Why Learning on Hypersphere Loss Design - Large-Margin Learning on Hypersphere Convolution Operator - Deep


slide-1
SLIDE 1

Deep Visual Learning on Hypersphere

Weiyang Liu*, Zhen Liu* College of Computing Georgia Institute of Technology

1

slide-2
SLIDE 2
  • Why Learning on Hypersphere
  • Loss Design - Large-Margin Learning on Hypersphere
  • Convolution Operator - Deep Hyperspherical Learning

and Decoupled Networks

  • Weight Regularization - Minimum Hyperspherical Energy

for Regularizing Neural Networks

  • Conclusion

Outline

2

slide-3
SLIDE 3
  • Why Learning on Hypersphere
  • Loss Design - Large-Margin Learning on Hypersphere
  • Convolution Operator - Deep Hyperspherical Learning

and Decoupled Networks

  • Weight Regularization - Minimum Hyperspherical Energy

for Regularizing Neural Networks

  • Conclusion

Outline

3

slide-4
SLIDE 4

Why Learning on Hypersphere

  • An empirical observation
  • Setting the output feature dimension as 2 in CNN
  • Directly visualizing the features without using T-SNE

Deep features are naturally distributed over a sphere!

4

slide-5
SLIDE 5

Why Learning on Hypersphere

  • Euclidean distance is not suitable for

high-dimensional data More specifically, In high-dimensional space, vectors tend to be

  • rthogonal to each other, then this reduces to

5

slide-6
SLIDE 6

Why Learning on Hypersphere

  • Learning features on Hypersphere can well

regularize the feature space.

In deep metric learning, features have to be normalized before entering the loss function.

Schroff et al. FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2015

6

slide-7
SLIDE 7
  • Why Learning on Hypersphere
  • Loss Design - Large-Margin Learning on Hypersphere
  • Convolution Operator - Deep Hyperspherical Learning

and Decoupled Networks

  • Weight Regularization - Minimum Hyperspherical Energy

for Regularizing Neural Networks

  • Conclusion

Outline

7

slide-8
SLIDE 8

Large-Margin Learning on Hypersphere

  • Standard CNN usually uses the softmax loss as the

learning objective.

How to incorporate margin

  • n hypersphere?

8

slide-9
SLIDE 9

Large-Margin Learning on Hypersphere

  • The intuition (from binary classification)

If x belongs to class 1, original Softmax requires: We want to make the classification more rigorous in order to produce a decision margin:

9

slide-10
SLIDE 10

Large-Margin Learning on Hypersphere

Original Softmax Loss Large-Margin Softmax Loss

Imposing large margin Normalizing classifier weights

Angular Softmax Loss

10

slide-11
SLIDE 11

Learned Feature Visualization

  • 2D Feature Visualization on MNIST
  • 3D Feature Visualization on CASIA Face Dataset

m=1 m=2 m=3 m=4

11

slide-12
SLIDE 12

Experimental Results

  • Face Verification

LFW and YTF dataset

SphereFace uses the angular large-margin softmax loss, achieving the state-of-the-art performance with only 0.5M training data.

12

slide-13
SLIDE 13

Experimental Results

  • Million-scale Face Recognition Challenge

MegaFace Challenge

SphereFace ranked No.1 from 2016.12 to 2017.4, and the current No. 1 entry is also developed based on SphereFace.

13

slide-14
SLIDE 14

Demo

14

slide-15
SLIDE 15
  • Why Learning on Hypersphere
  • Loss Design - Large-Margin Learning on Hypersphere
  • Convolution Operator - Deep Hyperspherical Learning

and Decoupled Networks

  • Weight Regularization - Minimum Hyperspherical Energy

for Regularizing Neural Networks

  • Conclusion

Outline

15

slide-16
SLIDE 16

SphereNet

  • Traditional Convolution
  • HyperSpherical Convolution (SphereConv)

SphereConv normalizes each local patch of a feature map and each weight vector.

16

slide-17
SLIDE 17

SphereNet - Intuition from Fourier Transform

  • Semantic information is mostly preserved with

corrupted magnitude but not corrupted phase (angular information)

17

slide-18
SLIDE 18

Observation: The final feature is naturally decoupled, where the magnitude represents the intra-class variation.

Decoupled Convolution

18

slide-19
SLIDE 19

General Framework - Decoupled Convolution

  • Decoupling angle and magnitude of feature vectors
  • Allowing different designs of convolution operators for

different tasks

Decoupled Convolution

Magnitude (intra-class variation) Angle (semantic difference)

19

slide-20
SLIDE 20
  • SphereConv
  • BallConv
  • TanhConv
  • LinearConv

Example Choices of Magnitude

20

slide-21
SLIDE 21
  • Linear
  • Cosine
  • Squared Cosine

Example Choices of Angle

21

slide-22
SLIDE 22

With SphereConv, the top-1 accuracy of CNNs on ImageNet can be improved by ~1%.

Generalization

Plain-CNN-9 Plain-CNN-12 ResNet-27 Baseline 58.31 61.42 65.54 SphereNet 59.23 62.27 66.49

* Different from the original NeurIPS paper: 1) In ResNet, we use fully connected layer instead of average pooling to

  • btain the final feature. We found it to be crucial for SphereNet.

2) We add L2 decay, which slows down the optimization but results in better performance. Top-1 Accuracy (center crop) of baseline and SphereNet on ImageNet.

22

slide-23
SLIDE 23
  • Adversarial Robustness and Optimization

* Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Mądry.

23

slide-24
SLIDE 24
  • Without BatchNorm, decoupled convolutions
  • utperform the baseline.
  • The bounded TanhConv can be optimized while

unbounded ones fail.

Optimization Without BatchNorm

Accuracies of different convolution operators on Plain-CNN-9 without

  • BatchNorm. N/C indicates ‘not converged’.

24

slide-25
SLIDE 25

Bounded convolution operators have better robustness against both fast gradient sign method (FGSM) attack and the multi-step version of FGSM.

Adversarial Robustness

Naturally Training Adversarial Training

25

slide-26
SLIDE 26

It requires larger norm to attack decoupled convolution with bounded magnitude.

Adversarial Robustness

L2 and L_inf norms needed to attack models on samples in the test set.

26

slide-27
SLIDE 27
  • Why Learning on Hypersphere
  • Loss Design - Large-Margin Learning on Hypersphere
  • Convolution Operator - Deep Hyperspherical Learning

and Decoupled Networks

  • Weight Regularization - Minimum Hyperspherical Energy

for Regularizing Neural Networks

  • Conclusion

Outline

27

slide-28
SLIDE 28

Minimum Hyperspherical Energy

Intuition: Better generalization More diversity of neurons Less redundancy Paper [1] shows that, in one-hidden-layer network, maximizing diversity can eliminate spurious local minima. If two weight vectors in one layer are close to each other, there is probably more redundancy.

28

[1] Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target

  • functions. arXiv preprint arXiv:1611.03131, 2016.
slide-29
SLIDE 29

Minimum Hyperspherical Energy

Proposed regularization: add repulsion forces between any pair of weight vectors (in one layer) It connects to Thomson problem - to find a minimal configuration of electrons of an atom.

29

slide-30
SLIDE 30

Minimum Hyperspherical Energy

Loss function: This optimization problem is generally non-trivial. With s = 2, the problem is actually NP-hard.

30

slide-31
SLIDE 31

Although orthonormal loss seems similar, it does not yield ideal configuration of weights even in 3D case.

Minimum Hyperspherical Energy

31

slide-32
SLIDE 32

Minimum Hyperspherical Energy

MHE Loss is compatible with weight decay:

  • MHE regularizes the angles of weights
  • Weight decay regularizes the magnitude of weights

32

slide-33
SLIDE 33

Minimum Hyperspherical Energy

Co-linearity Issue: In this toy example, optimizing the original MHE results in colinear weight vectors Half-space MHE: Optimizing on pairwise angles between lines (instead of vectors).

33

slide-34
SLIDE 34

MHE - Ablation Study

MHE on 9 layer Plain CNN on CIFAR-10/100 dataset.

34

slide-35
SLIDE 35
  • MHE consistently improve the performance of

networks.

  • In cases that the network is hard to optimize due

to redundancy of neurons (small width/large depth), MHE helps more.

MHE - Ablation Study

MHE with different depths of network on CIFAR-100.

35

slide-36
SLIDE 36
  • MHE consistently improve the performance of

networks.

  • In cases that the network is hard to optimize due

to redundancy of neurons (small width/large depth), MHE helps more.

MHE - Ablation Study

MHE with different widths of network on CIFAR-100.

36

slide-37
SLIDE 37

MHE can improve performance of networks on ImageNet.

MHE Application - Image Recognition

Top-1 error (center crop) of models on ImageNet.

37

slide-38
SLIDE 38

We add MHE loss to the angular softmax loss in SphereFace. We call the resulted model SphreFace+. Synergy:

  • Angular softmax loss - intra-class compactness
  • MHE loss - inter-class separability.

MHE Application - Face Recognition

38

slide-39
SLIDE 39

MHE Application - Face Recognition

Comparison to State-of-the-art results. Comparison between SphereFace and SphereFace+.

39

slide-40
SLIDE 40

Applying MHE to the final classifier enforces the prior that all categories have the same importance and thus improves performance.

MHE Application - Class Imbalanced Recognition

Results on class imbalanced recognition on CIFAR-10.

* Single - Reduce the number of samples in only one category by 90%. Multiple - Reduce the number of samples in multiple categories with different

  • weights. Details are shown in the paper.

40

slide-41
SLIDE 41

MHE Application - Class Imbalanced Recognition

The category with less data tends to be ignored

Visualization for the final CNN feature.

41

slide-42
SLIDE 42

With MHE added to the discriminator, the inception score of spectral GAN can be improved from 7.42 to 7.68.

MHE Application - GAN

42

slide-43
SLIDE 43
  • Why Learning on Hypersphere
  • Loss Design - Large-Margin Learning on Hypersphere
  • Convolution Operator - Deep Hyperspherical Learning

and Decoupled Networks

  • Weight Regularization - Minimum Hyperspherical Energy

for Regularizing Neural Networks

  • Conclusion

Outline

43

slide-44
SLIDE 44
  • We introduce a hyperspherical learning framework for deep

visual learning, where all the neurons and classifiers are learned

  • ver a hypersphere.
  • Large-margin learning on hypersphere is very beneficial to tasks

like biometric verification and person re-id where features are expected to have large inter-class variation.

  • Hyperspherical networks and decoupled networks are natural

generalization of applying the hyperspherical learning to every layer of the network.

  • Minimum hyperspherical energy is a generic regularization that

aims to diversify the neurons on a hypersphere can improve the generalization.

Conclusion

44

slide-45
SLIDE 45

SphereFace: https://github.com/wy1iu/sphereface SphereNet: https://github.com/wy1iu/SphereNet DCNet: https://github.com/wy1iu/DCNets MHE: https://github.com/wy1iu/MHE SphereFace+: https://github.com/wy1iu/sphereface-plus

Source Code

45

slide-46
SLIDE 46

Architecture for SphereNet on ImageNet experiment

Plain-CNN-9: 7x7 conv - maxpool - 3*(3x3 conv, 64) - 3*(3x3 conv, 128) - 3*(3x3 conv, 256) - fc(512) - classifier Plain-CNN-12: 7x7 conv - maxpool - 3*(3x3 conv, 64) - 3*(3x3 conv, 128) - 3*(3x3 conv, 256) - 3*(3x3 conv, 512) - fc(512) - classifier ResNet-27: 7x7 conv - maxpool - 3*(3x3 ResBlock, 64) - 3*(3x3 ResBlock, 128) - 3*(3x3 ResBlock, 256) - 3*(3x3 ResBlock, 512) - fc(512) - classifier

Appendix

46

slide-47
SLIDE 47

Architecture for MHE ablation study on CIFAR-10/100

Appendix

47