We Dont Need No Annotation (Efficient Training for Image Retrieval) - - PowerPoint PPT Presentation

we don t need no annotation
SMART_READER_LITE
LIVE PREVIEW

We Dont Need No Annotation (Efficient Training for Image Retrieval) - - PowerPoint PPT Presentation

We Dont Need No Annotation (Efficient Training for Image Retrieval) Ondra Chum Visual Recognition Group Department of Cybernetics, Faculty of Electrical Engineering CTU in Prague Outline Algorithmic supervision for CNN training (local


slide-1
SLIDE 1

We Don’t Need No Annotation

(Efficient Training for Image Retrieval)

Ondra Chum

Visual Recognition Group Department of Cybernetics, Faculty of Electrical Engineering CTU in Prague

slide-2
SLIDE 2

Outline

Algorithmic supervision for CNN training (local features based methods)

  • CNN fine-tuning for efficient image retrieval
  • Sketch based image retrieval with CNN descriptors

Unsupervised metric learning from data manifolds

2 / 55

slide-3
SLIDE 3

CNN fine-tuning for image retrieval

Filip Radenović Giorgos Tolias

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning

with Hard Examples, In ECCV 2016

slide-4
SLIDE 4

Image Retrieval Challenges

Significant viewpoint and/or scale change Significant illumination change Severe occlusions Visually similar but different objects Old school: local features, photometric normalization, geometric constraints CNNs: lots of training data, provides image embedding, nearest neighbor search

4 / 55

slide-5
SLIDE 5

Large Internet photo collection

Convolutional Neural Network (CNN) Image annotations Training

Lots of Training Examples

5 / 55

slide-6
SLIDE 6

Large Internet photo collection

Convolutional Neural Network (CNN) Not accurate Not free $

Manual cleaning of the training data done by Researchers

Very expensive $$$$

Automated extraction

  • f training data

Accurate Free $

Lots of Training Examples

6 / 55

slide-7
SLIDE 7

CNN Image Retrieval

  • Image representation created from CNN activations of a

network pre-trained for classification task

[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]

+ Retrieval accuracy suggests generalization of CNNs

  • Trained for image classification, NOT retrieval task

Images from ImageNet.org

7 / 55

slide-8
SLIDE 8

CNN Image Retrieval

  • Image representation created from CNN activations of a

network pre-trained for classification task

[Gong et al. ECCV’14, Razavian et al. arXiv’14, Babenko et al. ICCV’15, Kalantidis et al. arXiv’15, Tolias et al. ICLR’16]

+ Retrieval accuracy suggests generalization of CNNs

  • Trained for image classification, NOT retrieval task

Same Class

8 / 55

slide-9
SLIDE 9

CNN Image Retrieval

  • CNN network re-trained using a dataset that contains landmarks and

buildings as object classes.

[Babenko et al. ECCV’14]

+ Training dataset closer to the target task

  • Final metric different to the one actually optimized
  • Constructing training datasets requires manual effort

9 / 55

slide-10
SLIDE 10

CNN Image Retrieval

  • CNN network re-trained using a dataset that contains landmarks and

buildings as object classes.

[Babenko et al. ECCV’14]

+ Training dataset closer to the target task

  • Final metric different to the one actually optimized
  • Constructing training datasets requires manual effort

Same Class

Image from [Babenko et al. ECCV’14]

10 / 55

slide-11
SLIDE 11

CNN Image Retrieval

  • NetVLAD: end-to-end fine-tuning for image retrieval. Geo-tagged

dataset for weakly supervised fine-tuning.

[Arandjelovic et al. CVPR’16]

+ Training dataset corresponds to the target task + Final metric corresponds to the one actually optimized

  • Training dataset requires geo-tags

11 / 55

slide-12
SLIDE 12

CNN Image Retrieval

  • NetVLAD: end-to-end fine-tuning for image retrieval. Geo-tagged

dataset for weakly supervised fine-tuning.

[Arandjelovic et al. CVPR’16]

+ Training dataset corresponds to the target task + Final metric corresponds to the one actually optimized

  • Training dataset requires geo-tags

query

Camera Orientation Unknown

unknown

12 / 55

slide-13
SLIDE 13

CNN learns from BoW – Training Data

Input: Large unannotated dataset

  • 1. Initial clusters created by grouping of spatially related images

[Chum & Matas PAMI’10]

  • 2. Clustered images used as queries for a retrieval-SfM pipeline

[Schonberger et al. CVPR’15]

Output: Non-overlapping 3D models 551 (134k) training / 162 (30k) validation

Camera Orientation Known Number of Inliers Known

13 / 55

slide-14
SLIDE 14

Hard Negative Examples

anchor the most similar CNN descriptor naive hard negatives top k by CNN diverse hard negatives top k: one per 3D model Negative examples: images from different 3D models than the anchor Hard negatives: closest negative examples to the anchor Only hard negatives: as good as using all negatives, but faster

increasing CNN descriptor distance to the anchor

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016

14 / 55

slide-15
SLIDE 15

anchor top 1 by CNN top 1 by BoW random from top k by BoW

harder positives

used in NetVLAD Positive examples: images that share 3D points with the anchor Hard positives: positive examples not close enough to the anchor

Hard Positive Examples

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016

15 / 55

slide-16
SLIDE 16

Contrastive Loss

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor … MAC & L2-norm

D x 1 CNN desc.

Positive Convolutional Layers Pooling Descriptor 1 – positive 0 – negative Pair Label

MATCHING PAIR

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016

16 / 55

slide-17
SLIDE 17

CNN Siamese Learning

… MAC & L2-norm

D x 1 CNN desc.

Query Convolutional Layers Pooling Descriptor … MAC & L2-norm

D x 1 CNN desc.

Convolutional Layers Pooling Descriptor Contrastiv e Loss 1 – positive 0 – negative Pair Label

NON-MATCHING PAIR

  • F. Radenovic, G. Tolias and O. Chum, CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples, In ECCV 2016

17 / 55

slide-18
SLIDE 18

Component Contributions (AlexNet)

Oxford 5k Paris 6k

MAC: off-the-shelf MAC: top 1 CNN + top k CNN MAC: top 1 CNN + top 1 / model CNN MAC: top 1 BoW + top 1 / model CNN MAC: random(top k BoW) + top 1 / model CNN

44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1

62.2 68.9

60.2 67.5

MAC: learned whitening

Careful choice of positive and negative training images makes a difference

global max pooling & L2-norm Dx1 CNN desc. whitening

end-to-end learning post-processing

  • ptional

dim reduction 18 / 55

slide-19
SLIDE 19

Global Pooling

global pooling & L2-norm

Dx1 CNN desc. whitening

end-to-end learning post-processing

  • ptional

dim reduction

MAC max pooling Maximum Activations of Convolutions [Tolias et al. ICLR’16] SPoC sum pooling Sum-Pooled Convolutional [Babenko et al. ICCV’15] GeM generalized mean pooling Generalized Mean p = 1 average pooling p = inf max pooling [Radenovic, Tolias, Chum: TPAMI 2018]

19 / 55

slide-20
SLIDE 20

Component Contributions (AlexNet)

Careful choice of positive and negative training images makes a difference

Oxford 5k Paris 6k

MAC: off-the-shelf MAC: top 1 CNN + top k CNN MAC: top 1 CNN + top 1 / model CNN MAC: top 1 BoW + top 1 / model CNN MAC: random(top k BoW) + top 1 / model CNN

44.2 51.6 56.2 63.1 56.7 63.9 59.7 67.1 62.2 68.9 60.2 67.5

MAC: learned whitening GeM: random(top k BoW) + top 1 / model CNN GeM: learned whitening

60.1 68.6

67.7 75.5

20 / 55

slide-21
SLIDE 21

Teacher vs. Student (VGG)

Method Oxf5k Oxf105k Par6k Par106k BoW(16M)+R+QE

84.9 79.5 82.4 77.3

CNN-MAC(512D)

79.7 73.9 82.4 74.6

21 / 55

slide-22
SLIDE 22

Method Oxf5k Oxf105k Par6k Par106k BoW(16M)+R+QE

84.9 79.5 82.4 77.3

CNN-MAC(512D)

79.7 73.9 82.4 74.6

CNN-GeM(512D)

86.4 81.3 88.1 81.7

CNN-GeM(512D)+QE

90.7 88.6 92.2 88.0

Teacher vs. Student (VGG)

Our CNN with GeM layer surpasses its teacher on all datasets!!! BUT…

22 / 55

slide-23
SLIDE 23

Teacher vs. Student for small objects

query region query region

CNN BoW+geometry

23 / 55

slide-24
SLIDE 24

CNN fine-tuning for sketch-based image retrieval

Filip Radenović Giorgos Tolias

slide-25
SLIDE 25

Sketch-based Image Retrieval

25 / 55

slide-26
SLIDE 26

Sketch-based Image Retrieval

26 / 55

slide-27
SLIDE 27

Training Data

27 / 55

slide-28
SLIDE 28

training data (relatively cheap)

Matching Sketches to Images

Classical Approach shape matching Modern Approach end-to-end deep learning image edge map sketch alignment training data (very expensive) Ours deep shape matching no training image edge map sketch … training data training data + category + similarity

  • man-years of annotation
  • very difficult to train

shape information only simple cost & training

28 / 55

slide-29
SLIDE 29

Category Retrieval

Query Result pig Shape based retrieval cannot do that 

29 / 55

slide-30
SLIDE 30

Category Retrieval

Result Standard image search can do that for years already

30 / 55

slide-31
SLIDE 31

Edge-maps vs Sketches

31 / 55

slide-32
SLIDE 32

Training without a Single Sketch

CNN Siamese learning contrastive loss

32 / 55

slide-33
SLIDE 33

EdgeMAC Architecture

global max pooling & L2-norm Dx1 CNN desc. whitening

end-to-end learning post-processing

  • ptional

dim reduction edge filtering

edge detector edge filtering layer edges filtered VGG 1st layer RGB averaged to intensity

[Dollár & Zitnick ICCV’13]

33 / 55

slide-34
SLIDE 34

Results on Flickr 15k

Radenovic, Tolias, Chum: Generic Sketch-Based Retrieval Learned without Drawing a Single Sketch, arXiv 2017

[21] Hu & Collomosse: A performance evaluation of gradient field hog descriptor for sketch based image retrieval. CVIU’13

34 / 55

slide-35
SLIDE 35

Results on Shoes, Chairs and Handbags

Image from https://www.eecs.qmul.ac.uk/~qian/Project_cvpr16.html

Fine-grained recognition of shoes / chairs [53] Q. Yu et al.: Sketch me that shoe. CVPR’16.

35 / 55

slide-36
SLIDE 36

Results on Shoes, Chairs and Handbags

36 / 55

slide-37
SLIDE 37

Beyond sketches

Image-based Edge-based

37 / 55

slide-38
SLIDE 38

Shape matching for domain generalization

slide-39
SLIDE 39

Domain generalization

39 / 55

slide-40
SLIDE 40

Domain generalization via shape matching

global max pooling & L2-norm Dx1 CNN desc. edge filtering

Linear classifier on edgeMAC descriptors

40 / 55

slide-41
SLIDE 41

Results on domain generalization

A: Artwork C: Cartoon P: Photo S: Sketch

41 / 55

slide-42
SLIDE 42

Metric Learning Without Labels

Ahmet Iscen Yannis Avrithis Giorgos Tolias Teddy Furon

slide-43
SLIDE 43

Iscen, Tolias, Avrithis, Furon, Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, CVPR’17

Euclidean & manifold distance

Mapping: Images to Rn descriptors The Euclidean distance is locally a good similarity measure Related images lie on non-linear manifolds

43 / 55

slide-44
SLIDE 44

Euclidean & manifold distance

Mapping: Images to Rn descriptors The Euclidean distance is locally a good similarity measure Related images lie on non-linear manifolds

Iscen, Tolias, Avrithis, Furon, Chum, Efficient Diffusion on Region Manifolds: Recovering Small Objects with Compact CNN Representations, CVPR’17

44 / 55

slide-45
SLIDE 45

Diffusion

Vector of similarities to the query Normalized (sparse) affinity matrix Query indicator vector k-Nearest Neighbour graph

⁞ 1 ⁞

Random walk implicitly considers all paths (visual proof)

45 / 55

slide-46
SLIDE 46

Large, non-sparse Large, sparse

Diffusion

k-Nearest Neighbour graph Closed form: Iterative: where

46 / 55

slide-47
SLIDE 47

Contributions on Diffusion for Retrieval

System of linear equations, Conjugate Gradients Generalization to novel queries (not part of the dataset) Diffusion can be efficiently applied to image parts

  • Low-rank approximation

Small, non-sparse

  • Significant impact on CNN-based retrieval of small object
  • Two orders of magnitude faster online diffusion

Iterative: Jacobi solver Intractable Closed form: [CVPR 2017] [CVPR 2018] [ACCV 2018]

47 / 55

slide-48
SLIDE 48

Euclidean vs Manifold Distance

Diffusion-guided to sample hard negatives and positives

  • Avoid computationally expensive SfM models
  • A. Iscen, G. Tolias , Y. Avrithis, O. Chum, Mining on Manifolds: Metric Learning without Labels, In CVPR 2018

48 / 55

slide-49
SLIDE 49

Mining of training samples

Anchors Mined positives Euclidean kNN Mined negatives Euclidean non-kNN

49 / 55

slide-50
SLIDE 50

Experiments on instance search

50 / 55

slide-51
SLIDE 51

Experiments on instance search

vs

51 / 55

slide-52
SLIDE 52

Mining of training samples

Anchors Mined positives Euclidean kNN Mined negatives Euclidean non-kNN

52 / 55

slide-53
SLIDE 53

Experiments on fine-grained recognition

53 / 55

slide-54
SLIDE 54

Online code and data

Siamese training code and training data http://cmp.felk.cvut.cz/cnnimageretrieval/

  • Image retrieval (ECCV 2016)
  • Matlab package using MatConvNet
  • Python package using PyTorch
  • Sketch based image retrieval (ECCV 2018)
  • Matlab package using MatConvNet

Region manifold search (CVPR 2017) https://github.com/ahmetius/diffusion-retrieval

  • Matlab package

54 / 55

slide-55
SLIDE 55

Conclusions

  • no human annotation needed for CNN image retrieval
  • CNN outperforms its teacher on standard benchmarks
  • BOW still better for certain tasks
  • no human annotation needed for CNN sketch based retrieval
  • generic CNN shape retrieval performs well
  • standard and fine-grained sketch based retrieval
  • significant appearance changes, domain generalization

BOW combined SfM is a good teacher Mining on Manifolds

  • fine tuning CNNs without supervision
  • using diffusion to compute manifold distance

55 / 55

slide-56
SLIDE 56

Thank you.