[PPT] - MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech PowerPoint Presentation

SLIDE 1

MIL-UT at ILSVRC2014

Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo

IIT Guwahati (undergrad)

> Virginia Tech (intern)

SLIDE 2

Pipeline of CLS-LOC task

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA

Averaged multiclass Passive Aggressive with hard negative mining

1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7

Averaged multiclass Passive Aggressive Late fusion Score

Multiclass Object Detection with hard negative classes

SLIDE 3

Region Proposals and Feature Extraction

R-CNN
R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for

accurate object detection and semantic segmentation, CVPR, 2014.

Region proposals
Selective Search
J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for

Object Recognition. IJCV, 2013.

CNN features
Single CNN model (5 conv layers, 2 fully connected layers)
Pre-computed ILSVRC13 model
http://www.cs.berkeley.edu/~rbg/r-cnn-release1-data-ilsvrc2013-caffe-proto-v0.tgz
No fine-tuning
4096 dim fc7 features

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

SLIDE 4

Multiclass Object Detection

Hard negatives classes
Idea: Create ‘negative’ classes and train on 2K classes
A. Kanezaki, S. Inaba, Y. Ushiku, Y. Yamashita, H. Muraoka, Y.

Kuniyoshi and T. Harada. Hard Negative Classes for Multiple Object Detection. ICRA, 2014.

Minimize detection errors as well as classification errors
Passive Aggressive algorithm with hard negative mining

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

SLIDE 5

𝒙𝑠

(𝑢+1) = 𝒙𝑠 (𝑢) + 𝜐𝑢𝒚𝑢

Multiclass object detection (training with negative classes)

…

=

Score of class 1 Score of class 2 Score of class K

…

: Positive class : Negative class with the highest score

r s

𝒚𝑢

𝑋𝒚𝑢 =

We use Passive Aggressive (PA) [Crammer et al., 2006] to learn multi-class linear classifiers. 𝑋

𝑢+1 = arg min 𝑋 1 2 𝑋 − 𝑋 𝑢 2 + 𝐷ζs. t. 𝑚 𝒚𝑗 𝑢 , 𝑧𝑗 𝑢 ; 𝑋 ≤ ζ, ζ≥ 0

where

𝒙𝑡

(𝑢+1) = 𝒙𝑡 (𝑢) − 𝜐𝑢𝒚𝑢

𝜐𝑢 = min 𝐷, 1 − (𝒙𝑠

𝑢 𝑈

𝒚𝑢 − 𝒙𝑡

𝑢 𝑈

𝒚𝑢) 2 𝒚𝑢

2

𝒙𝐿

ERROR

SLIDE 6

𝒙𝑠

(𝑢+1) = 𝒙𝑠 (𝑢) + 𝜐𝑢𝒚𝑢

Multiclass object detection (training with negative classes)

…

=

Score of class 1 Score of class 2 Score of class K

…

𝒚𝑢

where

𝒙𝑡

(𝑢+1) = 𝒙𝑡 (𝑢) − 𝜐𝑢𝒚𝑢

𝜐𝑢 = min 𝐷, 1 − (𝒙𝑠

𝑢 𝑈

𝒚𝑢 − 𝒙𝑡

𝑢 𝑈

𝒚𝑢) 2 𝒚𝑢

2

𝒙𝐿 … 𝒙′𝐿 𝒙′1 𝒙′2

Score of negative class 1 Score of negative class 2 Score of negative class K

…

ERROR

𝑚 𝒚𝑗 𝑢 , 𝑧𝑗 𝑢 ; 𝑋

𝒙1 𝒙2

Core Idea Hard negative classes

Cf.) single background class

𝒙𝑐𝑕

does not work.

SLIDE 7

𝒙𝑠

(𝑢+1) = 𝒙𝑠 (𝑢) + 𝜐𝑢𝒚𝑢

Multiclass object detection (training with negative classes)

…

=

Score of class 1 Score of class 2 Score of class K

…

= class 2 : Negative class with the highest score

r s

𝒚𝑢

where

𝒙𝑡

(𝑢+1) = 𝒙𝑡 (𝑢) − 𝜐𝑢𝒚𝑢

𝜐𝑢 = min 𝐷, 1 − (𝒙𝑠

𝑢 𝑈

𝒚𝑢 − 𝒙𝑡

𝑢 𝑈

𝒚𝑢) 2 𝒚𝑢

2

𝒙𝐿 … 𝒙′𝐿 𝒙′1 𝒙′2

Score of negative class 1 Score of negative class 2 Score of negative class K

…

ERROR

𝑚 𝒚𝑗 𝑢 , 𝑧𝑗 𝑢 ; 𝑋

𝒙1 𝒙2

Ex.) If a training sample 𝒚𝑢 is a positive sample of class 2, Candidates of 𝑡： class1, 3, …, or K,

r negative class 2

Classification error

SLIDE 8

s r

𝒙𝑠

(𝑢+1) = 𝒙𝑠 (𝑢) + 𝜐𝑢𝒚𝑢

…

=

Score of class 1 Score of class 2 Score of class K

…

= class 2 = negative class 2

𝒚𝑢

where

𝒙𝑡

(𝑢+1) = 𝒙𝑡 (𝑢) − 𝜐𝑢𝒚𝑢

𝜐𝑢 = min 𝐷, 1 − (𝒙𝑠

𝑢 𝑈

𝒚𝑢 − 𝒙𝑡

𝑢 𝑈

𝒚𝑢) 2 𝒚𝑢

2

𝒙𝐿 … 𝒙′𝐿 𝒙′1 𝒙′2

Score of negative class 1 Score of negative class 2 Score of negative class K

…

ERROR

𝑚 𝒚𝑗 𝑢 , 𝑧𝑗 𝑢 ; 𝑋

𝒙1 𝒙2

Ex.) If a training sample 𝒚𝑢 is a negative sample of class 2,

Detection error

Multiclass object detection (training with negative classes)

SLIDE 9

Features for Contextual Scores

Improved Fisher Vector

–

F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image
classification. ECCV, 2010.

– INRIA's Fisher vector implementation

http://lear.inrialpes.fr/src/inria_fisher/

– L2 normalization, Power normalization, Spatial pyramid

Parameters of IFV for all local features in our system

– Dimension reduction of local feature (D): 64 dim – # of components in GMM (K): 256 – 5 scales of local patches – Spatial pyramid (P): 1x1 + 2x2 + 3x1 = 8 – Dimension of IFK: 2PKD=262,144 dim

Local Descriptors

– SIFT

9

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

SLIDE 10

Classifiers for Contextual Scores

–
–
10

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

SLIDE 11

Online Learning for Large-Scale Visual Recognition

Three guidelines
Y. Ushiku, M. Hidaka, T. Harada.

Three Guidelines of Online Learning for Large-Scale Visual Recognition. CVPR, 2014. 1. Perceptron can compete against the latest methods.

Provided that the second guideline is
bserved.

2. Averaging is necessary for any algorithm.

First-order algorithms w/o averaging

cannot compete against second-order algorithms.

When averaging is used, the

accuracies of all algorithms become very close to each other.

3. Investigate multiclass learning first.

Both one-versus-the-rest learning and

multiclass learning achieve similar accuracy.

However, one-versus-the-rest takes

much longer CPU time to converge than multiclass does.  

T

T μ μ μ μ     

2 1

1 i y i y Y y i

i

y x μ   

 \

max arg

Averaging

SLIDE 12

Late Fusion

Input image Extract region proposals Compute CNN features Scoring regions by Multiclass PA for each class Whole image Extract FV with spacial information Scoring by linear classifier trained by PA for each class

Multiclass PA for class 1 Multiclass PA for class 𝑘

⋮

Multiclass PA for class 1000

⋮ 𝑇𝑗,1

𝐷𝑂𝑂

𝑇𝑗,𝑘

𝐷𝑂𝑂

𝑇𝑗,1000

𝐷𝑂𝑂

⋮ ⋮

Multiclass PA for class 1 Multiclass PA for class 𝑘

⋮

Multiclass PA for class 1000

⋮ 𝑇1

𝐺𝑊

𝑇

𝑘 𝐺𝑊

𝑇1000

𝐺𝑊

⋮ ⋮ 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7

2. Rescoring with combining RCNN feature and FV

For bounding box 𝑗, class 𝑘, 𝑇𝑗,𝑘

𝑜𝑓𝑥 = 𝑇𝑗,𝑘 𝐷𝑂𝑂𝑇 𝑘 𝐺𝑊

SLIDE 13

Results

Method Localization error Classification error R-CNN feature + one-vs-all SVMs 0.631743 0.460080 R-CNN feature + multi-class PA 0.446121 0.285720 R-CNN feature + multi-class PA using hard negative classes 0.387516 0.227200 R-CNN feature + multi-class PA using hard negative classes, and FV 0.341743 0.18768 Team name Localization error Classification error

VGG 0.253231 0.07405 GoogLeNet 0.264414 0.14828 SYSU_Vision 0.31899 0.14446 MIL (our team) 0.337414  0.20734 

Validation dataset Test dataset

SLIDE 14

Conclusion

Our pipeline

– R-CNN based region proposals and features with multi-class object detectors which create hard negative class for each positive class – Global features (FVs) with multi-class online-learning – Late fusion of region score and global score

Combining R-CNN with the contextual information improves the localization

performance.

Multi-class object detector trained with ‘hard negative classes’ outperforms one-vs.-

the-rest SVMs.

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA

Averaged multiclass Passive Aggressive with hard negative mining

1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7

Averaged multiclass Passive Aggressive

Late fusion Score