MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech - - PowerPoint PPT Presentation

โ–ถ
mil ut at ilsvrc2014
SMART_READER_LITE
LIVE PREVIEW

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech - - PowerPoint PPT Presentation

MIL-UT at ILSVRC2014 IIT Guwahati (undergrad) -> Virginia Tech (intern) Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo Pipeline of CLS-LOC task Multiclass Object


slide-1
SLIDE 1

MIL-UT at ILSVRC2014

Senthil Purushwalkam, Yuichiro Tsuchiya, Atsushi Kanehira, Asako Kanezaki and *Tatsuya Harada The University of Tokyo

IIT Guwahati (undergrad)

  • > Virginia Tech (intern)
slide-2
SLIDE 2

Pipeline of CLS-LOC task

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA

Averaged multiclass Passive Aggressive with hard negative mining

1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7

Averaged multiclass Passive Aggressive Late fusion Score

Multiclass Object Detection with hard negative classes

slide-3
SLIDE 3

Region Proposals and Feature Extraction

  • R-CNN
  • R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for

accurate object detection and semantic segmentation, CVPR, 2014.

  • Region proposals
  • Selective Search
  • J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders. Selective Search for

Object Recognition. IJCV, 2013.

  • CNN features
  • Single CNN model (5 conv layers, 2 fully connected layers)
  • Pre-computed ILSVRC13 model
  • http://www.cs.berkeley.edu/~rbg/r-cnn-release1-data-ilsvrc2013-caffe-proto-v0.tgz
  • No fine-tuning
  • 4096 dim fc7 features

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

slide-4
SLIDE 4

Multiclass Object Detection

  • Hard negatives classes
  • Idea: Create โ€˜negativeโ€™ classes and train on 2K classes
  • A. Kanezaki, S. Inaba, Y. Ushiku, Y. Yamashita, H. Muraoka, Y.

Kuniyoshi and T. Harada. Hard Negative Classes for Multiple Object Detection. ICRA, 2014.

  • Minimize detection errors as well as classification errors
  • Passive Aggressive algorithm with hard negative mining

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

slide-5
SLIDE 5

๐’™๐‘ 

(๐‘ข+1) = ๐’™๐‘  (๐‘ข) + ๐œ๐‘ข๐’š๐‘ข

Multiclass object detection (training with negative classes)

โ€ฆ

=

Score of class 1 Score of class 2 Score of class K

โ€ฆ

: Positive class : Negative class with the highest score

r s

๐’š๐‘ข

๐‘‹๐’š๐‘ข =

We use Passive Aggressive (PA) [Crammer et al., 2006] to learn multi-class linear classifiers. ๐‘‹

๐‘ข+1 = arg min ๐‘‹ 1 2 ๐‘‹ โˆ’ ๐‘‹ ๐‘ข 2 + ๐ทฮถs. t. ๐‘š ๐’š๐‘— ๐‘ข , ๐‘ง๐‘— ๐‘ข ; ๐‘‹ โ‰ค ฮถ, ฮถโ‰ฅ 0

where

๐’™๐‘ก

(๐‘ข+1) = ๐’™๐‘ก (๐‘ข) โˆ’ ๐œ๐‘ข๐’š๐‘ข

๐œ๐‘ข = min ๐ท, 1 โˆ’ (๐’™๐‘ 

๐‘ข ๐‘ˆ

๐’š๐‘ข โˆ’ ๐’™๐‘ก

๐‘ข ๐‘ˆ

๐’š๐‘ข) 2 ๐’š๐‘ข

2

๐’™๐ฟ

ERROR

slide-6
SLIDE 6

๐’™๐‘ 

(๐‘ข+1) = ๐’™๐‘  (๐‘ข) + ๐œ๐‘ข๐’š๐‘ข

Multiclass object detection (training with negative classes)

โ€ฆ

=

Score of class 1 Score of class 2 Score of class K

โ€ฆ

๐’š๐‘ข

where

๐’™๐‘ก

(๐‘ข+1) = ๐’™๐‘ก (๐‘ข) โˆ’ ๐œ๐‘ข๐’š๐‘ข

๐œ๐‘ข = min ๐ท, 1 โˆ’ (๐’™๐‘ 

๐‘ข ๐‘ˆ

๐’š๐‘ข โˆ’ ๐’™๐‘ก

๐‘ข ๐‘ˆ

๐’š๐‘ข) 2 ๐’š๐‘ข

2

๐’™๐ฟ โ€ฆ ๐’™โ€ฒ๐ฟ ๐’™โ€ฒ1 ๐’™โ€ฒ2

Score of negative class 1 Score of negative class 2 Score of negative class K

โ€ฆ

ERROR

๐‘š ๐’š๐‘— ๐‘ข , ๐‘ง๐‘— ๐‘ข ; ๐‘‹

๐’™1 ๐’™2

Core Idea Hard negative classes

Cf.) single background class

๐’™๐‘๐‘•

does not work.

slide-7
SLIDE 7

๐’™๐‘ 

(๐‘ข+1) = ๐’™๐‘  (๐‘ข) + ๐œ๐‘ข๐’š๐‘ข

Multiclass object detection (training with negative classes)

โ€ฆ

=

Score of class 1 Score of class 2 Score of class K

โ€ฆ

= class 2 : Negative class with the highest score

r s

๐’š๐‘ข

where

๐’™๐‘ก

(๐‘ข+1) = ๐’™๐‘ก (๐‘ข) โˆ’ ๐œ๐‘ข๐’š๐‘ข

๐œ๐‘ข = min ๐ท, 1 โˆ’ (๐’™๐‘ 

๐‘ข ๐‘ˆ

๐’š๐‘ข โˆ’ ๐’™๐‘ก

๐‘ข ๐‘ˆ

๐’š๐‘ข) 2 ๐’š๐‘ข

2

๐’™๐ฟ โ€ฆ ๐’™โ€ฒ๐ฟ ๐’™โ€ฒ1 ๐’™โ€ฒ2

Score of negative class 1 Score of negative class 2 Score of negative class K

โ€ฆ

ERROR

๐‘š ๐’š๐‘— ๐‘ข , ๐‘ง๐‘— ๐‘ข ; ๐‘‹

๐’™1 ๐’™2

Ex.) If a training sample ๐’š๐‘ข is a positive sample of class 2, Candidates of ๐‘ก๏ผš class1, 3, โ€ฆ, or K,

  • r negative class 2

Classification error

slide-8
SLIDE 8

s r

๐’™๐‘ 

(๐‘ข+1) = ๐’™๐‘  (๐‘ข) + ๐œ๐‘ข๐’š๐‘ข

โ€ฆ

=

Score of class 1 Score of class 2 Score of class K

โ€ฆ

= class 2 = negative class 2

๐’š๐‘ข

where

๐’™๐‘ก

(๐‘ข+1) = ๐’™๐‘ก (๐‘ข) โˆ’ ๐œ๐‘ข๐’š๐‘ข

๐œ๐‘ข = min ๐ท, 1 โˆ’ (๐’™๐‘ 

๐‘ข ๐‘ˆ

๐’š๐‘ข โˆ’ ๐’™๐‘ก

๐‘ข ๐‘ˆ

๐’š๐‘ข) 2 ๐’š๐‘ข

2

๐’™๐ฟ โ€ฆ ๐’™โ€ฒ๐ฟ ๐’™โ€ฒ1 ๐’™โ€ฒ2

Score of negative class 1 Score of negative class 2 Score of negative class K

โ€ฆ

ERROR

๐‘š ๐’š๐‘— ๐‘ข , ๐‘ง๐‘— ๐‘ข ; ๐‘‹

๐’™1 ๐’™2

Ex.) If a training sample ๐’š๐‘ข is a negative sample of class 2,

Detection error

Multiclass object detection (training with negative classes)

slide-9
SLIDE 9

Features for Contextual Scores

  • Improved Fisher Vector

โ€“

  • F. Perronnin, J. Sanchez, and T. Mensink. Improving the fisher kernel for large-scale image
  • classification. ECCV, 2010.

โ€“ INRIA's Fisher vector implementation

  • http://lear.inrialpes.fr/src/inria_fisher/

โ€“ L2 normalization, Power normalization, Spatial pyramid

  • Parameters of IFV for all local features in our system

โ€“ Dimension reduction of local feature (D): 64 dim โ€“ # of components in GMM (K): 256 โ€“ 5 scales of local patches โ€“ Spatial pyramid (P): 1x1 + 2x2 + 3x1 = 8 โ€“ Dimension of IFK: 2PKD=262,144 dim

  • Local Descriptors

โ€“ SIFT

9

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

slide-10
SLIDE 10

Classifiers for Contextual Scores

  • โ€“
  • โ€“
  • 10

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA Averaged multiclass Passive Aggressive with hard negative mining 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7 Averaged multiclass Passive Aggressive Late fusion Score

slide-11
SLIDE 11

Online Learning for Large-Scale Visual Recognition

  • Three guidelines
  • Y. Ushiku, M. Hidaka, T. Harada.

Three Guidelines of Online Learning for Large-Scale Visual Recognition. CVPR, 2014. 1. Perceptron can compete against the latest methods.

  • Provided that the second guideline is
  • bserved.

2. Averaging is necessary for any algorithm.

  • First-order algorithms w/o averaging

cannot compete against second-order algorithms.

  • When averaging is used, the

accuracies of all algorithms become very close to each other.

3. Investigate multiclass learning first.

  • Both one-versus-the-rest learning and

multiclass learning achieve similar accuracy.

  • However, one-versus-the-rest takes

much longer CPU time to converge than multiclass does. ๏€จ ๏€ฉ

T

T ฮผ ฮผ ฮผ ฮผ ๏€ซ ๏€ซ ๏€ซ ๏€ฝ ๏Œ

2 1

1 i y i y Y y i

i

y x ฮผ ๏ƒ— ๏€ฝ ๏‚ข

๏ƒŽ \

max arg

Averaging

slide-12
SLIDE 12

Late Fusion

Input image Extract region proposals Compute CNN features Scoring regions by Multiclass PA for each class Whole image Extract FV with spacial information Scoring by linear classifier trained by PA for each class

Multiclass PA for class 1 Multiclass PA for class ๐‘˜

โ‹ฎ

Multiclass PA for class 1000

โ‹ฎ ๐‘‡๐‘—,1

๐ท๐‘‚๐‘‚

๐‘‡๐‘—,๐‘˜

๐ท๐‘‚๐‘‚

๐‘‡๐‘—,1000

๐ท๐‘‚๐‘‚

โ‹ฎ โ‹ฎ

Multiclass PA for class 1 Multiclass PA for class ๐‘˜

โ‹ฎ

Multiclass PA for class 1000

โ‹ฎ ๐‘‡1

๐บ๐‘Š

๐‘‡

๐‘˜ ๐บ๐‘Š

๐‘‡1000

๐บ๐‘Š

โ‹ฎ โ‹ฎ 1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7

  • 2. Rescoring with combining RCNN feature and FV

For bounding box ๐‘—, class ๐‘˜, ๐‘‡๐‘—,๐‘˜

๐‘œ๐‘“๐‘ฅ = ๐‘‡๐‘—,๐‘˜ ๐ท๐‘‚๐‘‚๐‘‡ ๐‘˜ ๐บ๐‘Š

slide-13
SLIDE 13

Results

Method Localization error Classification error R-CNN feature + one-vs-all SVMs 0.631743 0.460080 R-CNN feature + multi-class PA 0.446121 0.285720 R-CNN feature + multi-class PA using hard negative classes 0.387516 0.227200 R-CNN feature + multi-class PA using hard negative classes, and FV 0.341743 0.18768 Team name Localization error Classification error

VGG 0.253231 0.07405 GoogLeNet 0.264414 0.14828 SYSU_Vision 0.31899 0.14446 MIL (our team) 0.337414 ๏Š 0.20734 ๏Œ

Validation dataset Test dataset

slide-14
SLIDE 14

Conclusion

  • Our pipeline

โ€“ R-CNN based region proposals and features with multi-class object detectors which create hard negative class for each positive class โ€“ Global features (FVs) with multi-class online-learning โ€“ Late fusion of region score and global score

  • Combining R-CNN with the contextual information improves the localization

performance.

  • Multi-class object detector trained with โ€˜hard negative classesโ€™ outperforms one-vs.-

the-rest SVMs.

Input image Extract region proposals Extract CNN features Scoring regions by multiclass PA Whole image Extract FV with spacial information Scoring whole image by multiclass PA

Averaged multiclass Passive Aggressive with hard negative mining

1-1 Scoring each bounding boxes by RCNN 1-2 Scoring whole image by FV as contextual scores fc7

Averaged multiclass Passive Aggressive

Late fusion Score