Localization with Spatio-Temporal Selective Search and SPPnet - - PowerPoint PPT Presentation

localization with spatio temporal selective search and
SMART_READER_LITE
LIVE PREVIEW

Localization with Spatio-Temporal Selective Search and SPPnet - - PowerPoint PPT Presentation

Localization with Spatio-Temporal Selective Search and SPPnet Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Institute of Technology TRECVID 2015 TokyoTech 1 Outline Previous works Selective Search Spatial Pyramid Pooling


slide-1
SLIDE 1

TRECVID 2015 TokyoTech 1

Localization with Spatio-Temporal Selective Search and SPPnet

Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Institute of Technology

slide-2
SLIDE 2

TRECVID 2015 TokyoTech 2

Outline

  • Previous works

– Selective Search – Spatial Pyramid Pooling (SPP) net

  • Our Methods

1.Spatio-Temporal Selective Search 2.Multi-Frame Score Fusion 3.Neighbor-Frame Score Boosting

  • Experiments, Results and Conclusion
slide-3
SLIDE 3

TRECVID 2015 TokyoTech 3

Selective Search

  • Selective Search produces a large number of
  • bject region proposals from an image

– Use several strategies including useless ones

  • J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, Selective

search for object recognition. In IJCV, vol.104, pp.154-171, 2013 The image is from the paper

slide-4
SLIDE 4

TRECVID 2015 TokyoTech 4

Spatial Pyramid Pooling (SPP) net

  • An efficient method to extract

CNN scores from a large number

  • f object regions of an image

– CNN layers shared among all regions – SVMs computed for each region – Selective Search is used for region

proposals

  • K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in

deep convolutional networks for visual recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1904-1916, 2015

SPP layer SPP layer

FC ReLU FC ReLU SVM Score CNN FC ReLU FC ReLU SVM Score Region proposals by Selective Search[2]

slide-5
SLIDE 5

TRECVID 2015 TokyoTech 5

Spatio-Temporal Region Proposals

  • Selective Search with temporal dimensional

extended region proposals

– Produce temporally continuous regions – Contains a large number of meaningless regions – Each video is separated at each I-frame and

segmented since computational time is limited Time Image pixels Video voxels Edges weighted with similarity

1. (1)

slide-6
SLIDE 6

TRECVID 2015 TokyoTech 6

Spatio-Temporal Region Proposals

Time Hierarchy

⊃ ⊃ ⊃ ⊃ ⊃ ⊃

Regions are hierarchical and temporally continuous

1. (2)

slide-7
SLIDE 7

TRECVID 2015 TokyoTech 7

Multi-Frame Score Fusion

  • Basic idea

– Some frames contain noise or object deformation

making detection harder

– Results of ST-Region Proposals contain many

meaningless region proposals

➔ Information of neighbor frames provides robustness

  • Fuse feature maps among several frames

– This requires region proposals temporal continuous

– ST-Region Proposals adopted

2. (1)

slide-8
SLIDE 8

TRECVID 2015 TokyoTech 8

Multi-Frame Score Fusion

– In experiments, we concluded late fusion is the best

CNN

SPP

FC RELU FC RELU SVM

SPP

FC RELU FC

SPP

FC RELU FC RELU SVM

SPP

FC RELU FC Score Fusion Fusion Fusion I-Frame I-Frame P-Frame P-Frame CNN CNN CNN P-Frame P-Frame P-Frame P-Frame FC: Fully connected layer SPP: Spatial Pyramod Pooling layer ReLU: Rectified linear unit

2. (2)

slide-9
SLIDE 9

TRECVID 2015 TokyoTech 9

Neighbor-Frame Score Boosting

  • Basic idea

– Based on same aspect of previous score fusion – Objects will appear in several continuous frames ➔Information of neighbor frames provides robustness

  • Boost scores of I-frames between positives by

Increase their scores by a constant

Boosted Boosted Time

I-Frame I-Frame I-Frame I-Frame I-Frame I-Frame

3.

slide-10
SLIDE 10

TRECVID 2015 TokyoTech 10

Experiments – Manual Annotations

  • Airplane, Boat_Ship, Bridges, Bus, Motorcycle,

Telephones, Flags, Quadruped – provided

  • Anchorperson – annotated 12k I-frames
  • Computers – annotated 7k I-frames
slide-11
SLIDE 11

TRECVID 2015 TokyoTech 11

Experiments – Training

  • Deciding the threshold and the fusion method

– Used last year's dataset and concepts – Train: IACC_2_A – Val: IACC_2_B

  • Submitted runs

– Train: IACC_2_A including additional annotations,

IACC_2_B

– Test: IACC_2_C

slide-12
SLIDE 12

TRECVID 2015 TokyoTech 12

  • Harm. Mean of

F-scores Run ID Method Val Test (Base) Selective Search + SPPnet 0.4481 0.5656 Multiple + ST-Region Proposals, Multi-Frame Score Fusion 0.4518 0.5716 Multiple_Aug3 + Neighbour-Frame Score Boost 0.4569 0.5750

Results

  • Multi-Frame Score Fusion and Neighbor-Frame

Score Boosting improved the score

  • We archived 3rd place among all teams with

harmonic mean of F-scores

slide-13
SLIDE 13

TRECVID 2015 TokyoTech 13

Results

  • Multi-Frame Score Fusion and Neighbor-Frame

Score Boosting improved the score

  • We archived 3rd place among all teams with

harmonic mean of F-scores

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 I-frame F-score Mean pixel F-score Harmonic mean

M e d i a M i l l B e s t C C N Y B e s t O u r B e s t T r i m p s B e s t P i c S O M B e s t D C U B e s t

slide-14
SLIDE 14

TRECVID 2015 TokyoTech 14

Results – Examples

  • Sometimes better than GT

System output Ground truth

slide-15
SLIDE 15

TRECVID 2015 TokyoTech 15

Results – Spatial Score

  • We achieved 1st place in Mean Pixel F-score by

throttling a number of positives to reduce FPs

– Of course I-frame F-score is not good

  • Mean Pixel F-score is calculated from true

positive and false positive I-frames, not intuitive

0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean pixel I-frame F-Score Multiple_Spat Multiple_Spat Multiple Multiple Multiple_Aug3 Multiple_Aug3 Single2 Single2

slide-16
SLIDE 16

TRECVID 2015 TokyoTech 16

Conclusion

  • We developed a localization system using ST-

Region Proposals and CNN with SPP-net

  • Multi-Frame Score Fusion with ST-Region

Proposals and Neighbor-Frame Score Boosting improved the score

  • Problem: The detection results strongly depend
  • n quality of ST-Region Proposals

– Improve ST-Region Proposals quality – Localization without region candidates