TRECVID 2015 TokyoTech 1
Localization with Spatio-Temporal Selective Search and SPPnet - - PowerPoint PPT Presentation
Localization with Spatio-Temporal Selective Search and SPPnet - - PowerPoint PPT Presentation
Localization with Spatio-Temporal Selective Search and SPPnet Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Institute of Technology TRECVID 2015 TokyoTech 1 Outline Previous works Selective Search Spatial Pyramid Pooling
TRECVID 2015 TokyoTech 2
Outline
- Previous works
– Selective Search – Spatial Pyramid Pooling (SPP) net
- Our Methods
1.Spatio-Temporal Selective Search 2.Multi-Frame Score Fusion 3.Neighbor-Frame Score Boosting
- Experiments, Results and Conclusion
TRECVID 2015 TokyoTech 3
Selective Search
- Selective Search produces a large number of
- bject region proposals from an image
– Use several strategies including useless ones
- J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, Selective
search for object recognition. In IJCV, vol.104, pp.154-171, 2013 The image is from the paper
TRECVID 2015 TokyoTech 4
Spatial Pyramid Pooling (SPP) net
- An efficient method to extract
CNN scores from a large number
- f object regions of an image
– CNN layers shared among all regions – SVMs computed for each region – Selective Search is used for region
proposals
- K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in
deep convolutional networks for visual recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.1904-1916, 2015
SPP layer SPP layer
FC ReLU FC ReLU SVM Score CNN FC ReLU FC ReLU SVM Score Region proposals by Selective Search[2]
TRECVID 2015 TokyoTech 5
Spatio-Temporal Region Proposals
- Selective Search with temporal dimensional
extended region proposals
– Produce temporally continuous regions – Contains a large number of meaningless regions – Each video is separated at each I-frame and
segmented since computational time is limited Time Image pixels Video voxels Edges weighted with similarity
1. (1)
TRECVID 2015 TokyoTech 6
Spatio-Temporal Region Proposals
Time Hierarchy
⊃ ⊃ ⊃ ⊃ ⊃ ⊃
Regions are hierarchical and temporally continuous
1. (2)
TRECVID 2015 TokyoTech 7
Multi-Frame Score Fusion
- Basic idea
– Some frames contain noise or object deformation
making detection harder
– Results of ST-Region Proposals contain many
meaningless region proposals
➔ Information of neighbor frames provides robustness
- Fuse feature maps among several frames
– This requires region proposals temporal continuous
– ST-Region Proposals adopted
2. (1)
TRECVID 2015 TokyoTech 8
Multi-Frame Score Fusion
– In experiments, we concluded late fusion is the best
CNN
SPP
FC RELU FC RELU SVM
SPP
FC RELU FC
SPP
FC RELU FC RELU SVM
SPP
FC RELU FC Score Fusion Fusion Fusion I-Frame I-Frame P-Frame P-Frame CNN CNN CNN P-Frame P-Frame P-Frame P-Frame FC: Fully connected layer SPP: Spatial Pyramod Pooling layer ReLU: Rectified linear unit
2. (2)
TRECVID 2015 TokyoTech 9
Neighbor-Frame Score Boosting
- Basic idea
– Based on same aspect of previous score fusion – Objects will appear in several continuous frames ➔Information of neighbor frames provides robustness
- Boost scores of I-frames between positives by
Increase their scores by a constant
Boosted Boosted Time
I-Frame I-Frame I-Frame I-Frame I-Frame I-Frame
3.
TRECVID 2015 TokyoTech 10
Experiments – Manual Annotations
- Airplane, Boat_Ship, Bridges, Bus, Motorcycle,
Telephones, Flags, Quadruped – provided
- Anchorperson – annotated 12k I-frames
- Computers – annotated 7k I-frames
TRECVID 2015 TokyoTech 11
Experiments – Training
- Deciding the threshold and the fusion method
– Used last year's dataset and concepts – Train: IACC_2_A – Val: IACC_2_B
- Submitted runs
– Train: IACC_2_A including additional annotations,
IACC_2_B
– Test: IACC_2_C
TRECVID 2015 TokyoTech 12
- Harm. Mean of
F-scores Run ID Method Val Test (Base) Selective Search + SPPnet 0.4481 0.5656 Multiple + ST-Region Proposals, Multi-Frame Score Fusion 0.4518 0.5716 Multiple_Aug3 + Neighbour-Frame Score Boost 0.4569 0.5750
Results
- Multi-Frame Score Fusion and Neighbor-Frame
Score Boosting improved the score
- We archived 3rd place among all teams with
harmonic mean of F-scores
TRECVID 2015 TokyoTech 13
Results
- Multi-Frame Score Fusion and Neighbor-Frame
Score Boosting improved the score
- We archived 3rd place among all teams with
harmonic mean of F-scores
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 I-frame F-score Mean pixel F-score Harmonic mean
M e d i a M i l l B e s t C C N Y B e s t O u r B e s t T r i m p s B e s t P i c S O M B e s t D C U B e s t
TRECVID 2015 TokyoTech 14
Results – Examples
- Sometimes better than GT
System output Ground truth
TRECVID 2015 TokyoTech 15
Results – Spatial Score
- We achieved 1st place in Mean Pixel F-score by
throttling a number of positives to reduce FPs
– Of course I-frame F-score is not good
- Mean Pixel F-score is calculated from true
positive and false positive I-frames, not intuitive
0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean pixel I-frame F-Score Multiple_Spat Multiple_Spat Multiple Multiple Multiple_Aug3 Multiple_Aug3 Single2 Single2
TRECVID 2015 TokyoTech 16
Conclusion
- We developed a localization system using ST-
Region Proposals and CNN with SPP-net
- Multi-Frame Score Fusion with ST-Region
Proposals and Neighbor-Frame Score Boosting improved the score
- Problem: The detection results strongly depend
- n quality of ST-Region Proposals
– Improve ST-Region Proposals quality – Localization without region candidates