Localiza)on using Faster R-CNN and Mul)-Frame Fusion Ryosuke - - PowerPoint PPT Presentation

localiza on using faster r cnn and
SMART_READER_LITE
LIVE PREVIEW

Localiza)on using Faster R-CNN and Mul)-Frame Fusion Ryosuke - - PowerPoint PPT Presentation

Localiza)on using Faster R-CNN and Mul)-Frame Fusion Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Ins8tute of Technology Outline Mo)va)on: detect an ac)on concept Si?ngDown Our method: Faster R-CNN + LSTM + Re-scoring Annota)on:


slide-1
SLIDE 1

Localiza)on using Faster R-CNN and Mul)-Frame Fusion

Ryosuke Yamamoto, Nakamasa Inoue, Koichi Shinoda Tokyo Ins8tute of Technology

slide-2
SLIDE 2

Outline

Mo)va)on: detect an ac)on concept “Si?ngDown” Our method: Faster R-CNN + LSTM + Re-scoring Annota)on: Frame-wise annota)on for Si?ngDown,

Key-frame annota)on for other concepts

Results:

2nd among 3 teams, best result at Si?ngDown

0.1 0.2 0.3 0.4 0.5 iframe_fscore mean_pixel_fscore F-score

slide-3
SLIDE 3

Mo)va)on

・ Localiza)on task focuses not only on sta)c objects, but also on ac)on concepts ・ We focus on Si?ngDown, one of ac)on concepts ・ How to dis)nguish between Si?ng and Si?ngDown? → Dynamic informa)on is important for precise detec)on

  • Si?ng

Si?ngDown

slide-4
SLIDE 4

Our Method

・Faster-RCNN (Ren 2015)

  • Efficient object localiza)on

・LSTM (Donahue 2015)

  • Precise ac)on localiza)on
  • Applied to Si?ngDown

・Re-scoring (Yamamoto 2015)

  • Mul)-frame Score Fusion
  • Mul)-Shot Score Boos)ng
  • Faster R-CNN

Prediction Prediction Prediction Fusion LSTM LSTM LSTM Boost Boost Boost Time Sequence

slide-5
SLIDE 5

Faster R-CNN (Ren 2015)

Efficient End-to-End object localiza)on

  • 1. Generate region proposals by a network
  • 2. Predict scores for each region by using

CNN features Example CNNs:

  • ZF Net (Zeiler 2014) we use
  • VGG-16 (Simonyan 2014)
  • GoogLeNet (Szegedy 2015)
  • ResNet (He 2016)
  • ROI Pooling

ROI Pooling

CNN Region Region proposals proposals DNN DNN

slide-6
SLIDE 6

Faster R-CNN LSTM Prediction Faster R-CNN LSTM Prediction Faster R-CNN LSTM Prediction Time Sequence

Long Short-Term Memory (LSTM)

An LSTM layer is introduced to Faster R-CNN

  • memorize long and short term informa)on
  • applied only to Si?ngDown
slide-7
SLIDE 7

Mul)-Frame and Mul)-Shot (Yamamoto 2015)

l

Mul)-Frame Score Fusion

Average pooling of scores over 5 frames in a shot

l

Mul)-Shot Score Boos)ng

Add adjacent shot scores

Key-frame (I-frame) Average

slide-8
SLIDE 8

Key-Frame Annota)ons

Bounding-box annota)on on the representa)ve key-frame for each shot labeled as posi)ve in collabora)ve annota)on

  • Concept

# frames # boxesConcept # frames # boxes Animal Bicycling Boy Dancing ExplosionFire 11,545 599 1,848 2,118 2,483 9,155 1,355 2,492 5,199 2,402 Inst.Musician Running Si?ngDown Baby Skier 4,923 945

  • 898

320 7,229 1,394

  • 895

521

slide-9
SLIDE 9

I-Frame Annota)ons for Si?ngDown

l

I-Frame annota)on for Si?ngDown to train LSTM

l

Annota)on results # shots = 92 # frames = 481 # bounding-boxes = 515 * We found Si?ngDown in only 92 shots in the 3K shots labeled as posi)ve in collabora)ve annota)on

slide-10
SLIDE 10

Results

0.1 0.2 0.3 0.4 0.5 iframe_fscore mean_pixel_fscore F-score

TokyoTech Runs

  • ID

Method RunID 1* 2* 3* 4* 5 Faster R-CNN + Mul)-Frame Score Fusion 1 + Mul)-Shot Score Boos)ng 1 + LSTM(4096units) for Si?ngDown 2 + LSTM(4096units) for Si?ngDown 2 + LSTM(64units) for Si?ngDown fusion boost fusion.lstm boost.lstm (post exp.)

l

2nd among 3 teams

slide-11
SLIDE 11

Results for Si?ngDown

  • ID Method

I-Frame F-score Pixel F-score 2* 4* 5 Fusion + Boos)ng 2 + LSTM (4096units) 2 + LSTM (64units) 0.63 0.00 11.96 0.22 0.00 4.51

Best result for Si?ngDown with run #2 LSTM with 4096 units (run #4) did not work → LSTM with 64 units (run #5) avoided over-fi?ng and worked in post submission experiment

slide-12
SLIDE 12

SittingDown

System output Ground truth

Good cases Bad cases

Moving but not sitting down Moving around a chair Sitting down

Re-trained network with LSTM 64 units

slide-13
SLIDE 13

Animal, Good Results

System output Ground truth

Faster R-CNN Score Fusion

Cat (no movement)

Score Boosting

Dog (walking)

slide-14
SLIDE 14

Animal, Bad Results

System output Ground truth

Faster R-CNN Score Fusion

Many animals

Score Boosting

Bird (flying fast)

slide-15
SLIDE 15

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Bicycling Boy

slide-16
SLIDE 16

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Dancing ExplosionFire

slide-17
SLIDE 17

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

InstrumentalMusician Running

slide-18
SLIDE 18

Others

Faster R-CNN Score Fusion Score Boosting

System output Ground truth

Baby Skier

slide-19
SLIDE 19

Conclusion & Future Work

l

We proposed a localiza)on system

  • Faster R-CNN + LSTM + Re-scoring

l

Manual annota)on

  • 31K bounding boxes

l

Results

  • 2nd among 3 teams, best result at Si?ngDown
  • LSTM with 64 units was effec)ve for Si?ngDown

l

Future work

  • Find a beoer way to localize ac)on