End-to-end Learning of Action Detection from Frame Glimpses in - - PowerPoint PPT Presentation

end to end learning of action
SMART_READER_LITE
LIVE PREVIEW

End-to-end Learning of Action Detection from Frame Glimpses in - - PowerPoint PPT Presentation

End-to-end Learning of Action Detection from Frame Glimpses in Videos CVPR 2016 Serena Yeung, Olga Russakovsky, Greg Mori , Li Fei-Fei Presenter: Wei-Jen Ko 1 Action detection Predict which and when action occurs in the video. 2 Related


slide-1
SLIDE 1

End-to-end Learning of Action Detection from Frame Glimpses in Videos

CVPR 2016 Serena Yeung, Olga Russakovsky, Greg Mori , Li Fei-Fei Presenter: Wei-Jen Ko

1

slide-2
SLIDE 2

Action detection

  • Predict which and when action occurs in the video.

2

slide-3
SLIDE 3

Related Work

Motion features: Dense Trajectories Apearance features: CNN+SIFT+ COLOR Audio features: MFCC+ASR Classified by SVM over exhaustive segments with varying scale and temporal position.

3

  • D. Oneata, J. Verbeek, and C. Schmid. The lear submission at thumos 2014.
  • L. Wang, Y. Qiao, and X. Tang. Action recognition and detection by combining motion and appearance features
  • J. Yuan, Y. Pei, B. Ni, P. Moulin, and A. Kassim. Adsc submission at thumos challenge 2015
slide-4
SLIDE 4

Dynamic feature prioritization Predictive corrective networks

4

Related Work

  • A. Dave, O. Russakovsky, D. Ramanan. Predictive-Corrective Networks

for Action Detection, CVPR 2017. Y-C. Su and K. Grauman. Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video , ECCV 2016.

slide-5
SLIDE 5

Proposed method

  • Recurrent neural network-based end-to-end model
  • Decides which frame to observe next and when to emit a prediction.

5

slide-6
SLIDE 6

Observation Network

Video frame Vln Image fearure Frame location ln

VGG FC

On

slide-7
SLIDE 7

Recurrent Network

7

sn : start location of the action en : end location of the action ln+ 1: location of the video frame to

  • bserve next

cn : confidence level of the prediction Pn : prediction indicator Sn, en, ln+ 1 normalized to [0,1]

slide-8
SLIDE 8

8

slide-9
SLIDE 9

Loss function

Lcls(dn): Cross-entropy loss on confidence Cn Lloc(dn, gm) : L2- regression loss minimizing the distance

9

slide-10
SLIDE 10

Reward function

10

pnand ln+1 trained by REINFORCE

negative reward if did not emit predictions for videos containg instances

slide-11
SLIDE 11

THUMOS’14 Results

11

slide-12
SLIDE 12

If observed frames are not be determined dynamically, it does not provide sufficient resolution to localize action boundaries.

12

slide-13
SLIDE 13

ActivityNet Results

13

slide-14
SLIDE 14

Strengths:

  • First End-to-end training approach
  • Select important frames to observe, no exhaustive searching
  • Better results

14

slide-15
SLIDE 15

15