lear trecvid med 2012
play

LEAR @ TrecVid MED 2012 a 1 , Matthijs Douze 1 , J ome Revaud 1 , - PowerPoint PPT Presentation

Low-level features Encoding High-level features Fusion Results References LEAR @ TrecVid MED 2012 a 1 , Matthijs Douze 1 , J ome Revaud 1 , Dan Oneat er Jochen Schwenninger 2 , Heng Wang 1 , Danila Potapov 1 , d Harchaoui 1 ,


  1. Low-level features Encoding High-level features Fusion Results References LEAR @ TrecVid MED 2012 a 1 , Matthijs Douze 1 , J´ ome Revaud 1 , Dan Oneat ¸˘ erˆ Jochen Schwenninger 2 , Heng Wang 1 , Danila Potapov 1 , ıd Harchaoui 1 , Jakob Verbeek 1 , Cordelia Schmid 1 Za¨ 1 LEAR team, INRIA Grenoble, France 2 Fraunhofer Sankt Augustin, Germany 1 / 17

  2. Low-level features Encoding High-level features Fusion Results References Outline Low-level features: appearance, motion, audio 1 Feature encoding: Fisher vectors 2 High-level features: text 3 Fusion strategies 4 Experiments and results 5 2 / 17

  3. Low-level features Encoding High-level features Fusion Results References Outline Low-level features: appearance, motion, audio 1 Feature encoding: Fisher vectors 2 High-level features: text 3 Fusion strategies 4 Experiments and results 5 3 / 17

  4. Low-level features Encoding High-level features Fusion Results References Appearance and audio features Scale-invariant feature transform (SIFT, Lowe 2004 ): 21 × 21 patches at 4 pixel steps on 5 scales Every 60-th frame. 4 / 17

  5. Low-level features Encoding High-level features Fusion Results References Appearance and audio features Scale-invariant feature transform (SIFT, Lowe 2004 ): 21 × 21 patches at 4 pixel steps on 5 scales Every 60-th frame. Mel-frequency cepstral coefficients (MFCC, Rabiner and Schafer 2007 ). Window of 25 ms and a step-size of 10 ms 39 coefficients: 12 MFCC and energy of the signal, first and second derivative Optionally: Speech/non-speech separation. 4 / 17

  6. Low-level features Encoding High-level features Fusion Results References Motion features Dense trajectories (Wang et al., 2011) Strong performance on many action recognition datasets: Hollywood2, Youtube, UCF Sports. Idea: MBH descriptors computed across short densely sampled trajectories. Dense sampling in each spatial scale Wang, H., Kl¨ aser, A., Schmid, C., and Cheng-Lin, L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition , pages 3169–3176, Colorado Springs, United States 5 / 17

  7. Low-level features Encoding High-level features Fusion Results References Motion features Dense trajectories (Wang et al., 2011) Strong performance on many action recognition datasets: Hollywood2, Youtube, UCF Sports. Idea: MBH descriptors computed across short densely sampled trajectories. Tracking in each spatial scale separately Dense sampling in each spatial scale Wang, H., Kl¨ aser, A., Schmid, C., and Cheng-Lin, L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition , pages 3169–3176, Colorado Springs, United States 5 / 17

  8. Low-level features Encoding High-level features Fusion Results References Motion features Dense trajectories (Wang et al., 2011) Strong performance on many action recognition datasets: Hollywood2, Youtube, UCF Sports. Idea: MBH descriptors computed across short densely sampled trajectories. Tracking in each spatial scale separately Trajectory description Dense sampling in each spatial scale Wang, H., Kl¨ aser, A., Schmid, C., and Cheng-Lin, L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition , pages 3169–3176, Colorado Springs, United States 5 / 17

  9. Low-level features Encoding High-level features Fusion Results References Motion features Dense trajectories (Wang et al., 2011) Strong performance on many action recognition datasets: Hollywood2, Youtube, UCF Sports. Idea: MBH descriptors computed across short densely sampled trajectories. Tracking in each spatial scale separately Trajectory description Dense sampling in each spatial scale HOG HOF MBH Wang, H., Kl¨ aser, A., Schmid, C., and Cheng-Lin, L. (2011). Action recognition by dense trajectories. In IEEE Conference on Computer Vision & Pattern Recognition , pages 3169–3176, Colorado Springs, United States 5 / 17

  10. Low-level features Encoding High-level features Fusion Results References Motion features 6 / 17

  11. Low-level features Encoding High-level features Fusion Results References Video rescaling for dense trajectories Computationally expensive: cost scales linearly with the size of the video (time × resolution) 7 / 17

  12. Low-level features Encoding High-level features Fusion Results References Video rescaling for dense trajectories Computationally expensive: cost scales linearly with the size of the video (time × resolution) 7 / 17

  13. Low-level features Encoding High-level features Fusion Results References Video rescaling for dense trajectories Computationally expensive: cost scales linearly with the size of the video (time × resolution) 7 / 17

  14. Low-level features Encoding High-level features Fusion Results References Video rescaling for dense trajectories Computationally expensive: cost scales linearly with the size of the video (time × resolution) Speed-ups: Rescale videos: width at most 200 px. 7 / 17

  15. Low-level features Encoding High-level features Fusion Results References Video rescaling for dense trajectories Computationally expensive: cost scales linearly with the size of the video (time × resolution) Speed-ups: Rescale videos: width at most 200 px. Skip every second frame 7 / 17

  16. Low-level features Encoding High-level features Fusion Results References Video rescaling for dense trajectories Computationally expensive: cost scales linearly with the size of the video (time × resolution) Speed-ups: Rescale videos: width at most 200 px. Skip every second frame Process descriptors on-the-fly. 7 / 17

  17. Low-level features Encoding High-level features Fusion Results References Outline Low-level features: appearance, motion, audio 1 Feature encoding: Fisher vectors 2 High-level features: text 3 Fusion strategies 4 Experiments and results 5 8 / 17

  18. Low-level features Encoding High-level features Fusion Results References Feature encoding: Fisher vectors (Perronnin et al., 2010) Top feature encoding technique for: object recognition (Chatfield et al., 2011) action recognition (Wang et al., 2012) . 9 / 17

  19. Low-level features Encoding High-level features Fusion Results References Feature encoding: Fisher vectors (Perronnin et al., 2010) Top feature encoding technique for: object recognition (Chatfield et al., 2011) action recognition (Wang et al., 2012) . Fisher vectors (FV) for GMM: 9 / 17

  20. Low-level features Encoding High-level features Fusion Results References Feature encoding: Fisher vectors (Perronnin et al., 2010) Top feature encoding technique for: object recognition (Chatfield et al., 2011) action recognition (Wang et al., 2012) . Fisher vectors (FV) for GMM: soft bag-of-words: � x p ( k | x ) first moment: � x p ( k | x )( x − µ k ) x p ( k | x )( x − µ k ) 2 . second moment: � 9 / 17

  21. Low-level features Encoding High-level features Fusion Results References Feature encoding: Fisher vectors (Perronnin et al., 2010) Top feature encoding technique for: object recognition (Chatfield et al., 2011) action recognition (Wang et al., 2012) . Fisher vectors (FV) for GMM: soft bag-of-words: � x p ( k | x ) first moment: � x p ( k | x )( x − µ k ) x p ( k | x )( x − µ k ) 2 . second moment: � FV size: K + 2 KD K : number of Gaussians D : descriptor dimension. 9 / 17

  22. Low-level features Encoding High-level features Fusion Results References Feature encoding: Fisher vectors (Perronnin et al., 2010) Top feature encoding technique for: object recognition (Chatfield et al., 2011) action recognition (Wang et al., 2012) . Fisher vectors (FV) for GMM: soft bag-of-words: � x p ( k | x ) first moment: � x p ( k | x )( x − µ k ) x p ( k | x )( x − µ k ) 2 . second moment: � FV size: K + 2 KD K : number of Gaussians D : descriptor dimension. Normalization: zero mean, unit variance signed square-rooting ℓ 2 normalization. 9 / 17

  23. Low-level features Encoding High-level features Fusion Results References Outline Low-level features: appearance, motion, audio 1 Feature encoding: Fisher vectors 2 High-level features: text 3 Fusion strategies 4 Experiments and results 5 10 / 17

  24. Low-level features Encoding High-level features Fusion Results References High-level features. Optical character recognition Feature extraction: Maximally stable extremal regions Video frame all MSERs (MSER; Matas et al. 2004 ) Gradient filtering Color and stroke width filtering 11 / 17 Pairs filtering Forming words

  25. Low-level features Encoding High-level features Fusion Results References High-level features. Optical character recognition Feature extraction: Maximally stable extremal regions Video frame all MSERs (MSER; Matas et al. 2004 ) Gradient filtering Color and stroke width filtering 11 / 17 Pairs filtering Forming words

  26. Video frame all MSERs Low-level features Encoding High-level features Fusion Results References High-level features. Optical character recognition Feature extraction: Maximally stable extremal regions Gradient filtering Color and stroke width filtering (MSER; Matas et al. 2004 ) Filtering based on boundary gradients and aspect ratio. Pairs filtering Forming words 11 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend