cmu trecvid event detection
play

CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann - PowerPoint PPT Presentation

CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon CMU @ TRECVID 2009 E CMU @ TRECVID 2009 Event Detection t D t ti CMU


  1. CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon

  2. CMU @ TRECVID 2009 E CMU @ TRECVID 2009 Event Detection t D t ti � CMU submitted all 10 event detection tasks � Part-based generic approach • Local features extracted from videos - Local features describe both appearance and motion - Bag of word features represent video content Bag of word features represent video content • Robust to action deformation, occlusion and illumination � Sliding window detection approach lidi i d d i h • Extend part-based method to detection tasks • False alarm reduction is a critical task False alarm reduction is a critical task 2

  3. 3 System overviw i t S

  4. M SIFT MoSIFT – feature detection f t d t ti � MoSIFT detects spatial interest points in multiple scales p p p • Local maximum of Difference of Gaussian (DoG) � MoSIFT computes optical flow to detect moving areas � MoSIFT detects video interest areas by local maximum of DoG and optical flows 4

  5. M SIFT MoSIFT – feature description f t d i ti � Descriptor of shape p p • Histogram of Gradient (HoG) • Aggregate neighbor areas as 4x4 grids; each grid is described as 8 orientations • 4x4x8 = 128 dimensional vector to describe shape of interest areas 4 4 8 128 di i l t t d ib h f i t t � Descriptor of motion • Histogram of Optical Flow (HoF); the same format as HoG Histogram of Optical Flow (HoF); the same format as HoG • 128 dimensional vector to describe motion of interest areas � 256 dimensional vectors as feature descriptors 5

  6. E Event detection t d t ti � K-mean cluster algorithm is applied to quantize feature points extracted g pp q p from videos • K is chosen by cross-validation � A video codebook is built by clustering result id d b k i b il b l i l • A visual code is a category of similar video interest points � Bag of word (BoW) feature is constructed for each video sequence � Bag of word (BoW) feature is constructed for each video sequence • Soft weight is used to construct BoW feature � Event models are trained by Support Vector Machine (SVM) y pp ( ) • X 2 kernel is applied � Sliding window approach creates video sequence in both training and t testing sets ti t 6

  7. E Evaluation metric - DCR l ti t i DCR � Normalized Detection Cost Rate (NDCR) is used to evaluate performances. ( ) p ( ) ( ) = • + • ( , ) , , DetectionC ost S E Cost P S E Cost R S E Miss Miss FA FA [0,1] [0, ∞ ) � Strongly penalize false alarms • NDCR doesn’t encourage to detect more positive examples as much as reducing false alarms alarms • Reducing false alarms is then extremely important to improve NDCR scores 7

  8. F l False alarm reduction l d ti � Cascade architecture is highly used to reduce false alarm in detect g y tasks � We applied the idea of cascade algorithm in test phase to reduce false alarm • Two positive biased classifiers are built (due to computation, it can extend to more layers) y ) • Windows pass both classifiers will be predicted as positive All windows T T Detected windows M1 M2 F F Rejected windows 8

  9. False alarm reduction (Cont.) F l l d ti (C t ) � Lesson from last year, multi-scale sliding window approach has a lot y , g pp of false alarm � We do not apply multi-scale this year � Instead of several short positive predictions, we aggregated consecutive positive predictions as a long positive segment • Reduce number of positive predictions • Reduce number of positive predictions � Performance improves 80% by cascade algorithm � Performance improves 40% by concatenating short predictions to long Performance improves 40% by concatenating short predictions to long predictions 9

  10. S System set up t t � MoSIFT features are extracted via 3 different scales every 5 frames y • approximate 2160 hours for a single core to extract MoSIFT features � A sliding window (25 frames) slides every 5 frames � 1000 video codes � Soft weighted BoW feature representation (4 nearest clusters) � One against all SVM model for each action of each camera view • 50 models are built (10 actions * 5 camera views) 10

  11. P Performance comparison f i 1.4 1.2 1 0.8 DCR CMU Min D Median Median Best 0.6 0.4 0.2 0 y s t g r e w t p e u a r e n t c U n r P o n e E u u a i t l M t t E R t o r c F i n c l b T o e g p n i e i o P m N n S l j l o b P l p e e E i e s r s O o C o k l r o e p e t a a p o P P T v v p p e e e e O O P P l E Actions 11

  12. C Correct detection comparison t d t ti i 700 600 500 CMU CorDet Median 400 Num of C Max M 300 200 100 0 y t s r e w t g e u p r e a n t n c U r P o E n e u u a i l t E t M t R t F o r c i n c l b T o e g p n i e i o P m N n S l j l o P l b p e e i e s r E O s o k C o r l o e p e a t a p P o P T v p e e O P l E Action 12

  13. 13 2009) Performance (2008 v.s. 2009) (2008 f P

  14. Hi h l High level feature extraction l f t t ti � Motion related high level features g • 7 motion related concepts • Airplane flying, Person playing soccer, Hand, Person playing a musical instrument Person riding a bicycle Person eating People dacing instrument, Person riding a bicycle, Person eating, People dacing MAP MM 0.24 PKU 0.21 TITG 0.20 CMU 0.18 FTRD FTRD 0 18 0.18 VIREO 0.18 Eurecom 0.18 14

  15. C Conclusion & future work l i & f t k � Conclusion: • A generic approach to detect events • MoSIFT features captures both shape and motion information • Perform robust over all tasks P f b t ll t k • False alarm reduction is critical to improve DCR � Future work: • The approach can’t localize where the action is • The approach can further fuse with people tracking and global features • Bag of word representation is lack of spatial constraints 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend