multimedia event detection using gs svms and audio hmms
play

Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke - PowerPoint PPT Presentation

TRECVID 2011 TokyoTech+Canon Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke Kamishima, Canon Inc. Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology TRECVID 2011


  1. TRECVID 2011 TokyoTech+Canon Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke Kamishima, Canon Inc. Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology

  2. TRECVID 2011 TokyoTech+Canon Outline  Motivation  System Overview  Method  Features extraction  GS-SVM  Audio HMMs  Results  Best result: Minimum NDC = 0.525 1 1

  3. TRECVID 2011 TokyoTech+Canon Motivation  Two event feature categories:  Features that appear in every frame  Features that appear only in some frames  Their combination can improve the detection performance. ex.) Flash Mob Gathering clips Every frame: • Outdoor • Dancers • Road • Crowd Some frames: • Crowd buzz • Dancing … • Dance music • Cheering voice 2

  4. TRECVID 2011 TokyoTech+Canon Method Overview  For every-frame features: GS-SVM (GMM-Supervector Support Vector Machine)  Use several visual and audio features  Soft clustering - robust against quantization errors  Based on our system of TRECVID 2010 SIN task  For some-frame features: HMM (Hidden Markov model)  Model temporal features in sound  Apply word-spotting in speech recognition  Use only audio, not video 3

  5. TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC- Score Fusion HMM 3. Audio-HMM Detection Result 4

  6. TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC- Score Fusion HMM 3. Audio-HMM Detection Result 5

  7. TRECVID 2011 TokyoTech+Canon Feature Extraction  5 types of features, from 3 kinds of sources • SIFT(Harris) Still images • SIFT(Hessian) frames every • HOG 2 seconds clip Spatio- • STIP temporal image t • MFCC Audio 6

  8. TRECVID 2011 TokyoTech+Canon List of Features source feature description Scale-Invariant Feature Transform SIFT with Harris-affine regions (Harris) and Hessian-affine regions SIFT Still images [Mikolajczyk, 2004] (Hessian) 32 dimensional HOG HOG Dense sampling (every 4 pixels) Space-Time Interest Points Spatio-temporal STIP HOG and HOF features extracted images [Laptev, 2005] Mel-frequency cepstral coefficients Audio MFCC Audio features for speech recognition 7

  9. TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC-HM Score Fusion M 3. Audio-HMM Detection Result 8

  10. TRECVID 2011 TokyoTech+Canon GMM Supervector SVM (GS-SVM)  Represent the distribution of each feature  Each clip is modeled by a GMM (Gaussian Mixture Model)  Derive a supervector from the GMM parameters  Train SVM (Support Vector Machine) of the supervectors Features Gaussian Mixture Model Supervector SVM Score 9

  11. TRECVID 2011 TokyoTech+Canon GMM Estimation  Estimated by using maximum a posteriori (MAP) adaptation for mean vectors: where UBM’s mean adapted mean UBM* MAP adaptation *Universal background model (UBM): a prior GMM which is estimated by using all video data. 10

  12. TRECVID 2011 TokyoTech+Canon GMM Supervector  GMM Supervector: combination of the mean vectors. where normalized mean UBM MAP supervector adaptation 11

  13. TRECVID 2011 TokyoTech+Canon Score Fusion in GS-SVM  GS-SVMs use RBF-kernels:  Score: Weighted Average of SVM outputs: where = {SIFT-Her, SIFT-Hes, HOG, STIP, MFCC} are decided by 2-fold cross validation based on  Minimum Normalized Detection Cost - Run 1 & Run 2  Average Precision - Run 3  In Run 4, is equal for all features  12

  14. TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC-HM Score Fusion M 3. Audio-HMM Detection Result 13

  15. TRECVID 2011 TokyoTech+Canon Audio HMM Training: 1. Label an event period manually for each event clip 2. Train an event HMM using MFCC Test: 1. Find likelihood L E of the event period by word-spotting 2. Find likelihood L G of the event period for a garbage model estimated from all video data 3. Calculate likelihood ratio L E / L G as the detection score likelihood detect train HMM Score Garbage with Event HMM model 14 Event Period Period labels

  16. TRECVID 2011 TokyoTech+Canon Preliminary result of Audio HMMs  Fuse HMM score with GS-SVM by weighted average.  Audio HMMs are effective in 3 events – Use them in Run1. Birthday party Changing a vehicle tire (*) Flash mob gathering Getting a vehicle unstuck Grooming an animal Making a sandwich (*) Parade GS-SVM only Parkour GS-SVM (*) Repairing an appliance +HMM Working on a sewing project 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 15

  17. TRECVID 2011 TokyoTech+Canon System Overview A clip of test data 1. Feature Extraction 2. GS-SVM SIFT-Hes STIP HOG SIFT-Har MFCC GS-SVM GS-SVM GS-SVM GS-SVM GS-SVM MFCC- Score Fusion HMM 3. Audio-HMM Detection Result 16

  18. TRECVID 2011 TokyoTech+Canon Experiments  Run3 was the best. GS-SVM was effective  Run1 (Audio-HMM) did not show good performance  Run2, weights decided by Minimum NDC, is not good  Simple cross validation may have failed. 3rd among participated teams Run3 (Actual Precision weighting) – 7th 1.5 Run4 (No weighting) – 8th Mean Minimum Run2 (Minimum NDC weighting) – 10th 1 Run1 (Run2 & Audio-HMM) – primary -12th NDC 0.5 0 17 TRECVID 2011 MED runs

  19. TRECVID 2011 TokyoTech+Canon Effect of each feature in GS-SVM  STIP and HOG had better performance.  MFCC was effective when combined with STIP and HOG. Mean Minimum NDC 1 type 2 types 4 types all 3 types 1 STIP STIP+HOG STIP+HOG 0.8 +MFCC 0.6 SIFT-Har                 0.4 SIFT-Hes                 MFCC                 STIP                 0.2 HOG                 18 Checked: used Black: not used

  20. TRECVID 2011 TokyoTech+Canon Why Audio HMM did not work?  It failed to capture temporal features  Each state represents a specific sound such as drum, cheering, which may appear in non-event and/or at random.  Test data include many sounds not appear in training and development data Flash mob gathering Preliminary Experiment Parade Official Evaluation Repairing an appliance 0.1 0.05 0 -0.05 -0.1 Difference of Minimum NDC between with and without Audio HMMs 19

  21. TRECVID 2011 TokyoTech+Canon Conclusion  We combine GS-SVM and Audio HMM  GS-SVMs are effective for MED.  STIP, HOG, and MFCC are important  Audio HMMs are not effective  It cannot capture temporal features  Variety of sounds are larger than expected  Future works  Include other features, such as Dense SIFT  Improve the HMM-based sound detection  Model event subclasses and their relationship 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend