Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke - - PowerPoint PPT Presentation

multimedia event detection using gs svms and audio hmms
SMART_READER_LITE
LIVE PREVIEW

Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke - - PowerPoint PPT Presentation

TRECVID 2011 TokyoTech+Canon Multimedia Event Detection using GS-SVMs and Audio-HMMs Shunsuke Sato Nakamasa Inoue, Yusuke Kamishima, Canon Inc. Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology TRECVID 2011


slide-1
SLIDE 1

TRECVID 2011 TokyoTech+Canon

Multimedia Event Detection using GS-SVMs and Audio-HMMs

Shunsuke Sato Canon Inc. Nakamasa Inoue, Yusuke Kamishima, Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology

slide-2
SLIDE 2

TRECVID 2011 TokyoTech+Canon

Outline

 Motivation  System Overview  Method

 Features extraction  GS-SVM  Audio HMMs

 Results

 Best result: Minimum NDC = 0.525

1 1

slide-3
SLIDE 3

TRECVID 2011 TokyoTech+Canon

Motivation

 Two event feature categories:

 Features that appear in every frame  Features that appear only in some frames

 Their combination can improve the detection

performance.

2

ex.) Flash Mob Gathering clips Every frame:

  • Outdoor
  • Dancers
  • Road
  • Crowd
  • Crowd buzz

… Some frames:

  • Dancing
  • Dance music
  • Cheering voice
slide-4
SLIDE 4

TRECVID 2011 TokyoTech+Canon

Method Overview

 For every-frame features: GS-SVM

(GMM-Supervector Support Vector Machine)

 Use several visual and audio features  Soft clustering - robust against quantization errors  Based on our system of TRECVID 2010 SIN task

 For some-frame features: HMM

(Hidden Markov model)

 Model temporal features in sound  Apply word-spotting in speech recognition  Use only audio, not video

3

slide-5
SLIDE 5

TRECVID 2011 TokyoTech+Canon

4

System Overview

  • 1. Feature Extraction

SIFT-Har GS-SVM SIFT-Hes GS-SVM STIP GS-SVM HOG GS-SVM MFCC GS-SVM MFCC- HMM

Score Fusion

  • 2. GS-SVM
  • 3. Audio-HMM

Detection Result

A clip of test data

slide-6
SLIDE 6

TRECVID 2011 TokyoTech+Canon

5

System Overview

  • 1. Feature Extraction

SIFT-Har GS-SVM SIFT-Hes GS-SVM STIP GS-SVM HOG GS-SVM MFCC GS-SVM MFCC- HMM

Score Fusion

  • 2. GS-SVM
  • 3. Audio-HMM

Detection Result

A clip of test data

slide-7
SLIDE 7

TRECVID 2011 TokyoTech+Canon

6

Feature Extraction

 5 types of features, from 3 kinds of sources

clip Still images

frames every 2 seconds

Audio

  • SIFT(Harris)
  • SIFT(Hessian)
  • HOG
  • STIP
  • MFCC

Spatio- temporal image

t

slide-8
SLIDE 8

TRECVID 2011 TokyoTech+Canon

7

List of Features

source feature description

Still images SIFT

(Harris)

Scale-Invariant Feature Transform with Harris-affine regions and Hessian-affine regions

[Mikolajczyk, 2004]

SIFT

(Hessian)

HOG

32 dimensional HOG Dense sampling (every 4 pixels)

Spatio-temporal images

STIP

Space-Time Interest Points HOG and HOF features extracted

[Laptev, 2005]

Audio MFCC

Mel-frequency cepstral coefficients Audio features for speech recognition

slide-9
SLIDE 9

TRECVID 2011 TokyoTech+Canon

8

System Overview

  • 1. Feature Extraction

SIFT-Har GS-SVM SIFT-Hes GS-SVM STIP GS-SVM HOG GS-SVM MFCC GS-SVM MFCC-HM M

Score Fusion

  • 2. GS-SVM
  • 3. Audio-HMM

Detection Result

A clip of test data

slide-10
SLIDE 10

TRECVID 2011 TokyoTech+Canon

GMM Supervector SVM (GS-SVM)

 Represent the distribution of each feature

 Each clip is modeled by a GMM (Gaussian Mixture Model)  Derive a supervector from the GMM parameters  Train SVM (Support Vector Machine) of the supervectors

9

Features Gaussian Mixture Model Supervector SVM Score

slide-11
SLIDE 11

TRECVID 2011 TokyoTech+Canon

10

GMM Estimation

 Estimated by using maximum a posteriori (MAP)

adaptation for mean vectors:

where

UBM* MAP adaptation *Universal background model (UBM): a prior GMM which is estimated by using all video data.

UBM’s mean adapted mean

slide-12
SLIDE 12

TRECVID 2011 TokyoTech+Canon

11

GMM Supervector

 GMM Supervector: combination of the mean vectors.

UBM MAP adaptation supervector where normalized mean

slide-13
SLIDE 13

TRECVID 2011 TokyoTech+Canon

12

Score Fusion in GS-SVM

 GS-SVMs use RBF-kernels:

 Score: Weighted Average of SVM outputs:

are decided by 2-fold cross validation based on

Minimum Normalized Detection Cost - Run 1 & Run 2

Average Precision - Run 3

In Run 4, is equal for all features

where = {SIFT-Her, SIFT-Hes, HOG, STIP, MFCC}

slide-14
SLIDE 14

TRECVID 2011 TokyoTech+Canon

13

System Overview

  • 1. Feature Extraction

SIFT-Har GS-SVM SIFT-Hes GS-SVM STIP GS-SVM HOG GS-SVM MFCC GS-SVM MFCC-HM M

Score Fusion

  • 2. GS-SVM
  • 3. Audio-HMM

Detection Result

A clip of test data

slide-15
SLIDE 15

TRECVID 2011 TokyoTech+Canon

Audio HMM

Training:

  • 1. Label an event period manually for each event clip
  • 2. Train an event HMM using MFCC

Test:

  • 1. Find likelihood LE of the event period by word-spotting
  • 2. Find likelihood LG of the event period for a garbage

model estimated from all video data

  • 3. Calculate likelihood ratio LE/LG as the detection score

14

Event Period

with Event Period labels train HMM detect HMM Score likelihood Garbage model

slide-16
SLIDE 16

TRECVID 2011 TokyoTech+Canon

Preliminary result of Audio HMMs

 Fuse HMM score with GS-SVM by weighted average.  Audio HMMs are effective in 3 events – Use them in

Run1.

15 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Birthday party Changing a vehicle tire (*) Flash mob gathering Getting a vehicle unstuck Grooming an animal Making a sandwich (*) Parade Parkour (*) Repairing an appliance Working on a sewing project GS-SVM

  • nly

GS-SVM +HMM

slide-17
SLIDE 17

TRECVID 2011 TokyoTech+Canon

16

System Overview

  • 1. Feature Extraction

SIFT-Har GS-SVM SIFT-Hes GS-SVM STIP GS-SVM HOG GS-SVM MFCC GS-SVM MFCC- HMM

Score Fusion

  • 2. GS-SVM
  • 3. Audio-HMM

Detection Result

A clip of test data

slide-18
SLIDE 18

TRECVID 2011 TokyoTech+Canon

Experiments

 Run3 was the best. GS-SVM was effective  Run1 (Audio-HMM) did not show good performance  Run2, weights decided by Minimum NDC, is not good

 Simple cross validation may have failed.

0.5 1 1.5

Mean Minimum NDC

TRECVID 2011 MED runs

Run1 (Run2 & Audio-HMM) – primary -12th Run2 (Minimum NDC weighting) – 10th Run4 (No weighting) – 8th

Run3 (Actual Precision weighting) – 7th

17 3rd among participated teams

slide-19
SLIDE 19

TRECVID 2011 TokyoTech+Canon

Effect of each feature in GS-SVM

 STIP and HOG had better performance.  MFCC was effective when combined with STIP and HOG.

0.2 0.4 0.6 0.8 1 Mean Minimum NDC

Checked: used Black: not used

1 type 2 types 3 types 4 types

SIFT-Har

               

SIFT-Hes

               

MFCC

               

STIP

               

HOG

               

18

all STIP STIP+HOG STIP+HOG +MFCC

slide-20
SLIDE 20

TRECVID 2011 TokyoTech+Canon

Why Audio HMM did not work?

 It failed to capture temporal features

 Each state represents a specific sound such as drum,

cheering, which may appear in non-event and/or at random.

 Test data include many sounds not appear in training and

development data

19

  • 0.1
  • 0.05

0.05 0.1

Flash mob gathering Parade Repairing an appliance Preliminary Experiment Official Evaluation

Difference of Minimum NDC between with and without Audio HMMs

slide-21
SLIDE 21

TRECVID 2011 TokyoTech+Canon

20

Conclusion

 We combine GS-SVM and Audio HMM  GS-SVMs are effective for MED.

 STIP, HOG, and MFCC are important

 Audio HMMs are not effective

 It cannot capture temporal features  Variety of sounds are larger than expected

 Future works

 Include other features, such as Dense SIFT  Improve the HMM-based sound detection

 Model event subclasses and their relationship