High-Level Feature Extraction Using SIFT GMMs, Audio Models, and - - PowerPoint PPT Presentation

high level feature extraction using sift gmms audio
SMART_READER_LITE
LIVE PREVIEW

High-Level Feature Extraction Using SIFT GMMs, Audio Models, and - - PowerPoint PPT Presentation

COLLABORATIVE TEAM for TRECVID 2009 High-Level Feature Extraction Using SIFT GMMs, Audio Models, and MFoM Ilseo Kim, Nakamasa Inoue, Shanshan Hao, Chin-Hui Lee, Tatsuhiko Saito, Koichi Shinoda, Department of Computer Science, Department of


slide-1
SLIDE 1

COLLABORATIVE TEAM for TRECVID 2009

High-Level Feature Extraction Using SIFT GMMs, Audio Models, and MFoM

Ilseo Kim, Chin-Hui Lee, Department of Computer Science, Georgia Institute of Technology Nakamasa Inoue, Shanshan Hao, Tatsuhiko Saito, Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology

slide-2
SLIDE 2

COLLABORATIVE TEAM for TRECVID 2009

Outline

  • 1. SIFT Gaussian mixture models (GMMs) and audio models
  • 2. Text representation of images
  • 3. Multi-Class Maximal Figure-of-Merit (MC MFoM)

classifier to combine 1 & 2 Best result: Mean InfAP = 0.168

1

slide-3
SLIDE 3

COLLABORATIVE TEAM for TRECVID 2009

  • 1. SIFT GMMs and Audio Models
slide-4
SLIDE 4

COLLABORATIVE TEAM for TRECVID 2009

SIFT Feature Extraction

Extract SIFT features from all the image frames

with Harris-Affine / Hessian-Affine regions.

Apply PCA to reduce dimension [128dim 32dim].

PCA PCA Harris-Affine Hessian-Affine shot

2

slide-5
SLIDE 5

COLLABORATIVE TEAM for TRECVID 2009

SIFT Gaussian Mixture Models

Model SIFT features by a Gaussian Mixture Model

(GMM).

Robustness against quantization errors that occur in hard- assignment clustering in the BoW approach is expected.

Probability density function (pdf)

  • f SIFT GMM :

: num. of mixtures (512) : mixing coefficient : pdf of Gaussian : mean vector : variance matrix

3

slide-6
SLIDE 6

COLLABORATIVE TEAM for TRECVID 2009

SIFT Gaussian Mixture Models

Maximum A Posteriori (MAP) adaptation

all videos shot

SIFT GMM UBM

(Universal Background Model)

SIFT GMM for the shot MAP adaptation

4

slide-7
SLIDE 7

COLLABORATIVE TEAM for TRECVID 2009

Classification

Distance between SIFT GMMs:

Weighted sum of Mahalanobis distance : UBM, : s-th and t-th shots

SVM classification with probability outputs

Kernel function : Finally, we obtain posteriori probability

5

slide-8
SLIDE 8

COLLABORATIVE TEAM for TRECVID 2009

FFT Log DCT

spectrum filter bank MFCCs

Audio Models

Features: Mel-Frequency Cepstral Coefficients (MFCCs) Models: Hidden Markov Models (HMMs)

Feature extraction process

  • 1. Frame extraction
  • 2. Windowing [Hamming window]
  • 3. Fast Fourier transform (FFT)
  • 4. Mel scale filter bank
  • 5. Logarithmic transform
  • 6. Discrete cosine transform (DCT)

6

slide-9
SLIDE 9

COLLABORATIVE TEAM for TRECVID 2009

Hidden Markov Models

Ergodic HMMs (2 states, GMMs with 512 mixtures) Log of likelihood ratio

HMM UBM HMM for the target HLF

all videos Videos of a target HLF

7

slide-10
SLIDE 10

COLLABORATIVE TEAM for TRECVID 2009

Hidden Markov Models

Ergodic HMMs (2 states, GMMs with 512 mixtures) Log of likelihood ratio

shot likelihood likelihood log of likelihood ratio UBM Target

7

slide-11
SLIDE 11

COLLABORATIVE TEAM for TRECVID 2009

Combination of SIFT GMMs and Audio Models

Outputs from

audio models SIFT GMMs with Harris-Affine regions SIFT GMMs with Hessian-Affine regions

Log of likelihood ratio and posteriori probability Combined log of likelihood ratio

Optimize weight parameters by 2-fold cross validation where

8

slide-12
SLIDE 12

COLLABORATIVE TEAM for TRECVID 2009

Combination of SIFT GMMs and Audio Models

Outputs from

audio models SIFT GMMs with Harris-Affine regions SIFT GMMs with Hessian-Affine regions

Log of likelihood ratio and posteriori probability

where const.

8

slide-13
SLIDE 13

COLLABORATIVE TEAM for TRECVID 2009

Combination of SIFT GMMs and Audio Models

Outputs from

audio models SIFT GMMs with Harris-Affine regions SIFT GMMs with Hessian-Affine regions

Log of likelihood ratio and posteriori probability Combined log of likelihood ratio

Optimize weight parameters by 2-fold cross validation where

8

slide-14
SLIDE 14

COLLABORATIVE TEAM for TRECVID 2009

  • 2. Text Representation of Images

and MC MFoM Classifier

slide-15
SLIDE 15

COLLABORATIVE TEAM for TRECVID 2009

Text Representation of Images

.

Concept 1 Concept 2 Concept n

. . . . MC-ML Learning Feature Vector

Counts of visual terms : unigram and bigrams or more

Dimensionality reduction

Apply LSA

1 1 1 1 1 1 1 1 1 4 4 1 4 9 9 4 40 38 40 40 21 21 21 38 Image representation with visual alphabets Segmentation Extract Low-Level Features

Object, Color, Texture, Shape

  • > Clustering

9

slide-16
SLIDE 16

COLLABORATIVE TEAM for TRECVID 2009

MC MFoM Classifier

Multi-Class (MC) learning approach

MC learning approach can learn a classifier even if there are not enough positive samples like the case of the HLF extraction task in TRECVID2009.

Maximal Figure-of-Merit (MFoM) Classifier

MFoM classifier can directly optimize any objective performance metric such as m-F1 and MAP by approximating discrete functions to continuous functions, and the GPD algorithm.

10

slide-17
SLIDE 17

COLLABORATIVE TEAM for TRECVID 2009

MC MFoM Learning Scheme

  • The parameter set, is estimated by directly
  • ptimizing an objective performance metric with a linear classifier,

.

  • Given N concepts, and D-dimensional image

representation, , the decision rule is where indicates a geometric average for scores of all competing concepts to the concept j.

11

slide-18
SLIDE 18

COLLABORATIVE TEAM for TRECVID 2009

MC MFoM Learning Scheme

  • Misclassification function, is

defined where a correct decision is made when .

  • Approximation of discrete functions to continuous functions by

introducing a sigmoid function

  • Now, most commonly used metrics could be represented with the

above approximations, and directly optimized with GPD algorithm.

12

slide-19
SLIDE 19

COLLABORATIVE TEAM for TRECVID 2009

  • 3. MFoM Fusion
slide-20
SLIDE 20

COLLABORATIVE TEAM for TRECVID 2009

Discriminant Fusion Scheme

Model Based Transformation (MBT) fusion

Given N concepts, N score functions are learned by an MC MFoM

  • classifier. Taking the N score functions as the basis for the

transformation, we can obtain a new N-dimensional feature. A new MC-MFoM classifier can be trained using MxN-dimensional features.

13

slide-21
SLIDE 21

COLLABORATIVE TEAM for TRECVID 2009

Rank fusion

The rank numbers from different systems are combined to get a new rank number: 2-fold cross validation is used to determine the weight parameters

Reference experiment to MFoM fusion

the rank number of shot x in the ranked output of classification system i : the weight assignment to system i :

14

slide-22
SLIDE 22

COLLABORATIVE TEAM for TRECVID 2009

  • 4. Experiment
slide-23
SLIDE 23

COLLABORATIVE TEAM for TRECVID 2009

Result

0.168 0.152 0.149 0.147 0.108 0.023 SIFT GMMs + Audio models (no fusion) MFoM (MBT fusion) 1 MFoM (MBT fusion) 2 Rank fusion Visual word + MFoM (no fusion) Local + Global features (no fusion) A_TITGT-Titech-1_4 A_TITGT-Fusion-score-2_3 A_TITGT-Fusion-score-1_2 A_TITGT-Fusion-rank_1 A_TITGT-Gatech-Ftr_5 A_TITGT-Titech-1_6 MInfAP Run name

  • MeanInfAP of SIFT GMMs + Audio models was 0.168, which is ranked

11th of all A-type runs and 4th among all participating teams.

  • The MFoM fusion works better than the rank fusion.

15

slide-24
SLIDE 24

COLLABORATIVE TEAM for TRECVID 2009

Result cont.

16

SIFTGMMs + Audio (A_TITGT-Titech-1_4) Visual word + MFoM (A_TITGT-Gatech-Ftr_5) Fusion best (A_TITGT-Fusion-score-2_3) Max Median

slide-25
SLIDE 25

COLLABORATIVE TEAM for TRECVID 2009

Result cont.

SIFTGMMs + Audio (A_TITGT-Titech-1_4) Visual word + MFoM (A_TITGT-Gatech-Ftr_5) Fusion best (A_TITGT-Fusion-score-2_3) Max Median

Combination with audio is effective for the HLF extraction.

Good : Singing (0.229), People-dancing (0.319), People-playing-a-musical-instruments (0.155), Female-human-face-closeup (0.266).

SIFT GMMs represent HLFs with the background.

Good : Airplane_flying (0.138), Boat_Ship (0.250).

16

slide-26
SLIDE 26

COLLABORATIVE TEAM for TRECVID 2009

Conclusion

Combination of SIFT GMMs and audio models is effective

for the HLF extraction (Mean InfAP = 0.168).

  • SIFT GMMs work well for various HLFs.
  • Audio models can detect HLFs complementary.

It is difficult to make a fusion of different systems. More improved collaboration work Using time/spatial region information

Future work

17