Semantic Indexing Using GMM Supervectors and Tree-structured GMMs - - PowerPoint PPT Presentation

semantic indexing using gmm supervectors and tree
SMART_READER_LITE
LIVE PREVIEW

Semantic Indexing Using GMM Supervectors and Tree-structured GMMs - - PowerPoint PPT Presentation

TRECVID 2011 TokyoTech+Canon Semantic Indexing Using GMM Supervectors and Tree-structured GMMs Nakamasa Inoue, Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology TRECVID 2011 TokyoTech+Canon Outline System


slide-1
SLIDE 1

TRECVID 2011 TokyoTech+Canon

Semantic Indexing Using GMM Supervectors and Tree-structured GMMs

Nakamasa Inoue, Koichi Shinoda, Department of Computer Science, Tokyo Institute of Technology

slide-2
SLIDE 2

TRECVID 2011 TokyoTech+Canon

Outline

System overview Fast and high-performance semantic indexing system

  • 6 types of audio and visual features
  • Gaussian mixture model (GMM) supervectors
  • Tree-structured GMMs

Best result: Mean InfAP = 17.3% 1

slide-3
SLIDE 3

TRECVID 2011 TokyoTech+Canon

System Overview

video (shot)

1) SIFT-Har 6) MFCC GMM supervectors SVM score Score fusion

Fast and high-performance semantic indexing system

Tree-sturuc tured GMMs 2) SIFT-Hes 3) SIFTH-Dense 4) HOG-Dense 5) HOG-Sub … … … … SVM score

2

slide-4
SLIDE 4

TRECVID 2011 TokyoTech+Canon

System Overview

video (shot)

1) SIFT-Har 6) MFCC GMM supervectors SVM score Score fusion

Fast and high-performance semantic indexing system

Tree-sturuc tured GMMs 2) SIFT-Hes 3) SIFTH-Dense 4) HOG-Dense 5) HOG-Sub … … … … SVM score

video (shot)

GMM supervectors SVM score Score fusion Tree-sturuc tured GMMs nse se … … … … SVM score

3

slide-5
SLIDE 5

TRECVID 2011 TokyoTech+Canon

Local Feature Extraction

1) SIFT-Har

  • Harris-affine detector: extension of Harris corner detector

[Mikolajczyk, 2004]

  • Multi-frame (every other frame)

2) SIFT-Hes

  • Hessian-affine detector
  • Multi-frame (every other frame)
  • avg. #features

per shot

  • avg. #features

per frame Feature type 18,986 240 SIFT-Hes 19,536 247 SIFT-Har

4

slide-6
SLIDE 6

TRECVID 2011 TokyoTech+Canon

Local Feature Extraction

3) SIFTH-Dense

  • SIFT + Hue histogram
  • 30,000 samples from a key-frame

4) HOG-Dense

  • 32 dimensional HOG
  • 10,000 samples from a key-frame

5) HOG-Sub

  • Dense HOG features extracted

from temporal subtraction images

  • Capture movement

5

slide-7
SLIDE 7

TRECVID 2011 TokyoTech+Canon

Local Feature Extraction

6) MFCC

  • Mel-frequency cepstrum coefficients (MFCC)
  • Audio features for speech recognition
  • Targets: Speaking, Singing etc.

MFCC(12) MFCC(12) MFCC(12) Log-power(1) Log-power(1)

6

slide-8
SLIDE 8

TRECVID 2011 TokyoTech+Canon

System Overview

video (shot)

1) SIFT-Har 6) MFCC GMM supervectors SVM score Score fusion

Fast and high-performance semantic indexing system

Tree-sturuc tured GMMs 2) SIFT-Hes 3) SIFTH-Dense 4) HOG-Dense 5) HOG-Sub … … … … SVM score

video (shot)

1) SIFT-Har 6) MFCC 2) SIFT-Hes 3) SIFTH-D 4) HOG-De 5) HOG-Sub SVM score Score fusion SVM score

7

slide-9
SLIDE 9

TRECVID 2011 TokyoTech+Canon

Gaussian Mixture Models (GMMs)

UBM* Fast MAP adaptation

Each shot is model by a GMM

: local features : GMM parameters

GMM parameters are estimated by using

fast maximum a posteriori (MAP) adaptation

*Universal background model (UBM): a prior GMM which is estimated by using all video data.

8

slide-10
SLIDE 10

TRECVID 2011 TokyoTech+Canon

Gaussian Mixture Models (GMMs)

(Basic) MAP adaptation for mean vectors:

where

UBM* Fast MAP adaptation *Universal background model (UBM): a prior GMM which is estimated by using all video data. responsibility of component for Computational cost: high

9

slide-11
SLIDE 11

TRECVID 2011 TokyoTech+Canon

: responsibility of component for

Gaussian Mixture Models (GMMs)

Gaussian components

  • Tree-structured GMMs calculate quickly!

10

slide-12
SLIDE 12

TRECVID 2011 TokyoTech+Canon

: responsibility of component for

Gaussian Mixture Models (GMMs)

Gaussian components

  • Tree-structured GMMs calculate quickly!

10

slide-13
SLIDE 13

TRECVID 2011 TokyoTech+Canon

Tree-structured GMMs

Calculate responsibilities quickly. [ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011]

: responsibility of component for

Gaussian components

11

slide-14
SLIDE 14

TRECVID 2011 TokyoTech+Canon

Leaf layer

Tree-structured GMMs

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011]

Leaf node has a Gaussian of the UBM (prior GMM).

Gaussian components

11

slide-15
SLIDE 15

TRECVID 2011 TokyoTech+Canon

Non-leaf layers

Tree-structured GMMs

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011]

Non-leaf node has a Gaussian that approximates its descendant Gauusians

Gaussian components

11

slide-16
SLIDE 16

TRECVID 2011 TokyoTech+Canon

Tree-structured GMMs

Non-leaf layers [ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011]

Non-leaf node has a Gaussian that approximates its descendant Gauusians

Gaussian components

11

slide-17
SLIDE 17

TRECVID 2011 TokyoTech+Canon

Non-leaf layers

Tree-structured GMMs

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011]

Non-leaf node has a Gaussian that approximates its descendant Gauusians

Gaussian components

11

slide-18
SLIDE 18

TRECVID 2011 TokyoTech+Canon

Tree-structured GMMs

Calculate responsibilities quickly. [ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011]

: responsibility of component for

Gaussian components

12

slide-19
SLIDE 19

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011] Calculate responsibilities quickly.

  • 1. Initialize

: a set of active nodes

: active node

12

slide-20
SLIDE 20

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011] Calculate responsibilities quickly.

: active node

  • 2. Make children of active
  • 3. Keep nodes if

12

slide-21
SLIDE 21

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011] Calculate responsibilities quickly.

: active node

  • 2. Make children of active
  • 3. Keep nodes if

12

slide-22
SLIDE 22

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011] Calculate responsibilities quickly.

: active node

  • 2. Make children of active
  • 3. Keep nodes if

12

slide-23
SLIDE 23

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

Summary of the algorithm

: a set of active nodes

  • 1. Initialize

: root node

  • 2. Make children of active
  • 3. Calculate

and keep nodes active if

  • 4. Go to 5 if all nodes in are leafs, otherwise return to 2
  • 5. Output GMM parameters

: active node

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011] 13

slide-24
SLIDE 24

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

Summary of the algorithm

  • 5. Output GMM parameters

where

: active node

[ Nakamasa Inoue, Koichi Shinoda, “A Fast MAP Adaptation Technique for GMM- supervector-based Video Semantic Indexing Systems,”In Proc. of ACM Multimedia (short paper), 2011] 13

slide-25
SLIDE 25

TRECVID 2011 TokyoTech+Canon

Fast MAP Adaptation

Calculation time for MAP adaptation

  • 4.2 times faster than without tree-structured GMMs
  • No decrease in accuracy

Optimized tree: the best tree in terms of calculation time on training data. Trees of depth at most 5 that have at most 5 children per node are tested. Mean InfAP(%) on TRECVID 2010 dataset

14

slide-26
SLIDE 26

TRECVID 2011 TokyoTech+Canon

GMM Supervector

Combine normalized mean vectors.

UBM Fast MAP adaptation GMM supervector where normalized mean

15

slide-27
SLIDE 27

TRECVID 2011 TokyoTech+Canon

System Overview

video (shot)

1) SIFT-Har 6) MFCC GMM supervectors SVM score Score fusion

Fast and high-performance semantic indexing system

Tree-sturuc tured GMMs 2) SIFT-Hes 3) SIFTH-Dense 4) HOG-Dense 5) HOG-Sub … … … … SVM score

video (shot)

1) SIFT-Har 6) MFCC GMM supervector Tree-sturuc tured GMMs 2) SIFT-Hes 3) SIFTH-Dense 4) HOG-Dense 5) HOG-Sub … … … …

16

slide-28
SLIDE 28

TRECVID 2011 TokyoTech+Canon

Score Fusion

SVMs are trained with RBF-kernels

Score fusion

Linear combination of SVM scores: Combination coefficients are optimized on a validation set

(IACC_1_tv10_training for training, and IACC_1_A for validation).

where

17

slide-29
SLIDE 29

TRECVID 2011 TokyoTech+Canon

Experimental Condition

TokyoTech_Canon_1

6 features, 3 parameters for RBF-kernel

(18 SVMs for one semantic concept)

TokyoTech_Canon_2

6 features, the parameter h is fixed to 1.0

(6 SVMs for one semantic concept)

18

slide-30
SLIDE 30

TRECVID 2011 TokyoTech+Canon

is trained on the TRECVID+ImageNET dataset with HOG-Dense features

Experimental Condition

TokyoTech_Canon_3

Scores for all semantic concepts are combined:

(i.e. 6 * 346 SVMs for one semantic concept)

TokyoTech_Canon_4

Additional training data from ImageNET

(i.e. 6+1 SVMs for one semantic concept) : score for concept S’

19

slide-31
SLIDE 31

TRECVID 2011 TokyoTech+Canon

Results

Mean TokyoTech_Canon_4 TokyoTech_Canon_3 TokyoTech_Canon_2 TokyoTech_Canon_1 Run ID 2nd run + ImageNET images(Type D) 2nd run + concept score fusion fixed parameter for RBF-kernels 6 audio/visual GMM supervectors, 3 parameters for RBF-kernels Method 16.4% 17.2% 17.3% 17.3% InfAP

20

slide-32
SLIDE 32

TRECVID 2011 TokyoTech+Canon

InfAP by Semantic Concepts

21

slide-33
SLIDE 33

TRECVID 2011 TokyoTech+Canon

Which features are important?

22

slide-34
SLIDE 34

TRECVID 2011 TokyoTech+Canon

Conclusion

6 types of audio and visual GMM supervectors

Mean InfAP: 17.3%

  • Single feature: 12.4% (SIFT-Har (multi-frame))
  • 3 features: 16.6% (SIFT-Har, SIFTH-Dense, MFCC)
  • No audio:16.4% (5 visual features)

Fast MAP adaptation

Tree-structured GMMs cut MAP adaptation costs. 4.2 times faster than without tree-structured GMMs.

Future work

Human actions and event detection. Spatial and temporal localization.

23

slide-35
SLIDE 35

TRECVID 2011 TokyoTech+Canon

Thank you!