[PDF] - Conclusions TRECVID 2008 Conclusions TRECVID 2008 Good settings PDF Document

SLIDE 1

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 1 Multi Multi-

Frame, Multi

Frame, Multi-

Modal, and Multi

Modal, and Multi-

Kernel

Kernel

Concept Detection in Video Concept Detection in Video

Cees Cees G.M. Snoek¹ , G.M. Snoek¹ , Koen Koen E.A. van de Sande¹ , Jasper R.R. Uijlings¹ , E.A. van de Sande¹ , Jasper R.R. Uijlings¹ , Miguel Bugalho² , Isabel Trancoso² , Miguel Bugalho² , Isabel Trancoso² , Fei Fei Yan³ , Yan³ , Muhammed Muhammed A. Tahir³ ,

A. Tahir³ ,

Krystian Krystian Mikolajczyk³ , Josef Kittler³ , Theo Gevers¹ , Dennis C. Koelma¹ , Mikolajczyk³ , Josef Kittler³ , Theo Gevers¹ , Dennis C. Koelma¹ , Arnold W.M. Smeulders¹ Arnold W.M. Smeulders¹

¹ ² ² ³ ³

Conclusions TRECVID 2008 Conclusions TRECVID 2008

Good settings for Bag

Good settings for Bag-

of
f-
Words

Words

– SIFT + SIFT + colorSIFT colorSIFT improves ~ 8% improves ~ 8% – Soft codebook assignment improves ~ 7% Soft codebook assignment improves ~ 7% – Multi Multi-

frame analysis improves ~ 20%

frame analysis improves ~ 20%

SLIDE 2

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 2

Myth: TRECVID incremental only Myth: TRECVID incremental only

> 100% improvement in just 3 years

State State-

of
f-
the

the-

Art

Art

Snoek et al, TRECVID 2008 Van de Sande et al, PAMI 2010 Van Gemert et al, PAMI 2010

SLIDE 3

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 3

State State-

of
f-
the

the-

Art

Art

Snoek et al, TRECVID 2008 Van de Sande et al, PAMI 2010 Van Gemert et al, PAMI 2010

Software available for download at http://colordescriptors.com

Our TRECVID 2009 focus Our TRECVID 2009 focus

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

SLIDE 4

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 4

Our TRECVID 2009 focus Our TRECVID 2009 focus

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning

Multi- k l

learning Audio concept detection

kernel Roadmap Roadmap

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

SLIDE 5

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 5

1,000,000 1,000,000 frames analyzed

frames analyzed

Snoek et al, ICME 2005

Multi

Multi-

frame biggest improvement in 2008

frame biggest improvement in 2008

– Extend further by analyzing up to 10 extra Extend further by analyzing up to 10 extra i i-

frames/shot

frames/shot – Yields 1M frames to analyze for the Yields 1M frames to analyze for the test set test set collection collection

Need to speed

Need to speed-

up by

up by being “smart and strong” being “smart and strong”

Speed Speed up feature extraction up feature extraction – Speed Speed-up feature extraction up feature extraction – Speed Speed-

up quantization

up quantization – Speed Speed-

up kernel

up kernel-

based learning

based learning – Speed Speed-

up by computing

up by computing

Roadmap Roadmap

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

SLIDE 6

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 6

Fast Fast dense dense descriptors descriptors

A Uijlings et al, CIVR 2009 R x = Final Descriptor Pixel-wise Responses R Linear Interpolation A x R x B

T

Image Patch

16x speed-up 2x speed-up

Reuse subregions

Roadmap Roadmap

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

SLIDE 7

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 7

Fast quantization Fast quantization

Moosman, PAMI 2008 Uijlings et al, CIVR 2009

Random

Random forests forests

– Randomized process makes it very fast to build Randomized process makes it very fast to build – Tree structure allows fast vector quantization Tree structure allows fast vector quantization – Logarithmic rather than linear projection time Logarithmic rather than linear projection time

Real

Real time time BoW BoW

Real

Real-time time BoW BoW

– When When used used with with fast fast dense dense sampling sampling – SURF 2x2 descriptor SURF 2x2 descriptor instead instead of 4x4

f 4x4

– RBF RBF kernel kernel

GPU GPU-

empowered quantization

empowered quantization

Achieve data

Achieve data parallelism by writing Euclidean parallelism by writing Euclidean

Van de Sande et al, ASCI 2009

8,00 10,00 12,00 14,00 e (s) CPU Xeon (3,4GHz) CPU Opteron 250 (2,4GHz) CPU Core 2 Duo 6400 (2,13GHz) CPU Core i7 (2,66GHz)

Achieve data

Achieve data-parallelism by writing Euclidean parallelism by writing Euclidean distance in vector form distance in vector form

0,00 2,00 4,00 6,00 5000 10000 15000 20000 Time Per Image Number of SIFT Descriptors Per Image GPU Geforce 8800GTX (128 cores) GPU Geforce GTX260 (216 cores)

17x speed-up GPU CPU

SLIDE 8

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 8

Roadmap Roadmap

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

SVM pre SVM pre-

computed

computed kernel trick kernel trick

Use distance between feature vectors

Use distance between feature vectors

– Feature length easily > 100,000 Feature length easily > 100,000

Increase efficiency significantly

Increase efficiency significantly

– Pre Pre-

compute the SVM kernel matrix

compute the SVM kernel matrix – Long vectors possible as we only need 2 in memory Long vectors possible as we only need 2 in memory – Parameter optimization re Parameter optimization re-

uses pre

uses pre-

computed matrix

computed matrix

SLIDE 9

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 9

1 Compute average distances per Compute average distances per N² N² kernel sub kernel sub-

block

block

GPU GPU-

empowered

empowered pre pre-

computed kernel

computed kernel

Van de Sande et al, ASCI 2009

1. 1. Compute average distances per Compute average distances per N² N² kernel sub kernel sub-

block

block 2. 2. Compute kernel function values Compute kernel function values

1000 1200 1400 1600 1800 s) 1x Core i7 (2,66GHz) 1x Opteron 250 (2,4GHz) 4x Opteron 250 (2,4GHz) 16x Opteron 250 (2,4GHz) 25x Opteron 250

65x speed-up 1 CPU 4 CPU

200 400 600 800 20000 40000 60000 80000 100000 120000 140000 Time ( Total Feature Vector Length 25x Opteron 250 (2,4GHz)

p p 3x speed-up GPU

Computing Computing

2009 system much

2009 system much more efficient than more efficient than 2008 2008 system system

– 6x more visual data analyzed using less 6x more visual data analyzed using less compute power compute power

Some best estimates:

Some best estimates:

Visual feature extraction: Visual feature extraction: 8400 8400 Processor rocessor Node

de Hours
urs

– Visual feature extraction: Visual feature extraction: 8400 8400 Processor rocessor-Node

de-Hours
urs

– Training concept detectors: 4000 PNH Training concept detectors: 4000 PNH – Applying concept detectors: ~ 1 week GPU Applying concept detectors: ~ 1 week GPU

SLIDE 10

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 10

Roadmap Roadmap

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

Audio concept detection Audio concept detection

Bugalho et al, InterSpeech 2009

External sound corpus: External sound corpus: ~ 100 ~ 100 concepts concepts

Trancoso et al, ICME 2009

Feature extraction SVM classification Speech Music Detector Telephone Non Speech

Reasoning

Speaker ID Audio Segmentation

speech, female voice,.. monologue, dialogue,… music events low frequency sirens, water,…

Early fusion of features

Early fusion of features

– MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth, MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth, ZCR, Pitch, ZCR, Pitch, Harmonicity Harmonicity, Shifted delta , Shifted delta cepstra cepstra, Audio , Audio spectrum envelope and flatness spectrum envelope and flatness – 0.50s 0.50s window length, with window length, with 0.25s 0.25s spacing spacing

Telephone detector

low frequency

SLIDE 11

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 11

Roadmap Roadmap

Spatio‐ temporal sampling Visual feature extraction Codebook transform Kernel‐ based learning learning Audio concept detection

Multi Multi-

kernel learning

kernel learning

Tahir et al, ICCV-Subspace 2009 Yan et al, ICDM 2009

Kernel

ernel Discriminant iscriminant Analysis combined with nalysis combined with spectral pectral regression egression [Tahir09]

[Tahir09]

– We use SR We use SR-

KDA with 6 visual kernels

KDA with 6 visual kernels – Weighted output combined using SUM rule Weighted output combined using SUM rule

Multi

ulti Kernel ernel Fisher isher Discriminant iscriminant Analysis nalysis

Multi

ulti-Kernel ernel Fisher isher Discriminant iscriminant Analysis nalysis

– We use We use non non-

sparse

sparse L2 L2 MK MK-

FDA

FDA [Yan09]

[Yan09]

– Fusion of 1 audio and 6 visual kernels Fusion of 1 audio and 6 visual kernels

– 20 audio concept detector scores used as input for RBF kernel 20 audio concept detector scores used as input for RBF kernel

SLIDE 12

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 12

Experiments Experiments (all type A)

(all type A)

Baseline

Baseline: single : single-

frame SFS on all

frame SFS on all visual visual kernels kernels

Experiment 1:

Experiment 1: multi ulti-

modal &

modal & multi multi-

kernel

kernel

– SR SR-

KDA

KDA (visual only)

(visual only)

– MK MK-

FDA

FDA (audiovisual fusion)

(audiovisual fusion)

Experiment 2:

Experiment 2: multi multi-

frame

frame

– Visual Visual fusion: 5 extra fusion: 5 extra i i-

frames + MAX fusion

frames + MAX fusion [donated]

[donated]

– Best Best-

of: 1 to 10 extra
f: 1 to 10 extra i

i-

frames + MAX/AVG fusion

frames + MAX/AVG fusion – SFS: all multi SFS: all multi-

frame visual kernel combinations

frame visual kernel combinations

Results: experiment 1 Results: experiment 1

Multi

Multi-

kernel improves upon baseline: ~ 9%

kernel improves upon baseline: ~ 9%

Multi

Multi-

modal

modal kernel outperforms kernel outperforms uni uni-

modal

modal kernel kernel only slightly: ~ 2%

nly slightly: ~ 2%

– …but for specific (audiovisual) concepts more impressive improvement, up to …but for specific (audiovisual) concepts more impressive improvement, up to 50% 50%

Audio not decisive here?

SLIDE 13

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 13

Results: experiment 2 Results: experiment 2

Multi

Multi-

frame is true performance

frame is true performance booster, improvement over baseline: ~ booster, improvement over baseline: ~ 30% 30%

Best to select optimal number of extra frames, per kernel, per concept,

Best to select optimal number of extra frames, per kernel, per concept,

– On average 6 additional On average 6 additional i i-

frames with MAX or AVG fusion is a solid choice

frames with MAX or AVG fusion is a solid choice

Visualizing multi Visualizing multi-

frame impact

frame impact

http://www.MediaMill.nl

SLIDE 14

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 14

Conclusions TRECVID 2009 Conclusions TRECVID 2009

Multi

Multi-

modal using multi

modal using multi-

kernel

kernel seems promising seems promising

M i t d d t b M i t d d t b l i l i – More experiments needed to be More experiments needed to be conclusive conclusive

Multi

Multi-

frame

frame is true is true performance booster performance booster

– 30% 30% improvement over improvement over single single-

frame baseline

frame baseline – Time for the community to move on Time for the community to move on to to video video analysis analysis

Multi-frame Multi-modal using multi-kernel

References References I I

http://www.vidivideo.eu

The MediaMill TRECVID 2008 Semantic Video Search Engine. C.G.M. Snoek et

al. Proceedings of the TRECVID Workshop, 2008.

Evaluating Color Descriptors for Object and Scene Recognition. K.E.A. van de Sande, Th. Gevers, C.G.M. Snoek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2010. Visual Word Ambiguity. Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2009. On the Surplus Value of Semantic Video Analysis Beyond the Key Frame. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinst a P oc IEEE Int’l Confe ence on M ltimedia & E po 2005

Seinstra. Proc. IEEE Int’l Conference on Multimedia & Expo, 2005.

Real-Time Bag of Words, Approximately. Jasper R. R. Uijlings, Arnold W. M. Smeulders, R. J. H. Scha. ACM Int'l Conference on Image and Video Retrieval, 2009. Empowering Visual Categorization with the GPU. K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. In Proc. Annual Conference of the Advanced School for Computing and Imaging, 2009.

SLIDE 15

MediaMill TRECVID 2009 17‐11‐2009 http://www.MediaMill.nl 15

References II References II

http://www.vidivideo.eu

Detecting Audio Events for Semantic Video Search. M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini, and A. Abad. In InterSpeech, 2009. Audio Contributions to Semantic Video Search. I. Trancoso, T. Pellegrini, J. Portelo, H. Meinedo, M. Bugalho, A. Abad, and J. Neto. Proc. IEEE Int’l Conference on Multimedia & Expo, 2009. Visual Category Recognition using Spectral Regression and Kernel Discriminant

Analysis. M. A. Tahir, J. Kittler, K. Mikolajczyk, F. Yan, K. van de Sande,

and T. Gevers. In Proc. 2nd Int’l Workshop on Subspace, In Conjunction with ICCV, 2009. Nonsparse Multiple Kernel Learning for Fisher Discriminant Analysis. F. Yan, J. Kittler K Mikolajczyk and A Tahir In IEEE Int’l conf Data Mining Kittler, K. Mikolajczyk, and A. Tahir. In IEEE Int l conf. Data Mining, 2009. The MediaMill TRECVID 2009 Semantic Video Search Engine. C.G.M. Snoek et

al. Proceedings of the TRECVID Workshop, 2009.

Concept-Based Video Retrieval. C.G.M. Snoek, M. Worring. Foundations and Trends in Information Retrieval, Vol. 4 (2), page 215-322, 2009.

Contact info Contact info

Cees Snoek