TRECVID-2005 High-Level Feature task: Overview Wessel Kraaij TNO - - PowerPoint PPT Presentation

trecvid 2005 high level feature task overview
SMART_READER_LITE
LIVE PREVIEW

TRECVID-2005 High-Level Feature task: Overview Wessel Kraaij TNO - - PowerPoint PPT Presentation

TRECVID-2005 High-Level Feature task: Overview Wessel Kraaij TNO & Paul Over NIST High-level feature task o Goal: Build benchmark collection for detection methods o Secondary goal: feature-indexing could help search/browsing o


slide-1
SLIDE 1

TRECVID-2005 High-Level Feature task: Overview

Wessel Kraaij TNO & Paul Over NIST

slide-2
SLIDE 2

TRECVID 2005 2

High-level feature task

  • Goal: Build benchmark collection for detection methods
  • Secondary goal: feature-indexing could help search/browsing
  • Feature set selected from feature set used for annotation of

development data (LSCOM-lite)

  • Examples of thing/activity/person/location
  • Collaborative development data annotation effort

n Tools from CMU and IBM (new tool) n 39 features and about 100 annotators n multiple annotations of each feature for a given shot

  • Range of frequencies in the common development data

annotation

slide-3
SLIDE 3

TRECVID 2005 3

  • 1000

2000 3000 4000 5000 6000 7000

3 8 3 9 4 4 1 4 2 4 3 4 4 4 5 4 6 4 7

  • True examples in the common training data

13% 2.3%

slide-4
SLIDE 4

TRECVID 2005 4

High-level feature evaluation

  • Each feature assumed to be binary: absent or present for

each master reference shot

  • Task: Find shots that contain a certain feature, rank them

according to confidence measure, submit the top 2000

  • NIST pooled submissions to depth 250
  • Evaluate performance quality by measuring the average

precision etc. of each feature detection method

slide-5
SLIDE 5

TRECVID 2005 5

10 Features

38. People walking/running: segment contains video of more than one person walking or running (tv4: 35) 39. Explosion or fire: segment contains video of an explosion or fire 40. Map: segment contains video of a map 41. US flag: segment contains video of a US flag 42. Building exterior: segment contains video of the exterior of a building (tv3: 14) 43. Waterscape/waterfront: segment contains video of a waterscape or waterfront 44. Mountain: segment contains video of a mountain or mountain range with slope(s) visible 45. Prisoner: segment contains video of a captive person, e.g., imprisoned, behind bars, in jail, in handcuffs, etc. 46. Sports: segment contains video of any sport in action (tv3: 23)

slide-6
SLIDE 6

TRECVID 2005 6

Participants (22/42) (up from 12/33 in 2004)

Bilkent University Turkey -- LL HL SE Carnegie Mellon University USA

  • - -- HL SE

CLIPS-IMAG, LSR-IMAG, Laboratoire LIS France SB –- HL -- Columbia University USA

  • - -- HL SE

Fudan University China SB LL HL SE FX Palo Alto Laboratory USA SB –- HL SE Helsinki University of Technology Finland

  • - -- HL SE

IBM USA SB –- HL SE Imperial College London UK SB –- HL SE Institut Eurecom France -- -- HL -- Johns Hopkins University USA

  • - -- HL --

Language Computer Corporation (LCC) USA

  • - -- HL SE

LIP6-Laboratoire d'Informatique de Paris 6 France -- -- HL -- Lowlands Team (CWI, Twente, U. of Amsterdam) Netherlands -- -- HL SE Mediamill Team (Univ. of Amsterdam) Netherlands -- LL HL SE National ICT Australia Australia SB LL HL -- National University of Singapore (NUS) Singapore

  • - -- HL SE

SCHEMA-Univ. Bremen Team EU -- -- HL SE Tsinghua University China SB LL HL SE University of Central Florida / University of Modena USA,Italy SB LL HL SE University of Electro-Communications Japan -- -- HL -- University of Washington USA

  • - -- HL --
slide-7
SLIDE 7

TRECVID 2005 7

Who worked on which features

Bilkent University 38 Carnegie Mellon University 38 39 40 41 42 43 44 45 46 47 CLIPS-IMAG, LSR-IMAG, Laboratoire LIS 38 39 40 41 42 43 44 45 46 47 Columbia University 38 39 40 41 42 43 44 45 46 47 Fudan University 38 39 40 41 42 43 44 45 46 47 FX Palo Alto Laboratory 38 39 40 41 42 43 44 45 46 47 Helsinki University of Technology 38 39 40 41 42 43 44 45 46 47 IBM 38 39 40 41 42 43 44 45 46 47 Imperial College London 38 39 40 41 42 43 44 45 46 47 Institut Eurecom 38 39 40 41 42 43 44 45 46 47 Johns Hopkins University 38 39 40 41 42 43 44 45 46 47 Language Computer Corporation (LCC) 38 39 40 41 42 43 44 45 46 47 LIP6-Laboratoire d'Informatique de Paris 6 40 Lowlands Team (CWI, Twente, U. of Amsterdam) 38 39 40 41 42 43 44 45 46 47 Mediamill Team (Univ. of Amsterdam) 38 39 40 41 42 43 44 45 46 47 National ICT Australia 38 39 40 41 42 43 44 45 46 47 National University of Singapore (NUS) 38 39 40 41 42 43 44 45 46 47 SCHEMA-Univ. Bremen Team 40 41 43 Tsinghua University 38 39 40 41 42 43 44 45 46 47 University of Central Florida / Univ. of Modena 39 40 41 42 43 44 45 46 47 University of Electro-Communications 43 44 University of Washington 38 39 40 41 42 43 44 45 46 47

slide-8
SLIDE 8

TRECVID 2005 8

Number of runs each training type

110 7 (6.3%) 24 (21.8%) 79 (71.8%) 2005 18 (30.0%) 11 (13.3%)

C

60 83

Total runs

20 (33.3%) 27 (32.5%)

B

22 (36.7%) 45 (54.2%)

A

2003 2004

Tr-Type

System training type: A - Only on common dev. collection and the common annotation B - Only on common dev. collection but not on (just) the common annotation C - not of type A or B

slide-9
SLIDE 9

TRECVID 2005 9

AvgP by feature (all runs)

Middle half

  • f the data

Median

Feature number Average precision

slide-10
SLIDE 10

TRECVID 2005 10

2005: AvgP by feature (top 10 runs)

  • !

1 2 3 4 5 6 7 8 9 10 M edian

Median

  • 38. People walking/running
  • 39. Explosion/fire
  • 40. Map
  • 41. US flag
  • 42. Building exterior
  • 43. Waterscape/waterfront
  • 44. Mountain
  • 45. Prisoner
  • 46. Sports
  • 47. Car

Previous best result on CNN/ABC

slide-11
SLIDE 11

TRECVID 2005 11

2004: AvgP by feature (top 10 runs)

  • "

" " " " " " " "

  • !

1 2 3 4 5 6 7 8 9 10 Median

Median

  • 28. Boats/ships
  • 29. M. Albright
  • 30. B. Clinton
  • 31. Trains
  • 32. Beach
  • 33. Basket scrore
  • 34. Airplane takeoff
  • 35. People walk/run
  • 36. Phys. violence
  • 37. Road
slide-12
SLIDE 12

TRECVID 2005 12

  • "

" " " " " " " "

  • !

1 2 3 4 5 6 7 8 9 10 Median

2003: AvgP by feature (top 10 runs)

Median ->

11.

Indoors 12. News subject face 13. People 14. Building 15. Road 16. Vegetation 17. Animal 18. Female speech 19. Car/truck/bus

20.

Aircraft 21. News subject monologue 22. Non-studio setting 23. Sporting event 24. Weather news 25. Zoom in 26. Physical violence 27. Madeleine Albright

slide-13
SLIDE 13

TRECVID 2005 13

  • "

" " " " " " " "

  • #$%&'#

#$%&(# #$&)$*# #$&)$*# #$&)$*# #+,%-./#0# #+,%-./#01# #+,%-./#,*# #+,%-./#,*# #+,%-./#%0/# #+,%-./#01%# #+,%-./#01%)# ##

  • ##

%2

AvgP by feature (top 3 runs by per feature)

  • 38. People walking/running
  • 39. Explosion/fire
  • 40. Map
  • 41. US flag
  • 42. Building exterior
  • 43. Waterscape/waterfront
  • 44. Mountain
  • 45. Prisoner
  • 46. Sports
  • 47. Car
slide-14
SLIDE 14

TRECVID 2005 14

Max AvgP by number of annotated training examples

1 1 1 1 1 1 2 2 2 3 3 3 3 2 3 4 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 4 4 4 4 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 2 3 3 2 4 4 4 4 4

slide-15
SLIDE 15

TRECVID 2005 15

Median AvgP by number of annotated training examples

1 1 1 1 1 1 2 2 2 3 3 3 3 2 3 4 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 4 4 4 4 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 2 3 3 2 4 4 4 4 4

slide-16
SLIDE 16

TRECVID 2005 16

Max AvgP by number true shots found

1 1 1 1 1 1 2 2 2 3 3 3 3 2 3 4 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 4 4 4 4 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 2 3 3 2 4 4 4 4 4

  • !
slide-17
SLIDE 17

TRECVID 2005 17

Median AvgP by number true shots found

1 1 1 1 1 1 2 2 2 3 3 3 3 2 3 4 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 1 1 1 4 4 4 4 1 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 2 3 3 2 4 4 4 4 4

  • !
slide-18
SLIDE 18

TRECVID 2005 18 10 20 30 40 50 60 70 80 90 100

A l l t e s t s h

  • t

s 3 8 3 9 4 4 1 4 2 4 3 4 4 4 5 4 6 4 7

! $( 3(

% of true shots by source language for each feature

People walking/running Explosion/fire Waterscape/waterfront US flag Building exterior Prisoner Mountain Sports Car Map

Feature number

%

slide-19
SLIDE 19

TRECVID 2005 19

True shots contributed uniquely by team for each feature

B i l k e n t C L I P S . L I S . L S R C M U E u r e c

  • m

F u d a n F X P a l I m p e r i a l J H U L C C U P a r i s L I P 6 L

  • w

l a n d s N I C T A N U S H U T T s i n g h u a U b r e m e n U v A U w a s h 3 8 P e

  • p

l e w a l k i n g / r u n n i n g 3 9 E x p l

  • s

i

  • n

/ f i r e 4 M a p 4 1 U S f l a g 4 2 B u i l d i n g e x t e r i

  • r

4 3 W a t e r s c a p e / w a t e r f r

  • n

t 4 4 M

  • u

n t a i n 4 5 P r i s

  • n

e r 4 6 S p

  • r

t s 4 7 C a r

4 1 7 5 1 1 1 2 1 1

1 1 1

1 1 3 1 2 16 32 2 1 1 3 5 1 4 2 3 3 11 3 6 2

1

2 4 8 2 1 9 5 8 8 114 12 2 6 6

10 20 30 40

Number of unique true shots

slide-20
SLIDE 20

TRECVID 2005 20

Observations

  • Participation almost doubled over 2004 (12 -> 22)
  • Focus on category A runs (increased comparability)
  • Scores are generally higher than in 2004 despite
  • new sources
  • errorful text from speech (via MT)
  • What does it mean?
  • Did anybody run last year’s system on this year’s task?
  • Features were generally found in all language sources
  • Top scores come from fewer systems/groups
slide-21
SLIDE 21

TRECVID 2005 21

To follow: overview of the systems with map > 0.16 (median)

  • Only systems that were tested on all 10 features
  • Only category A
  • Runs were compared on map across 10 features
slide-22
SLIDE 22

TRECVID 2005 22

Overview of approaches

  • HLF systems draw from a very wide range of

signal processing and machine learning techniques

n Generic vs feature specific n How to do feature selection for visual modalities such as color and texture n Visual representation: grid or salient feature clusters n Various fusion methods, normalization methods n Range of classifiers

slide-23
SLIDE 23

TRECVID 2005 23

Carnegie Mellon University

  • Approach

n unimodal / multimodal (as in 2004) n learn dependencies between semantic features (by using various graphical model representations): inconclusive n global fusion < local fusion n multilingual > monolingual n multiple text sources > single text source n Best run: local fusion

  • Results:

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4

slide-24
SLIDE 24

TRECVID 2005 24

Columbia University

  • presentation follows -
  • Approach

n Parts based object representation (ARG) n Captures:

  • topological structure (spatial relationships among parts)
  • Local attributes of parts

n Model learns the parameter distribution properties due to differences in photometric conditions and geometry n Runs vary across classifier combination schemes (fusion/selection)

  • Results:

n Significantly better than global (i.e. grid based) approach n

  • Esp. good for visual concepts where topology and local attributes are important (e.g.

US flag) n Text features play only a marginal role (contrastive experiment)

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4

slide-25
SLIDE 25

TRECVID 2005 25

Fudan University

  • Approach:

n Several runs

  • Specific feature detectors
  • ASR based
  • Fusion of several unimodal SVM classifiers
  • Contrastive experiments with different dimension

reduction techniques (PCA, locality preserving projection

  • Results:

n Best run: 0.19

slide-26
SLIDE 26

TRECVID 2005 26

FXPAL

  • Approach

n SVM trained on low level features donated by CMU n Classifier combination schemes based on various forms of regression n 1st time participation

  • Results

n Best result: map=0.18

slide-27
SLIDE 27

TRECVID 2005 27

Helsinki University of Technology

  • Approach:

n Self Organizing maps trained on multimodal features and LSCOM lite annotations

  • Result:

n 1 run : map 0.2

slide-28
SLIDE 28

TRECVID 2005 28

IBM

  • Approach

n Features:

  • Visual: Extensive experiments for selecting best feature type and granularity

for individual modalities (color, texture etc.)

  • Motion, Text, LSCOM LITE concepts
  • Features also included meta-information such as time of broadcast, channel

etc.

n SVM > (ME, KNN, GMM ) n Flat and hierarchical feature fusion n Variations in classifier fusion methods n Feature specific approaches (selection based on held-out data)

  • Results:

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4

slide-29
SLIDE 29

TRECVID 2005 29

Imperial College London

  • Approach

n

  • 1. “Naïve model”:
  • locate salient clusters in feature space
  • Learn HLF<-> clusters models

n

  • 2. Nonparametric Density estimation (kernel smoothing)
  • Results:

n Naïve model: performance problems n NPDE >> Naïve model

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4

slide-30
SLIDE 30

TRECVID 2005 30

Mediamill team (Univ. of Amsterdam)

  • presentation follows -
  • Approach

n Authoring metaphor n Feature specific combination of content, style and context analysis n 101 concept lexicon

  • Results:

n Textual features contribute only a small performance gain

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4

slide-31
SLIDE 31

TRECVID 2005 31

National University of Singapore (NUS)

  • Approach
  • 1. Ranked maximal figure of merit: ASR only, texture only, 2

fused runs

  • 2. HMM for visual dependency (4X4 grid): ASR only, +visual,

+audio,genre,OCR . RankBoost fusion

  • Results:

n 2nd approach >> 1st approach

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4

slide-32
SLIDE 32

TRECVID 2005 32

University of Washington

  • Approach: ?

n (notebook paper not available yet)

  • Results:

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4