TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO - - PowerPoint PPT Presentation
TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO - - PowerPoint PPT Presentation
TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO & Paul Over NIST Outline Task summary Evaluation details Inferred Average precision vs. mean average precision Participants Evaluation results
TRECVID 2006 2
Outline
Task summary
Evaluation details
Inferred Average precision vs. mean average precision
Participants
Evaluation results
Pool analysis
Results per category
Results per feature
Significance tests category A
comparison with TV2005
Global Observations
Issues
TRECVID 2006 3
High-level feature task
Goal: Build benchmark collection for visual concept detection methods
Secondary goals:
encourage generic (scalable) methods for detector development
feature-indexing could help search/browsing
Participants submitted runs for all 39 LSCOM-lite features
Used results of 2005 collaborative training data annotation
Tools from CMU and IBM (new tool)
39 features and about 100 annotators
multiple annotations of each feature for a given shot
Range of frequencies in the common development data annotation
NIST evaluated 20 (medium frequency) features from the 39 using a 50% random sample of the submission pools (Inferred AP)
TRECVID 2006 4
HLF is challenging for machine learning
Small imbalanced training collection
Large variation in examples
Noisy Annotations
Decisions to be made:
find suitable representations
find optimal fusion strategies
TRECVID 2006 5
20 LSCOM-lite features evaluated
1 sp
sport rts 3 we weath ther 5 of
- ffic
ice 6 me meeti ting 10 10 de deser ert 12 12 mo mount ntain in 17 17 wa water ersca cape/ e/wat aterf rfron
- nt
22 22 co corpo porat ate l lead ader 23 23 po polic ice s secu curit ity 24 24 mi milit itary ry pe perso sonne nel 26 26 an anima mal 27 27 co compu puter er tv tv sc scree een 28 28 us us fl flag 29 29 ai airpl plane ne 30 30 ca car 32 32 tr truck ck 35 35 pe peopl ple m marc rchin ing 36 36 ex explo losio ion f fire re 38 38 ma maps 39 39 ch chart rts
Note: this is a departure from the numbering scheme used at previous TV’s
TRECVID 2006 6
High-level feature evaluation
Each feature assumed to be binary: absent or present for each master reference shot
Task: Find shots that contain a certain feature, rank them according to confidence measure, submit the top 2000
NIST pooled and judged top results from all submissions
Evaluated performance effectiveness by calculating the inferred average precision of each feature result
Compared runs in terms of mean inferred average precision across the 20 feature results
to be used for comparison between TV2006 HLF runs
not comparable with TV2005, TV2004… figures
TRECVID 2006 7
Inferred average precision (infAP)
Just* developed by Emine Yilmaz and Javed A. Aslam at Northeastern University
Estimates average precision surprisingly well using a surprisingly small sample of judgments from the usual submission pools
Experiments on TRECVID 2005 feature submissions confirmed quality of the estimate in terms of actual scores and system ranking
* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.
TRECVID 2006 8
Inferred average precision (infAP) Experiments with 2005 data
Pool submitted results down to at least a depth of 200 items
Manually judge pools - forming a base set of judgments (100% judged)
Create 4 sampled sets of judgments by randomly marking some results “unjudged”
20% unjudged -> 80% sample
40% unjudged -> 60% sample
60% unjudged -> 40% sample
80% unjudged -> 20% sample
Evaluate all systems that submitted results for all features in 2005 using the base and each of the 4 sampled judgment sets using infAP
By definition, infAP of a 100% sample of the base judgment set is identical to average precision (AP).
Compare measurements of infAP using various sampled judgment sets to standard AP.
TRECVID 2006 9
2005 Mean InfAP scoring approximates MAP scoring very closely
MAP mean infAP
- n
sampled judgments
80% sample 60% sample 40% sample 20% sample
TRECVID 2006 10
2005 system rankings change very little when determined based on infAP versus AP.
Kendall's tau (normalizes pairwise swaps)
80% sample 0.9862658
60% sample 0.9871663
40% sample 0.9700546
20% sample 0.951566
Number of significant rank changes (randomization test, p<0.01) Add Keep Lose Swap 73 1883 170 20% 45 1949 104 40% 36 1996 57 60% 37 2018 35 80%
TRECVID 2006 11
2006: Inferred average precision (infAP)
Submissions for each of 20 features were pooled down to about 120 items (so that each feature pool contained ~ 6500 shots)
varying pool depth per feature
A 50% random sample of each pool was then judged:
66,769 total judgements (~ 125 hr of video)
Judgement process: one assessor per feature, watched complete shot while listening to the audio.
infAP was calculated using the judged and unjudged pool by trec_eval
TRECVID 2006 12
6 7 9 4 7 4 2 9 2 1 4 9 8 2 4 3 1 5 5 6 2 3 11 6 6 7 5 0 2 3 8 1 5 0 2 2 1 5 1 1 3 2 9 2 2 3 4 0 1 6 3 4 2 7 1 7 2 6 1 2
200 400 600 800 1000 1200 1400 1600
1 3 5 6 10 12 17 22 23 24 26 27 28 29 30 32 35 36 38 39
Feature Number of hits in the test data
Frequency of hits varies by feature
1% 2%
TRECVID 2006 13
68 65 68 32 35 32
10 20 30 40 50 60 70 80 Test hours Pooled shots Hits
%
known new program
Systems can find hits in video from programs not in the training data
TRECVID 2006 14
2006: 30/54 Participants (2005: 22/42, 2004: 12/33 )
Bilk lkent nt U. U.
- - FE
FE SE SE --
- Carn
rnegi gie M Mell llon n U. .
- - FE
FE SE SE --
- City
ty Un Unive versi sity y of f Hon
- ng K
Kong ng (C (City tyUHK HK) SB SB FE FE SE SE --
- CLIP
IPS-I
- IMAG
AG SB SB FE FE SE SE --
- Colu
lumbi bia U U.
- - FE
FE SE SE --
- COST
ST292 92 (w (www. w.cos
- st29
292.o .org) g) SB FE SE RU Fuda dan U U.
- - FE
FE SE SE --
- FX P
Palo lo Al Alto
- Lab
abora rator
- ry I
Inc SB SB FE FE SE SE --
- Hels
lsink nki U
- U. o
- f T
Tech chnol
- logy
gy SB SB FE FE SE SE --
- IBM
M T. . J. . Wat atson
- n Re
Resea earch ch Ce Cente ter
- - FE
FE SE SE RU RU Impe peria ial C Coll llege ge Lo Londo don / / Jo Johns ns Ho Hopki kins s U.
- - FE
FE SE SE --
- NUS
S / I I2R
- - FE
FE SE SE --
- Inst
stitu tut E EURE RECOM OM
- - F
FE -
- - R
RU KDDI DI/To Tokus ushim ima U U./T /Toky kyo U
- U. o
- f T
Tech chnol
- logy
gy SB SB FE FE --
- - --
- K-Sp
Space ce (k (kspa pace. e.qmu mul.n .net) t) -- FE SE --
TRECVID 2006 15
2006: 30 Participants (continued)
LIP6 P6 -
- Lab
abora ratoi
- ire
e d'I 'Info forma matiq ique e de e Par aris s 6
- - F
FE -
- - -
- Medi
diami mill l / U
- U. o
- f A
Amst sterd rdam m
- - FE SE --
Micr croso soft t Res esear arch h Asi sia
- - F
FE -
- - -
- Nati
tiona nal T Taiw iwan n U.
- - F
FE -
- - -
- NII/
I/ISM SM
- - F
FE -
- - -
- Toky
kyo I Inst stitu tute e of f Tec echno nolog
- gy
S SB F FE -
- - -
- Tsin
inghu hua U U. SB FE SE RU
- U. o
- f B
Brem emen n TZI ZI
- - FE -- --
- U. o
- f C
Cali lifor
- rnia
ia at at Be Berke keley ey
- FE
E --
- --
- U. o
- f C
Cent ntral al Fl Flori rida
- - FE SE --
- U. o
- f E
Elec ectro ro-Co Commu munic icati tions ns
- - FE -- --
- U. o
- f G
Glas asgow
- w /
/ U. . of f She heffi field ld
- - FE SE --
- U. o
- f I
Iowa wa
- - FE SE --
- U. o
- f O
Oxfo ford
- - FE SE --
Zhej ejian ang U U. SB FE SE --
HLF keeps attracting more participants, most of them come back the next year.
TRECVID 2006 16
Number of runs of each training type
125 7 (5.6% 32 (25.6%) 86 (68.8%) 2006 110 7 (6.3%) 24 (21.8%) 79 (71.8%) 2005 18 (30.0%) 11 (13.3%)
C
60 83
Total runs
20 (33.3%) 27 (32.5%)
B
22 (36.7%) 45 (54.2%)
A
2003 2004
Tr-Type
System training type: A - Only on common dev. collection and the common annotation B - Only on common dev. collection but not on (just) the common annotation C - not of type A or B
TRECVID 2006 17
10 20 30 40 50 60 70 80 90 100
All test shots 1 3 5 6 10 12 17 22 23 24 26 27 28 29 30 32 35 36 38 39
Arabic Chinese English
% of true shots by source language for each feature
Feature number
%
TRECVID 2006 18
True shots contributed uniquely by team for each feature
1 5 1 1 7 2 3 2 6 2 8 3 3 5 3 8 5 1 1 5TRECVID 2006 19
Category A results (top half)
0,05 0,1 0,15 0,2 0,25
tsinghua IBM.MBWN IBM.MRF IBM.MAAR IBM.MBW CMU.Return_of_The_Jedi IBM.UB CMU.The_Empire_Strikes_back CMU.A_New_Hope CMU.Attack_of_The_Clones COL3 UCF.CEC.PROD IBM.VB COL1 UCF.CE.PROD COL2 COL4 ucb_1best COL5 UCF.CE.PROB COL6 CMU.Revenge_of_The_Sith ucb_vision KSpace-base CityUHK1 ucb_concat MSRA_TRECVID MSRA_TRECVID icl.jhu_Sys2 ucb_fusion UCF.MIX CMU.The_Phantom_Menace KSpace-SC CityUHK2 icl.jhu_Sys1 CityUHK5 UCF.CM MSRA_TRECVID KSpace-DS1 CityUHK3 ucb_text KSpace-DS2 NTU
TRECVID 2006 20
Category A (bottom half)
0,05 0,1 0,15 0,2 0,25
MSRA_TRECVID UCF.EDEG MSRA_TRECVID CityUHK6 PicSOM_F6 PicSOM_F5 PicSOM_F4 ucb_sound MSRA_TRECVID PicSOM_F3 i2Rnus KSpace-bb KSpace-highSvm NII_ISM_R3 i2Rnus i2Rnus NII_ISM_R2 i2Rnus i2Rnus i2Rnus NII_ISM_R1 TokyoTech1 ZJU TokyoTech1 kddi.SiriusCy2 Bilkent1 TZI_Text CityUHK4 FD_SVM_BN kddi.SiriusCy1 TZI_RelaxText UEC_Common icl.jhu_NPDE Glasgow.Sheffield01 LIP6.FuzzyDT FD_SVM_MTL EUR01-SVM icl.jhu_5 icl.jhu_4 FD_SCM_MTL FD_SCM_BN COST292R2 COST292R1
TRECVID 2006 21
Category B results
0,05 0,1 0,15 0,2 0,25 tsinghua tsinghua tsinghua tsinghua MM.top MM.bottom MM.strange MM.charm OXVGG_AOJ clips.local-reuters-scale clips.local-text-scale clips.local-reuters-kernel-sum PicSOM_F9 OXVGG_OJ MM.up clips.local-reuters-late-context FXPAL-06Beta tsinghua clips.optimized-fusion-all PicSOM_F7 FXPAL-06Beta FXPAL-06Beta FXPAL-06Beta FXPAL-06Beta FXPAL-06Beta MM.down OXVGG_A clips.local-reuters-kernel-prod TZI_Image TZI_Avg TZI_RelaxImage TZI_RlxAll
TRECVID 2006 22
Category C results
0,05 0,1 0,15 0,2 0,25 kddi.SiriusCy6 kddi.SiriusCy5 kddi.SiriusCy4 kddi.SiriusCy3 UIow a06FE02 UIow a06FE01
TRECVID 2006 23
Inferred Avg. Precision by feature (all runs)
Middle half
- f the data
Median
Feature number Average precision
1
1 spo ports ts 3 3 wea eathe her 5 5 off ffice ce 6 6 mee eetin ing 10 d dese sert 12 m moun untai ain 17 w wate tersc scape pe/wa water erfro ront 22 c corp rpora rate e lea eader er 23 p poli lice e sec ecuri rity 24 m mili litar ary p pers rsonn nnel 26 a anim imal 27 c comp mpute ter t tv s scre reen 28 u us f flag ag 29 a airp rplan ane 30 c car 32 t truc uck 35 p peop
- ple
e mar archi hing 36 e expl plosi sion n fir ire 38 m maps ps 39 c char arts
TRECVID 2006 24
Inferred avg. precision by feature (top 10 runs)
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
Feature
I nferred avg. precision
1 2 3 4 5 6 7 8 9 10
Median
Which, if any, differences are significant, i.e. not due to chance?
1 sp
sport rts 3 3 wea eathe her 5 5 off ffice ce 6 6 mee eetin ing 10 desert 12 mountain 17 waterscape/waterfront 22 corporate leader 23 police security 24 military personnel 26 animal 27 computer tv screen 28 us flag 29 airplane 30 car 32 truck 35 people marching 36 explosion fire 38 maps 39 charts
TRECVID 2006 25
Method of testing for significant pairwise differences between runs
Developed c.1935 by R.A. Fisher as thought experiment
Gained new usefulness with advent of computer intensive methods in statistics
Avoids dependence on (usually untrue) assumptions that samples are truly random, normally distributed, have equal variances, etc.
But makes no claims about populations
Randomization testing
TRECVID 2006 26
1.
Given observed scores for two systems on the same 20 features, calculate the mean score for each system and the observed difference of between the means.
2.
Would like to know if the difference is due to the systems or to chance.
3.
Generate a distribution of differences between the means under the null hypothesis that the difference is due to chance: for any feature, score from one system could equally likely have come the other
- Calculate within feature pairwise difference & difference in means, once
- For ~10,000 iterations or more
- For each pair of scores, randomly change the sign of the difference
- Sum the differences, calculate new mean, add it to the H° distribution
4.
Count how many differences in H° are equal to or more extreme than the
- bserved difference
5.
Take [count / total number of generated differences] as probability (p) that the
- bserved difference in means is due to chance.
Randomization test procedure
TRECVID 2006 27
- Given observed scores for two systems on the same 20 features, calculate
the mean score for each system and the observed difference of between the means.
R1: 0.467 0.434 0.013 0.314 0.041 0.188 0.242 ... R2: 0.367 0.515 0.004 0.236 0.057 0.087 0.054 ... (R1-R2)/20: SUM(+0.1 -0.081 +0.009 +0.078 -0.016 +0.101 +0.188 ...)/20 = = 0.0 .033 3
- Generate a distribution of differences between the means under the null
hypothesis that the difference is due to chance: for any feature, score from one system could equally likely have come the other
1. 1.
SU SUM(- (-0.1 .1 -0
- 0.08
081 -
- 0.0
.009 9 +0. 0.078 78 -0
- 0.01
016 + +0.1 .101 1 +0. 0.188 88 .. ...)/ )/20 0 = -
- 0.0
.008
2. 2.
SU SUM(+ (+0.1 .1 -0
- 0.08
081 + +0.0 .009 9 -0. 0.078 78 +0 +0.01 016 + +0.1 .101 1 -0. 0.188 88 .. ...)/ )/20 0 = 0.0 .019
3. 3.
SU SUM(- (-0.1 .1 -0
- 0.08
081 -
- 0.0
.009 9 +0. 0.078 78 -0
- 0.01
016 -
- 0.1
.101 1 +0. 0.188 88 .. ...)/ )/20 0 = 0.0 .046
...
5. 5.
SUM( M(+0. 0.1 + +0.0 .081 1 +0. 0.009 09 -0
- 0.07
078 + +0.0 .016 6 +0. 0.101 01 +0 +0.18 188 . ...) .)/20 20 = = -0. 0.224 24
- 3145 of 95344 generated differences >= 0.033
- Probability observed difference is due to chance (p) = 0.03299
Randomization test procedure
TRECVID 2006 28
Run name (mean infAP) A_tsinghua_6 (0.192) A_IBM.MBWN_5 (0.177) A_IBM.MRF_2 (0.176) A_IBM.MAAR_3 (0.170) A_IBM.MBW_1 (0.169) A_CMU.Return..._6 (0.159) A_IBM.UB_4 (0.155) A_CMU.The_Empire..._5 (0.153) A_CMU.A_New_Hope..._4 (0.148) A_CMU.Attack of the..._2 (0.146)
Significant differences among top 10 A-category runs (using randomization test, p < 0.05)
A_tsinghua_6
A_IBM.UB_4
A_CMU.Return_of_The_Jedi_6
A_CMU.A_New_Hope_4
A_CMU.The_Empire_Strikes_back_5
A_CMU.Attack_of_The_Clones_2
A_IBM.MRF_2
A_CMU.Attack_of_The_Clones_2
A_IBM.UB_4
A_IBM.MBWN_5
A_CMU.Attack_of_The_Clones_2
A_IBM.UB_4
A_IBM.MAAR_3
A_IBM.UB_4
* = = = = > > > > >
TRECVID 2006 29
Comparison with TV2005
Some features were also evaluated last year
Comparison yields mixed bag:
2 features decreased
2 features inceased
1 feature stable
most of these features have just 100-200 true hits in the sampled pool
Caveat: comparison is just indicative…
compare m.a.p and InfApp
but test set drawn from similar dataset as TV2005
Did anyone re-run last year’s system?
0,1 0,2 0,3 0,4 0,5 0,6
Explosion or fire US flag Waterscape Mountain Sports
TV2005 top TV2006 top TV2005 med TV2006 med
TRECVID 2006 30
infAP vs. # true shots in test data
0,1 0,2 0,3 0,4 0,5 0,6 0,7 200 400 600 800 1000 1200 1400 1600 1800 med max
1%
TRECVID 2006 31
General observations (1)
Participation is still increasing
Maintained focus on cat A
Most groups built a generic feature detector
Top scores come from the usual suspects plus a few new groups
5 10 15 20 25 30 2003 2004 2005 2006
TRECVID 2006 32
General observations (2)
Many interesting new techniques are tried
Some consolidation: SVM is the dominant classifier with robust results
Good systems combine representations at multiple granularities
Salient point representation gaining more ground
Good systems combine different feature types (c,t,e/s,a,T,f)
8/30 teams look at more than just the shot keyframe
Many interesting multimodal/concept fusion experiments, room for more exploration here
multi-concept fusion still of limited use (due to small lexicon?)
CMU: not many concepts support each other
Columbia: 3 out of 4 predicted concepts have 30% increase
Can concept fusion learn from IR co-occurrence techniques?
TRECVID 2006 33
Overview of approaches across sites
feature types
c: color, t: texture, s:shape, e:edges, a:acoustic, f:face, T: text
granularity (local, region, global)
classifier techniques
fusion
generic vs. feature specific
focus of site experimented marked in blue, speaking slots in yellow
TRECVID 2006 34
Borda Count SVM c,e,f points (sparse/dens e) ,053 XVGG_A B DRF / chi2 svm MM MM 0,059 FXPAL-06Beta B handpicked negative concepts linear combination SOM motion act. average c,t, for shot c,t,T grid 0,0 64 PicSOM_F7 B 0,073 NTU A weighted fusion, also looked at unlabeled data SVM, KDE, manifold ranking, t- graph c,t,s,f, T global, grid 0,0 86 MSRA_TREC VID A average fusion svm EMD c,t points+grid 0,106 CityUHK1 A ge n+spe cific bayesian (DS) svm camera motion c,t,e ,T grid 0,1 10 KSpace-base A svm early/ late fusion svm/ log reg / LD global, grid, point 0,1 17 MM.bottom B average/product/KDE svm c,e 0,11 9 UCF.CE.PROB A svm svm svm shot context ,e,T points 0,1 22 ucb_1best A
- osting CRF
(PMI selection) average fusion vm MD c,t,T SIFT points/grid 0,142 COL1 A multi discr RF (chi2 selection) logistic regression, early, late, borda svm ,t,T rid (5x5) +points 0,1 48 CMU.A_New_ Hope A ? ? svm,? ? ? 0,1 70 IBM.MAAR A tackedSVM, rules weight-select, rankboost, stackedSVM svm camera motion, motion act. c,t,T,f global,grid,
- segm. point
0,19 2 tsinghua A eneric ? multiconcept fusion multimodal fusion classifier temporal analysis feat ures repr. granularity be st run tag best run Cat.
TRECVID 2006 35
- t all
NN/Bayes c,t / T points/grid/LSA 0,000 COST292R1 A ,001 Iowa06FE01 C source adaptation likelihood ratio (HMM) ,t,T rid ,001 icl.jhu_4 A
- cond. P
GMM/SVM c,t points 0,0 01 FD_SCM_BN A NN SVM c,t points 0,0 02 UR01-SVM A fuzzy decision trees p,c grid 0,0 04 LIP6.FuzzyDT A tfidf T 0,0 05 Glasgow.Sheffield 01 A 0,0 06 UEC_Common A +sp ecific cond prob weighted average,
- prob. relax. labelling
SVM every 20th frame c,T,e,f, a 0,0 21 TZI_Avg B NN ,t,e,T rid ,021 ilkent1 A not all Haar/KNN s grid + points 0,0 26 kddi.SiriusCy3 C ultimodal subspace correlation propag VM c,t,e,T,a global 0,029 ZJU A 0,0 30 TokyoTech1 A SVM c,t,T local+global 0,031 clips.local-reuters- kernel-prod B SVM
- loc. bin.
pat.
- verlapping
grid 0,033 NII_ISM_R1 A cond prob SVM,LDF,GMM frame clustering, bigrams c,t,T grid 0,040 i2Rnus A gen eric? multiconce pt fusion multimodal fusion classifier temporal analysis feature s repr. granularity be st run tag best run C at.
TRECVID 2006 36
Issues
How to make the most of a fixed limited number of assessor time
Sampling method
Equal pool size for each feature?
Repetition of advertisement clips was less of an issue as in TV2005
Systematic study of interaction between search and HLF
How to proceed after 5 years of HLF?
massive scaling requires massive amounts of annotation and assessment time
TRECVID 2006 37
Discussion input
How to make the most of a fixed limited number of assessor time
Sampling method refinement
top->sample->unique vs. top->unique-sample?
mark ignore vs. mark non relevant
map vs. precision@N
Equal pool size for each feature?
How to proceed after 5 years of HLF?
massive scaling requires massive amounts of annotation and possibly assessment time