TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO - - PowerPoint PPT Presentation

trecvid 2006 high level feature task overview
SMART_READER_LITE
LIVE PREVIEW

TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO - - PowerPoint PPT Presentation

TRECVID-2006 High-Level Feature task: Overview Wessel Kraaij TNO & Paul Over NIST Outline Task summary Evaluation details Inferred Average precision vs. mean average precision Participants Evaluation results


slide-1
SLIDE 1

TRECVID-2006 High-Level Feature task: Overview

Wessel Kraaij TNO & Paul Over NIST

slide-2
SLIDE 2

TRECVID 2006 2

Outline

Task summary

Evaluation details

Inferred Average precision vs. mean average precision

Participants

Evaluation results

Pool analysis

Results per category

Results per feature

Significance tests category A

comparison with TV2005

Global Observations

Issues

slide-3
SLIDE 3

TRECVID 2006 3

High-level feature task

Goal: Build benchmark collection for visual concept detection methods

Secondary goals:

encourage generic (scalable) methods for detector development

feature-indexing could help search/browsing

Participants submitted runs for all 39 LSCOM-lite features

Used results of 2005 collaborative training data annotation

Tools from CMU and IBM (new tool)

39 features and about 100 annotators

multiple annotations of each feature for a given shot

Range of frequencies in the common development data annotation

NIST evaluated 20 (medium frequency) features from the 39 using a 50% random sample of the submission pools (Inferred AP)

slide-4
SLIDE 4

TRECVID 2006 4

HLF is challenging for machine learning

Small imbalanced training collection

Large variation in examples

Noisy Annotations

Decisions to be made:

find suitable representations

find optimal fusion strategies

slide-5
SLIDE 5

TRECVID 2006 5

20 LSCOM-lite features evaluated

1 sp

sport rts 3 we weath ther 5 of

  • ffic

ice 6 me meeti ting 10 10 de deser ert 12 12 mo mount ntain in 17 17 wa water ersca cape/ e/wat aterf rfron

  • nt

22 22 co corpo porat ate l lead ader 23 23 po polic ice s secu curit ity 24 24 mi milit itary ry pe perso sonne nel 26 26 an anima mal 27 27 co compu puter er tv tv sc scree een 28 28 us us fl flag 29 29 ai airpl plane ne 30 30 ca car 32 32 tr truck ck 35 35 pe peopl ple m marc rchin ing 36 36 ex explo losio ion f fire re 38 38 ma maps 39 39 ch chart rts

Note: this is a departure from the numbering scheme used at previous TV’s

slide-6
SLIDE 6

TRECVID 2006 6

High-level feature evaluation

Each feature assumed to be binary: absent or present for each master reference shot

Task: Find shots that contain a certain feature, rank them according to confidence measure, submit the top 2000

NIST pooled and judged top results from all submissions

Evaluated performance effectiveness by calculating the inferred average precision of each feature result

Compared runs in terms of mean inferred average precision across the 20 feature results

to be used for comparison between TV2006 HLF runs

not comparable with TV2005, TV2004… figures

slide-7
SLIDE 7

TRECVID 2006 7

Inferred average precision (infAP)

Just* developed by Emine Yilmaz and Javed A. Aslam at Northeastern University

Estimates average precision surprisingly well using a surprisingly small sample of judgments from the usual submission pools

Experiments on TRECVID 2005 feature submissions confirmed quality of the estimate in terms of actual scores and system ranking

* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.

slide-8
SLIDE 8

TRECVID 2006 8

Inferred average precision (infAP) Experiments with 2005 data

Pool submitted results down to at least a depth of 200 items

Manually judge pools - forming a base set of judgments (100% judged)

Create 4 sampled sets of judgments by randomly marking some results “unjudged”

20% unjudged -> 80% sample

40% unjudged -> 60% sample

60% unjudged -> 40% sample

80% unjudged -> 20% sample

Evaluate all systems that submitted results for all features in 2005 using the base and each of the 4 sampled judgment sets using infAP

By definition, infAP of a 100% sample of the base judgment set is identical to average precision (AP).

Compare measurements of infAP using various sampled judgment sets to standard AP.

slide-9
SLIDE 9

TRECVID 2006 9

2005 Mean InfAP scoring approximates MAP scoring very closely

MAP mean infAP

  • n

sampled judgments

80% sample 60% sample 40% sample 20% sample

slide-10
SLIDE 10

TRECVID 2006 10

2005 system rankings change very little when determined based on infAP versus AP.

Kendall's tau (normalizes pairwise swaps)

80% sample 0.9862658

60% sample 0.9871663

40% sample 0.9700546

20% sample 0.951566

Number of significant rank changes (randomization test, p<0.01) Add Keep Lose Swap 73 1883 170 20% 45 1949 104 40% 36 1996 57 60% 37 2018 35 80%

slide-11
SLIDE 11

TRECVID 2006 11

2006: Inferred average precision (infAP)

Submissions for each of 20 features were pooled down to about 120 items (so that each feature pool contained ~ 6500 shots)

varying pool depth per feature

A 50% random sample of each pool was then judged:

66,769 total judgements (~ 125 hr of video)

Judgement process: one assessor per feature, watched complete shot while listening to the audio.

infAP was calculated using the judged and unjudged pool by trec_eval

slide-12
SLIDE 12

TRECVID 2006 12

6 7 9 4 7 4 2 9 2 1 4 9 8 2 4 3 1 5 5 6 2 3 11 6 6 7 5 0 2 3 8 1 5 0 2 2 1 5 1 1 3 2 9 2 2 3 4 0 1 6 3 4 2 7 1 7 2 6 1 2

200 400 600 800 1000 1200 1400 1600

1 3 5 6 10 12 17 22 23 24 26 27 28 29 30 32 35 36 38 39

Feature Number of hits in the test data

Frequency of hits varies by feature

1% 2%

slide-13
SLIDE 13

TRECVID 2006 13

68 65 68 32 35 32

10 20 30 40 50 60 70 80 Test hours Pooled shots Hits

%

known new program

Systems can find hits in video from programs not in the training data

slide-14
SLIDE 14

TRECVID 2006 14

2006: 30/54 Participants (2005: 22/42, 2004: 12/33 )

Bilk lkent nt U. U.

  • - FE

FE SE SE --

  • Carn

rnegi gie M Mell llon n U. .

  • - FE

FE SE SE --

  • City

ty Un Unive versi sity y of f Hon

  • ng K

Kong ng (C (City tyUHK HK) SB SB FE FE SE SE --

  • CLIP

IPS-I

  • IMAG

AG SB SB FE FE SE SE --

  • Colu

lumbi bia U U.

  • - FE

FE SE SE --

  • COST

ST292 92 (w (www. w.cos

  • st29

292.o .org) g) SB FE SE RU Fuda dan U U.

  • - FE

FE SE SE --

  • FX P

Palo lo Al Alto

  • Lab

abora rator

  • ry I

Inc SB SB FE FE SE SE --

  • Hels

lsink nki U

  • U. o
  • f T

Tech chnol

  • logy

gy SB SB FE FE SE SE --

  • IBM

M T. . J. . Wat atson

  • n Re

Resea earch ch Ce Cente ter

  • - FE

FE SE SE RU RU Impe peria ial C Coll llege ge Lo Londo don / / Jo Johns ns Ho Hopki kins s U.

  • - FE

FE SE SE --

  • NUS

S / I I2R

  • - FE

FE SE SE --

  • Inst

stitu tut E EURE RECOM OM

  • - F

FE -

  • - R

RU KDDI DI/To Tokus ushim ima U U./T /Toky kyo U

  • U. o
  • f T

Tech chnol

  • logy

gy SB SB FE FE --

  • - --
  • K-Sp

Space ce (k (kspa pace. e.qmu mul.n .net) t) -- FE SE --

slide-15
SLIDE 15

TRECVID 2006 15

2006: 30 Participants (continued)

LIP6 P6 -

  • Lab

abora ratoi

  • ire

e d'I 'Info forma matiq ique e de e Par aris s 6

  • - F

FE -

  • - -
  • Medi

diami mill l / U

  • U. o
  • f A

Amst sterd rdam m

  • - FE SE --

Micr croso soft t Res esear arch h Asi sia

  • - F

FE -

  • - -
  • Nati

tiona nal T Taiw iwan n U.

  • - F

FE -

  • - -
  • NII/

I/ISM SM

  • - F

FE -

  • - -
  • Toky

kyo I Inst stitu tute e of f Tec echno nolog

  • gy

S SB F FE -

  • - -
  • Tsin

inghu hua U U. SB FE SE RU

  • U. o
  • f B

Brem emen n TZI ZI

  • - FE -- --
  • U. o
  • f C

Cali lifor

  • rnia

ia at at Be Berke keley ey

  • FE

E --

  • --
  • U. o
  • f C

Cent ntral al Fl Flori rida

  • - FE SE --
  • U. o
  • f E

Elec ectro ro-Co Commu munic icati tions ns

  • - FE -- --
  • U. o
  • f G

Glas asgow

  • w /

/ U. . of f She heffi field ld

  • - FE SE --
  • U. o
  • f I

Iowa wa

  • - FE SE --
  • U. o
  • f O

Oxfo ford

  • - FE SE --

Zhej ejian ang U U. SB FE SE --

HLF keeps attracting more participants, most of them come back the next year.

slide-16
SLIDE 16

TRECVID 2006 16

Number of runs of each training type

125 7 (5.6% 32 (25.6%) 86 (68.8%) 2006 110 7 (6.3%) 24 (21.8%) 79 (71.8%) 2005 18 (30.0%) 11 (13.3%)

C

60 83

Total runs

20 (33.3%) 27 (32.5%)

B

22 (36.7%) 45 (54.2%)

A

2003 2004

Tr-Type

System training type: A - Only on common dev. collection and the common annotation B - Only on common dev. collection but not on (just) the common annotation C - not of type A or B

slide-17
SLIDE 17

TRECVID 2006 17

10 20 30 40 50 60 70 80 90 100

All test shots 1 3 5 6 10 12 17 22 23 24 26 27 28 29 30 32 35 36 38 39

Arabic Chinese English

% of true shots by source language for each feature

Feature number

%

slide-18
SLIDE 18

TRECVID 2006 18

True shots contributed uniquely by team for each feature

1 5 1 1 7 2 3 2 6 2 8 3 3 5 3 8 5 1 1 5
slide-19
SLIDE 19

TRECVID 2006 19

Category A results (top half)

0,05 0,1 0,15 0,2 0,25

tsinghua IBM.MBWN IBM.MRF IBM.MAAR IBM.MBW CMU.Return_of_The_Jedi IBM.UB CMU.The_Empire_Strikes_back CMU.A_New_Hope CMU.Attack_of_The_Clones COL3 UCF.CEC.PROD IBM.VB COL1 UCF.CE.PROD COL2 COL4 ucb_1best COL5 UCF.CE.PROB COL6 CMU.Revenge_of_The_Sith ucb_vision KSpace-base CityUHK1 ucb_concat MSRA_TRECVID MSRA_TRECVID icl.jhu_Sys2 ucb_fusion UCF.MIX CMU.The_Phantom_Menace KSpace-SC CityUHK2 icl.jhu_Sys1 CityUHK5 UCF.CM MSRA_TRECVID KSpace-DS1 CityUHK3 ucb_text KSpace-DS2 NTU

slide-20
SLIDE 20

TRECVID 2006 20

Category A (bottom half)

0,05 0,1 0,15 0,2 0,25

MSRA_TRECVID UCF.EDEG MSRA_TRECVID CityUHK6 PicSOM_F6 PicSOM_F5 PicSOM_F4 ucb_sound MSRA_TRECVID PicSOM_F3 i2Rnus KSpace-bb KSpace-highSvm NII_ISM_R3 i2Rnus i2Rnus NII_ISM_R2 i2Rnus i2Rnus i2Rnus NII_ISM_R1 TokyoTech1 ZJU TokyoTech1 kddi.SiriusCy2 Bilkent1 TZI_Text CityUHK4 FD_SVM_BN kddi.SiriusCy1 TZI_RelaxText UEC_Common icl.jhu_NPDE Glasgow.Sheffield01 LIP6.FuzzyDT FD_SVM_MTL EUR01-SVM icl.jhu_5 icl.jhu_4 FD_SCM_MTL FD_SCM_BN COST292R2 COST292R1

slide-21
SLIDE 21

TRECVID 2006 21

Category B results

0,05 0,1 0,15 0,2 0,25 tsinghua tsinghua tsinghua tsinghua MM.top MM.bottom MM.strange MM.charm OXVGG_AOJ clips.local-reuters-scale clips.local-text-scale clips.local-reuters-kernel-sum PicSOM_F9 OXVGG_OJ MM.up clips.local-reuters-late-context FXPAL-06Beta tsinghua clips.optimized-fusion-all PicSOM_F7 FXPAL-06Beta FXPAL-06Beta FXPAL-06Beta FXPAL-06Beta FXPAL-06Beta MM.down OXVGG_A clips.local-reuters-kernel-prod TZI_Image TZI_Avg TZI_RelaxImage TZI_RlxAll

slide-22
SLIDE 22

TRECVID 2006 22

Category C results

0,05 0,1 0,15 0,2 0,25 kddi.SiriusCy6 kddi.SiriusCy5 kddi.SiriusCy4 kddi.SiriusCy3 UIow a06FE02 UIow a06FE01

slide-23
SLIDE 23

TRECVID 2006 23

Inferred Avg. Precision by feature (all runs)

Middle half

  • f the data

Median

Feature number Average precision

1

1 spo ports ts 3 3 wea eathe her 5 5 off ffice ce 6 6 mee eetin ing 10 d dese sert 12 m moun untai ain 17 w wate tersc scape pe/wa water erfro ront 22 c corp rpora rate e lea eader er 23 p poli lice e sec ecuri rity 24 m mili litar ary p pers rsonn nnel 26 a anim imal 27 c comp mpute ter t tv s scre reen 28 u us f flag ag 29 a airp rplan ane 30 c car 32 t truc uck 35 p peop

  • ple

e mar archi hing 36 e expl plosi sion n fir ire 38 m maps ps 39 c char arts

slide-24
SLIDE 24

TRECVID 2006 24

Inferred avg. precision by feature (top 10 runs)

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

Feature

I nferred avg. precision

1 2 3 4 5 6 7 8 9 10

Median

Which, if any, differences are significant, i.e. not due to chance?

1 sp

sport rts 3 3 wea eathe her 5 5 off ffice ce 6 6 mee eetin ing 10 desert 12 mountain 17 waterscape/waterfront 22 corporate leader 23 police security 24 military personnel 26 animal 27 computer tv screen 28 us flag 29 airplane 30 car 32 truck 35 people marching 36 explosion fire 38 maps 39 charts

slide-25
SLIDE 25

TRECVID 2006 25

Method of testing for significant pairwise differences between runs

Developed c.1935 by R.A. Fisher as thought experiment

Gained new usefulness with advent of computer intensive methods in statistics

Avoids dependence on (usually untrue) assumptions that samples are truly random, normally distributed, have equal variances, etc.

But makes no claims about populations

Randomization testing

slide-26
SLIDE 26

TRECVID 2006 26

1.

Given observed scores for two systems on the same 20 features, calculate the mean score for each system and the observed difference of between the means.

2.

Would like to know if the difference is due to the systems or to chance.

3.

Generate a distribution of differences between the means under the null hypothesis that the difference is due to chance: for any feature, score from one system could equally likely have come the other

  • Calculate within feature pairwise difference & difference in means, once
  • For ~10,000 iterations or more
  • For each pair of scores, randomly change the sign of the difference
  • Sum the differences, calculate new mean, add it to the H° distribution

4.

Count how many differences in H° are equal to or more extreme than the

  • bserved difference

5.

Take [count / total number of generated differences] as probability (p) that the

  • bserved difference in means is due to chance.

Randomization test procedure

slide-27
SLIDE 27

TRECVID 2006 27

  • Given observed scores for two systems on the same 20 features, calculate

the mean score for each system and the observed difference of between the means.

R1: 0.467 0.434 0.013 0.314 0.041 0.188 0.242 ... R2: 0.367 0.515 0.004 0.236 0.057 0.087 0.054 ... (R1-R2)/20: SUM(+0.1 -0.081 +0.009 +0.078 -0.016 +0.101 +0.188 ...)/20 = = 0.0 .033 3

  • Generate a distribution of differences between the means under the null

hypothesis that the difference is due to chance: for any feature, score from one system could equally likely have come the other

1. 1.

SU SUM(- (-0.1 .1 -0

  • 0.08

081 -

  • 0.0

.009 9 +0. 0.078 78 -0

  • 0.01

016 + +0.1 .101 1 +0. 0.188 88 .. ...)/ )/20 0 = -

  • 0.0

.008

2. 2.

SU SUM(+ (+0.1 .1 -0

  • 0.08

081 + +0.0 .009 9 -0. 0.078 78 +0 +0.01 016 + +0.1 .101 1 -0. 0.188 88 .. ...)/ )/20 0 = 0.0 .019

3. 3.

SU SUM(- (-0.1 .1 -0

  • 0.08

081 -

  • 0.0

.009 9 +0. 0.078 78 -0

  • 0.01

016 -

  • 0.1

.101 1 +0. 0.188 88 .. ...)/ )/20 0 = 0.0 .046

...

5. 5.

SUM( M(+0. 0.1 + +0.0 .081 1 +0. 0.009 09 -0

  • 0.07

078 + +0.0 .016 6 +0. 0.101 01 +0 +0.18 188 . ...) .)/20 20 = = -0. 0.224 24

  • 3145 of 95344 generated differences >= 0.033
  • Probability observed difference is due to chance (p) = 0.03299

Randomization test procedure

slide-28
SLIDE 28

TRECVID 2006 28

Run name (mean infAP) A_tsinghua_6 (0.192) A_IBM.MBWN_5 (0.177) A_IBM.MRF_2 (0.176) A_IBM.MAAR_3 (0.170) A_IBM.MBW_1 (0.169) A_CMU.Return..._6 (0.159) A_IBM.UB_4 (0.155) A_CMU.The_Empire..._5 (0.153) A_CMU.A_New_Hope..._4 (0.148) A_CMU.Attack of the..._2 (0.146)

Significant differences among top 10 A-category runs (using randomization test, p < 0.05)

A_tsinghua_6

A_IBM.UB_4

A_CMU.Return_of_The_Jedi_6

A_CMU.A_New_Hope_4

A_CMU.The_Empire_Strikes_back_5

A_CMU.Attack_of_The_Clones_2

A_IBM.MRF_2

A_CMU.Attack_of_The_Clones_2

A_IBM.UB_4

A_IBM.MBWN_5

A_CMU.Attack_of_The_Clones_2

A_IBM.UB_4

A_IBM.MAAR_3

A_IBM.UB_4

* = = = = > > > > >

slide-29
SLIDE 29

TRECVID 2006 29

Comparison with TV2005

Some features were also evaluated last year

Comparison yields mixed bag:

2 features decreased

2 features inceased

1 feature stable

most of these features have just 100-200 true hits in the sampled pool

Caveat: comparison is just indicative…

compare m.a.p and InfApp

but test set drawn from similar dataset as TV2005

Did anyone re-run last year’s system?

0,1 0,2 0,3 0,4 0,5 0,6

Explosion or fire US flag Waterscape Mountain Sports

TV2005 top TV2006 top TV2005 med TV2006 med

slide-30
SLIDE 30

TRECVID 2006 30

infAP vs. # true shots in test data

0,1 0,2 0,3 0,4 0,5 0,6 0,7 200 400 600 800 1000 1200 1400 1600 1800 med max

1%

slide-31
SLIDE 31

TRECVID 2006 31

General observations (1)

Participation is still increasing

Maintained focus on cat A

Most groups built a generic feature detector

Top scores come from the usual suspects plus a few new groups

5 10 15 20 25 30 2003 2004 2005 2006

slide-32
SLIDE 32

TRECVID 2006 32

General observations (2)

Many interesting new techniques are tried

Some consolidation: SVM is the dominant classifier with robust results

Good systems combine representations at multiple granularities

Salient point representation gaining more ground

Good systems combine different feature types (c,t,e/s,a,T,f)

8/30 teams look at more than just the shot keyframe

Many interesting multimodal/concept fusion experiments, room for more exploration here

multi-concept fusion still of limited use (due to small lexicon?)

CMU: not many concepts support each other

Columbia: 3 out of 4 predicted concepts have 30% increase

Can concept fusion learn from IR co-occurrence techniques?

slide-33
SLIDE 33

TRECVID 2006 33

Overview of approaches across sites

feature types

c: color, t: texture, s:shape, e:edges, a:acoustic, f:face, T: text

granularity (local, region, global)

classifier techniques

fusion

generic vs. feature specific

focus of site experimented marked in blue, speaking slots in yellow

slide-34
SLIDE 34

TRECVID 2006 34

Borda Count SVM c,e,f points (sparse/dens e) ,053 XVGG_A B DRF / chi2 svm MM MM 0,059 FXPAL-06Beta B handpicked negative concepts linear combination SOM motion act. average c,t, for shot c,t,T grid 0,0 64 PicSOM_F7 B 0,073 NTU A weighted fusion, also looked at unlabeled data SVM, KDE, manifold ranking, t- graph c,t,s,f, T global, grid 0,0 86 MSRA_TREC VID A average fusion svm EMD c,t points+grid 0,106 CityUHK1 A ge n+spe cific bayesian (DS) svm camera motion c,t,e ,T grid 0,1 10 KSpace-base A svm early/ late fusion svm/ log reg / LD global, grid, point 0,1 17 MM.bottom B average/product/KDE svm c,e 0,11 9 UCF.CE.PROB A svm svm svm shot context ,e,T points 0,1 22 ucb_1best A

  • osting CRF

(PMI selection) average fusion vm MD c,t,T SIFT points/grid 0,142 COL1 A multi discr RF (chi2 selection) logistic regression, early, late, borda svm ,t,T rid (5x5) +points 0,1 48 CMU.A_New_ Hope A ? ? svm,? ? ? 0,1 70 IBM.MAAR A tackedSVM, rules weight-select, rankboost, stackedSVM svm camera motion, motion act. c,t,T,f global,grid,

  • segm. point

0,19 2 tsinghua A eneric ? multiconcept fusion multimodal fusion classifier temporal analysis feat ures repr. granularity be st run tag best run Cat.

slide-35
SLIDE 35

TRECVID 2006 35

  • t all

NN/Bayes c,t / T points/grid/LSA 0,000 COST292R1 A ,001 Iowa06FE01 C source adaptation likelihood ratio (HMM) ,t,T rid ,001 icl.jhu_4 A

  • cond. P

GMM/SVM c,t points 0,0 01 FD_SCM_BN A NN SVM c,t points 0,0 02 UR01-SVM A fuzzy decision trees p,c grid 0,0 04 LIP6.FuzzyDT A tfidf T 0,0 05 Glasgow.Sheffield 01 A 0,0 06 UEC_Common A +sp ecific cond prob weighted average,

  • prob. relax. labelling

SVM every 20th frame c,T,e,f, a 0,0 21 TZI_Avg B NN ,t,e,T rid ,021 ilkent1 A not all Haar/KNN s grid + points 0,0 26 kddi.SiriusCy3 C ultimodal subspace correlation propag VM c,t,e,T,a global 0,029 ZJU A 0,0 30 TokyoTech1 A SVM c,t,T local+global 0,031 clips.local-reuters- kernel-prod B SVM

  • loc. bin.

pat.

  • verlapping

grid 0,033 NII_ISM_R1 A cond prob SVM,LDF,GMM frame clustering, bigrams c,t,T grid 0,040 i2Rnus A gen eric? multiconce pt fusion multimodal fusion classifier temporal analysis feature s repr. granularity be st run tag best run C at.

slide-36
SLIDE 36

TRECVID 2006 36

Issues

How to make the most of a fixed limited number of assessor time

Sampling method

Equal pool size for each feature?

Repetition of advertisement clips was less of an issue as in TV2005

Systematic study of interaction between search and HLF

How to proceed after 5 years of HLF?

massive scaling requires massive amounts of annotation and assessment time

slide-37
SLIDE 37

TRECVID 2006 37

Discussion input

How to make the most of a fixed limited number of assessor time

Sampling method refinement

top->sample->unique vs. top->unique-sample?

mark ignore vs. mark non relevant

map vs. precision@N

Equal pool size for each feature?

How to proceed after 5 years of HLF?

massive scaling requires massive amounts of annotation and possibly assessment time

Explore social tagging, annotation as a game?