Overview Georges Qunot Laboratoire d'Informatique de Grenoble - - PowerPoint PPT Presentation

overview
SMART_READER_LITE
LIVE PREVIEW

Overview Georges Qunot Laboratoire d'Informatique de Grenoble - - PowerPoint PPT Presentation

TRECVID-2015 Semantic Indexing task: Overview Georges Qunot Laboratoire d'Informatique de Grenoble George Awad Dakota Consulting - NIST Outline Task summary (Goals, Data, Run types, Concepts, Metrics) Evaluation details Inferred


slide-1
SLIDE 1

TRECVID-2015 Semantic Indexing task: Overview

Georges Quénot Laboratoire d'Informatique de Grenoble George Awad Dakota Consulting - NIST

slide-2
SLIDE 2

Outline

  • Task summary (Goals, Data, Run types, Concepts, Metrics)
  • Evaluation details
  • Inferred average precision
  • Participants
  • Evaluation results
  • Hits per concept
  • Results per run
  • Results per concept
  • Significance tests
  • Progress task results
  • Global Observations

2

slide-3
SLIDE 3

Semantic Indexing task

  • Goal: Automatic assignment of semantic tags to video segments (shots)
  • Secondary goals:
  • Encourage generic (scalable) methods for detector development.
  • Semantic annotation is important for filtering, categorization, searching and browsing.
  • Task: Find shots that contain a certain concept, rank them according to

confidence measure, submit the top 2000.

  • Participants submitted one type of runs:
  • Main run Includes results for 60 concepts, from which NIST evaluated 30.

3

slide-4
SLIDE 4

Semantic Indexing task (data)

  • SIN testing dataset
  • Main test set (IACC.2.C): 200 hours, with durations between

10 seconds and 6 minutes.

  • SIN development dataset
  • (IACC.1.A, IACC.1.B, IACC.1.C & IACC.1.tv10.training): 800 hours, used from

2010 – 2012 with durations between 10 seconds to just longer than 3.5 minutes.

  • Total shots:
  • Development: 549,434
  • Test: IACC.2.C (113,046 shots)
  • Common annotation for 346 concepts coordinated by LIG/LIF/Quaero

from 2007-2013 made available.

4

slide-5
SLIDE 5

Semantic Indexing task (Concepts)

  • Selection of the 60 target concepts Were drawn from 500

concepts chosen from the TRECVID “high level features” from 2005 to 2010 to favor cross-collection experiments Plus a selection of LSCOM concepts.

  • Generic-Specific relations among concepts for promoting research on

methods for indexing many concepts and using ontology relations between them.

  • we cover a number of potential subtasks, e.g. “persons” or “actions”

(not really formalized).

  • These concepts are expected to be useful for the content-based

(instance) search task.

  • Set of relations provided:
  • 427 “implies” relations, e.g. “Actor implies Person”
  • 559 “excludes” relations, e.g. “Daytime_Outdoor excludes Nighttime”

5

slide-6
SLIDE 6

Semantic Indexing task (training types)

  • Six training types were allowed:
  • A – used only IACC training data (30 runs)
  • B – used only non-IACC training data (0 runs)
  • C – used both IACC and non-IACC TRECVID (S&V and/or

Broadcast news) training data (2 runs)

  • D – used both IACC and non-IACC non-TRECVID training

data(54 runs)

  • E – used only training data collected automatically using only the

concepts’ name and definition (0 runs)

  • F – used only training data collected automatically using a query

built manually from the concepts’ name and definition (0 runs)

6

slide-7
SLIDE 7

30 Single concepts evaluated(1)

3 Airplane* 5 Anchorperson 9 Basketball* 13 Bicycling* 15 Boat_Ship* 17 Bridges* 19 Bus* 22 Car_Racing 27 Cheering* 31 Computers* 38 Dancing 41 Demonstration_Or_Protest 49 Explosion_fire 56 Government_leaders 71 Instrumental_Musician*

  • The 14 marked with “*” are a subset of those tested in 2014

8

72 Kitchen 80 Motorcycle* 85 Office 86 Old_people 95 Press_conference 100 Running* 117 Telephones* 120 Throwing 261 Flags* 297 Hill 321 Lakes 392 Quadruped* 440 Soldiers 454 Studio_With_Anchorperson 478 Traffic

slide-8
SLIDE 8

Evaluation

  • The 30 evaluated single concepts were chosen after examining

TRECVid 2013 60 evaluated concept scores across all runs and choosing the top 45 concepts with maximum score variation.

  • Each feature assumed to be binary: absent or present for each

master reference shot

  • NIST sampled ranked pools and judged top results from all

submissions

  • Metrics: inferred average precision per concept
  • Compared runs in terms of mean inferred average precision

across the 30 concept results for main runs.

9

slide-9
SLIDE 9

2015: mean extended Inferred average precision (xinfAP)

  • 2 pools were created for each concept and sampled as:
  • Top pool (ranks 1-200) sampled at 100%
  • Bottom pool (ranks 201-2000) sampled at 11.1%
  • Judgment process: one assessor per concept, watched

complete shot while listening to the audio.

  • infAP was calculated using the judged and unjudged pool by

sample_eval

30 concepts 195,500 total judgments 11,636 total hits 7489 Hits at ranks (1-100) 2970 Hits at ranks (101-200) 1177 Hits at ranks (201-2000)

11

slide-10
SLIDE 10

2015 : 15 Finishers

PicSOM Aalto U., U. of Helsinki ITI_CERTH Information Technologies Institute, Centre for Research and Technology Hellas CMU Carnegie Mellon U.; CMU-Affiliates Insightdcu Dublin City Un.; U. Polytechnica Barcelona EURECOM EURECOM FIU_UM Florida International U., U. of Miami IRIM CEA-LIST, ETIS, EURECOM, INRIA-TEXMEX, LABRI, LIF, LIG, LIMSI- TLP, LIP6, LIRIS, LISTIC LIG Laboratoire d'Informatique de Grenoble NII_Hitachi_UIT Natl.Inst. Of Info.; Hitachi Ltd; U. of Inf. Tech.(HCM-UIT) TokyoTech Tokyo Institute of Technology MediaMill

  • U. of Amsterdam Qualcomm

siegen_kobe_nict

  • U. of Siegen; Kobe U.; Natl. Inst. of Info. and Comm. Tech.

UCF_CRCV

  • U. of Central Florida

UEC

  • U. of Electro-Communications

Waseda Waseda U. 12

slide-11
SLIDE 11

Inferred frequency of hits varies by concept

500 1000 1500 2000 2500 3000 3500

Airplane Anchorperson Basketball Bicycling Boat_Ship Bridges Bus Car_Racing Cheering Computers Dancing Demonstration_Or_Protest Explosion_fire Government_leaders Instrumental_Musician Kitchen Motorcycle Office Old_people Press_conference Running Telephones Throwing Flags Hill Lakes Quadruped Soldiers Studio_With_Anchorperson Traffic

  • Inf. Hits

1%** **from total test shots 13

slide-12
SLIDE 12

Total true shots contributed uniquely by team

Team

  • No. of

Shots Team

  • No. of

shots Insightdcu 27 Mediamill 8 NII 19 NHKSTRL 7 UEC 17 ITI_CERTH 6 siegen_kobe_nict 13 HFUT 4 EURECOM 10 CMU 3 FIU 10 LIG 2 UCF 10 IRIM 1

Fewer unique shots compared to TV2014, TV2013 & TV2012

14

slide-13
SLIDE 13

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

D_MediaMill.15 D_MediaMill.15 D_MediaMill.15 D_MediaMill.15 D_Waseda.15 D_Waseda.15 D_Waseda.15 D_Waseda.15 D_TokyoTech.15 D_TokyoTech.15 D_TokyoTech.15 D_IRIM.15 D_LIG.15 D_LIG.15 D_IRIM.15 D_IRIM.15 D_TokyoTech.15 D_PicSOM.15 D_PicSOM.15 D_UCF_CRCV.15 D_LIG.15 D_PicSOM.15 D_IRIM.15 D_UCF_CRCV.15 D_LIG.15 D_PicSOM.15 D_EURECOM.15 D_EURECOM.15 D_UCF_CRCV.15 D_EURECOM.15 D_EURECOM.15 C_CMU.15 D_UCF_CRCV.15 C_CMU.15 D_ITI_CERTH.15 D_ITI_CERTH.15 D_ITI_CERTH.15 D_ITI_CERTH.15 D_UEC.15 D_UEC.15 A_NII_Hitachi_UIT.15 A_NII_Hitachi_UIT.15 A_NII_Hitachi_UIT.15 A_NII_Hitachi_UIT.15 D_insightdcu.15 D_insightdcu.15 D_insightdcu.15 D_insightdcu.15 D_siegen_kobe_nict.15 D_siegen_kobe_nict.15 D_UEC.15 A_FIU_UM.15 D_siegen_kobe_nict.15 A_FIU_UM.15 A_FIU_UM.15 A_FIU_UM.15

Main runs scores – 2015 submissions

Median = 0.239

Mean InfAP.

Type D runs (both IACC and non-IACC non-TRECVID ) Type A runs (only IACC for training) Type C runs (both IACC and non-IACC TRECVID )

Higher median and max scores than 2014 

15

slide-14
SLIDE 14

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

D_MediaMill.15 D_MediaMill.15 D_MediaMill.15 D_MediaMill.15 D_Waseda.15 D_Waseda.15 D_Waseda.15 D_Waseda.15 D_TokyoTech.15 D_TokyoTech.15 D_TokyoTech.15 D_IRIM.15 D_LIG.15 D_LIG.15 D_IRIM.15 D_IRIM.15 D_TokyoTech.15 D_nist.baseline.15 D_PicSOM.15 D_PicSOM.15 D_UCF_CRCV.15 D_LIG.15 D_PicSOM.15 D_IRIM.15 D_UCF_CRCV.15 D_LIG.15 D_PicSOM.15 D_EURECOM.15 D_EURECOM.15 D_UCF_CRCV.15 D_LIG.14 D_IRIM.14 D_IRIM.14 D_LIG.14 D_EURECOM.15 D_EURECOM.15 A_LIG.13 A_LIG.13 A_VideoSense.13 A_IRIM.13 A_inria.lear.13 A_inria.lear.13 A_axes.13 A_axes.13 D_EURECOM.14 A_inria.lear.13 A_axes.13 A_IRIM.13 D_EURECOM.14 C_CMU.15 D_UCF_CRCV.15 C_CMU.15 D_ITI_CERTH.15 D_ITI_CERTH.15 D_ITI_CERTH.15 A_NII.13 A_NII.13 D_UEC.14 A_ITI-CERTH.13 A_insightdcu.13 D_ITI_CERTH.15 A_ITI-CERTH.13 A_NHKSTRL.13 D_UEC.15 D_UEC.14 D_UEC.15 A_NII_Hitachi_UIT.15 A_NII_Hitachi_UIT.15 A_NII_Hitachi_UIT.15 A_NII_Hitachi_UIT.15 D_insightdcu.15 D_insightdcu.15 D_insightdcu.15 D_insightdcu.15 A_insightdcu.14 D_siegen_kobe_nict.15 A_HFUT.13 D_siegen_kobe_nict.15 D_UEC.15 A_EURECOM.13 A_EURECOM.13 A_FIU_UM.15 D_siegen_kobe_nict.15 A_FIU_UM.15 A_FIU_UM.15 A_FIU_UM.15 A_UEC.13

Median = 0.188

Mean InfAP. NIST median baseline run

* Submitted runs in 2013 against 2015 testing data (Progress runs)

Main runs scores – Including progress

16 * Submitted runs in 2014 against 2015 testing data (Progress runs)

slide-15
SLIDE 15

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Airplane* Anchorperson Basketball* Bicycling* Boat_Ship* Bridges* Bus* Car_Racing Cheering* Computers* Dancing Demonstration_Or_Protest Explosion_fire Government_leaders Instrumental_Musician* Kitchen Motorcycle* Office Old_people Press_conference Running* Telephones* Throwing Flags* Hill Lakes Quadruped* Soldiers Studio_With_Anchorperson Traffic

Median

Top 10 InfAP scores by concept

Inf AP.

* Common concept in TV2014

17

Most common concept’s has higher max scores than TV14

slide-16
SLIDE 16

Statistical significant differences among top 10 Main runs (using randomization test, p < 0.05)

  • Run name

(mean infAP) D_MediaMill.15_4 0.362 D_MediaMill.15_2 0.359 D_MediaMill.15_1 0.359 D_MediaMill.15_3 0.349 D_Waseda.15_1 0.309 D_Waseda.15_4 0.307 D_Waseda.15_3 0.307 D_Waseda.15_2 0.307 D_TokyoTech.15_1 0.299 D_TokyoTech.15_2 0.298 D_MediaMill.15_4 D_MediaMill.15_3 D_TokyoTech.15_1 D_TokyoTech.15_2 D_Waseda.15_1 D_Waseda.15_3 D_Waseda.15_4 D_Waseda.15_2 D_MediaMill.15_1 D_MediaMill.15_3 D_Waseda.15_1 D_Waseda.15_3 D_Waseda.15_4 D_Waseda.15_2 D_TokyoTech.15_1 D_TokyoTech.15_2 D_MediaMill.15_2 D_MediaMill.15_3 D_Waseda.15_1 D_Waseda.15_3 D_Waseda.15_4 D_Waseda.15_2 D_TokyoTech.15_1 D_TokyoTech.15_2

18

slide-17
SLIDE 17

Progress subtask

  • Measuring progress of 2013, 2014, & 2015 systems
  • n IACC.2.C dataset.
  • 2015 systems used same training data and

annotations as in 2013 & 2014.

  • Total 6 teams submitted progress runs against

IACC.2.C dataset.

19

slide-18
SLIDE 18

0.05 0.1 0.15 0.2 0.25 0.3 0.35 EURECOM IRIM ITI_CERTH LIG UEC insightdcu

Mean InfAP

2013_system 2014_system 2015_system

Randomization tests show that 2015 systems are better than 2013 & 2014 systems (except for UEC, 2014 is better)

Progress subtask: Comparing best runs in 2013, 2014 & 2015 by team

20

slide-19
SLIDE 19

Progress subtask: Concepts improved vs weaken by team

21

EURECOM IRIM insightdcu LIG UEC ITI_CERTH better than 2014 23 24 19 25 6 better than 2013 30 25 14 25 30 21 worse than 2013 5 16 5 9 worse than 2014 6 6 10 5 21 same as 2014 1 1 3 same as 2013 5 10 15 20 25 30 35

  • No. of Concepts

Most 2015 concepts improved

slide-20
SLIDE 20

2015 Observations

  • 2015 main task was harder than 2014 main task that was itself

harder than 2013 main task (different data and different set of target concepts)

  • Raw system scores have higher Max and Median compared to

TV2014 and TV2103, still relatively low but regularly improving

  • Most common concepts with TV2015 have higher median scores.
  • Most Progress systems improved significantly from 2014 to 2015 as

this was also the case from 2013 to 2014.

  • Stable participation (15 teams) between 2014 and 2015 (but was 26

teams for TV2013).

22

slide-21
SLIDE 21

2015 Observations - methods

  • Further moves toward deep learning
  • More “deep-only” submissions
  • Retraining of networks trained on ImageNet
  • Use of many deep networks in parallel
  • Data augmentation for training
  • Use of multiple frames per shot for predicting
  • Feeding of DCNNs with gradient and motion features
  • Use of “deep features” (either final or hidden) with “classical” learning
  • Hybrid DCNN-based/classical systems
  • Engineered features still used as a complement (mostly Fisher

Vectors, SuperVectors, improved BoW, and similar) but no new development

  • Use of re-ranking or equivalent methods

23

slide-22
SLIDE 22

SIN 2016 ?

  • No SIN task is planned for 2016
  • Resuming the ad hoc video retrieval task is

considered instead

24