[PPT] - TRECVID 2017 AD-HOC VIDEO SEARCH TASK : OVERVIEW Georges Qunot PowerPoint Presentation

SLIDE 1

TRECVID 2017

AD-HOC VIDEO SEARCH TASK : OVERVIEW Georges Quénot Laboratoire d'Informatique de Grenoble George Awad Dakota Consulting, Inc National Institute of Standards and Technology

Disclaimer

The identification of any commercial product or trade name does not imply endorsement or recommendation by the National Institute of Standards and Technology.

SLIDE 2

Ad-hoc Video Search Task Definition

Goal: promote progress in content-based retrieval based on end

user ad-hoc queries that include persons, objects, locations, activities and their combinations.

Task: Given a test collection, a query, and a master shot

boundary reference, return a ranked list of at most 1000 shots (out of 335 944) which best satisfy the need.

Testing data: 4593 Internet Archive videos (IACC.3), 600 total

hours with video durations between 6.5 min to 9.5 min.

Development data: ≈1400 hours of previous IACC data used

between 2010-2015 with concept annotations.

TRECVID 2017

3

12/19/2017

SLIDE 4

Query Development

Test videos were viewed by 10 human assessors hired by the

National Institute of Standards and Technology (NIST).

4 facet description of different scenes were used (if

applicable):

Who : concrete objects and being (kind of persons, animals, things)
What : are the objects and/or beings doing ? (generic actions,

conditions/state)

Where : locale, site, place, geographic, architectural
When : time of day, season
In total assessors watched ≈35% of the IACC.3 videos
90 Candidate queries chosen from human written descriptions

to be used between 2016-2018.

TRECVID 2017

4

12/19/2017

SLIDE 5

TV2017 Queries by complexity

Person + Action + Object + Location
Find shots of one or more people eating food at a table indoors
Find shots of one or more people driving snowmobiles in the snow
Find shots of a man sitting down on a couch in a room
Find shots of a person talking behind a podium wearing a suit outdoors during daytime
Find shots of a person standing in front of a brick building or wall
Person + Action + Location
Find shots of children playing in a playground
Find shots of one or more people swimming in a swimming pool
Find shots of a crowd of people attending a football game in a stadium
Find shots of an adult person running in a city street

12/19/2017 TRECVID 2017

5

SLIDE 6

TV2017 Queries by complexity

Person + Action/state + Object
Find shots of a person riding a horse including horse-drawn carts
Find shots of a person wearing any kind of hat
Find shots of a person talking on a cell phone
Find shots of a person holding or operating a tv or movie camera
Find shots of a person holding or opening a briefcase
Find shots of a person wearing a blue shirt
Find shots of person holding, throwing or playing with a balloon
Find shots of a person wearing a scarf
Find shots of a person holding, opening, closing or handing over a box
Person + Action
Find shots of a person communicating using sign language
Find shots of a child or group of children dancing
Find shots of people marching in a parade
Find shots of a male person falling down

12/19/2017 TRECVID 2017

6

SLIDE 7

TV2017 Queries by complexity

Person + Object + Location
Find shots of a man and woman inside a car
Person + Location
Find shots of a chef or cook in a kitchen
Find shots of a blond female indoors
Person + Object
Find shots of a person with a gun visible
Object + Location
Find shots of a map indoors
Object
Find shots of vegetables and/or fruits
Find shots of a newspaper
Find shots of at least two planes both visible

12/19/2017 TRECVID 2017

7

SLIDE 8

12/19/2017 TRECVID 2017

8

Training and run types

Four training data types:

✓ A – used only IACC training data (0 runs) ✓ D – used any other training data (40 runs) ✓ E – used only training data collected automatically using

nly the query text (12 runs)

✓ F – used only training data collected automatically using a

query built manually from the given query text (0 runs)

Two run submission types:

✓

Manually-assisted (M): Query built manually (19 runs)

✓

Fully automatic (F): System uses official query directly(33 runs)

SLIDE 9

12/19/2017 TRECVID 2017

9

Finishers : 10 out of 20

Team Organization M F

INF Renmin University; Shandong Normal University; Chongqing university of posts and telecommunications; Carnegie Mellon University

4

kobe_nict_siegen Kobe University, Japan Center for Information and Neural Networks, National Institute of Information and Communications Technology (NICT), Japan Pattern Recognition Group, University of Siegen, Germany 3

ITI_CERTH Information Technologies Institute, Centre for Research and

Technology Hellas

4

ITEC_UNIKLU Klagenfurt University 4 4 NII_Hitachi_UIT National Institute of Informatics, Japan (NII); Hitachi, Ltd; University of Information Technology, VNU-HCM, Vietnam (HCM-UIT)

4

MediaMill University of Amsterdam

4

Waseda_Meisei Waseda University; Meisei University 4 4 VIREO City University of Hong Kong 4 4 EURECOM EURECOM

4

FIU_UM Florida International University, University of Miami 4

SLIDE 10

12/19/2017 TRECVID 2017

10

Evaluation

Each query assumed to be binary: absent or present for each master reference shot. NIST sampled ranked pools and judged top results from all submissions. Metrics: inferred average precision per query. Compared runs in terms of mean inferred average precision across the 30 queries.

SLIDE 11

12/19/2017 TRECVID 2017

11

Mean Extended Inferred Average Precision (XInfAP)

2 pools were created for each query and sampled as:

✓ Top pool (ranks 1 to 150) sampled at 100 % ✓ Bottom pool (ranks 151 to 1000) sampled at 2.5 % ✓ % of sampled and judged clips from rank 151 to 1000 across all runs

and topics (min= 2 %, max = 64.4 %, mean = 29 %)

Judgment process: one assessor per query, watched complete shot while listening to the audio. infAP was calculated using the judged and unjudged pool by sample_eval tool

30 queries 89 435 total judgments 9611 total hits 7209 hits at ranks (1 to100) 2013 hits at ranks (101 to 150) 389 hits at ranks (151 to 1000) > TV2016 >> TV2016

SLIDE 12

12/19/2017 TRECVID 2017

12

Inferred frequency of hits varies by query

1000 2000 3000 4000 5000 6000 531 533 535 537 539 541 543 545 547 549 551 553 555 557 559

Inf. Hits / query

1 % of test shots Queries

Inf. hits

SLIDE 13

12/19/2017 TRECVID 2017

13

Total true shots contributed uniquely by team

10 20 30 40 50 60 70 80 90 100 Number of true shots

SLIDE 14

12/19/2017 TRECVID 2017

14

2017 run submissions scores (19 Manually-assisted runs)

0.05 0.1 0.15 0.2 0.25 Mean Inf. AP Median = 0.12 (>> TV2016 : 0.04)) Max = 0.216 (>> TV2016 : 0.177))

SLIDE 15

12/19/2017 TRECVID 2017

15

2017 run submissions scores (33 Fully automatic runs)

0.05 0.1 0.15 0.2 0.25

MediaMill.17_1 MediaMill.17_2 MediaMill.17_4 Waseda_Meisei.17_1 MediaMill.17_3 Waseda_Meisei.17_4 Waseda_Meisei.17_3 Waseda_Meisei.17_2 VIREO.17_2 VIREO.17_4 VIREO.17_3 ITI_CERTH.17_3 EURECOM.17_3 VIREO.17_1 ITI_CERTH.17_4 ITI_CERTH.17_1 EURECOM.17_1 EURECOM.17_2 ITI_CERTH.17_2 NII_Hitachi_UIT.17_1 NII_Hitachi_UIT.17_2 ITEC_UNIKLU.17_4 ITEC_UNIKLU.17_3 INF.17_2 ITEC_UNIKLU.17_2 NII_Hitachi_UIT.17_5 INF.17_1 NII_Hitachi_UIT.17_3 ITEC_UNIKLU.17_1 INF.17_3 EURECOM.17_4 INF.17_4 NII_Hitachi_UIT.17_4

Mean Inf. AP Median = 0.092 (> TV2016 : 0.024) Max = 0.206 (>> TV2016 : 0.054))

SLIDE 16

12/19/2017 TRECVID 2017

16

Top 10 infAP scores by query (Fully automatic)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 531 533 535 537 539 541 543 545 547 549 551 553 555 557 559

Inf. AP

10 9 8 7 6 5 4 3 2 1 Median Topics

People driving snowmobiles in snow Chef or cook in kitchen Person wearing any kind of hat Person standing in front of brick building or wall Adult running in city street Person holding,

pening, closing or

handing over a box Male person falling down

SLIDE 17

12/19/2017 TRECVID 2017

17

Top 10 infAP scores by queries (Manually-Assisted)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 531 533 535 537 539 541 543 545 547 549 551 553 555 557 559

Inf. AP

10 9 8 7 6 5 4 3 2 1 Median Queries

Example of a query where manual improved over auto

SLIDE 18

Which topics where easy or difficult overall ?

Top 10 Easy (sorted by count of runs with InfAP >= 0.7) Top 10 Hard (sorted by count of runs with InfAP < 0.7) a person wearing any kind of hat an adult person running in a city street a chef or cook in a kitchen person standing in front of a brick building or wall

ne or more people driving snowmobiles in the snow

person holding, opening, closing or handing over a box

ne or more people swimming in a swimming pool

a male person falling down a man and woman inside a car child or group of children dancing a crowd of people attending a football game in a stadium children playing in a playground a newspaper person talking on a cell phone a person communicating using sign language person holding or opening a briefcase a person wearing a scarf

ne or more people eating food at a table indoor

a person riding a horse including horse-drawn carts person talking behind a podium wearing a suit

utdoors during daytime

12/19/2017 TRECVID 2017

18 More action and dynamics in hard queries

SLIDE 19

12/19/2017 TRECVID 2017

19

Statistical significant differences among top 10 “M” runs (using randomization test, p < 0.05)

Run Mean Inf. AP score D_Waseda_Meisei.17_1 0.216 + D_Waseda_Meisei.17_3 0.207 + D_Waseda_Meisei.17_2 0.204 + D_Waseda_Meisei.17_4 0.189 + D_VIREO.17_4 0.164 ! D_VIREO.17_2 0.164 ! D_FIU_UM.17_2 0.147 # D_FIU_UM.17_4 0.145 # D_VIREO.17_1 0.124 * D_VIREO.17_3 0.120 *

D_Waseda_Meisei.17_1 ➢ D_VIREO.17_4 ➢ D_VIREO.17_1 ➢ D_VIREO.17_3 ➢ D_VIREO.17_2 ➢ D_VIREO.17_1 ➢ D_VIREO.17_3 ➢ D_FIU_UM.17_2 ➢ D_FIU_UM.17_4 D_Waseda_Meisei.17_3 ➢ D_VIREO.17_4 ➢ D_VIREO.17_1 ➢ D_VIREO.17_3 ➢ D_VIREO.17_2 ➢ D_VIREO.17_1 ➢ D_VIREO.17_3 ➢ D_FIU_UM.17_2 ➢ D_FIU_UM.17_4 D_Waseda_Meisei.17_2 ➢ D_VIREO.17_1 ➢ D_VIREO.17_3 ➢ D_FIU_UM.17_2 ➢ D_FIU_UM.17_4 D_Waseda_Meisei.17_4 ➢ D_VIREO.17_1 ➢ D_VIREO.17_3 ➢ D_FIU_UM.17_4

+!#* : no significant difference among each set of runs ➢ Runs higher in the hierarchy are significantly better than runs more indented.

SLIDE 20

12/19/2017 TRECVID 2017

20

Statistical significant differences among top 10 “F” runs (using randomization test, p < 0.05)

Run Mean Inf. AP score D_MediaMill.17_1 0.206 + D_MediaMill.17_2 0.205 + D_MediaMill.17_4 0.177 D_Waseda_Meisei.17_1 0.159 D_MediaMill.17_3 0.150 D_Waseda_Meisei.17_4 0.143 # D_Waseda_Meisei.17_3 0.141 # D_Waseda_Meisei.17_2 0.125 D_VIREO.17_2 0.120 * D_VIREO.17_4 0.116 * D_VIREO.17_3 0.116 *

D_MediaMill.17_1 ➢ D_MediaMill.17_4 ➢ D_VIREO.17_2 ➢ D_VIREO.17_3 ➢ D_VIREO.17_4 ➢ D_Waseda_Meisei.17_1 ➢ D_Waseda_Meisei.17_2 ➢ D_Waseda_Meisei.17_3 ➢ D_Waseda_Meisei.17_4 D_MediaMill.17_2 ➢ D_MediaMill.17_4 ➢ D_VIREO.17_2 ➢ D_VIREO.17_3 ➢ D_VIREO.17_4 ➢ D_Waseda_Meisei.17_1 ➢ D_Waseda_Meisei.17_2 ➢ D_Waseda_Meisei.17_3 ➢ D_Waseda_Meisei.17_4

+#* : no significant difference among each set of runs ➢ Runs higher in the hierarchy are significantly better than runs more indented.

SLIDE 21

Good and fast

12/19/2017 TRECVID 2017

21

Processing time vs Inf. AP (“M” runs) Across all topics and runs

1 10 100 0.2 0.4 0.6 0.8 1 Time (s)

Inf. AP

Waseda _Meisei Kobe_nict_siegen

SLIDE 22

Good and fast

12/19/2017 TRECVID 2017

22

Processing time vs Inf. AP (“F” runs) Across all topics and runs

1 10 100 1000 0.2 0.4 0.6 0.8 Time (s)

Inf. AP

Vireo NII_Hitachi_UIT

SLIDE 23

12/19/2017 TRECVID 2017

23

2017 Main Approaches

Concept bank with automatic or manual mapping with query

terms

Combination of concept scores from Boolean operators
Work on Query Understanding
Rectified Linear Score Normalization
Use of Video-To-Text techniques on shots
Query expansion / term matching techniques
Use of unified text-image vector space

SLIDE 24

12/19/2017 TRECVID 2017

24

2017 Observations

Ad-hoc search is more difficult than simple concept-based

tagging.

Max and Median scores are better than TV2016 for both M

and F runs.

Manually-assisted runs performed slightly better than fully-

automatic.

Most systems are not real-time (slower systems were not

necessarily effective).

Some systems reported 0 time!!! (or didn’t measure it!)
There was 0 A and F runs submitted compared to D and E

SLIDE 25

12/19/2017 TRECVID 2017

25

Continued at MMM2018

10 Ad-Hoc Video Search (AVS) tasks, 5 of which are a random subset
f the 30 AVS tasks of TRECVID 2017 and 5 will be chosen directly by

human judges as a surprise. Each AVS task has several/many target shots that should be found.

10 Known-Item Search (KIS) tasks, which are selected completely

random on site. Each KIS task has only one single 20 s long target segment.

Registration for the task is now closed

SLIDE 26

12/19/2017 TRECVID 2017

26

9:20 - 11:40 : Ad-hoc Video Search

9:40 - 10:00, Query understanding is key for zero-example video search

(MediaMill - University of Amsterdam)

10:00 - 10:20, Waseda_Meisei at TRECVID 2017: Ad-hoc video search

(Waseda_Meisei - Waseda University; Meisei University)

10:20 - 10:40, Break with refreshments

10:40 - 11:00, FIU-UM@TRECVID 2017: Rectified Linear Score Normalization and

Weighted Integration for Ad-hoc Video Search (FIU_UM - Florida International University, University of Miami)

11:00 - 11:20, Interactive Video Search at VBS (ITEC_UNIKLU -Institute of

Information Technology, Klagenfurt University)

11:20 - 11:40, AVS discussion

SLIDE 27

12/19/2017 TRECVID 2017

27

2017 Questions

Was the task/queries realistic enough?!
Do we need to change/add/remove anything from the task in

2018 ?

Is there any specific reason why systems did not submit any

“F” runs? (training data collected automatically using a query built

manually from the given query text)

Did any team run their 2017 system on TV2016 topics or

2016 system on this year’s topics ?

Should we consider new dataset in 2019 to continue working
n Ad-hoc ? (e.g YouTube, Vimeo, etc)

TRECVID 2017

AD-HOC VIDEO SEARCH TASK : OVERVIEW Georges Quénot Laboratoire d'Informatique de Grenoble George Awad Dakota Consulting, Inc National Institute of Standards and Technology

Table of contents

Ad-hoc Video Search Task Definition

user ad-hoc queries that include persons, objects, locations, activities and their combinations.

boundary reference, return a ranked list of at most 1000 shots (out of 335 944) which best satisfy the need.

hours with video durations between 6.5 min to 9.5 min.

between 2010-2015 with concept annotations.

Query Development

National Institute of Standards and Technology (NIST).

applicable):

conditions/state)

to be used between 2016-2018.

TV2017 Queries by complexity

TV2017 Queries by complexity

TV2017 Queries by complexity

Training and run types

Four training data types:

✓ A – used only IACC training data (0 runs) ✓ D – used any other training data (40 runs) ✓ E – used only training data collected automatically using

✓ F – used only training data collected automatically using a

query built manually from the given query text (0 runs)

Two run submission types:

✓

Manually-assisted (M): Query built manually (19 runs)

✓

Fully automatic (F): System uses official query directly(33 runs)

Finishers : 10 out of 20

Team Organization M F

Evaluation

Each query assumed to be binary: absent or present for each master reference shot. NIST sampled ranked pools and judged top results from all submissions. Metrics: inferred average precision per query. Compared runs in terms of mean inferred average precision across the 30 queries.

Mean Extended Inferred Average Precision (XInfAP)

2 pools were created for each query and sampled as:

and topics (min= 2 %, max = 64.4 %, mean = 29 %)

Judgment process: one assessor per query, watched complete shot while listening to the audio. infAP was calculated using the judged and unjudged pool by sample_eval tool

Inferred frequency of hits varies by query

Total true shots contributed uniquely by team

2017 run submissions scores (19 Manually-assisted runs)

2017 run submissions scores (33 Fully automatic runs)

Top 10 infAP scores by query (Fully automatic)

Top 10 infAP scores by queries (Manually-Assisted)

Which topics where easy or difficult overall ?

Statistical significant differences among top 10 “M” runs (using randomization test, p < 0.05)

Statistical significant differences among top 10 “F” runs (using randomization test, p < 0.05)

Processing time vs Inf. AP (“M” runs) Across all topics and runs

Processing time vs Inf. AP (“F” runs) Across all topics and runs

2017 Main Approaches

terms

2017 Observations

tagging.

and F runs.

automatic.

necessarily effective).

Continued at MMM2018

human judges as a surprise. Each AVS task has several/many target shots that should be found.

random on site. Each KIS task has only one single 20 s long target segment.

9:20 - 11:40 : Ad-hoc Video Search

10:20 - 10:40, Break with refreshments

2017 Questions

2018 ?

“F” runs? (training data collected automatically using a query built

manually from the given query text)

2016 system on this year’s topics ?