TRECVID 2018 Video to Text Description Asad A. Butt NIST George - - PowerPoint PPT Presentation

trecvid 2018
SMART_READER_LITE
LIVE PREVIEW

TRECVID 2018 Video to Text Description Asad A. Butt NIST George - - PowerPoint PPT Presentation

TRECVID 2018 1 TRECVID 2018 Video to Text Description Asad A. Butt NIST George Awad NIST; Dakota Consulting, Inc Alan Smeaton Dublin City University Disclaimer: Certain commercial entities, equipment, or materials may be identified in


slide-1
SLIDE 1

TRECVID 2018

Video to Text Description

Asad A. Butt NIST George Awad NIST; Dakota Consulting, Inc Alan Smeaton Dublin City University

1

TRECVID 2018

Disclaimer: Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards, nor is it intended to imply that the entities, materials,

  • r equipment are necessarily the best available for the purpose.
slide-2
SLIDE 2

Goals and Motivations

✓Measure how well an automatic system can describe a video in

natural language.

✓Measure how well an automatic system can match high-level

textual descriptions to low-level computer vision features.

✓Transfer successful image captioning technology to the video

domain.

Real world Applications

✓Video summarization ✓Supporting search and browsing ✓Accessibility - video description to the blind ✓Video event prediction

2

TRECVID 2018

slide-3
SLIDE 3
  • Systems are asked to submit results for two

subtasks:

  • 1. Matching & Ranking:

Return for each URL a ranked list of the most likely text description from each of the five sets.

  • 2. Description Generation:

Automatically generate a text description for each URL.

3

TASKS

TRECVID 2018

slide-4
SLIDE 4

Video Dataset

  • Crawled 50k+ Twitter Vine video URLs.
  • Max video duration == 6 sec.
  • A subset of 2000 URLs (quasi) randomly selected, divided

amongst 10 assessors.

  • Significant preprocessing to remove unsuitable videos.
  • Final dataset included 1903 URLs due to removal of

videos from Vine.

4

TRECVID 2018

slide-5
SLIDE 5

Steps to Remove Redundancy

▪ Before selecting the dataset, we clustered videos based

  • n visual similarity.

▪ Used a tool called SOTU [1], which used Visual Bag of Words to

cluster videos with 60% similarity for at least 3 frames.

▪ Resulted in the removal of duplicate videos, as well as those which

were very visually similar (e.g. soccer games), resulting in a more diverse set of videos.

TRECVID 2018

5

[1] Zhao, Wan-Lei and Ngo Chong-Wah. "SOTU in Action." (2012).

slide-6
SLIDE 6

Dataset Cleaning

▪ Dataset Creation Process: Manually went through large

collection of videos.

▪ Used list of commonly appearing videos from last year to select a

diverse set of videos.

▪ Removed videos with multiple, unrelated segments that are hard to

describe.

▪ Removed any animated (or otherwise unsuitable) videos.

▪ Resulted in a much cleaner dataset.

TRECVID 2018

6

slide-7
SLIDE 7

Annotation Process

  • Each video was annotated by 5 assessors.
  • Annotation guidelines by NIST:
  • For each video, annotators were asked to combine 4 facets if

applicable:

  • Who is the video describing (objects, persons, animals, …etc) ?
  • What are the objects and beings doing (actions, states, events,

…etc)?

  • Where (locale, site, place, geographic, ...etc) ?
  • When (time of day, season, ...etc) ?

TRECVID 2018

7

slide-8
SLIDE 8

Annotation Process – Observations

  • 1. Different assessors provide varying amount of detail

when describing videos. Some assessors had very long sentences to incorporate all information, while

  • thers gave a brief description.
  • 2. Assessors interpret scenes according to cultural or

pop cultural references, not universally recognized.

  • 3. Specifying the time of the day was often not possible

for indoor videos.

  • 4. Given the removal of videos with multiple disjointed

scenes, assessors were better able to provide descriptions.

TRECVID 2018

8

slide-9
SLIDE 9

Sample Captions of 5 Assessors

TRECVID 2018

9 1. Orange car #1 on gray day drives around curve in road race test. 2. Orange car drives on wet road curve with

  • bservers.

3. An orange car with black roof, is driving around a curve on the road, while a person, wearing grey is

  • bserving it.

4. The orange car is driving on the road and going around a curve. 5. Advertisement for automobile mountain race showing the orange number one car navigating a curve on the mountain during the race in the evening; an individual is observing the vehicle dressed in jeans and cold weather coat. 1. A woman lets go of a brown ball attached to

  • verhead wire that comes back and hits her in the

face. 2. In a room, a bowling ball on a string swings and its a woman with a white shirt on in the face. 3. During a demonstration a white woman with black hair wearing a white top and holding a ball tether to a line from above as the demonstrator tells her to let go of the ball which returns on its tether and hits the woman in the face. 4. A man in blue holds a ball on a cord and lets it swing, and it comes back and hits a woman in white in the face. 5. A young girl, before an audience of students, allows a pendulum to swing from her face and all are surprised when it returns to strike her.

slide-10
SLIDE 10

2018 Participants (12 teams finished)

Matching & Ranking (26 Runs) Description Generation (24 Runs)

INF P P KSLAB P P KU_ISPL P P MMSys_CCMIP P P NTU_ROSE P P PicSOM P UPCer P UTS_CETC_D2DCRC_ CAI P P EURECOM P ORAND P RUCMM P UCR_VCG P 10

TRECVID 2018

slide-11
SLIDE 11

Sub-task 1: Matching & Ranking

11

TRECVID 2018

Person reading newspaper outdoors at daytime Three men running in the street at daytime Person playing golf outdoors in the field Two men looking at laptop in an office

  • Up to 4 runs per site were allowed in the Matching & Ranking subtask.
  • Mean inverted rank used for evaluation.
  • Five sets of descriptions used.
slide-12
SLIDE 12

Matching & Ranking Results – Set A

12

TRECVID 2018 0.1 0.2 0.3 0.4 0.5 0.6

Run 1 Run 2 Run 3 Run 4

Mean Inverted Rank

slide-13
SLIDE 13

Matching & Ranking Results – Set B

13

TRECVID 2018 0.1 0.2 0.3 0.4 0.5 0.6

Run 1 Run 2 Run 3 Run 4

Mean Inverted Rank

slide-14
SLIDE 14

Matching & Ranking Results – Set C

14

TRECVID 2018 0.1 0.2 0.3 0.4 0.5 0.6

Run 1 Run 2 Run 3 Run 4

Mean Inverted Rank

slide-15
SLIDE 15

Matching & Ranking Results – Set D

15

TRECVID 2018 0.1 0.2 0.3 0.4 0.5 0.6

Run 1 Run 2 Run 3 Run 4

Mean Inverted Rank

slide-16
SLIDE 16

Matching & Ranking Results – Set E

16

TRECVID 2018 0.1 0.2 0.3 0.4 0.5 0.6

Run 1 Run 2 Run 3 Run 4

Mean Inverted Rank

slide-17
SLIDE 17

Systems Rankings for each Set

A B C D E

RUCMM RUCMM RUCMM RUCMM RUCMM INF INF INF INF INF EURECOM EURECOM EURECOM EURECOM EURECOM UCR_VCG UCR_VCG UCR_VCG UCR_VCG UCR_VCG NTU_ROSE KU_ISPL ORAND KU_ISPL KU_ISPL KU_ISPL ORAND KU_ISPL ORAND ORAND ORAND NTU_ROSE NTU_ROSE KSLAB KSLAB KSLAB UTS_CETC_D2DCR C_CAI KSLAB NTU_ROSE UTS_CETC_D2DCR C_CAI UTS_CETC_D2DCR C_CAI KSLAB UTS_CETC_D2DCR C_CAI UTS_CETC_D2DCR C_CAI NTU_ROSE MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP TRECVID 2018

17

Not much difference between these runs.

slide-18
SLIDE 18

Top 3 Results

TRECVID 2018

18 #1874 #1681 #598

slide-19
SLIDE 19

Bottom 3 Results

TRECVID 2018

19 #1029 #958 #1825

slide-20
SLIDE 20

Sub-task 2: Description Generation

TRECVID 2018

20 “a dog is licking its nose”

Given a video Generate a textual description

  • Up to 4 runs in the Description Generation subtask.
  • Metrics used for evaluation:
  • BLEU (BiLingual Evaluation Understudy)
  • METEOR

(Metric for Evaluation

  • f

Translation with Explicit Ordering)

  • CIDEr (Consensus-based Image Description Evaluation)
  • STS (Semantic Textual Similarity)
  • DA (Direct Assessment), which is a crowdsourced rating of

captions using Amazon Mechanical Turk (AMT)

  • Run Types
  • V (Vine videos used for training)
  • N (Only non-Vine videos used for training)

Who ? What ? Where ? When ?

slide-21
SLIDE 21

CIDEr Results

TRECVID 2018

21

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Run 1 Run 2 Run 3 Run 4

slide-22
SLIDE 22

CIDEr-D Results

TRECVID 2018

22

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Run 1 Run 2 Run 3 Run 4

slide-23
SLIDE 23

METEOR Results

TRECVID 2018

23

0.05 0.1 0.15 0.2 0.25 Run 1 Run 2 Run 3 Run 4

slide-24
SLIDE 24

BLEU Results

TRECVID 2018

24

0.005 0.01 0.015 0.02 0.025 0.03 Run 1 Run 2 Run 3 Run 4

slide-25
SLIDE 25

STS Results

TRECVID 2018

25

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Run 1 Run 2 Run 3 Run 4

slide-26
SLIDE 26

CIDEr Results – Run Type

TRECVID 2018

26

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 V N

slide-27
SLIDE 27

Direct Assessment (DA)

  • Measures …
  • RAW: Average DA score [0..100] for each system (non-

standardised) – micro-averaged per caption then overall average

  • Z: Average DA score per system after standardisation

per individual AMT worker’s mean and std. dev. score.

TRECVID 2018

29

slide-28
SLIDE 28

DA results - Raw

TRECVID 2018

30

10 20 30 40 50 60 70 80 90 100

Raw

slide-29
SLIDE 29

DA results - Z

TRECVID 2018

31

  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

Z

slide-30
SLIDE 30

What DA Results Tell Us ..

TRECVID 2018

33

1. Green squares indicate a significant “win” for the row over the column. 2. No system yet reaches human performance. 3. Humans B and E statistically perform better than Human D. 4. Amongst systems, INF outperforms the rest.

slide-31
SLIDE 31

Systems Rankings for each Metric

CIDEr CIDEr-D METEOR BLEU STS DA

INF INF INF INF INF INF UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI UTS_CETC_D2 DCRC_CAI NTU_ROSE UPCer UPCer UPCer PicSOM UPCer PicSOM KSLAB PicSOM PicSOM NTU_ROSE PicSOM UPCer PicSOM KU_ISPL KSLAB UPCer KU_ISPL KSLAB NTU_ROSE KSLAB KU_ISPL KU_ISPL KSLAB KU_ISPL KU_ISPL NTU_ROSE NTU_ROSE KSLAB NTU_ROSE MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP MMSys_CCMIP TRECVID 2018

34

slide-32
SLIDE 32

Observations

  • The task continues to evolve as the number of

annotations per video were standardized to 5 (compare to last year’s task).

  • Tried to remove redundancy and create a

diverse set with little or no ambiguity for matching sub-task.

  • Steps were taken to ensure that a cleaner

dataset was used for the task.

TRECVID 2018

36

slide-33
SLIDE 33

Participants

  • Teams that will present today:
  • RUCMM
  • KU_ISPL
  • INF
  • Very high level bullets on approaches by other teams.

TRECVID 2018

37

slide-34
SLIDE 34

UTS_CETC_D2DCRC

  • Widely used LSTM based sequence to sequence model.
  • Focus on improving generalization ability of the model.
  • Different training strategies used.
  • Several combinations of spatial and temporal features are

ensembled together.

  • Simple model structure preferred to help generalization

ability.

  • Training data: MSVD, MSR-VTT 2016, TGIF, VTT 2016,

VTT 2017

TRECVID 2018

38

slide-35
SLIDE 35

PicSOM

Description Generation

  • LSTM recurrent neural networks used to generate

descriptions using multi-modal features.

  • Visual features include image and video features and

trajectory features.

  • Audio features also used.
  • Training datasets used: MS COCO, MSR-VTT, TGIF,

MSVD.

  • Significant improvement by expanding MSR-VTT training

dataset with MS COCO.

TRECVID 2018

39

slide-36
SLIDE 36

KSLAB

  • The main idea is to extract representations from only key

frames.

  • Key frames are detected for different types of events.
  • The method uses a CNN encoder and LSTM decoder.
  • Model trained using MS COCO dataset.

TRECVID 2018

40

slide-37
SLIDE 37

NTU_ROSE

  • Matching & Ranking
  • Trained 2 different models on MS COCO dataset.
  • Image based retrieval methods found suitable.
  • Description Generation
  • Training dataset: MSR-VTT and MSVD.
  • CST-captioning (Consensus-based Sequence Training) used as

baseline and adapted.

  • Both visual and audio features used.
  • Model trained on MSR-VTT performed better, probably because it

generates longer sentences than one trained on MSVD.

TRECVID 2018

41

slide-38
SLIDE 38

MMSys

  • Matching & Ranking
  • Wikipedia and Pascal Sentence datasets used for training.
  • Used pre-trained cross-modal retrieval method for matching task.
  • Description Generation
  • MSR-VTT dataset used for training.
  • Extract 1 fps per video and used pre-trained Inception-ResNetV2 to

extract features.

  • Used sen2vec for text features.
  • Model trained on frame and text features.

TRECVID 2018

42

slide-39
SLIDE 39

EURECOM

Matching & Ranking

  • Improved approach of best team of 2017 (DL-61-86).
  • Feature vectors derived from frames extracted at 2 fps using final

layer of ResNet-152.

  • Contextualized features obtained and combined through soft

attention mechanism.

  • Resulting vector v fed into two fully connected layers using RELU

activation.

  • Vector v concatenated with vector from last layer of an

RGB-I3D.

  • Instead of using Res-Net152 trained on ImageNet, it is

also finetuned on MSCOCO.

TRECVID 2018

43

slide-40
SLIDE 40

UCR_VCG

Matching & Ranking

  • MS-COCO dataset used for training.
  • Keyframes extracted from videos – representative frames
  • A joint image-text embedding approach used to match

videos to descriptions.

TRECVID 2018

44

slide-41
SLIDE 41

Conclusion

  • Good number of participation. Task will be renewed.
  • This year we had more annotations per video.
  • A cleaner dataset created.
  • Direct Assessment was used for a second year running.

This year we included multiple human responses. The results are interesting.

  • Lots of available training sets, some overlap ... MSR-

VTT, MS-COCO, ImageNet, YouTube2Text, MSVD, TRECVid2016-2017 VTT, TGIF

  • Some teams used audio features in addition to visual

features.

TRECVID 2018

45

slide-42
SLIDE 42

Discussion

  • Is there value in the caption ranking sub-task? Should it

be continued, especially with some teams participating

  • nly in this subtask?
  • Is the inclusion of run type (N or V) valuable?
  • Other possible run types? Video datasets only vs. video + image

captioning training datasets.

  • Possibilities for a new dataset?
  • Are more teams planning to use audio features? What

about motion from video?

  • What did individual teams learn?

TRECVID 2018

46