Image Data, Video Data and Both in VTT Model Training Video-to-Text - PowerPoint PPT Presentation

Image Data, Video Data and Both in VTT Model Training Video-to-Text Task in TRECVID 2019 Jorma Laaksonen, PicSOM Team Department of Computer Science Aalto University School of Science Espoo, Finland November 13th, 2019

Contents Background Motivation Approach Results Analysis Conclusions

People Jorma Laaksonen Héctor Laria Mantecón (Danny Francis & Benoit Huet of EURECOM)

Lessons from TRECVID 2018 We used only cross-entropy training, others did better with reinforcement learning Validation with VTT 2016 data was not able to select the best models Training with COCO image dataset gave equally good results as with video datasets We could move from old Theano-based code to new PyTorch-based

Development of scores METEOR scores by submission PicSOM pre experiments PicSOM submissions Other submissions 0.20 PicSOM post experiments 0.15 METEOR 0.10 0.05 0.00 a4a3 b4a2a1s2b2 b1 s1 s3s4 b3

Work between TRECVID 2018 and 2019 Implemented self-critical reinforcement learning Studied methods to combine image and video datasets and features Also wanted to study optimal combination of different video datasets

TGIF and COCO datasets Statistics: TGIF: 125,713 videos with 125,713 captions COCO: 123,287 images with 616,767 captions Which approach would be the best: 125,713 video feature vectors and 125,713 captions 123,287 image feature vectors and 616,767 captions 249,000 image feature vectors and 742,480 captions 249,000 image and video feature vectors and 742,480 captions

Videos to image features and vice versa Image features can be extracted from videos in multiple ways, e.g. use only the middle frame max or mean pool features of multiple or all frames Genuine video features such as I3D cannot be extracted from still images we used fake video features for COCO images average of all video features in TGIF was used assigned to all COCO images The final feature vector was concatenation of TGIF videos: I3D video feature ResNet image feature of middle frame COCO images: constant average I3D feature ResNet image feature

Methodology COCO image and TGIF video datasets in training model validation and early stopping with VTT 2018 dataset ResNet-152 CNN image and I3D video features fake I3D video features for COCO images “DeepCaption” LSTM language model decoder in PyTorch cross-entropy loss training in the beginning self-critical reinforcement learning in the end

Submissions We submitted four runs: • P IC SOM.1-M E MAD. PRIMARY : uses ResNet and I3D features for initialising the LSTM generator, and is trained on MS COCO + TGIF using self-critical loss, • P IC SOM.2-M E MAD: uses I3D features as initialisation, and is trained on TGIF using self-critical loss, • P IC SOM.3: uses ResNet features as initialisation, and is trained on MS COCO + TGIF using self-critical loss, • P IC SOM.4: is the same as P IC SOM.1-M E MAD. PRIMARY except that the loss function used is cross-entropy,

Results setup 2018 2019 id t loss feat data METEOR CIDEr CIDErD BLEU METEOR CIDEr CIDErD BLEU STS p-18-s2 I ce rn+fr C+M 0.1541 0.1657 0.0476 0.0091 0.1773 0.1858 0.0722 0.0207 – p-18-a3 I ce rn C+T 0.1776 0.1948 0.0700 0.0197 0.1993 0.2174 0.1004 0.0288 – p-19-s1 B sc rn+i3d C+T 0.2055 0.3025 0.1157 0.0294 0.2285 0.3277 0.1615 0.0385 0.4168 p-19-s2 V sc i3d T 0.1958 0.2718 0.0949 0.0348 0.2139 0.2773 0.1245 0.0379 0.4169 p-19-s3 I sc rn C+T 0.2007 0.2777 0.1074 0.0301 0.2254 0.3130 0.1569 0.0345 0.4282 p-19-s4 B ce rn+i3d C+T 0.1850 0.2190 0.0822 0.0213 0.2049 0.2348 0.1147 0.0319 0.4057 p-18-s2 is our best submission in TRECVID 2018 p-18-a3 is our best TRECVID 2018 post-conference result p-19-s* are our TRECVID 2019 submissisons

Comparison: METEOR 2018 METEOR scores by submission PicSOM pre experiments PicSOM submissions Other submissions 0.20 PicSOM post experiments 0.15 METEOR 0.10 0.05 0.00 a4a3 b4a2a1s2b2 b1 s1 s3s4 b3

Comparison: METEOR METEOR scores by submission PicSOM 2018 models 0.30 PicSOM submissions Other submissions 0.25 0.20 METEOR 0.15 0.10 0.05 0.00 s1s3 s2 s4 18-a3 18-s2

Comparison: CIDEr CIDEr scores by submission 0.6 PicSOM 2018 models PicSOM submissions Other submissions 0.5 0.4 CIDEr 0.3 0.2 0.1 0.0 s1s3s2 s4 18-a3 18-s2

Comparison: CIDEr-D CIDErD scores by submission PicSOM 2018 models PicSOM submissions 0.30 Other submissions 0.25 0.20 CIDErD 0.15 0.10 0.05 0.00 s1s3 s2s4 18-a3 18-s2

Comparison: BLEU-4 BLEU scores by submission PicSOM 2018 models 0.06 PicSOM submissions Other submissions 0.05 0.04 BLEU 0.03 0.02 0.01 0.00 s1s2 s3s4 18-a3 18-s2

Comparison: STS STS scores by submission 0.5 PicSOM submissions Other submissions 0.4 0.3 STS 0.2 0.1 0.0 s3 s2s1s4

Comparison s4 run is always the worst — reinforcement learning is beneficial s1 run is almost always the best — combining image and video features is good s3 run wins s2 with 4–1 — COCO image features better than TGIF video features

Run types In TRECVID VTT 2019 all submissions had to be tagged with their run type: Run type ’I’: Only image captioning datasets were used for training Run type ’V’: Only video captioning datasets were used for training Run type ’B’: Both image and video captioning datasets were used for training

Run types per team team image video both EURECOM 1 FDU 2 IMFD_IMPRESEE 3 Insight_DCU 1 KU_ISPL 3 KsLab 4 PicSOM 1 1 2 RUCMM 4 RUC_AIM3 4 UTS_ISA 4 10 teams 1 26 3

Training datasets used per team team COCO TGIF MSR-VTT MSVD VTT VATEX EURECOM X X X 0+3 FDU X 0+1 IMFD_IMPRESEE X 0+1 Insight_DCU X 0+1 KsLab X X 0+2 PicSOM X X 1+1 RUCMM X X X 0+3 RUC_AIM3 X X X X 0+4 UTS_ISA X X X X 0+4 9 teams 1 8 5 3 3 1 0+0

Statistics of the training datasets dataset items captions COCO 123,287 img 616,767 TGIF 125,713 vid 125,713 MSR-VTT 6,513 vid 130,260 MSVD 1,969 vid 80,800 VTT 3,753 vid 9,020 VATEX 41,300 vid 826,000 LSMDC 108,536 vid 108,536

Video features used per team team I3D C3D CNN+pool CNN+seq audio EURECOM X FDU X IMFD_IMPRESEE X X Insight_DCU X KsLab X PicSOM X RUCMM X X RUC_AIM3 X X X UTS_ISA X X 9 teams 5 3 2 3 1

Conclusions In the PicSOM experiments the use of also the COCO dataset proved to be beneficial Naïve use of fake video features for images was better than not to use images at all This conclusion might be different if our overall result level were higher we used more video data than just TGIF we used better video features than I3D we used pooling or RNN based aggregation of framewise features our implementation of self-critical training were better Model performance was very stable from validation with 2018 data to 2019 test data No other team used COCO dataset anymore Our results we clearly behind those of the best teams Specifying the run types in the way it was done now might be discontinued

Image Data, Video Data and Both in VTT Model Training Video-to-Text - PowerPoint PPT Presentation

Image Data, Video Data and Both in VTT Model Training Video-to-Text Task in TRECVID 2019 Jorma Laaksonen, PicSOM Team Department of Computer Science Aalto University School of Science Espoo, Finland November 13th, 2019 Contents Background

DIH summer school Introduction to brokerage Contact information: Heidi Korhonen (VTT) +358

Passive Monitoring of RTT spikes Jorma Kilpi VTT Information Technology P.O.Box 1202, 02044 VTT,

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Responsible Research and Innovation (RRI) Adjunct Professor & Principal Scientist Mika

Dynamic Approach to Service Level Agreement Risk Pirkko Kuusela and Ilkka Norros VTT, Technical

Image and Video Coding: Encoder Control D D = - R d R Problem Statement / Scope of Image

Image and Video Coding: Introduction bitstream encoder decoder Motivation Image and Video

Sustainable circular economy value propositions in clothing as a service -model IS PIM Virtual

OpenProd Demonstration Video for Dynamic Maintenance Service Model Tero Jokinen VTT

Trust and Cloud Services - An Interview Study Ilkka Uusitalo, Kaarina Karppinen, Arto Juhola and

TRANSIT BUS EMISSION STUDY COMPARISON OF EMISSIONS FROM DIESEL AND NATURAL GAS BUSES Nils-Olof

1 VTT TECHNICAL RESEARCH CENTRE OF FINLAND Applications Augmented on-site visualisation

1 VTT TECHNICAL RESEARCH CENTRE OF FINLAND Applications Augmented on-site visualisation

SUSTAINABLE UTILITY OF WOOD BIOMASS CURRENT TRENDS AT VTT Finnish-Japanese Workshop on

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Personal Photo Enhancement using Example Images Neel Joshi Wojciech Matusik, Edward H. Adelson,

Linear Transformations Marco Chiarandini Department of Mathematics & Computer Science

CSSE463: Image Recognition Day 14 Lab due Weds. These solutions assume that you don't

CS 188: Artificial Intelligence Spring 2011 Lecture 21: Perceptrons 4/13/2010 Pieter Abbeel

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

SixTrack / BOINC Updates Veronica Berglyd Olsen General SixTrack Meeting Thursday, 13 June 2019

Lt. Leighton and Lt. Shawna Miller Lt. Leighton Miller US Navy Operation Desert Storm

Image Data, Video Data and Both in VTT Model Training Video-to-Text - PowerPoint PPT Presentation

Image Data, Video Data and Both in VTT Model Training Video-to-Text Task in TRECVID 2019 Jorma Laaksonen, PicSOM Team Department of Computer Science Aalto University School of Science Espoo, Finland November 13th, 2019 Contents Background

DIH summer school Introduction to brokerage Contact information: Heidi Korhonen (VTT) +358

Passive Monitoring of RTT spikes Jorma Kilpi VTT Information Technology P.O.Box 1202, 02044 VTT,

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Responsible Research and Innovation (RRI) Adjunct Professor &amp; Principal Scientist Mika

Dynamic Approach to Service Level Agreement Risk Pirkko Kuusela and Ilkka Norros VTT, Technical

Image and Video Coding: Encoder Control D D = - R d R Problem Statement / Scope of Image

Image and Video Coding: Introduction bitstream encoder decoder Motivation Image and Video

Sustainable circular economy value propositions in clothing as a service -model IS PIM Virtual

OpenProd Demonstration Video for Dynamic Maintenance Service Model Tero Jokinen VTT

Trust and Cloud Services - An Interview Study Ilkka Uusitalo, Kaarina Karppinen, Arto Juhola and

TRANSIT BUS EMISSION STUDY COMPARISON OF EMISSIONS FROM DIESEL AND NATURAL GAS BUSES Nils-Olof

1 VTT TECHNICAL RESEARCH CENTRE OF FINLAND Applications Augmented on-site visualisation

1 VTT TECHNICAL RESEARCH CENTRE OF FINLAND Applications Augmented on-site visualisation

SUSTAINABLE UTILITY OF WOOD BIOMASS CURRENT TRENDS AT VTT Finnish-Japanese Workshop on

Energy-aware job scheduler for high- performance computing 7.9.2011 Olli Mmmel (VTT), Mikko

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Personal Photo Enhancement using Example Images Neel Joshi Wojciech Matusik, Edward H. Adelson,

Linear Transformations Marco Chiarandini Department of Mathematics &amp; Computer Science

CSSE463: Image Recognition Day 14 Lab due Weds. These solutions assume that you don't

CS 188: Artificial Intelligence Spring 2011 Lecture 21: Perceptrons 4/13/2010 Pieter Abbeel

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering.

SixTrack / BOINC Updates Veronica Berglyd Olsen General SixTrack Meeting Thursday, 13 June 2019

Lt. Leighton and Lt. Shawna Miller Lt. Leighton Miller US Navy Operation Desert Storm

Responsible Research and Innovation (RRI) Adjunct Professor & Principal Scientist Mika

Linear Transformations Marco Chiarandini Department of Mathematics & Computer Science