AIM 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha - - PowerPoint PPT Presentation

ai m 3 at team eam ru ruc ai at vid video eo pe
SMART_READER_LITE
LIVE PREVIEW

AIM 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha - - PowerPoint PPT Presentation

AIM 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha Challeng nge 2020 2020 Shizhe Chen , Yida Zhao, Qin Jin Renmin University of China 1 Vi Video Pe Pentathlon Ch Challenge Task Text-to-Video Cross-modal


slide-1
SLIDE 1

Team eam RU RUC AI AI·M3 at at Vid Video eo Pe Pentathlon Cha Challeng nge 2020 2020

Shizhe Chen, Yida Zhao, Qin Jin

Renmin University of China

1

slide-2
SLIDE 2

Vi Video Pe Pentathlon Ch Challenge

  • Task
  • Text-to-Video Cross-modal Retrieval
  • Using provided multimodal features
  • Evaluation
  • a pentathlon of five video-text benchmarks
  • MSRVTT, MSVD, DiDeMo, ActivityNet (ANet), YouCook2 (YC2)
  • Metric
  • geometric mean of Recall@K (K={1, 5, 10})

2

slide-3
SLIDE 3

Ou Our Con Contri ribution

  • ns
  • Hierarchical Video-Text Matching
  • Hierarchical graph reasoning model
  • Enhanced Inference Methods
  • Query expansion
  • Hubness mitigation
  • Knowledge Transfer from Additional Datasets
  • Multi-task training

3

slide-4
SLIDE 4

Ou Our Con Contri ribution

  • ns
  • Hierarchical Video-Text Matching
  • Hierarchical graph reasoning model
  • Enhanced Inference Methods
  • Query expansion
  • Hubness mitigation
  • Knowledge Transfer from Additional Datasets
  • Multi-task training

4

slide-5
SLIDE 5

Hier Hierar archic hical al Vi Video-Te Text Ma Matching

  • Simple embeddings are insufficient

to represent complicated video and text details

  • Hierarchical Graph Reasoning Model
  • multi-level cross-modal matching
  • Event
  • Actions
  • Entities
  • Hierarchical textual encoding
  • Hierarchical video encoding

5

Global Local

Chen, Shizhe, et al. "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning." CVPR, 2020.

slide-6
SLIDE 6

Hier Hierar archic hical al Vi Video-Te Text Ma Matching

  • Experimental results
  • HGR model achieves the best performance on all datasets
  • Especially on DiDeMo and Anet whose description lengths are long

6

+ 1.25 + 0.77 + 4.18 + 2.98 + 1.84 Absolute Gains

Average Sentence Length 9 7 33 54 9

slide-7
SLIDE 7

Ou Our Con Contri ribution

  • ns
  • Hierarchical Video-Text Matching
  • Hierarchical graph reasoning model
  • Enhanced Inference Methods
  • Query expansion
  • Hubness mitigation
  • Knowledge Transfer from Additional Datasets
  • Multi-task training

7

slide-8
SLIDE 8

Enha Enhanc nced In Infer erenc ence Me Method

  • ds
  • Query Expansion
  • Reformulate a given query and ensemble results from all expanded queries
  • Use multiple query texts for a video in MSRVTT and MSVD datasets

8

  • Experimental results
  • improves retrieval performance with groundtruth expanded queries
  • Future work: other techniques such as automatic paraphrasing
slide-9
SLIDE 9

Enha Enhanc nced In Infer erenc ence Me Method

  • ds
  • Hubness Mitigation
  • some points have high probabilities to be nearest neighbors of many other points
  • Inverted Softmax:

9

  • Experimental results
  • improves retrieval performance with groundtruth expanded queries
  • Future work: mitigate hubness problem during training

Smith, Samuel L., et al. “Offline bilingual word vectors, orthogonal transformations and the inverted softmax.” ICLR, 2017.

slide-10
SLIDE 10

Ou Our Con Contri ribution

  • ns
  • Hierarchical Video-Text Matching
  • Hierarchical graph reasoning model
  • Enhanced Inference Methods
  • Query expansion
  • Hubness mitigation
  • Knowledge Transfer from Additional Datasets
  • Multi-task balanced training

10

slide-11
SLIDE 11

Kn Knowledge Tr Transfer

  • Training with all datasets does not perform well
  • Different dataset scales and cross-domain discrepancies
  • Cross-dataset performance

11

MSRVTT MSVD DiDeMo Anet YC2 # trn pairs 117,220 43,892 7,552 8,007 7,745

slide-12
SLIDE 12

Kn Knowledge Tr Transfer

  • Multi-task balanced training
  • Combine target dataset and MSRVTT in training
  • Balance the training examples from different datasets

12

  • Experimental results
  • beneficial to employ additional datasets
  • Future work: more effective transfer learning approaches
slide-13
SLIDE 13

Te Testing Su Submi mission

  • ns
  • Pipeline

13

HGR model with multi- task balanced training Average Ensembling (3-5 models) Query Expansion (optional) Hubness mitigation inference

  • Experimental results
  • Second place in the challenge
slide-14
SLIDE 14

Ta Take Ho Home Me Message

  • Multi-level matching model (HGR) is effective than global/local

matching models for text-video retrieval

  • Hubness problem needs to be addressed in training and inference
  • Knowledge transferring is promising

14

Contact email: cszhe1@ruc.edu.cn