ai m 3 at team eam ru ruc ai at vid video eo pe
play

AIM 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha - PowerPoint PPT Presentation

AIM 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha Challeng nge 2020 2020 Shizhe Chen , Yida Zhao, Qin Jin Renmin University of China 1 Vi Video Pe Pentathlon Ch Challenge Task Text-to-Video Cross-modal


  1. AI·M 3 at Team eam RU RUC AI at Vid Video eo Pe Pentathlon Cha Challeng nge 2020 2020 Shizhe Chen , Yida Zhao, Qin Jin Renmin University of China 1

  2. Vi Video Pe Pentathlon Ch Challenge • Task • Text-to-Video Cross-modal Retrieval • Using provided multimodal features • Evaluation • a pentathlon of five video-text benchmarks • MSRVTT, MSVD, DiDeMo, ActivityNet (ANet), YouCook2 (YC2) • Metric • geometric mean of Recall@K (K={1, 5, 10}) 2

  3. Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task training 3

  4. Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task training 4

  5. Hier Hierar archic hical al Vi Video-Te Text Ma Matching • Simple embeddings are insufficient to represent complicated video and text details • Hierarchical Graph Reasoning Model • multi-level cross-modal matching Global • Event • Actions • Entities Local • Hierarchical textual encoding • Hierarchical video encoding Chen, Shizhe, et al. "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning." CVPR, 2020. 5

  6. Hier Hierar archic hical al Vi Video-Te Text Ma Matching • Experimental results • HGR model achieves the best performance on all datasets • Especially on DiDeMo and Anet whose description lengths are long Absolute Gains + 1.25 + 0.77 + 4.18 + 2.98 + 1.84 Average 9 7 33 54 9 Sentence Length 6

  7. Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task training 7

  8. Enha Enhanc nced In Infer erenc ence Me Method ods • Query Expansion • Reformulate a given query and ensemble results from all expanded queries • Use multiple query texts for a video in MSRVTT and MSVD datasets • Experimental results • improves retrieval performance with groundtruth expanded queries • Future work: other techniques such as automatic paraphrasing 8

  9. Enha Enhanc nced In Infer erenc ence Me Method ods • Hubness Mitigation • some points have high probabilities to be nearest neighbors of many other points • Inverted Softmax: • Experimental results • improves retrieval performance with groundtruth expanded queries • Future work: mitigate hubness problem during training Smith, Samuel L., et al. “Offline bilingual word vectors, orthogonal transformations and the inverted softmax.” ICLR, 2017. 9

  10. Ou Our Con Contri ribution ons • Hierarchical Video-Text Matching • Hierarchical graph reasoning model • Enhanced Inference Methods • Query expansion • Hubness mitigation • Knowledge Transfer from Additional Datasets • Multi-task balanced training 10

  11. Kn Knowledge Tr Transfer • Training with all datasets does not perform well • Different dataset scales and cross-domain discrepancies MSRVTT MSVD DiDeMo Anet YC2 # trn pairs 117,220 43,892 7,552 8,007 7,745 • Cross-dataset performance 11

  12. Kn Knowledge Tr Transfer • Multi-task balanced training • Combine target dataset and MSRVTT in training • Balance the training examples from different datasets • Experimental results • beneficial to employ additional datasets • Future work: more effective transfer learning approaches 12

  13. Testing Su Te Submi mission ons • Pipeline HGR model Average Query Hubness with multi- Ensembling Expansion mitigation task balanced (3-5 models) (optional) inference training • Experimental results • Second place in the challenge 13

  14. Ta Take Ho Home Me Message • Multi-level matching model (HGR) is effective than global/local matching models for text-video retrieval • Hubness problem needs to be addressed in training and inference • Knowledge transferring is promising Contact email: cszhe1@ruc.edu.cn 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend