fi fine ne gr grained ained vid video eo te text re
play

Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval - PowerPoint PPT Presentation

Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning Shizhe Chen 1 , Yida Zhao 1 , Qin Jin 1 , Qi Wu 2 1 Renmin University of China , 2 University of Adelaide


  1. Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning Shizhe Chen 1 , Yida Zhao 1 , Qin Jin 1 , Qi Wu 2 1 Renmin University of China , 2 University of Adelaide 1

  2. Vi Video-Te Text Cr Cros oss-mod modal Re Retrieval • Task: using sentences to retrieve videos • Sentences contain richer and more structured details than keywords 2

  3. 文本视频跨模态检索:动机 Mot Motivation on • Understanding fine-grained semantics in the query sentence • Hierarchical sentence structures • Event Action-action relationships • Actions Action-entity relationships • Entities • Fine-grained local components & how they compose to the event 3

  4. 文本视频跨模态检索:动机 Motivation Mot on • Understanding fine-grained semantics in the query sentence • Hierarchical sentence structures • Event Action-action relationships • Actions Action-entity relationships • Entities • Fine-grained local components & how they compose to the event • Limitations of previous works • Global matching: one vector • hard to capture fine-grained details • Local matching: word level • Cannot express complex relationships among words 4

  5. 文本视频跨模态检索:模型 Th The Pr Proposed Me Method od • Hierarchical Graph Reasoning Model (HGR) • Hierarchical Textual Encoding • Hierarchical Video Encoding • Multi-level Video-Text Matching 5

  6. 文本视频跨模态检索:模型 Hier Hierar archic hical al Te Textual Enc Encodi ding ng Semantic Role Graph Node Initialization Attention-based Graph Reasoning contextual word embedding o Capture interactive context via Event attentive relational GCN node: o Factorize relational matrix max pooling Action Entity 6

  7. Hier Hierar archic hical al Vi Video Enc Encodi ding ng • Video contain multiple aspects • Objects, actions, events • Challenging to parse directly as texts, which require object detection, tracking, action segmentation etc. • Different weights for each level • Use different level of text as guidance to learn diverse video representation 7

  8. Mu Multi-le level el Cr Cros oss-mod modal Ma Matching • Multi-level fusion • Event Level • Global Matching • Cosine similarity • Action & Entity Levels • Local Matching • Weakly supervised attentive alignment • Training objective • contrastive ranking loss 8

  9. 文本视频跨模态检索 : 实验 Expe Experimental Se Settings • Datasets dataset Train Validation Test # sent/video MSR-VTT 6573 497 2990 20 TGIF 79451 10651 11310 1 VATEX 25991 1500 1500 10 Youtube2Text - - 670 41.5 • Evaluation metric • R@K: K={1, 5, 10} • MedR (median rank) & MnR (mean rank) 9

  10. 文本视频跨模态检索 : 实验 Expe Experimental Re Results • In-domain cross-modal retrieval • HGR model achieves consistent improvements on three datasets MSR-VTT dataset 10

  11. 文本视频跨模态检索 : 实验 Expe Experimental Re Results • In-domain cross-modal retrieval: ablation study • Textual encoding • Graph attention & Semantic role awareness • Video encoding • Different video weights at each level MSR-VTT dataset 11

  12. 文本视频跨模态检索 : 实验 Expe Experimental Re Results • Cross-dataset video-text retrieval • Train on MSRVTT and test on Youtube2Text • HGR model also has better generalization performances In-domain Cross-dataset 12

  13. 文本视频跨模态检索 : 实验 Expe Experimental Re Results • Fine-grained binary selection • Evaluation models’ fine-grained textual discrimination abilities • Better performances, especial incomplete events a man is positive a dog hits a man’s hands cutting pizza . positive with its paws while standing. pizza is negative negative a dog hits a man’s hands . cutting a man . 13

  14. Con Conclusion on • Contributions • Decompose video and text at event, action and entity levels for multi-level cross-modal matching • Utilize attention-based graph reasoning on textual semantic role graph to generate hierarchical embeddings • Results on in-domain, cross-dataset and fine-grained binary selection demonstrate the advantages of our model • Future work • Improve video encoding with multi-modalities and fine-grained spatial- temporal information Codes are released at: https://github.com/cshizhe/hgr_v2t 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend