Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval - - PowerPoint PPT Presentation

fi fine ne gr grained ained vid video eo te text re
SMART_READER_LITE
LIVE PREVIEW

Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval - - PowerPoint PPT Presentation

Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning Shizhe Chen 1 , Yida Zhao 1 , Qin Jin 1 , Qi Wu 2 1 Renmin University of China , 2 University of Adelaide


slide-1
SLIDE 1

Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning

Shizhe Chen1, Yida Zhao1, Qin Jin1, Qi Wu2

1Renmin University of China, 2University of Adelaide

1

slide-2
SLIDE 2

Vi Video-Te Text Cr Cros

  • ss-mod

modal Re Retrieval

  • Task: using sentences to retrieve videos
  • Sentences contain richer and more structured details than keywords

2

slide-3
SLIDE 3
  • Understanding fine-grained semantics in the query sentence
  • Hierarchical sentence structures
  • Event
  • Actions
  • Entities

文本视频跨模态检索:动机

3

Action-action relationships Action-entity relationships

Mot Motivation

  • n
  • Fine-grained local components &

how they compose to the event

slide-4
SLIDE 4
  • Understanding fine-grained semantics in the query sentence
  • Hierarchical sentence structures
  • Event
  • Actions
  • Entities

文本视频跨模态检索:动机

4

Action-action relationships Action-entity relationships

Mot Motivation

  • n
  • Fine-grained local components &

how they compose to the event

  • Limitations of previous works
  • Global matching: one vector
  • hard to capture fine-grained details
  • Local matching: word level
  • Cannot express complex relationships

among words

slide-5
SLIDE 5

文本视频跨模态检索:模型

5

  • Hierarchical Graph Reasoning Model (HGR)
  • Hierarchical Textual Encoding
  • Hierarchical Video Encoding
  • Multi-level Video-Text Matching

Th The Pr Proposed Me Method

  • d
slide-6
SLIDE 6

文本视频跨模态检索:模型

6

Event node:

max pooling

Hier Hierar archic hical al Te Textual Enc Encodi ding ng

  • Capture interactive context via

attentive relational GCN

  • Factorize relational matrix

Semantic Role Graph Node Initialization Attention-based Graph Reasoning

Action Entity contextual word embedding

slide-7
SLIDE 7
  • Different weights for each level
  • Use different level of text as guidance to

learn diverse video representation

7

  • Video contain multiple aspects
  • Objects, actions, events
  • Challenging to parse directly as texts, which

require object detection, tracking, action segmentation etc.

Hier Hierar archic hical al Vi Video Enc Encodi ding ng

slide-8
SLIDE 8
  • Multi-level fusion
  • Event Level
  • Global Matching
  • Cosine similarity
  • Action & Entity Levels
  • Local Matching
  • Weakly supervised attentive alignment
  • Training objective
  • contrastive ranking loss

8

Mu Multi-le level el Cr Cros

  • ss-mod

modal Ma Matching

slide-9
SLIDE 9

文本视频跨模态检索: 实验

9

  • Datasets
  • Evaluation metric
  • R@K: K={1, 5, 10}
  • MedR (median rank) & MnR (mean rank)

dataset Train Validation Test # sent/video MSR-VTT 6573 497 2990 20 TGIF 79451 10651 11310 1 VATEX 25991 1500 1500 10 Youtube2Text

  • 670

41.5

Expe Experimental Se Settings

slide-10
SLIDE 10

文本视频跨模态检索: 实验

10

  • In-domain cross-modal retrieval
  • HGR model achieves consistent improvements on three datasets

MSR-VTT dataset

Expe Experimental Re Results

slide-11
SLIDE 11

文本视频跨模态检索: 实验

11

  • In-domain cross-modal retrieval: ablation study
  • Textual encoding
  • Graph attention & Semantic role awareness
  • Video encoding
  • Different video weights at each level

MSR-VTT dataset

Expe Experimental Re Results

slide-12
SLIDE 12

文本视频跨模态检索: 实验

12

  • Cross-dataset video-text retrieval
  • Train on MSRVTT and test on Youtube2Text
  • HGR model also has better generalization performances

In-domain Cross-dataset

Expe Experimental Re Results

slide-13
SLIDE 13

文本视频跨模态检索: 实验

13

  • Fine-grained binary selection
  • Evaluation models’ fine-grained textual discrimination abilities
  • Better performances, especial incomplete events

Expe Experimental Re Results

positive a man is cutting pizza. negative pizza is cutting a man. positive a dog hits a man’s hands with its paws while standing. negative a dog hits a man’s hands.

slide-14
SLIDE 14

Con Conclusion

  • n
  • Contributions
  • Decompose video and text at event, action and entity levels for multi-level

cross-modal matching

  • Utilize attention-based graph reasoning on textual semantic role graph to

generate hierarchical embeddings

  • Results on in-domain, cross-dataset and fine-grained binary selection

demonstrate the advantages of our model

  • Future work
  • Improve video encoding with multi-modalities and fine-grained spatial-

temporal information

14

Codes are released at: https://github.com/cshizhe/hgr_v2t