Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning
Shizhe Chen1, Yida Zhao1, Qin Jin1, Qi Wu2
1Renmin University of China, 2University of Adelaide
1
Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval - - PowerPoint PPT Presentation
Fi Fine ne-gr grained ained Vid Video eo-Te Text Re Retrieval wi with th Hier Hierar arch chic ical al Gr Graph Re Reasoning Shizhe Chen 1 , Yida Zhao 1 , Qin Jin 1 , Qi Wu 2 1 Renmin University of China , 2 University of Adelaide
Shizhe Chen1, Yida Zhao1, Qin Jin1, Qi Wu2
1Renmin University of China, 2University of Adelaide
1
2
3
Action-action relationships Action-entity relationships
how they compose to the event
4
Action-action relationships Action-entity relationships
how they compose to the event
among words
5
6
Event node:
max pooling
attentive relational GCN
Semantic Role Graph Node Initialization Attention-based Graph Reasoning
Action Entity contextual word embedding
learn diverse video representation
7
require object detection, tracking, action segmentation etc.
8
9
dataset Train Validation Test # sent/video MSR-VTT 6573 497 2990 20 TGIF 79451 10651 11310 1 VATEX 25991 1500 1500 10 Youtube2Text
41.5
10
MSR-VTT dataset
11
MSR-VTT dataset
12
In-domain Cross-dataset
13
positive a man is cutting pizza. negative pizza is cutting a man. positive a dog hits a man’s hands with its paws while standing. negative a dog hits a man’s hands.
cross-modal matching
generate hierarchical embeddings
demonstrate the advantages of our model
temporal information
14
Codes are released at: https://github.com/cshizhe/hgr_v2t