Attentive Moment Retrieval in Videos
Meng Liu1, Xiang Wang2, Liqiang Nie1, Xiangnan He2, Baoquan Chen1, and Tat-Seng Chua2
1Shandong University, China 2National University of Singapore, Singapore
Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , - - PowerPoint PPT Presentation
Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , Liqiang Nie 1 , Xiangnan He 2 , Baoquan Chen 1 , and Tat-Seng Chua 2 1 Shandong University, China 2 National University of Singapore, Singapore Pipeline Background Learning
Meng Liu1, Xiang Wang2, Liqiang Nie1, Xiangnan He2, Baoquan Chen1, and Tat-Seng Chua2
1Shandong University, China 2National University of Singapore, Singapore
Query: FIFA World Cup Query: Pig Peggy
… … … …
Retrieving a segment from the untrimmed videos, which contain complex scenes and involve a large number of
Messi's penalty/Football shot
and suspects
Query: A girl in orange first walks by the camera.
Query: Baby’s face gets very close to the camera.
Boring and time consuming Expensive computational costs and the exponential search space
Query: a girl in orange walks by the camera.
24s 30s
Memory Attention Network Moment Feature Girl with blue shirt drives past on bike. Query Feature Cross-Modal Fusion MLP Model Alignment Score 0.9 Localization Offset [1.2s, 2.5s] Video Query
segment video into moments with sliding window, each moment ! has a time location [#$,#&]
[($,(&]=[)$,)&]-[#$,#&], [)$,)&] is the temporal interval
C3D feature for each moment
! " #$% = '
$#$% + )$
*$ = +
,∈[/01,01]
4$% " #$%
5 6,, 7 = 8(+
:;/01 ,
'
$#$< + )$)>? 8(' @7 + )@)
4$% = 5(6,, 7) ∑B;/01
01
5(6B, 7) , C ∈ [−E$, E$] Memory cell
Moment ( ) Query (! ")
⨂
Mean pooling Mean pooling Inter Modal Intra Visual Intra Query
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1 1 ⋯ ⋯ ⋯ ⋯ 1
!
"# = %
&" 1 ⨂ ) * 1 = [ % &", % &"⨂) *, ) *, 1]
The output of this fusion procedure explores the intra- modal and the inter-modal feature interactions to generate the moment-query representations.
Given the output of the fusion model into a two Layer MLP model, and the output of the MLP model is a three dimension vector !" = [%&',)*,)+].
(&,')∈<log 1 + exp −%&'
+ 6E8
(&,')∈Flog 1 + exp %&'
(&,')∈<
[G )*
∗ − )* + G()+ ∗ − )+)]
R(n,m)=“R@n,IoU=m”
which jointly characterizes the attentive contextual visual feature and the cross-modal feature representation.
memorize the contextual information for each moment, and treat the natural language query as the input of an attention network to adaptively assign weights to the memory representation.
to demonstrate the performance improvement.