Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , - - PowerPoint PPT Presentation

attentive moment retrieval in videos
SMART_READER_LITE
LIVE PREVIEW

Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , - - PowerPoint PPT Presentation

Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , Liqiang Nie 1 , Xiangnan He 2 , Baoquan Chen 1 , and Tat-Seng Chua 2 1 Shandong University, China 2 National University of Singapore, Singapore Pipeline Background Learning


slide-1
SLIDE 1

Attentive Moment Retrieval in Videos

Meng Liu1, Xiang Wang2, Liqiang Nie1, Xiangnan He2, Baoquan Chen1, and Tat-Seng Chua2

1Shandong University, China 2National University of Singapore, Singapore

slide-2
SLIDE 2

Pipeline

  • Background
  • Learning Model
  • Experiment
  • Conclusion
slide-3
SLIDE 3

Pipeline

  • Background
  • Learning Model
  • Experiment
  • Conclusion
slide-4
SLIDE 4

Background

  • Inter-video Retrieval

Query: FIFA World Cup Query: Pig Peggy

… … … …

slide-5
SLIDE 5

Background

  • Intra-video Retrieval

Retrieving a segment from the untrimmed videos, which contain complex scenes and involve a large number of

  • bjects, attributes, actions, and interactions.

Messi's penalty/Football shot

slide-6
SLIDE 6

Background

  • Surveillance Videos: Finding missing children or pets

and suspects

Query: A girl in orange first walks by the camera.

  • Home videos: Recalling the desired moment

Query: Baby’s face gets very close to the camera.

  • Online Videos: Quickly Jumping to the specific moment
slide-7
SLIDE 7

Background

Reality: Dragging progress bar to locate the desired moment. Research: Densely segment the long video into different scale moments, and then match each moment with the query.

Boring and time consuming Expensive computational costs and the exponential search space

slide-8
SLIDE 8

Problem Formulation

  • Temporal Moment Localization

Query: a girl in orange walks by the camera.

Input: a video and a language query

24s 30s

Output: Temporal moment corresponding to the given query (green box) with time points [24s,30,s]

slide-9
SLIDE 9

Pipeline

  • Background
  • Learning Model
  • Experiment
  • Conclusion
slide-10
SLIDE 10

Learning Model-Pipeline

Memory Attention Network Moment Feature Girl with blue shirt drives past on bike. Query Feature Cross-Modal Fusion MLP Model Alignment Score 0.9 Localization Offset [1.2s, 2.5s] Video Query

slide-11
SLIDE 11

Learning Model-Feature Extraction

  • Video
  • 1. Segmentation:

segment video into moments with sliding window, each moment ! has a time location [#$,#&]

  • 2. Computing location offset:

[($,(&]=[)$,)&]-[#$,#&], [)$,)&] is the temporal interval

  • f the given query
  • 3. Computing temporal-spatio feature *+:

C3D feature for each moment

  • Query

,: Skip-thoughts feature

slide-12
SLIDE 12

Learning Model-Moment Attention Network

  • There are many temporal constraint words in the

given query, such as the term “first”, “second”, and “closer to”, therefore temporal context are useful to the localization.

  • Not all the context have the same influence on the

localization, the near context are more important than the far ones.

slide-13
SLIDE 13

Learning Model-Memory Attention Network

! " #$% = '

$#$% + )$

*$ = +

,∈[/01,01]

4$% " #$%

5 6,, 7 = 8(+

:;/01 ,

'

$#$< + )$)>? 8(' @7 + )@)

4$% = 5(6,, 7) ∑B;/01

01

5(6B, 7) , C ∈ [−E$, E$] Memory cell

slide-14
SLIDE 14

Learning Model- Cross-modal Fusion

Moment ( ) Query (! ")

Mean pooling Mean pooling Inter Modal Intra Visual Intra Query

⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 1 1 ⋯ ⋯ ⋯ ⋯ 1

!

"# = %

&" 1 ⨂ ) * 1 = [ % &", % &"⨂) *, ) *, 1]

The output of this fusion procedure explores the intra- modal and the inter-modal feature interactions to generate the moment-query representations.

slide-15
SLIDE 15

Learning Model- Loss Function

Given the output of the fusion model into a two Layer MLP model, and the output of the MLP model is a three dimension vector !" = [%&',)*,)+].

  • = -./012 + 4-/5&
  • ./012 = 678

(&,')∈<log 1 + exp −%&'

+ 6E8

(&,')∈Flog 1 + exp %&'

  • /5& = 8

(&,')∈<

[G )*

∗ − )* + G()+ ∗ − )+)]

slide-16
SLIDE 16

Pipeline

  • Background
  • Learning Model
  • Experiment
  • Conclusion
slide-17
SLIDE 17

Experiment - Dataset

  • TACoS and DiDeMo
  • Evaluation:

R(n,m)=“R@n,IoU=m”

slide-18
SLIDE 18

Experiment – Performance Comparison

slide-19
SLIDE 19

Experiment – Model Variants

  • ACRN-a (pink): Mean pooling context feature as moment feature
  • ACRN-m (purple): Attention model without memory part
  • ACRN-c (blue): Concatenating multi-modal features
slide-20
SLIDE 20

Experiment – Qualitative Result

slide-21
SLIDE 21

Pipeline

  • Background
  • Learning Model
  • Experiment
  • Conclusion
slide-22
SLIDE 22

Conclusion

  • We present a novel Attentive Cross-Modal Retrieval Network,

which jointly characterizes the attentive contextual visual feature and the cross-modal feature representation.

  • We introduce a temporal memory attention network to

memorize the contextual information for each moment, and treat the natural language query as the input of an attention network to adaptively assign weights to the memory representation.

  • We perform extensive experiments on two benchmark datasets

to demonstrate the performance improvement.

slide-23
SLIDE 23

Thank you Q&A