Multimodal Memory Modelling for Video Captioning Liang Wang & - - PowerPoint PPT Presentation

multimodal memory modelling for video captioning
SMART_READER_LITE
LIVE PREVIEW

Multimodal Memory Modelling for Video Captioning Liang Wang & - - PowerPoint PPT Presentation

2018 018 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for


slide-1
SLIDE 1

Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences (CASIA)

Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences 中国科学院自动化研究所

2018 018

Multimodal Memory Modelling for Video Captioning

Liang Wang & Yan Huang

Mar Mar 28 28, , 2018 2018

slide-2
SLIDE 2

NVAIL

Artificial Intelligence Laboratory Researches on artificial intelligence and deep learning

slide-3
SLIDE 3

Outline

  • Introduction
  • Model Description
  • Experimental Results
  • Conclusion
slide-4
SLIDE 4

Outline

  • Introduction
  • Model Description
  • Experimental Results
  • Conclusion
slide-5
SLIDE 5

Video Captioning

  • 1. A man and a woman performing a musical.
  • 2. A teenage couple perform in an amateur musical.
  • 3. Dancers are playing a routine.
  • 4. People are dancing in a musical.
  • Challenges

 Learning an effective mapping from visual sequence space to language space  The long-term visual-textual dependency modelling

  • Generate natural sentences to describe video content
  • Potential applications
slide-6
SLIDE 6

Related Work

  • Language template-based approach

Guadarrama et al.Youtube2text:Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition.ICCV13. Krishnamoorthy et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. AAAI13.

slide-7
SLIDE 7

Related Work

  • Search-based approach

Xu et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. AAAI15.

slide-8
SLIDE 8

Related Work

  • Sequence-to-Sequence learning-based approach

Yao et al. Describing Videos by Exploiting Temporal Structure.ICCV15. Pan et al. Jointly Modeling Embedding and Translation to Bridge Video and Language.CVPR16.

slide-9
SLIDE 9

Outline

  • Introduction
  • Model Description
  • Experimental Results
  • Conclusion
slide-10
SLIDE 10

Motivation

  • Recent work has pointed out that LSTM doesn’t work well

when the sequence is long enough.

  • Neural memory models have shown great potential to long-

term dependency modelling, e.g., QA in NLP.

  • Visual working memory is one of the key factors to guide

eye movements.

  • A. Graves, et al. Neural

Turing Machines. arXiv:1410.5401

  • W. Wang, et al. Simulating Human Saccadic Scanpath
  • n Natural Images. CVPR 2011.
slide-11
SLIDE 11

Recent Related Work

Fakoor et al. Memory-augmented Attention Modelling for Videos.arxiv16.

slide-12
SLIDE 12

Recent Related Work

slide-13
SLIDE 13

Recent Related Work

Agrawal et al. Recurrent Memory Addressing for describing videos.arxiv16.

slide-14
SLIDE 14

Captioning Framework

2D/3D CNN 2D/3D CNN 2D/3D CNN 2D/3D CNN

𝑤1 𝑤2 𝑤3 𝑤𝑜

CNN-Based Video Encoder Multimodal Memory LSTM-Based Text Decoder

𝑗=1 𝑜

𝑏𝑗

𝑢+1𝑤𝑗

𝑏1

𝑢+1, 𝑏2 𝑢+1, … 𝑏𝑜 𝑢+1 𝑗=1 𝑜

𝑏𝑗

𝑢+2𝑤𝑗

𝑏1

𝑢+2, 𝑏2 𝑢+2, … 𝑏𝑜 𝑢+2

𝐵𝑢𝑢𝑓𝑜𝑒𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒𝑢+2

𝑠𝑓𝑏𝑒𝑏𝑢𝑢

1

𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢 𝑠𝑓𝑏𝑒𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑒𝑓𝑑 𝑠𝑓𝑏𝑒𝑒𝑓𝑑 𝑥𝑠𝑗𝑢𝑓𝑒𝑓𝑑 𝑠𝑓𝑏𝑒𝑒𝑓𝑑

2 4 3 𝑁𝑓𝑛𝑢 𝑁𝑓𝑛𝑢+1 𝑁𝑓𝑛𝑢+2 #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡 𝑀𝑇𝑈𝑁𝑢 𝑀𝑇𝑈𝑁𝑢+1 𝑀𝑇𝑈𝑁𝑢+2

… … … … …

slide-15
SLIDE 15

CNN-Based Video Encoder

C3D VGG-19 GoogLenet Residual Inception-3

slide-16
SLIDE 16

Multimodal Memory Modelling

𝑠𝑓𝑏𝑒𝑏𝑢𝑢

1

𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢 𝑠𝑓𝑏𝑒𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑒𝑝𝑑 𝑠𝑓𝑏𝑒𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓𝑒𝑝𝑑

2 4 3 5 6

  • Multimodal Memory

 N×M matrix

① Writing hidden representations to update memory ② Reading updated memory for temporal attention

slide-17
SLIDE 17

Multimodal Memory Modelling

Temporal attention selection for video representation

𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑤𝑜

𝑗=1 𝑜

𝑏𝑗

𝑢+1𝑤𝑗

𝑏1

𝑢+1, 𝑏2 𝑢+1, … 𝑏𝑜 𝑢+1 𝑗=1 𝑜

𝑏𝑗

𝑢+2𝑤𝑗

𝑏1

𝑢+2, 𝑏2 𝑢+2, … 𝑏𝑜 𝑢+2 𝑗=1 𝑜

𝑏𝑗

𝑢+𝑙𝑤𝑗

𝑏1

𝑢+𝑙, 𝑏2 𝑢+𝑙, … 𝑏𝑜 𝑢+𝑙

𝐵𝑢𝑢𝑓𝑜𝑒𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒𝑢+2 𝐵𝑢𝑢𝑓𝑜𝑒𝑢+𝑙

𝑠𝑓𝑏𝑒𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢 𝑠𝑓𝑏𝑒𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢 𝑠𝑓𝑏𝑒𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢

slide-18
SLIDE 18

Multimodal Memory Modelling

𝑠𝑓𝑏𝑒𝑏𝑢𝑢

1

𝑥𝑠𝑗𝑢𝑓𝑏𝑢𝑢 𝑠𝑓𝑏𝑒𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓𝑒𝑝𝑑 𝑠𝑓𝑏𝑒𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓𝑒𝑝𝑑

2 4 3 5 6

  • Multimodal Memory

 N×M matrix

③ Writing selected visual information to update memory ④ Reading the updated memory for LSTM-based language model

slide-19
SLIDE 19

LSTM-Based Text Decoder

#𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡 #𝑓𝑜𝑒

𝑀𝑇𝑈𝑁𝑢 𝑀𝑇𝑈𝑁𝑢+1 𝑀𝑇𝑈𝑁𝑢+2 𝑀𝑇𝑈𝑁𝑢+𝑙

slide-20
SLIDE 20

Memory Addressing & Regularized Loss

  • Content-Based Memory Addressing
  • Regularized Loss
slide-21
SLIDE 21

Outline

  • Introduction
  • Model Description
  • Experimental Results
  • Conclusion
slide-22
SLIDE 22

Implementation Details

  • Variable-length sentences

 A start tag and an end tag

  • Beam search

 beam size: 5

  • LSTM-based decoder

 visual hidden units:1024,word embedding size: 468

  • Memory matrix

 memory size:(128,512), GlorotUniform  read weight and write weight initial with OneHot

  • Others

 Minibatch: 64, optimization algorithm: ADADELTA  Dropout with rate of 0.5, gradient norm clipped (-10,10)

slide-23
SLIDE 23

Experimental Results

  • Microsoft Video Description Dataset

 1970 Youtube videos, training set (1200), validation set (100), and test set (670)  10 seconds to 25 seconds for each clip  each video has about 40 sentences

slide-24
SLIDE 24

实验结果

ICLR2017 ICLR2017 CVPR2017 CVPR2017

slide-25
SLIDE 25

Experimental Results

  • Microsoft Research-Video to Text Dataset

 the largest dataset in terms of sentence and vocabulary, 10,000 video clips and 200,000 sentences  each video is labelled with about 20 sentences  training set (6513), validation set (497), and test set (2990)

slide-26
SLIDE 26

Experimental Results

  • Description Generation

SA:Yao et al. ICCV 2015 M3:Our model

slide-27
SLIDE 27

Outline

  • Introduction
  • Model Description
  • Experimental Results
  • Conclusion
slide-28
SLIDE 28
  • Textual/Visual/Attribute Memory

Conclusion & Future Work

视觉空间板 中央执行系统 语音环路

Working Memory, Baddeley et al.

slide-29
SLIDE 29

Acknowledgement

NVAIL Artificial Intelligence Laboratory Sponsor excellent hardware resources

slide-30
SLIDE 30

Than ank you! u!

(Q/A)

wangliang@nlpr.ia.ac.cn