multimodal memory modelling for video captioning
play

Multimodal Memory Modelling for Video Captioning Liang Wang & - PowerPoint PPT Presentation

2018 018 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for


  1. 2018 018 中国科学院自动化研究所 Institute of Automation Institute of Automation Chinese Academy of Sciences Chinese Academy of Sciences Multimodal Memory Modelling for Video Captioning Liang Wang & Yan Huang Center for Research on Intelligent Perception and Computing (CRIPAC) National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences (CASIA) Mar Mar 28 28, , 2018 2018

  2. NVAIL Artificial Intelligence Laboratory Researches on artificial intelligence and deep learning

  3. Outline  Introduction  Model Description  Experimental Results  Conclusion

  4. Outline  Introduction  Model Description  Experimental Results  Conclusion

  5. Video Captioning  Generate natural sentences to describe video content 1. A man and a woman performing a musical. 2. A teenage couple perform in an amateur musical. 3. Dancers are playing a routine. 4. People are dancing in a musical.  Potential applications  Challenges  Learning an effective mapping from visual sequence space to language space  The long-term visual-textual dependency modelling

  6. Related Work  Language template-based approach Krishnamoorthy et al. Generating Guadarrama et al.Youtube2text:Recognizing and Natural-Language Video Descriptions describing arbitrary activities using semantic Using Text-Mined Knowledge. AAAI13. hierarchies and zero-shot recognition.ICCV13.

  7. Related Work  Search-based approach Xu et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. AAAI15.

  8. Related Work  Sequence-to-Sequence learning-based approach Yao et al. Describing Videos by Exploiting Temporal Structure.ICCV15. Pan et al. Jointly Modeling Embedding and Translation to Bridge Video and Language.CVPR16.

  9. Outline  Introduction  Model Description  Experimental Results  Conclusion

  10. Motivation  Recent work has pointed out that LSTM doesn’t work well when the sequence is long enough.  Neural memory models have shown great potential to long- term dependency modelling, e.g., QA in NLP.  Visual working memory is one of the key factors to guide eye movements. A. Graves, et al. Neural W. Wang, et al. Simulating Human Saccadic Scanpath Turing Machines. arXiv:1410.5401 on Natural Images. CVPR 2011.

  11. Recent Related Work Fakoor et al. Memory-augmented Attention Modelling for Videos.arxiv16.

  12. Recent Related Work

  13. Recent Related Work Agrawal et al. Recurrent Memory Addressing for describing videos.arxiv16.

  14. Captioning Framework … CNN-Based Video Encoder 2D/3D CNN 2D/3D CNN 2D/3D CNN 2D/3D CNN 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝑜 𝑜 … 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑤 𝑗 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑤 𝑗 𝑢+1 𝑢+2 𝑏 1 𝑏 1 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 2 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Multimodal Memory … 𝑁𝑓𝑛 𝑢 𝑁𝑓𝑛 𝑢+1 𝑁𝑓𝑛 𝑢+2 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 1 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑓𝑑 𝑠𝑓𝑏𝑒 𝑒𝑓𝑑 4 LSTM-Based … 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 Text Decoder … #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡

  15. CNN-Based Video Encoder C3D Residual GoogLenet VGG-19 Inception-3

  16. Multimodal Memory Modelling  Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6  N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ① Writing hidden representations to update memory ② Reading updated memory for temporal attention

  17. Multimodal Memory Modelling 𝑤 1 𝑤 2 𝑤 3 𝑤 4 𝑤 5 𝑤 𝑜 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+1 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+2 𝐵𝑢𝑢𝑓𝑜𝑒 𝑢+𝑙 𝑜 𝑜 𝑜 𝑢+1 , 𝑏 2 𝑢+1 , … 𝑏 𝑜 𝑢+1 𝑢+2 , 𝑏 2 𝑢+2 , … 𝑏 𝑜 𝑢+2 𝑢+𝑙 , 𝑏 2 𝑢+𝑙 , … 𝑏 𝑜 𝑢+𝑙 𝑏 1 𝑏 1 𝑏 1 𝑢+1 𝑤 𝑗 𝑢+2 𝑤 𝑗 𝑢+𝑙 𝑤 𝑗 𝑏 𝑗 𝑏 𝑗 𝑏 𝑗 𝑗=1 𝑗=1 𝑗=1 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 Temporal attention selection for video representation

  18. Multimodal Memory Modelling  Multimodal Memory 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 2 𝑥𝑠𝑗𝑢𝑓 𝑏𝑢𝑢 𝑠𝑓𝑏𝑒 𝑏𝑢𝑢 3 6  N × M matrix 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 𝑠𝑓𝑏𝑒 𝑒𝑝𝑑 𝑥𝑠𝑗𝑢𝑓 𝑒𝑝𝑑 1 4 5 ③ Writing selected visual information to update memory ④ Reading the updated memory for LSTM-based language model

  19. LSTM-Based Text Decoder 𝑀𝑇𝑈𝑁 𝑢+1 𝑀𝑇𝑈𝑁 𝑢+2 𝑀𝑇𝑈𝑁 𝑢 … 𝑀𝑇𝑈𝑁 𝑢+𝑙 … #𝑓𝑜𝑒 #𝑡𝑢𝑏𝑠𝑢 𝐵 𝑛𝑏𝑜 𝑗𝑡

  20. Memory Addressing & Regularized Loss  Content-Based Memory Addressing  Regularized Loss

  21. Outline  Introduction  Model Description  Experimental Results  Conclusion

  22. Implementation Details  Variable-length sentences  A start tag and an end tag  Beam search  beam size: 5  LSTM-based decoder  visual hidden units:1024,word embedding size: 468  Memory matrix  memory size : (128,512), GlorotUniform  read weight and write weight initial with OneHot  Others  Minibatch: 64, optimization algorithm: ADADELTA  Dropout with rate of 0.5, gradient norm clipped (-10,10)

  23. Experimental Results  Microsoft Video Description Dataset  1970 Youtube videos, training set (1200), validation set (100), and test set (670)  10 seconds to 25 seconds for each clip  each video has about 40 sentences

  24. 实验结果 CVPR2017 CVPR2017 ICLR2017 ICLR2017

  25. Experimental Results  Microsoft Research-Video to Text Dataset  the largest dataset in terms of sentence and vocabulary, 10,000 video clips and 200,000 sentences  each video is labelled with about 20 sentences  training set (6513), validation set (497), and test set (2990)

  26. Experimental Results  Description Generation M 3 : Our model SA : Yao et al. ICCV 2015

  27. Outline  Introduction  Model Description  Experimental Results  Conclusion

  28. Conclusion & Future Work  Textual/Visual/Attribute Memory 视觉空间板 中央执行系统 语音环路 Working Memory, Baddeley et al.

  29. Acknowledgement NVAIL Artificial Intelligence Laboratory Sponsor excellent hardware resource s

  30. Than ank you! u! ( Q/A) wangliang@nlpr.ia.ac.cn

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend