sequence to sequence video to text
play

Sequence to Sequence Video to Text Venugopalan et al. Given a - PowerPoint PPT Presentation

Garrett Bingham Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation


  1. Garrett Bingham Sequence to Sequence – Video to Text Venugopalan et al.

  2. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation Approach Encoding & Decoding Input Data Datasets METEOR Results ... 2

  3. Video description in general has applications in Prev: Problem Human-robot interaction ● Motivation Video indexing ● Describing movies for the blind ● Next: Sequence to sequence in particular : video descriptions should Approach Be sensitive to temporal structure ● Encoding & Decoding Allow input and output of variable length ● Input Data Previous work resolved variable length input with Datasets Holistic video representations ● METEOR Pooling over frames ● Results Sub-sampling on a fixed number of input frames ● Examples ... 3

  4. Prev: Motivation Approach Next: Encoding & Decoding Input Data Datasets METEOR Results Examples Conclusion ... 4

  5. Prev: Approach Encoding & Decoding Next: Input Data Encoding: Decoding: Datasets LSTMs encode frame <BOS> tag prompts LSTM to decode ● ● METEOR sequence hidden state into sequence of words Hidden representation Model maximizes log-likelihood of ● ● Results is concatenated with predicted output sentence given Examples null input words hidden representation of visual No loss while encoding frames and previous words ● Conclusion Critique Loss is propagated back in time, allowing the LSTM to learn Questions appropriate hidden state representation while encoding. 5

  6. RGB frames: pre-trained CNNs process video Prev: Encoding & Decoding frames. Fully-connected classification layer is Input Data replaced with a linear embedding to a 500-dimensional space. Next: Optical flow: output of CNN pre-trained on UCF101 Datasets video dataset is mapped to 500-dimensional space. METEOR RGB + Flow: shallow fusion Results Examples Text: one-hot vector encoding Conclusion Critique Questions 6

  7. Prev: Input Data Datasets Next: MSVD: Mechanical Turk workers collected short clips depicting a single activity and described the video with a single sentence. METEOR Multilingual corpus, but only English descriptions used. Results MPII-MD: Contains video clips extracted from Hollywood movies, Examples along with movie scripts/audio description data. Challenging due to Conclusion diverse visual/textual content Critique M-VAD: Similar to MPII-MD Questions “Together they form the largest parallel corpora with open domain video and natural language descriptions.” 7

  8. “METEOR compares exact token matches, stemmed Prev: Datasets tokens, paraphrase matches, as well as semantically METEOR similar matches using WordNet synonyms.” Next: Results Examples Conclusion Critique Questions 8

  9. Random frame order hurts performance, implying the full Prev: METEOR ● model learns temporal features Results Flow images alone do poorly, but outperform previous work ● when combined with RGB Movie datasets are hard ● Next: Examples Conclusion Critique Questions 9

  10. A subject is verbing an object. Prev: Results Examples Next: Conclusion Critique Questions 10

  11. M-VAD is much more difficult: the descriptions are Prev: Results complex and have a unique style. Examples This would be difficult for humans too! Next: Conclusion Critique Questions 11

  12. First sequence to sequence approach to video Prev: Examples ● description Conclusion Learns temporal structure of data ● State-of-the-art performance on MSVD ● Next: Outperforms related work on MPII-MD, MVAD Critique ● Questions Simple approach outperforms more complicated ● ones (e.g. GoogleNet + 3D-CNN) 12

  13. Only one metric: the authors justify using METEOR over Prev: Conclusion other metrics, but adding other metrics would have been Critique straightforward and potentially insightful (e.g. what fraction of descriptions are relevant-ish?) Next: Rudimentary RGB + Flow fusion: we can do more than just tune a single α parameter Questions Significance of results: the improvements on each dataset are small (29.6 → 29.8, 7.0 → 7.1, 6.3 → 6.7), raising the question: Are we really benefiting from temporal information? Statistical significance tests would be helpful. 13

  14. Lack of creativity: “... 42.9% of the predictions are Prev: Conclusion identical to some training sentence, and another Critique 38.3% can be obtained by inserting, deleting or substituting one word from some sentence in the training corpus.” Next: Questions 14

  15. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend