Sequence to Sequence – Video to Text
Venugopalan et al.
Garrett Bingham
Sequence to Sequence Video to Text Venugopalan et al. Given a - - PowerPoint PPT Presentation
Garrett Bingham Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation
Venugopalan et al.
Garrett Bingham
Prev: Title Slide
Next: Motivation Approach Encoding & Decoding Input Data Datasets METEOR Results ...
Given a variable-length sequence of video frames, generate a variable-length natural language description of the video.
2
Prev: Problem
Next: Approach Encoding & Decoding Input Data Datasets METEOR Results Examples ...
Video description in general has applications in
Sequence to sequence in particular: video descriptions should
Previous work resolved variable length input with
3
Prev: Motivation
Next: Encoding & Decoding Input Data Datasets METEOR Results Examples Conclusion ...
4
Prev: Approach
Encoding & Decoding
Next: Input Data Datasets METEOR Results Examples Conclusion Critique Questions
Encoding:
sequence
is concatenated with null input words
5
Decoding:
hidden state into sequence of words
predicted output sentence given hidden representation of visual frames and previous words Loss is propagated back in time, allowing the LSTM to learn appropriate hidden state representation while encoding.
Prev: Encoding & Decoding
Next: Datasets METEOR Results Examples Conclusion Critique Questions
RGB frames: pre-trained CNNs process video
replaced with a linear embedding to a 500-dimensional space. Optical flow: output of CNN pre-trained on UCF101 video dataset is mapped to 500-dimensional space. RGB + Flow: shallow fusion Text: one-hot vector encoding
6
Prev: Input Data
Next: METEOR Results Examples Conclusion Critique Questions
MSVD: Mechanical Turk workers collected short clips depicting a single activity and described the video with a single sentence. Multilingual corpus, but only English descriptions used. MPII-MD: Contains video clips extracted from Hollywood movies, along with movie scripts/audio description data. Challenging due to diverse visual/textual content M-VAD: Similar to MPII-MD “Together they form the largest parallel corpora with open domain video and natural language descriptions.”
7
Prev: Datasets
Next: Results Examples Conclusion Critique Questions
“METEOR compares exact token matches, stemmed tokens, paraphrase matches, as well as semantically similar matches using WordNet synonyms.”
8
Prev: METEOR
Next: Examples Conclusion Critique Questions
model learns temporal features
when combined with RGB
9
Prev: Results
Next: Conclusion Critique Questions
A subject is verbing an object.
10
Prev: Results
Next: Conclusion Critique Questions
M-VAD is much more difficult: the descriptions are complex and have a unique style. This would be difficult for humans too!
11
Prev: Examples
Next: Critique Questions
description
12
Prev: Conclusion
Next: Questions
Only one metric: the authors justify using METEOR over
straightforward and potentially insightful (e.g. what fraction
Rudimentary RGB + Flow fusion: we can do more than just tune a single α parameter Significance of results: the improvements on each dataset are small (29.6 → 29.8, 7.0 → 7.1, 6.3 → 6.7), raising the question: Are we really benefiting from temporal information? Statistical significance tests would be helpful.
13
Prev: Conclusion
Next: Questions
Lack of creativity: “... 42.9% of the predictions are identical to some training sentence, and another 38.3% can be obtained by inserting, deleting or substituting one word from some sentence in the training corpus.”
14