Sequence to Sequence Video to Text Venugopalan et al. Given a - - PowerPoint PPT Presentation

sequence to sequence video to text
SMART_READER_LITE
LIVE PREVIEW

Sequence to Sequence Video to Text Venugopalan et al. Given a - - PowerPoint PPT Presentation

Garrett Bingham Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation


slide-1
SLIDE 1

Sequence to Sequence – Video to Text

Venugopalan et al.

Garrett Bingham

slide-2
SLIDE 2

Prev: Title Slide

Problem

Next: Motivation Approach Encoding & Decoding Input Data Datasets METEOR Results ...

Given a variable-length sequence of video frames, generate a variable-length natural language description of the video.

2

slide-3
SLIDE 3

Prev: Problem

Motivation

Next: Approach Encoding & Decoding Input Data Datasets METEOR Results Examples ...

Video description in general has applications in

  • Human-robot interaction
  • Video indexing
  • Describing movies for the blind

Sequence to sequence in particular: video descriptions should

  • Be sensitive to temporal structure
  • Allow input and output of variable length

Previous work resolved variable length input with

  • Holistic video representations
  • Pooling over frames
  • Sub-sampling on a fixed number of input frames

3

slide-4
SLIDE 4

Prev: Motivation

Approach

Next: Encoding & Decoding Input Data Datasets METEOR Results Examples Conclusion ...

4

slide-5
SLIDE 5

Prev: Approach

Encoding & Decoding

Next: Input Data Datasets METEOR Results Examples Conclusion Critique Questions

Encoding:

  • LSTMs encode frame

sequence

  • Hidden representation

is concatenated with null input words

  • No loss while encoding

5

Decoding:

  • <BOS> tag prompts LSTM to decode

hidden state into sequence of words

  • Model maximizes log-likelihood of

predicted output sentence given hidden representation of visual frames and previous words Loss is propagated back in time, allowing the LSTM to learn appropriate hidden state representation while encoding.

slide-6
SLIDE 6

Prev: Encoding & Decoding

Input Data

Next: Datasets METEOR Results Examples Conclusion Critique Questions

RGB frames: pre-trained CNNs process video

  • frames. Fully-connected classification layer is

replaced with a linear embedding to a 500-dimensional space. Optical flow: output of CNN pre-trained on UCF101 video dataset is mapped to 500-dimensional space. RGB + Flow: shallow fusion Text: one-hot vector encoding

6

slide-7
SLIDE 7

Prev: Input Data

Datasets

Next: METEOR Results Examples Conclusion Critique Questions

MSVD: Mechanical Turk workers collected short clips depicting a single activity and described the video with a single sentence. Multilingual corpus, but only English descriptions used. MPII-MD: Contains video clips extracted from Hollywood movies, along with movie scripts/audio description data. Challenging due to diverse visual/textual content M-VAD: Similar to MPII-MD “Together they form the largest parallel corpora with open domain video and natural language descriptions.”

7

slide-8
SLIDE 8

Prev: Datasets

METEOR

Next: Results Examples Conclusion Critique Questions

“METEOR compares exact token matches, stemmed tokens, paraphrase matches, as well as semantically similar matches using WordNet synonyms.”

8

slide-9
SLIDE 9

Prev: METEOR

Results

Next: Examples Conclusion Critique Questions

  • Random frame order hurts performance, implying the full

model learns temporal features

  • Flow images alone do poorly, but outperform previous work

when combined with RGB

  • Movie datasets are hard

9

slide-10
SLIDE 10

Prev: Results

Examples

Next: Conclusion Critique Questions

A subject is verbing an object.

10

slide-11
SLIDE 11

Prev: Results

Examples

Next: Conclusion Critique Questions

M-VAD is much more difficult: the descriptions are complex and have a unique style. This would be difficult for humans too!

11

slide-12
SLIDE 12

Prev: Examples

Conclusion

Next: Critique Questions

  • First sequence to sequence approach to video

description

  • Learns temporal structure of data
  • State-of-the-art performance on MSVD
  • Outperforms related work on MPII-MD, MVAD
  • Simple approach outperforms more complicated
  • nes (e.g. GoogleNet + 3D-CNN)

12

slide-13
SLIDE 13

Prev: Conclusion

Critique

Next: Questions

Only one metric: the authors justify using METEOR over

  • ther metrics, but adding other metrics would have been

straightforward and potentially insightful (e.g. what fraction

  • f descriptions are relevant-ish?)

Rudimentary RGB + Flow fusion: we can do more than just tune a single α parameter Significance of results: the improvements on each dataset are small (29.6 → 29.8, 7.0 → 7.1, 6.3 → 6.7), raising the question: Are we really benefiting from temporal information? Statistical significance tests would be helpful.

13

slide-14
SLIDE 14

Prev: Conclusion

Critique

Next: Questions

Lack of creativity: “... 42.9% of the predictions are identical to some training sentence, and another 38.3% can be obtained by inserting, deleting or substituting one word from some sentence in the training corpus.”

14

slide-15
SLIDE 15

Questions?