Sequence to Sequence Video to Text Venugopalan et al. Given a - PowerPoint PPT Presentation

Garrett Bingham Sequence to Sequence – Video to Text Venugopalan et al.

Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation Approach Encoding & Decoding Input Data Datasets METEOR Results ... 2

Video description in general has applications in Prev: Problem Human-robot interaction ● Motivation Video indexing ● Describing movies for the blind ● Next: Sequence to sequence in particular : video descriptions should Approach Be sensitive to temporal structure ● Encoding & Decoding Allow input and output of variable length ● Input Data Previous work resolved variable length input with Datasets Holistic video representations ● METEOR Pooling over frames ● Results Sub-sampling on a fixed number of input frames ● Examples ... 3

Prev: Motivation Approach Next: Encoding & Decoding Input Data Datasets METEOR Results Examples Conclusion ... 4

Prev: Approach Encoding & Decoding Next: Input Data Encoding: Decoding: Datasets LSTMs encode frame <BOS> tag prompts LSTM to decode ● ● METEOR sequence hidden state into sequence of words Hidden representation Model maximizes log-likelihood of ● ● Results is concatenated with predicted output sentence given Examples null input words hidden representation of visual No loss while encoding frames and previous words ● Conclusion Critique Loss is propagated back in time, allowing the LSTM to learn Questions appropriate hidden state representation while encoding. 5

RGB frames: pre-trained CNNs process video Prev: Encoding & Decoding frames. Fully-connected classification layer is Input Data replaced with a linear embedding to a 500-dimensional space. Next: Optical flow: output of CNN pre-trained on UCF101 Datasets video dataset is mapped to 500-dimensional space. METEOR RGB + Flow: shallow fusion Results Examples Text: one-hot vector encoding Conclusion Critique Questions 6

Prev: Input Data Datasets Next: MSVD: Mechanical Turk workers collected short clips depicting a single activity and described the video with a single sentence. METEOR Multilingual corpus, but only English descriptions used. Results MPII-MD: Contains video clips extracted from Hollywood movies, Examples along with movie scripts/audio description data. Challenging due to Conclusion diverse visual/textual content Critique M-VAD: Similar to MPII-MD Questions “Together they form the largest parallel corpora with open domain video and natural language descriptions.” 7

“METEOR compares exact token matches, stemmed Prev: Datasets tokens, paraphrase matches, as well as semantically METEOR similar matches using WordNet synonyms.” Next: Results Examples Conclusion Critique Questions 8

Random frame order hurts performance, implying the full Prev: METEOR ● model learns temporal features Results Flow images alone do poorly, but outperform previous work ● when combined with RGB Movie datasets are hard ● Next: Examples Conclusion Critique Questions 9

A subject is verbing an object. Prev: Results Examples Next: Conclusion Critique Questions 10

M-VAD is much more difficult: the descriptions are Prev: Results complex and have a unique style. Examples This would be difficult for humans too! Next: Conclusion Critique Questions 11

First sequence to sequence approach to video Prev: Examples ● description Conclusion Learns temporal structure of data ● State-of-the-art performance on MSVD ● Next: Outperforms related work on MPII-MD, MVAD Critique ● Questions Simple approach outperforms more complicated ● ones (e.g. GoogleNet + 3D-CNN) 12

Only one metric: the authors justify using METEOR over Prev: Conclusion other metrics, but adding other metrics would have been Critique straightforward and potentially insightful (e.g. what fraction of descriptions are relevant-ish?) Next: Rudimentary RGB + Flow fusion: we can do more than just tune a single α parameter Questions Significance of results: the improvements on each dataset are small (29.6 → 29.8, 7.0 → 7.1, 6.3 → 6.7), raising the question: Are we really benefiting from temporal information? Statistical significance tests would be helpful. 13

Lack of creativity: “... 42.9% of the predictions are Prev: Conclusion identical to some training sentence, and another Critique 38.3% can be obtained by inserting, deleting or substituting one word from some sentence in the training corpus.” Next: Questions 14

Questions?

Sequence to Sequence Video to Text Venugopalan et al. Given a - PowerPoint PPT Presentation

Garrett Bingham Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

Joint Dereverberation and Noise Reduction Using Beamforming and a Single-Channel Speech

Foundations of Chemical Kinetics Lecture 32: Heterogeneous kinetics: Gases and surfaces Marc R.

Second-Order Masked Lookup Table Compression Scheme Annapurna Valiveti , Srinivas Vivek IIIT

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS

google define: design google define: design the act of working out the form of something

Deliver Early There Is No Excuse! Jesper Boeg VP Trifork Agile Excellence jbo@trifork.com

FRANKLYSPEAKING A SERIES OF COMMUNITY MEETINGS ABOUT FRANKLIN COUNTYS FUTURE COMMUNITY MEETING

Mode Callbacks Tarjei Mandt Black Hat USA 2011 Who am I 11. august 2011 Security Researcher

Sequence to Sequence Video to Text Venugopalan et al. Given a - PowerPoint PPT Presentation

Garrett Bingham Sequence to Sequence Video to Text Venugopalan et al. Given a variable-length sequence of video frames, Prev: Title Slide generate a variable-length natural language Problem description of the video. Next: Motivation

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

Joint Dereverberation and Noise Reduction Using Beamforming and a Single-Channel Speech

Foundations of Chemical Kinetics Lecture 32: Heterogeneous kinetics: Gases and surfaces Marc R.

Second-Order Masked Lookup Table Compression Scheme Annapurna Valiveti , Srinivas Vivek IIIT

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck &amp; Jean-Pierre Martens ELIS

google define: design google define: design the act of working out the form of something

Deliver Early There Is No Excuse! Jesper Boeg VP Trifork Agile Excellence jbo@trifork.com

FRANKLYSPEAKING A SERIES OF COMMUNITY MEETINGS ABOUT FRANKLIN COUNTYS FUTURE COMMUNITY MEETING

Mode Callbacks Tarjei Mandt Black Hat USA 2011 Who am I 11. august 2011 Security Researcher

Segmentation of Broadcast News Brecht Desplanques, Kris Demuynck & Jean-Pierre Martens ELIS