Natural-Language Video Description with Deep Recurrent Neural - PowerPoint PPT Presentation

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini Venugopalan University of Texas at Austin 1

Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2

Applications Image and video retrieval by content Video description service Children are wearing green shirts. They are dancing as they sing the carol. Human Robot Interaction Video surveillance 3

Outline ● Review (proposal) ○ Background ○ Encoder-Decoder approaches to video description ● External knowledge to improve video description ● External knowledge for novel object captioning ● Temporal segmentation and description for long videos ● Future Directions 4

Early Work in Video Description ● Extract features ● Classify objects, actions, scenes Subjects Verbs Objects Scenes egg 0.31 ● person 0.95 slice 0.19 kitchen 0.64 Visual confidences over entities : onion 0.21 monkey 0.01 chop 0.11 sky 0.17 Subject, Verb, Object, Scene potato 0.20 animal 0.01 play 0.09 house 0.07 . . . . piano 0 parrot 0 speak 0 snow 0 ● Bias with statistics from language ● Factor Graph to estimates most likely entities (S, V, O, P) ● Template based sentence A person is slicing an onion in the kitchen. generation. J. Thomason * , S. Venugopalan * , S. Guadarrama, K. Saenko, R. Mooney COLING’14 5

Early Work in Video Description Limitations: ● Narrow Domains ● Small Grammars ● Template based sentences [Guadarrama, et al. ICCV’13] ● Several features and classifiers Which objects/actions/scenes should we build classifiers for? [Yu and Siskind, ACL’13] [Rohrbach et al. ICCV’13] [Thomason et al. COLING’14] 6

Can we learn directly from video sentence pairs? Without having to explicitly identify objects/actions/scenes to build classifiers. S. Venugopalan, H. Xu, M. Rohrbach, J. Donahue, R. Mooney, K. Saenko. NAACL’15 7

Deep Neural Networks Convolutional Neural Networks Recurrent Neural Networks ● RNNs can model sequences. ● Maps ● Features and classifiers are jointly ● Successful in translation, speech. learned. ● ● We use LSTMs. Directly from raw pixels and labels. 9

Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et al. decoder NAACL’15] Key Insight: Generate feature representation of the video and “decode” it to a sentence 10

Inference: Feature extraction LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing LSTM LSTM golf CNN LSTM LSTM <EOS> fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 11

Inference: Mean Pool & Generation Input Video Convolutional Net Recurrent Net Output LSTM LSTM A LSTM LSTM boy LSTM . . . LSTM is LSTM LSTM playing LSTM LSTM golf LSTM LSTM <EOS> 12

Translating Videos to Natural Language LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing LSTM LSTM golf CNN LSTM LSTM <EOS> Does not consider temporal sequence of frames. 13

Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et al. decoder NAACL’15] RNN RNN Encode Sentence [Venugopalan et al. ICCV’15] encoder decoder 14

S2VT: Sequence to Sequence Video to Text CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A man is talking ... Encoding stage Decoding stage S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko. ICCV’15 15

Frames: RGB, Flow 1000 1. RGB frames. categories [Simonyan and Zisserman CNN ICLR’15] 2. Use optical flow to extract flow images. [T. Brox et al. ECCV ‘04] UCF 101 101 Action 3. Train CNN on Classes Activity classes [Donahue et al. CVPR’15] CNN (modified AlexNet) 16

Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ • 1970 YouTube video snippets • 10-30s each • typically single activity • 1200 training, 100 validation, 670 test • Annotations • Descriptions in multiple languages • ~40 English descriptions per video • descriptions and videos collected on AMT 17

Sample video and gold descriptions ● ● A man appears to be plowing a rice field with a plow A man is walking on a rope . ● being pulled by two oxen . A man is walking across a rope . ● ● A team of water buffalo pull a plow through a rice paddy. A man is balancing on a rope . ● Domesticated livestock are helping a man plow . ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● Two oxen walk through some mud. ● A man is balancing on a volleyball net . ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire . ● Two oxen are plowing a field. ● The man is balancing on the wire . ● The farmer is tilling the soil. ● A man is walking on a rope . ● A man in ploughing the field. ● A man is standing in the sea shore. 18

Movie Corpus - DVS DVS - Separate audio track for the visually impaired Processed : Someone rushes Looking troubled, into the courtyard. someone descends She then puts a the stairs. head scarf on ... 19

Evaluation: Movie Corpora M-VAD MPII-MD ● Univ. of Montreal • MPII, Germany ● DVS alignment: semi-automated and • DVS alignment: semi-automated crowdsourced and crowdsourced ● 92 movies • 94 movies ● 46,009 clips • 68,000 clips ● Avg. length: 6.2s per clip • Avg. length: 3.9s per clip ● 1-2 sentences per clip • ~1 sentence per clip ● 56,634 sentences • 68,375 sentences 20 [Torabi et al. arXiv‘15] [Rohrbach et al. CVPR ‘15]

Evaluation Metrics • Machine Translation Metric • METEOR - word similarity and phrasing • Human evaluation • Relevance • Grammar 21

Results (Youtube) 23.9 Prior Work (FGM) 27.7 Mean-Pool [1] 29.2 S2VT (RGB) [2] S2VT [2] 29.8 (RGB+Flow) [1] S. Venugopalan, H. Xu, M. Rohrbach, J. Donahue, R. Mooney, K. Saenko. NAACL’15 [2] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko. ICCV’15 22

Proposed Work ● Short Term - Incorporate linguistic knowledge to improve descriptions. ● Long Term - Descriptions for longer videos. 23

Can external linguistic knowledge improve descriptive quality? Unsupervised training on external text S. Venugopalan, L.A. Hendricks, R. Mooney, K. Saenko. EMNLP’16 25

Integrating Statistical Linguistic Knowledge S. Venugopalan, L.A. Hendricks, R. Mooney, K. Saenko. EMNLP’16 26

Unsupervised Training on External Text Fusing LSTM language model trained on text ● Early fusion ● Late fusion ● Deep fusion Distributional Embeddings ● Replace one-hot encoding with GloVe 27

LSTM Language Model We learn a language model using LSTMs. ● Learns to predict the next word given previous words in the sequence. <BOS> A man is talking LSTM LSTM LSTM LSTM LSTM A man is talking <EOS> ● Data ○ Web Corpus: Wikipedia, UkWac, BNC, Gigaword ○ InDomain: MSCOCO image-caption sentences ○ Vocabulary: 72,700 (most frequent words) 28

Distributional Embedding “You shall know a word by the company it keeps” (J. R. Firth, 1957) Dense vector representation of words. Paris ● semantically similar words are closer. Talking We use GloVe [Pennington et al. EMNLP’14] Seaworld ● Trained on Wikipedia and Gigaword. (6B tokens) Dolphin Porpoise ● Replace one-hot encoded input with GloVe. [10000] [01000] [00100] [00010] [00001] 29

Early Fusion • Initialize weights of the caption model from the LSTM LM. Use LM to Initialize Weights 30

Late Fusion Re-score video LSTM output based on language model. Set coefficient based on a validation set. 31

Natural-Language Video Description with Deep Recurrent Neural - PowerPoint PPT Presentation

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini Venugopalan University of Texas at Austin 1 Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dogs tail

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov.

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor

Meas rDroid An Android Measurement Framework Johann Schlamp Georg Carle May 2, 2013 The Meas

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search Toan Vinh Luu,

The Innovation & Collaboration Centre was offjcially launched on 16 November 2015 by the

9.1 Remeshing Hao Li http://cs599.hao-li.com 1 Outline What is remeshing? Why

Disaster Recovery Compliance Disaster Recovery Compliance Davis- -Bacon and CDBG Bacon and CDBG

Graphs Graphs Definitions Implementation/Representation of graphs Search

Natural-Language Video Description with Deep Recurrent Neural - PowerPoint PPT Presentation

Natural-Language Video Description with Deep Recurrent Neural Networks June 2017 Subhashini Venugopalan University of Texas at Austin 1 Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dogs tail

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov.

Natural Language Processing with Deep Learning Language Modeling with Recurrent Neural Networks

CHAPTER VII VII CHAPTER Learning in Recurrent Networks Learning in Recurrent Networks CHAPTER

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Lecture 4: Recurrent neural networks for natural language processing Plan of the lecture Part

CSEP 517: Natural Language Processing Recurrent Neural Networks Autumn 2018 Luke Zettlemoyer

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Recurrent Neural Networks Greg Mori - CMPT 419/726 Goodfellow, Bengio, and Courville: Deep

Recurrent Neural Networks CS60010: Deep Learning Abir Das IIT Kharagpur Mar 11, 2020

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Lecture 2: Recurrent Neural Networks (RNNs) Caio

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Captioning Images with Diverse Objects Lisa Anne Subhashini Marcus Raymond Kate Trevor

Meas rDroid An Android Measurement Framework Johann Schlamp Georg Carle May 2, 2013 The Meas

Lessons Learned from Building a Large Multilingual, Multi-region Website in Drupal 8 Stella

Combining Solr and Elasticsearch to Improve Autosuggestion on Mobile Local Search Toan Vinh Luu,

The Innovation &amp; Collaboration Centre was offjcially launched on 16 November 2015 by the

9.1 Remeshing Hao Li http://cs599.hao-li.com 1 Outline What is remeshing? Why

Disaster Recovery Compliance Disaster Recovery Compliance Davis- -Bacon and CDBG Bacon and CDBG

Graphs Graphs Definitions Implementation/Representation of graphs Search

The Innovation & Collaboration Centre was offjcially launched on 16 November 2015 by the