Translating Videos to Natural Language Using Deep Recurrent Neural - PowerPoint PPT Presentation

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff Marcus Raymond Kate Huijuan Xu Venugopalan Donahue Rohrbach Mooney Saenko UMass. UT Austin UC Berkeley UC Berkeley UT Austin UMass. Lowell Lowell Subhashini Venugopalan University of Texas at Austin

Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2

Prior Work (Pipelined approach) [Thomason et al. COLING’14] ● Detect objects ● Classify actions and scenes Subjects Verbs Objects Scenes egg 0.31 person 0.95 slice 0.19 kitchen 0.64 ● Visual confidences over onion 0.21 monkey 0.01 chop 0.11 sky 0.17 entities and actions potato 0.20 animal 0.01 play 0.09 house 0.07 . . . . piano 0 parrot 0 speak 0 snow 0 ● Bias with language statistics ● Factor Graph Model (FGM) estimates most likely entities ● Template based sentence A person is slicing an onion in the kitchen. generation. 3

Prior Work Yu and Siskind, ACL’13 Detect and track objects. Learning HMMs for actions. Rohrbach et. al. ICCV’13 Cooking videos. CRFs. Xu et. al. AAAI’15 Embed video and words in same space. Retrieval. CRFs for generation. Lots of work on image to text but relatively little on video to text. Downside: which objects/actions/scenes should I build classifiers for? 4

Can we learn directly from video sentence pairs? Without having to explicitly learn object/action/scene classifiers for our dataset. 5

Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [V. NAACL’15] (this work) decoder Key Insight: Generate feature representation of the video and “decode” it to a sentence 6

Recurrent Neural Networks (RNNs) Insight: Each time input hid out t0 t=0 step has a layer with the same input hid out t1 t=1 weights. input hid out t2 t=2 Pr (out t n | input, out t 0 ...t n-1 ) time input hid out t3 t=3 Problems - 1. Hard to capture long term dependencies 2. Vanishing gradients (shrink Solution: Long Short Term Memory (LSTM) unit through many layers) 7

LSTM [Hochreiter and Schmidhuber ‘97] x t h t-1 x t [Graves ‘13] h t-1 Output Input Gate Gate Memory Cell x t h t (=z t ) h t-1 Input Modulation Gate Forget Gate LSTM Unit x t h t-1 8

LSTM Sequence decoders Layer with LSTM units (1000) input hid out t0 Full gradient is computed by t=0 backpropagating through time. input hid out t1 t=1 Matches state-of-the-art on: Speech Recognition input hid out t2 [Graves & Jaitly ICML’14] t=2 Machine Translation (Eng-Fr) [Sutskever et al. NIPS’14] time Image-Description input hid out t3 t=3 [Donahue et al. CVPR’15] [Vinyals et al. CVPR’15] 9

LSTM Sequence decoders Two LSTM layers input hid hid out t0 t=0 input hid hid out t1 t=1 input hid hid out t2 t=2 time input hid hid out t3 t=3 10

Translating videos to natural language LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing LSTM LSTM golf CNN LSTM LSTM <EOS> 11

Test time: Step 1 LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM playing LSTM LSTM LSTM golf CNN LSTM LSTM <EOS> (a) Input Sample frames Video @1/10 227x227 Frame Scale (b) 12

Convolutional Neural Networks (CNNs) for feature learning Fukushima, 1980 Neocognitron. Rumelhart, Hinton, Williams 1986 “T” vs “C” LeCun et al. 1989-1998 Handwritten digit recognition >> Krizhevsky, Sutskever, Hinton 2012 ImageNet classification breakthrough Credits: R. Girshick 13

Test time: Step 2 Feature extraction LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM playing LSTM LSTM LSTM golf CNN LSTM LSTM <EOS> fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 14

Test time: Step 3 Mean pooling CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 15

Test time: Step 4 Generation Input Video Convolutional Net Recurrent Net Output LSTM LSTM A LSTM LSTM boy LSTM . . . LSTM is LSTM LSTM playing LSTM LSTM golf LSTM LSTM <EOS> 16

Training Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. 17

Step1: CNN pre-training fc7: 4096 dimension CNN “feature vector” ● Caffe Reference Net - variation of Alexnet [Krizhevsky et al. NIPS’12] ● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] ● Initialize weights of our network. 18

Step2: Image-Caption training LSTM LSTM A LSTM LSTM man LSTM is LSTM scaling LSTM LSTM CNN LSTM a LSTM cliff LSTM LSTM 19

Step3: Fine-tuning LSTM LSTM A LSTM LSTM boy LSTM LSTM is playing LSTM LSTM CNN LSTM LSTM golf 1. Video Dataset LSTM 2. Mean pooled feature LSTM <EOS> 3. Lower learning rate 20

Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ ● 1970 YouTube video snippets ○ 10-30s each ○ typically single activity ○ no dialogues ○ 1200 training, 100 validation, 670 test ● Annotations ○ Descriptions in multiple languages ○ ~40 English descriptions per video ○ descriptions and videos collected on AMT 21

Sample video and descriptions ● A man appears to be plowing a rice field with a ● A man is walking on a rope. plow being pulled by two oxen. ● A man is walking across a rope. ● A man is plowing a mud field. ● A man is balancing on a rope. ● Domesticated livestock are helping a man plow. ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● A man is plowing with some oxen. ● A man is balancing on a volleyball net. ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire. ● Two oxen are plowing a field. ● The man is balancing on the wire. ● The farmer is tilling the soil. ● A man is walking on a rope. ● A man in ploughing the field. ● A man is standing in the sea shore. 22

Augment Image datasets MSCOCO # Training videos - 1300 Flickr30k - 30,000 images, 150,000 descriptions MSCOCO - 120,000 images, 600,000 descriptions 23

Evaluation ● Subject, Verb, Object accuracy (extracted from generated sentences) ● BLEU ● METEOR ● Human evaluation 24

Evaluation: Extracting SVO Extracting Subject-Verb-Object (SVO) from sentences. Consider the dependency parse of a sentence. Extract Subject, Verb, Object. (person, ride, motorbike) Accuracy - any valid ground truth S, V, O 25

SVO - Subject accuracy Best Prior Work 88.27 [Thomason et al. COLING’14] Only Images 79.95 Only Videos 79.40 Images+Videos 87.27 26

SVO - Verb accuracy Best Prior Work 38.66 [Thomason et al. COLING’14] Only Images 15.47 Only Videos 35.52 Images+Videos 42.79 27

SVO - Object accuracy Best Prior Work 24.63 [Thomason et al. COLING’14] Only Images 14.86 Only Videos 20.59 Images+Videos 26.69 28

Results - Generation MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. Model BLEU METEOR Best Prior Work 13.68 23.90 [Thomason et al. COLING’14] 12.66 20.96 Only Images 31.19 26.87 Only Video 33.29 29.07 Images+Video 29

Human Evaluation Relevance Grammar Rank sentences based on how accurately Rate the grammatical correctness of the they describe the event depicted in the video. following sentences . No two sentences can have the same rank. Multiple sentences can have same rating. 30

Results - Human Evaluation Model Relevance Grammar Best Prior Work 2.26 3.99 [Thomason et al. COLING’14] 2.74 3.84 Only Video 2.93 3.64 Images+Video 4.65 4.61 Ground Truth 31

Examples FGM: A person is dancing with the person on the stage. FGM: A person is cutting a potato in the kitchen. YT: A group of men are riding the forest. YT: A man is slicing a tomato. I+V: A group of people are dancing. I+V: A man is slicing a carrot. GT: Many men and women are dancing in the street. GT: A man is slicing carrots. FGM: A person is walking with a person in the forest. FGM: A person is riding a horse on the stage. YT: A monkey is walking. YT: A group of playing are playing in the ball. I+V: A bear is eating a tree. I+V: A basketball player is playing . GT: Two bear cubs are digging into dirt and plant matter GT: Dwayne wade does a fancy layup in an allstar game. at the base of a tree. 32

Translating Videos to Natural Language Using Deep Recurrent Neural - PowerPoint PPT Presentation

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff Marcus Raymond Kate Huijuan Xu Venugopalan Donahue Rohrbach Mooney Saenko UMass. UT Austin UC Berkeley UC Berkeley UT Austin UMass. Lowell

Natural Language Understanding We want to communicate with computers using natural language

Welcome To The ResQIPS Research Symposium Keynote Discussion - Translating Health Policy Into

Translating Henri Poincar Bruce D. Popp, Ph.D. ATA Certified Translator, Fr>En ATA 57th

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Translating OCL to Natural Language David Burke and Kristofer Johannisson Background Existing

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Recurrent Neural Networks (RNN) Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Unsupervised Recurrent Neural Network Grammars Yoon Kim Alexander Rush Lei Yu Adhiguna Kuncoro

Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified

Recurrent Recommendation with Local Coherence Jianling Wang and James Caverlee Dynamics in

Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread

Introduction to the course RECURREN T N EURAL N ETW ORK S F OR LAN GUAGE MODELIN G IN P YTH ON

Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang,

Translating Videos to Natural Language Using Deep Recurrent Neural - PowerPoint PPT Presentation

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff Marcus Raymond Kate Huijuan Xu Venugopalan Donahue Rohrbach Mooney Saenko UMass. UT Austin UC Berkeley UC Berkeley UT Austin UMass. Lowell

Natural Language Understanding We want to communicate with computers using natural language

Welcome To The ResQIPS Research Symposium Keynote Discussion - Translating Health Policy Into

Translating Henri Poincar Bruce D. Popp, Ph.D. ATA Certified Translator, Fr&gt;En ATA 57th

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Translating OCL to Natural Language David Burke and Kristofer Johannisson Background Existing

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Recurrent Neural Networks (RNN) Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Unsupervised Recurrent Neural Network Grammars Yoon Kim Alexander Rush Lei Yu Adhiguna Kuncoro

Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified

Recurrent Recommendation with Local Coherence Jianling Wang and James Caverlee Dynamics in

Recurrent neural network grammars Slide credits: Chris Dyer, Adhiguna Kuncoro Widespread

Introduction to the course RECURREN T N EURAL N ETW ORK S F OR LAN GUAGE MODELIN G IN P YTH ON

Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang,

Translating Henri Poincar Bruce D. Popp, Ph.D. ATA Certified Translator, Fr>En ATA 57th

Deep learning for natural language processing A short primer on deep learning Benoit Favre <