 
              Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff Marcus Raymond Kate Huijuan Xu Venugopalan Donahue Rohrbach Mooney Saenko UMass. UT Austin UC Berkeley UC Berkeley UT Austin UMass. Lowell Lowell Subhashini Venugopalan University of Texas at Austin
Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2
Prior Work (Pipelined approach) [Thomason et al. COLING’14] ● Detect objects ● Classify actions and scenes Subjects Verbs Objects Scenes egg 0.31 person 0.95 slice 0.19 kitchen 0.64 ● Visual confidences over onion 0.21 monkey 0.01 chop 0.11 sky 0.17 entities and actions potato 0.20 animal 0.01 play 0.09 house 0.07 . . . . piano 0 parrot 0 speak 0 snow 0 ● Bias with language statistics ● Factor Graph Model (FGM) estimates most likely entities ● Template based sentence A person is slicing an onion in the kitchen. generation. 3
Prior Work Yu and Siskind, ACL’13 Detect and track objects. Learning HMMs for actions. Rohrbach et. al. ICCV’13 Cooking videos. CRFs. Xu et. al. AAAI’15 Embed video and words in same space. Retrieval. CRFs for generation. Lots of work on image to text but relatively little on video to text. Downside: which objects/actions/scenes should I build classifiers for? 4
Can we learn directly from video sentence pairs? Without having to explicitly learn object/action/scene classifiers for our dataset. 5
Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [V. NAACL’15] (this work) decoder Key Insight: Generate feature representation of the video and “decode” it to a sentence 6
Recurrent Neural Networks (RNNs) Insight: Each time input hid out t0 t=0 step has a layer with the same input hid out t1 t=1 weights. input hid out t2 t=2 Pr (out t n | input, out t 0 ...t n-1 ) time input hid out t3 t=3 Problems - 1. Hard to capture long term dependencies 2. Vanishing gradients (shrink Solution: Long Short Term Memory (LSTM) unit through many layers) 7
LSTM [Hochreiter and Schmidhuber ‘97] x t h t-1 x t [Graves ‘13] h t-1 Output Input Gate Gate Memory Cell x t h t (=z t ) h t-1 Input Modulation Gate Forget Gate LSTM Unit x t h t-1 8
LSTM Sequence decoders Layer with LSTM units (1000) input hid out t0 Full gradient is computed by t=0 backpropagating through time. input hid out t1 t=1 Matches state-of-the-art on: Speech Recognition input hid out t2 [Graves & Jaitly ICML’14] t=2 Machine Translation (Eng-Fr) [Sutskever et al. NIPS’14] time Image-Description input hid out t3 t=3 [Donahue et al. CVPR’15] [Vinyals et al. CVPR’15] 9
LSTM Sequence decoders Two LSTM layers input hid hid out t0 t=0 input hid hid out t1 t=1 input hid hid out t2 t=2 time input hid hid out t3 t=3 10
Translating videos to natural language LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM LSTM playing LSTM LSTM golf CNN LSTM LSTM <EOS> 11
Test time: Step 1 LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM playing LSTM LSTM LSTM golf CNN LSTM LSTM <EOS> (a) Input Sample frames Video @1/10 227x227 Frame Scale (b) 12
Convolutional Neural Networks (CNNs) for feature learning Fukushima, 1980 Neocognitron. Rumelhart, Hinton, Williams 1986 “T” vs “C” LeCun et al. 1989-1998 Handwritten digit recognition >> Krizhevsky, Sutskever, Hinton 2012 ImageNet classification breakthrough Credits: R. Girshick 13
Test time: Step 2 Feature extraction LSTM LSTM A LSTM LSTM boy LSTM LSTM is LSTM playing LSTM LSTM LSTM golf CNN LSTM LSTM <EOS> fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 14
Test time: Step 3 Mean pooling CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 15
Test time: Step 4 Generation Input Video Convolutional Net Recurrent Net Output LSTM LSTM A LSTM LSTM boy LSTM . . . LSTM is LSTM LSTM playing LSTM LSTM golf LSTM LSTM <EOS> 16
Training Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. 17
Step1: CNN pre-training fc7: 4096 dimension CNN “feature vector” ● Caffe Reference Net - variation of Alexnet [Krizhevsky et al. NIPS’12] ● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] ● Initialize weights of our network. 18
Step2: Image-Caption training LSTM LSTM A LSTM LSTM man LSTM is LSTM scaling LSTM LSTM CNN LSTM a LSTM cliff LSTM LSTM 19
Step3: Fine-tuning LSTM LSTM A LSTM LSTM boy LSTM LSTM is playing LSTM LSTM CNN LSTM LSTM golf 1. Video Dataset LSTM 2. Mean pooled feature LSTM <EOS> 3. Lower learning rate 20
Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ ● 1970 YouTube video snippets ○ 10-30s each ○ typically single activity ○ no dialogues ○ 1200 training, 100 validation, 670 test ● Annotations ○ Descriptions in multiple languages ○ ~40 English descriptions per video ○ descriptions and videos collected on AMT 21
Sample video and descriptions ● A man appears to be plowing a rice field with a ● A man is walking on a rope. plow being pulled by two oxen. ● A man is walking across a rope. ● A man is plowing a mud field. ● A man is balancing on a rope. ● Domesticated livestock are helping a man plow. ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● A man is plowing with some oxen. ● A man is balancing on a volleyball net. ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire. ● Two oxen are plowing a field. ● The man is balancing on the wire. ● The farmer is tilling the soil. ● A man is walking on a rope. ● A man in ploughing the field. ● A man is standing in the sea shore. 22
Augment Image datasets MSCOCO # Training videos - 1300 Flickr30k - 30,000 images, 150,000 descriptions MSCOCO - 120,000 images, 600,000 descriptions 23
Evaluation ● Subject, Verb, Object accuracy (extracted from generated sentences) ● BLEU ● METEOR ● Human evaluation 24
Evaluation: Extracting SVO Extracting Subject-Verb-Object (SVO) from sentences. Consider the dependency parse of a sentence. Extract Subject, Verb, Object. (person, ride, motorbike) Accuracy - any valid ground truth S, V, O 25
SVO - Subject accuracy Best Prior Work 88.27 [Thomason et al. COLING’14] Only Images 79.95 Only Videos 79.40 Images+Videos 87.27 26
SVO - Verb accuracy Best Prior Work 38.66 [Thomason et al. COLING’14] Only Images 15.47 Only Videos 35.52 Images+Videos 42.79 27
SVO - Object accuracy Best Prior Work 24.63 [Thomason et al. COLING’14] Only Images 14.86 Only Videos 20.59 Images+Videos 26.69 28
Results - Generation MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. Model BLEU METEOR Best Prior Work 13.68 23.90 [Thomason et al. COLING’14] 12.66 20.96 Only Images 31.19 26.87 Only Video 33.29 29.07 Images+Video 29
Human Evaluation Relevance Grammar Rank sentences based on how accurately Rate the grammatical correctness of the they describe the event depicted in the video. following sentences . No two sentences can have the same rank. Multiple sentences can have same rating. 30
Results - Human Evaluation Model Relevance Grammar Best Prior Work 2.26 3.99 [Thomason et al. COLING’14] 2.74 3.84 Only Video 2.93 3.64 Images+Video 4.65 4.61 Ground Truth 31
Examples FGM: A person is dancing with the person on the stage. FGM: A person is cutting a potato in the kitchen. YT: A group of men are riding the forest. YT: A man is slicing a tomato. I+V: A group of people are dancing. I+V: A man is slicing a carrot. GT: Many men and women are dancing in the street. GT: A man is slicing carrots. FGM: A person is walking with a person in the forest. FGM: A person is riding a horse on the stage. YT: A monkey is walking. YT: A group of playing are playing in the ball. I+V: A bear is eating a tree. I+V: A basketball player is playing . GT: Two bear cubs are digging into dirt and plant matter GT: Dwayne wade does a fancy layup in an allstar game. at the base of a tree. 32
Recommend
More recommend