natural language video description using deep recurrent
play

Natural Language Video Description using Deep Recurrent Neural - PowerPoint PPT Presentation

Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1 Problem Statement Generate descriptions for events depicted in video clips A monkey


  1. Natural Language Video Description using Deep Recurrent Neural Networks Thesis Proposal 23 Nov. 2015 Subhashini Venugopalan University of Texas at Austin 1

  2. Problem Statement Generate descriptions for events depicted in video clips A monkey pulls a dog’s tail and is chased by the dog. 2

  3. Applications Image and video retrieval by content. Video description service. Children are wearing green shirts. They are dancing as they sing the carol. Human Robot Interaction Video surveillance 3

  4. Outline ● ● ○ ○ ● ○ ○ ○ ● 4

  5. Related Work 5

  6. Related Work - 1: Language & Vision Language: Increasingly focused on grounding meaning in perception. Vision: Exploit linguistic ontologies to “tell a story” from images. [Farhadi et. al. ECCV’10] [Kulkarni et. al. CVPR’11] Many early works on Image Description Farhadi et. al. ECCV’10, Kulkarni et. al. CVPR’11, Mitchell et. al. EACL’12, Kuznetsova et. al. ACL’12 & ACL’13 Identify objects and attributes, and combine with linguistic knowledge to “tell a story”. There are one cow and one sky. (animal, stand, ground) Dramatic increase in interest the past year. The golden cow is by the blue sky. (8 papers in CVPR’15) [Donahue et. al. CVPR’15] Relatively little on Video Description Need videos for semantics of wider range of actions. A group of young men playing a game of soccer. 6

  7. Related Work - 2: Video Description ● Extract object and action descriptors. ● Learn object, action, scene classifiers. ● Use language to bias visual interpretation. ● Estimate most likely agents and actions. [Krishnamurthy, et al. AAAI’13] ● Template to generate sentence. Others: Guadarrama ICCV’13, Thomason COLING’14 Limitations: ● Narrow Domains ● Small Grammars [Yu and Siskind, ACL’13] ● Template based sentences ● Several features and classifiers Which objects/actions/scenes should [Rohrbach et. al. ICCV’13] we build classifiers for? 7

  8. Can we learn directly from video sentence pairs? Without having to explicitly learn object/action/scene classifiers for our dataset. [Venugopalan et. al. NAACL’15] 8

  9. Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [Venugopalan et. al. decoder NAACL’15] (this work) Key Insight: Generate feature representation of the video and “decode” it to a sentence 9

  10. In this section ● Background - Recurrent Neural Networks ● 2 Deep methods for video description ■ First, learns from image description. (ignores temporal frame sequence in videos) ■ Second is temporally sensitive to input. 10

  11. [Background] Recurrent Neural Networks Successful in translation, speech. Cell x t h t RNNs can map an input to an output h t-1 sequence. y t Output Pr(out y t | input, out y 0 ...y t-1 ) RNN Unit Insight: Each time step has a layer with the same weights. y t-1 RNN time x t-1 Problems: h t-1 1. Hard to capture long term dependencies 2. Vanishing gradients (shrink through many layers) y t RNN x t h t Solution: Long Short Term Memory (LSTM) unit 11

  12. [Background] LSTM [Hochreiter and Schmidhuber ‘97] [Graves ‘13] x t h t-1 x t h t-1 Output Input Gate Gate Memory Cell x t h t (=z t ) h t-1 Input Modulation Gate Forget Gate LSTM Unit x t h t-1 12

  13. [Background] LSTM Sequence decoders Functions are differentiable. Full gradient is computed by backpropagating through time. Weights updated using Stochastic Gradient Descent. input LSTM out t0 t=0 Matches state-of-the-art on: Speech Recognition [Graves & Jaitly ICML’14] input LSTM out t1 t=1 Machine Translation (Eng-Fr) [Sutskever et al. NIPS’14] Image-Description input LSTM out t2 t=2 [Donahue et al. CVPR’15] [Vinyals et al. CVPR’15] time input LSTM out t3 t=3 13

  14. LSTM Sequence decoders Two LSTM layers - 2nd layer of depth in temporal processing. Softmax over the vocabulary to predict the output at each time step. SoftMax input LSTM out t0 LSTM t=0 input LSTM SoftMax LSTM out t1 t=1 input LSTM SoftMax LSTM out t2 t=2 time input LSTM SoftMax LSTM out t3 t=3 14

  15. Translating Videos to Natural Language CNN [Venugopalan et. al. NAACL’15] 15

  16. Test time: Step 1 CNN (a) Input Sample frames Video @1/10 227x227 Frame Scale (b) 16

  17. [Background] Convolutional Neural Networks (CNNs) Successful in semantic visual recognition tasks. Layer - linear filters followed by non linear function. Stack layers. Learn a hierarchy of features of increasing semantic richness. Image Credit: Maurice Peeman >> Krizhevsky, Sutskever, Hinton 2012 ImageNet classification breakthrough 17

  18. Test time: Step 2 Feature extraction CNN fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) 18

  19. Test time: Step 3 Mean pooling CNN CNN Mean across all frames CNN Arxiv: http://arxiv.org/abs/1505.00487 19

  20. Test time: Step 4 Generation Input Video Convolutional Net Recurrent Net Output 20

  21. Training Annotated video data is scarce. Key Insight: Use supervised pre-training on data-rich auxiliary tasks and transfer. 21

  22. Step1: CNN pre-training fc7: 4096 dimension CNN “feature vector” ● Based on Alexnet [Krizhevsky et al. NIPS’12] ● 1.2M+ images from ImageNet ILSVRC-12 [Russakovsky et al.] ● Initialize weights of our network. 22

  23. Step2: Image-Caption training CNN 23

  24. Step3: Fine-tuning 1. Video Dataset 2. Mean pooled feature 3. Lower learning rate 24

  25. Experiments: Dataset Microsoft Research Video Description dataset [Chen & Dolan, ACL’11] Link: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/ ● 1970 YouTube video snippets ○ 10-30s each ○ typically single activity ○ no dialogues ○ 1200 training, 100 validation, 670 test ● Annotations ○ Descriptions in multiple languages ○ ~40 English descriptions per video ○ descriptions and videos collected on AMT 25

  26. Augment Image datasets MSCOCO # Training videos - 1300 Flickr30k - 30,000 images, 150,000 descriptions MSCOCO - 120,000 images, 600,000 descriptions 26

  27. Sample video and gold descriptions ● A man appears to be plowing a rice field with a ● A man is walking on a rope . plow being pulled by two oxen . ● A man is walking across a rope . ● A team of water buffalo pull a plow through a rice paddy. ● A man is balancing on a rope . ● Domesticated livestock are helping a man plow . ● A man is balancing on a rope at the beach. ● A man leads a team of oxen down a muddy path. ● A man walks on a tightrope at the beach. ● Two oxen walk through some mud. ● A man is balancing on a volleyball net . ● A man is tilling his land with an ox pulled plow. ● A man is walking on a rope held by poles ● Bulls are pulling an object. ● A man balanced on a wire . ● Two oxen are plowing a field. ● The man is balancing on the wire . ● The farmer is tilling the soil. ● A man is walking on a rope . ● A man in ploughing the field. ● A man is standing in the sea shore. 27

  28. Evaluation ● Machine Translation Metrics ○ BLEU ○ METEOR ● Human evaluation 28

  29. Results - Generation MT metrics (BLEU, METEOR) to compare the system generated sentences against (all) ground truth references. [Thomason et al. COLING’14] 29

  30. Human Evaluation Relevance Grammar Rank sentences based on how accurately Rate the grammatical correctness of the they describe the event depicted in the video. following sentences . No two sentences can have the same rank. Multiple sentences can have same rating. 30

  31. Results - Human Evaluation Model Relevance Grammar 2.26 3.99 [Thomason et al. COLING’14] 2.74 3.84 2.93 3.64 4.65 4.61 31

  32. More Examples 32

  33. Translating Videos to Natural Language CNN Does not consider temporal sequence of frames. [Venugopalan et. al. NAACL’15] 33

  34. Can our model be sensitive to temporal structure? Allowing both input (sequence of frames) and output (sequence of words) of variable length. [Venugopalan et. al. ICCV’15] 34

  35. Recurrent Neural Networks (RNNs) can map a vector to a sequence. RNN RNN English French [Sutskever et al. NIPS’14] encoder decoder Sentence Sentence RNN [Donahue et al. CVPR’15] Encode Sentence decoder [Vinyals et al. CVPR’15] RNN Encode Sentence [V. NAACL’15] decoder RNN RNN Encode Sentence [Venugopalan et. al.x encoder decoder ICCV’15] (this work) 35

  36. S2VT: Sequence to Sequence Video to Text CNN CNN CNN CNN Now decode it to a sentence! LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM ... LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A man is talking ... Encoding stage Decoding stage [Venugopalan et. al. ICCV’15] 36

  37. 1. Train on Imagenet 1000 categories [Krizhevsky et al. NIPS’15] CNN 2. Take activations from layer before classification fc7: 4096 dimension CNN “feature vector” Forward propagate Output: “fc7” features (activations before classification layer) Frames: RGB 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend