video paragraph captioning using hierarchical recurrent
play

Video Paragraph Captioning using Hierarchical Recurrent Neural - PowerPoint PPT Presentation

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu Problem Given a video, generate a paragraph (multiple sentences). 01/13 Problem Given a video, generate a paragraph


  1. Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang, ZhihengHuang, Yi Yang, Wei Xu

  2. Problem Given a video, generate a paragraph (multiple sentences). 01/13

  3. Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. 01/13

  4. Problem Given a video, generate a paragraph (multiple sentences). The person entered the kitchen. The person opened the drawer. The person took out a knife and a sharpener. The person sharpened the knife. The person cleaned the knife. VS. The person sharpened the knife in the kitchen. 01/13

  5. Motivation Inter-sentence dependency (semantics context) 02/13

  6. Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. 02/13

  7. Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. 02/13

  8. Motivation Inter-sentence dependency (semantics context) The person took out some potatoes. The person peeled the potatoes. The person turned on the stove. We want to model this dependency. 02/13

  9. Hierarchy A paragraph is inherently hierarchical. 03/13

  10. Hierarchy A paragraph is inherently hierarchical. The person took out some potatoes. 03/13

  11. Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. 03/13

  12. Hierarchy A paragraph is inherently hierarchical. … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13

  13. Hierarchy A paragraph is inherently hierarchical. RNN … … The person took out some potatoes. The person peeled the potatoes. RNN RNN 03/13

  14. Framework (a (a) ) Sentence Generator RNN RNN (b (b) ) Paragraph Generator 04/13

  15. Framework – language model (a (a) ) Sentence Generator Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b (b) ) Paragraph Generator 04/13

  16. Framework – attention model for video feature (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal RNN (b) (b ) Paragraph Generator 04/13

  17. Framework – paragraph model (a) (a ) Sentence Generator Video Feature Pool Sequential Softmax Weighted Average Attention II Attention I Softmax MaxID Embedding Recurrent I Hidden Predicted Words Input Words 512 512 512 1024 Multimodal Embedding Last Instance Average Recurrent II Sentence Paragraph State 512 512 512 Embedding (b) (b ) Paragraph Generator 04/13

  18. Visual Features Appearance Feature Pool Video Feature Pool Action Feature Pool Object appearance: VGG-16 (fc7) [Simonyan et al. , 2015], pre-trained on ImageNet dataset Action: C3D (fc6) [Tran et al., 2015], pre-trained on Sports-1M dataset Dense Trajectories+Fisher Vector [Wang et al. , 2011] 05/13

  19. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 Learning spatial & temporal attention simultaneously 06/13

  20. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13

  21. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 06/13

  22. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 06/13

  23. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 previous recurrent state t-1 06/13

  24. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 … … attention weights previous recurrent state t-1 06/13

  25. Video Feature Pool Attention Model Sequential Softmax Weighted Average Attention II Attention I Recurrent I 512 … … i-1 feature pool i i+1 average feature dot product … … attention weights (input to multimodal layer) previous recurrent state t-1 06/13

  26. Paragraph Generator Unrolled visual features sentence n-1 embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator input to next visual features sentence n sentence embedding hidden softmax maxid current word next word 7192 7192 1024 512 512 512 sentence generator multi-model 512 paragraph generator 07/13

  27. Sentence Embedding Embedding Recurrent I Input Words 512 512 Embedding Last Instance Average Sentence 512 Embedding 08/13

  28. Experiments - Setup Two datasets: YouTube2Text > open-domain > 1,970 videos, ~80k video-sentence pairs, 12k unique words > only one sentence for a video ( special case ) TACoS-MultiLevel > closed-domain: cooking > 173 videos, 16,145 intervals, ~40k interval-sentence pairs, 2k unique words > several dependent sentences for a video Three evaluation metrics: BLEU [Papineni et al., 2002] METEOR [Banerjee and Lavie, 2005] CIDEr [Vedantam et al., 2015] The higher, the better. 09/13

  29. Experiments - YouTube2Text 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 BLEU@4 METEOR CIDEr 10/13

  30. Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

  31. Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 0.24 BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

  32. Experiments - TACoS-MultiLevel 0.31 0.3 0.29 0.28 0.27 0.26 0.25 Evaluation metric scores are not always 0.24 reliable, we need further comparison. BLEU@4 METEOR 1.65 1.6 1.55 1.5 1.45 1.4 1.35 1.3 1.25 1.2 10/13 CIDEr

  33. RNN-cat vs. h-RNN 11/13

  34. RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. 11/13

  35. RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13

  36. RNN-cat vs. h-RNN RNN-cat flat structure, concatenating sentences directly with one RNN RNN … … The person took out some potatoes. The person peeled the potatoes. Amazon Mechanical Turk (AMT): side-by-side comparison Which of the two sentences better describes the video? 1. the first 2. the second. 3. Equally good or bad 11/13

  37. RNN-sent vs. h-RNN examples 12/13

  38. Conclusions & Discussions Hierarchical RNN improves paragraph generation 13/13

  39. Conclusions & Discussions Hierarchical RNN improves paragraph generation Issues: 1. Most errors occur when generating nouns; small objects hard to recognize (on TACoS-MultiLevel) 2. One-way information flow 3. Language model helps, but sometimes overrides computer vision result in a wrong way 13/13

  40. Thanks! Poster #4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend