multimodal abstractive summarization for how2 videos
play

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* - PowerPoint PPT Presentation

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar Jindrich Libovick Spandana Gella Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University


  1. Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar 、 Jindrˇich Libovický 、 Spandana Gella 、 Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University Amazon AI Xiachong Feng

  2. Outline • Author • Background • Task • Dataset • Metric • Experiment

  3. Author • PhD student at the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University . • multimodal machine learning , speech recognition and natural language processing

  4. Background Computer Natural language Automa*c speech vision (CV) processing (NLP) recogni*on (ASR) Human information processing is inherently multimodal, and language is best understood in a situated context.

  5. Task • Mul*modal summariza*on • Video summariza*on • Text summariza*on

  6. Search and Retrieve Relevant Videos

  7. Dataset-How2

  8. Dataset • 2,000 hours of short instruc*onal videos, spanning different domains such as cooking, sports, indoor/outdoor ac*vi*es, music, etc. • Each video is accompanied by a human-generated transcript and a 2 to 3 sentence summary Training 73993 Validation 2965 Testing 2156 Input avg 291 words Summary avg 33 words

  9. Model • Video-based Summariza*on • Speech-based Summariza*on

  10. Video-based Summarization • Pre-trained acGon recogniGon model : a ResNeXt-101 3D Convolu*onal Neural Network • Recognize 400 different human ac*ons

  11. Actions

  12. Video-based Summarization 2048 dimensional, extracted for every 16 non-overlapping frames •

  13. Speech-based SummarizaGon • Pretrained speech recognizer • use the state-of-the-art models for distant-microphone conversational speech recognition, ASpIRE and EESEN. Audio Text

  14. SummarizaGon Models

  15. Content F1 1. Use the METEOR toolkit to obtain the alignment between ref and gen. 2. Remove function words and task-specific stop words. 3. F1 score over the alignment.

  16. RNN language model on all the summaries and randomly sample • Experiment tokens from it. The output obtained is fluent in English leading to a high ROUGE score, • but the content is unrelated which leads to a low Content F1 score

  17. Experiment • Sentence containing words “how to” with predicates learn, tell, show, discuss or explain , usually the second sentence in the transcript.

  18. Experiment • trained with the summary of the nearest neighbor of each video in the Latent Dirichlet Alloca*on (LDA) based topic space as a target.

  19. The text-only model performs best when using the complete transcript in • Experiment the input (650 tokens). This is in contrast to prior work with news-domain summarization. •

  20. PG networks do not perform beger than S2S models on this data which • could be agributed to the abstrac*ve nature of our summaries and also Experiment the lack of common n-gram overlap between input and output which is the important feature of PG networks ASR: degrades no*ceably •

  21. Experiment • almost compe**ve ROUGE and Content F1 scores compared to the text-only model showing the importance of both modali*es in this task. single mean-pooled feature vector sequence of feature vectors

  22. Experiment • Hierarchical attention model that combines both modalities obtains the highest score.

  23. Human Evaluation • Informa*veness, relevance, coherence, and fluency

  24. Word distributions most model outputs are shorter • • very similar in length showing than human annota*ons that the improvements in Rouge-L and Content-F1 scores stem from the difference in content rather than length.

  25. ANenGon Analysis-painGng. input time-steps (from the transcript). output summary of the model less attention in the first part of the video where the speaker is introducing the task and preparing the brush. • the camera focuses on the close-up of brush strokes with hand, model pays higher attention over consecutive frames. • the close up does not contain the hand but only the paper and brush, less attention which could be due to unrecognized • actions in the close-up.

  26. Case Study

  27. Thanks!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend