Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* - PowerPoint PPT Presentation

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar 、 Jindrˇich Libovický 、 Spandana Gella 、 Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University Amazon AI Xiachong Feng

Outline • Author • Background • Task • Dataset • Metric • Experiment

Author • PhD student at the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University . • multimodal machine learning , speech recognition and natural language processing

Background Computer Natural language Automa*c speech vision (CV) processing (NLP) recogni*on (ASR) Human information processing is inherently multimodal, and language is best understood in a situated context.

Task • Mul*modal summariza*on • Video summariza*on • Text summariza*on

Search and Retrieve Relevant Videos

Dataset-How2

Dataset • 2,000 hours of short instruc*onal videos, spanning different domains such as cooking, sports, indoor/outdoor ac*vi*es, music, etc. • Each video is accompanied by a human-generated transcript and a 2 to 3 sentence summary Training 73993 Validation 2965 Testing 2156 Input avg 291 words Summary avg 33 words

Model • Video-based Summariza*on • Speech-based Summariza*on

Video-based Summarization • Pre-trained acGon recogniGon model : a ResNeXt-101 3D Convolu*onal Neural Network • Recognize 400 different human ac*ons

Actions

Video-based Summarization 2048 dimensional, extracted for every 16 non-overlapping frames •

Speech-based SummarizaGon • Pretrained speech recognizer • use the state-of-the-art models for distant-microphone conversational speech recognition, ASpIRE and EESEN. Audio Text

SummarizaGon Models

Content F1 1. Use the METEOR toolkit to obtain the alignment between ref and gen. 2. Remove function words and task-specific stop words. 3. F1 score over the alignment.

RNN language model on all the summaries and randomly sample • Experiment tokens from it. The output obtained is fluent in English leading to a high ROUGE score, • but the content is unrelated which leads to a low Content F1 score

Experiment • Sentence containing words “how to” with predicates learn, tell, show, discuss or explain , usually the second sentence in the transcript.

Experiment • trained with the summary of the nearest neighbor of each video in the Latent Dirichlet Alloca*on (LDA) based topic space as a target.

The text-only model performs best when using the complete transcript in • Experiment the input (650 tokens). This is in contrast to prior work with news-domain summarization. •

PG networks do not perform beger than S2S models on this data which • could be agributed to the abstrac*ve nature of our summaries and also Experiment the lack of common n-gram overlap between input and output which is the important feature of PG networks ASR: degrades no*ceably •

Experiment • almost compe**ve ROUGE and Content F1 scores compared to the text-only model showing the importance of both modali*es in this task. single mean-pooled feature vector sequence of feature vectors

Experiment • Hierarchical attention model that combines both modalities obtains the highest score.

Human Evaluation • Informa*veness, relevance, coherence, and fluency

Word distributions most model outputs are shorter • • very similar in length showing than human annota*ons that the improvements in Rouge-L and Content-F1 scores stem from the difference in content rather than length.

ANenGon Analysis-painGng. input time-steps (from the transcript). output summary of the model less attention in the first part of the video where the speaker is introducing the task and preparing the brush. • the camera focuses on the close-up of brush strokes with hand, model pays higher attention over consecutive frames. • the close up does not contain the hand but only the paper and brush, less attention which could be due to unrecognized • actions in the close-up.

Case Study

Thanks!

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* - PowerPoint PPT Presentation

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar Jindrich Libovick Spandana Gella Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University

A Neural Attention Model for Abstractive Sentence Summarization Alexander Rush Sumit Chopra

A Neural Attention Model for Sentence Summarization Alexander M. Rush, Sumit Chopra, Jason

MeanSum : A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu *

ACL19 Summarization Xiachong Feng Papers Multi-Document Summarization Scientific Paper

Tutorial on Abstractive Text Summarization Advaith Siddharthan NLG Summer School, Aberdeen, 22

Document Summarization Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC

Get To The Point: Summarization with Pointer-Generator Networks Abigail See* Peter J. Liu

Alternative Perspectives on Summarization Systems & Applications Ling 573 May 25, 2017

Alternative Summarization: Abstraction, Reviews & Speech Ling 573 Systems and Applications

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss Project

Query Focused Abstractive Summarization via Incorporating Query Relevance and Transfer Learning

Improving Neural Abstractive Text Summarization with Prior Knowledge Gaetano Rossiello , Pierpaolo

Overview of TAC 2011 Summarization Track Karolina Owczarzak, Hoa Trang Dang National Institute of

Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC Berkeley Document

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

Multimodal Dependent Type Theory Daniel Gratzer 0 Alex Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Production in a Multimodal Corpus: How Speakers Communicate Complex Actions LREC 2008 Carlos

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Nave Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob

Applications November 20, 2008 CS 486/686 University of Waterloo Outline Alchemy

Probabilistic Classifiers -- Generative Naive Bayes Announcements Math for Visual

Sifting through images with Multinomial Relevance Feedback Dorota G lowacka, Alan Medlar and

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* - PowerPoint PPT Presentation

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar Jindrich Libovick Spandana Gella Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University

A Neural Attention Model for Abstractive Sentence Summarization Alexander Rush Sumit Chopra

A Neural Attention Model for Sentence Summarization Alexander M. Rush, Sumit Chopra, Jason

MeanSum : A Neural Model for Unsupervised Multi-Document Abstractive Summarization Eric Chu *

ACL19 Summarization Xiachong Feng Papers Multi-Document Summarization Scientific Paper

Tutorial on Abstractive Text Summarization Advaith Siddharthan NLG Summer School, Aberdeen, 22

Document Summarization Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC

Get To The Point: Summarization with Pointer-Generator Networks Abigail See* Peter J. Liu

Alternative Perspectives on Summarization Systems &amp; Applications Ling 573 May 25, 2017

Alternative Summarization: Abstraction, Reviews &amp; Speech Ling 573 Systems and Applications

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss Project

Query Focused Abstractive Summarization via Incorporating Query Relevance and Transfer Learning

Improving Neural Abstractive Text Summarization with Prior Knowledge Gaetano Rossiello , Pierpaolo

Overview of TAC 2011 Summarization Track Karolina Owczarzak, Hoa Trang Dang National Institute of

Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC Berkeley Document

Speaker and Emotion Recognition of TV-Series Data Using Multimodal and Multitask Deep Learning

Multimodal Dependent Type Theory Daniel Gratzer 0 Alex Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Production in a Multimodal Corpus: How Speakers Communicate Complex Actions LREC 2008 Carlos

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Nave Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky &amp; James Martin, Jacob

Applications November 20, 2008 CS 486/686 University of Waterloo Outline Alchemy

Probabilistic Classifiers -- Generative Naive Bayes Announcements Math for Visual

Sifting through images with Multinomial Relevance Feedback Dorota G lowacka, Alan Medlar and

Alternative Perspectives on Summarization Systems & Applications Ling 573 May 25, 2017

Alternative Summarization: Abstraction, Reviews & Speech Ling 573 Systems and Applications

Nave Bayes CMSC 470 Marine Carpuat Slides credit: Dan Jurafsky & James Martin, Jacob