Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* - - PowerPoint PPT Presentation

multimodal abstractive summarization for how2 videos
SMART_READER_LITE
LIVE PREVIEW

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* - - PowerPoint PPT Presentation

Multimodal Abstractive Summarization for How2 Videos ACL19 Shru* Palaskar Jindrich Libovick Spandana Gella Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University


slide-1
SLIDE 1

Multimodal Abstractive Summarization for How2 Videos

ACL19 Shru* Palaskar、Jindrˇich Libovický、Spandana Gella、Florian Metze School of Computer Science, Carnegie Mellon University Faculty of Mathema*cs and Physics, Charles University Amazon AI

Xiachong Feng

slide-2
SLIDE 2

Outline

  • Author
  • Background
  • Task
  • Dataset
  • Metric
  • Experiment
slide-3
SLIDE 3

Author

  • PhD student at the Language Technologies Institute of the

School of Computer Science at Carnegie Mellon University.

  • multimodal machine learning, speech recognition and

natural language processing

slide-4
SLIDE 4

Background

Human information processing is inherently multimodal, and language is best understood in a situated context.

Automa*c speech recogni*on (ASR) Computer vision (CV) Natural language processing (NLP)

slide-5
SLIDE 5

Task

  • Mul*modal summariza*on
  • Video summariza*on
  • Text summariza*on
slide-6
SLIDE 6

Search and Retrieve Relevant Videos

slide-7
SLIDE 7

Dataset-How2

slide-8
SLIDE 8

Dataset

Training 73993 Validation 2965 Testing 2156 Input avg 291 words Summary avg 33 words

  • 2,000 hours of short instruc*onal videos, spanning different domains

such as cooking, sports, indoor/outdoor ac*vi*es, music, etc.

  • Each video is accompanied by a human-generated transcript and a 2 to

3 sentence summary

slide-9
SLIDE 9

Model

  • Video-based Summariza*on
  • Speech-based Summariza*on
slide-10
SLIDE 10

Video-based Summarization

  • Pre-trained acGon recogniGon model: a ResNeXt-101 3D

Convolu*onal Neural Network

  • Recognize 400 different human ac*ons
slide-11
SLIDE 11

Actions

slide-12
SLIDE 12

Video-based Summarization

  • 2048 dimensional, extracted for every 16 non-overlapping frames
slide-13
SLIDE 13

Speech-based SummarizaGon

  • Pretrained speech recognizer
  • use the state-of-the-art models for distant-microphone conversational

speech recognition, ASpIRE and EESEN.

Audio Text

slide-14
SLIDE 14

SummarizaGon Models

slide-15
SLIDE 15

Content F1

  • 1. Use the METEOR toolkit to obtain the alignment between ref and

gen.

  • 2. Remove function words and task-specific stop words.
  • 3. F1 score over the alignment.
slide-16
SLIDE 16

Experiment

  • RNN language model on all the summaries and randomly sample

tokens from it.

  • The output obtained is fluent in English leading to a high ROUGE score,

but the content is unrelated which leads to a low Content F1 score

slide-17
SLIDE 17

Experiment

  • Sentence containing words “how to” with predicates learn, tell, show, discuss
  • r explain, usually the second sentence in the transcript.
slide-18
SLIDE 18

Experiment

  • trained with the summary of the nearest neighbor of each video in the Latent

Dirichlet Alloca*on (LDA) based topic space as a target.

slide-19
SLIDE 19

Experiment

  • The text-only model performs best when using the complete transcript in

the input (650 tokens).

  • This is in contrast to prior work with news-domain summarization.
slide-20
SLIDE 20

Experiment

  • PG networks do not perform beger than S2S models on this data which

could be agributed to the abstrac*ve nature of our summaries and also the lack of common n-gram overlap between input and output which is the important feature of PG networks

  • ASR: degrades no*ceably
slide-21
SLIDE 21

Experiment

single mean-pooled feature vector sequence of feature vectors

  • almost compe**ve ROUGE and Content F1 scores compared to the text-only

model showing the importance of both modali*es in this task.

slide-22
SLIDE 22

Experiment

  • Hierarchical attention model that combines both modalities obtains the

highest score.

slide-23
SLIDE 23

Human Evaluation

  • Informa*veness, relevance, coherence, and fluency
slide-24
SLIDE 24

Word distributions

  • most model outputs are shorter

than human annota*ons

  • very similar in length showing

that the improvements in Rouge-L and Content-F1 scores stem from the difference in content rather than length.

slide-25
SLIDE 25

ANenGon Analysis-painGng.

  • utput summary of

the model input time-steps (from the transcript).

  • less attention in the first part of the video where the speaker is introducing the task and preparing the brush.
  • the camera focuses on the close-up of brush strokes with hand, model pays higher attention over consecutive frames.
  • the close up does not contain the hand but only the paper and brush, less attention which could be due to unrecognized

actions in the close-up.

slide-26
SLIDE 26

Case Study

slide-27
SLIDE 27

Thanks!