[PPT] - Towards generating stories about video Anna Rohrbach The PowerPoint Presentation

SLIDE 1

Towards generating stories about video

Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020

https://anna-rohrbach.net

SLIDE 2

2

A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her

wn drink down, she faces the stage and takes his hand.

Pete shrugs and gives a delighted smile. Debbie smiles encouragingly.

Let’s look at a human generated video description

SLIDE 3

Let’s look at a human generated video description

Human descriptions …
are relevant to the video
are coherent and non-redundant
mention distinct person identities and make use
f co-references (e.g. she)
Besides they …
may contain references to objects and places

significant to the story, including named entities (e.g. they meet at the Denny’s)

may require common sense for deeper

understanding of events (e.g. they make up after quarreling)

And much more (connect to audio, dialog, etc.)

3

A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her own drink down, she faces the stage and takes his hand. Pete shrugs and gives a delighted smile. Debbie smiles encouragingly.

SLIDE 4

This talk

4

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

SLIDE 5

This talk

5

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

SLIDE 6

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Multi-Sentence Video Description

6

Visually Relevant Linguistically Fluent Diverse & Coherent Across Sentences

… …

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 7

Multi-Sentence Video Description

7

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 8

Multi-Sentence Video Description

8

Content Error

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 9

Multi-Sentence Video Description

9

Content Error Incoherent Sentence

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 10

Multi-Sentence Video Description

10

Content Error Incoherent Sentence Repetition Across Sentences

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 11

Multi-Sentence Video Description

11

Content Error Incoherent Sentence Repetition Across Sentences

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 12

Multi-Sentence Video Description

12

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Adversarial Inference (Ours)

A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

SLIDE 13

13

Conventional Video Captioning Model

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 14

14

Generator

A man is seen speaking … Inference

Greedy Max / Beam Search

Conventional Video Captioning Model

MLE Training

Generator

Maximum Likelihood Estimation (MLE)

People are riding down …

Favors frequent n-grams in training set Explores limited vocabulary space

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 15

15

GANs for More Humanlike Descriptions

Towards Diverse and Natural Image Descriptions via a Conditional GAN

[Dai et al. ICCV17]

Speaking the Same Language: Matching Machine to Human Captions by

Adversarial Training [Shetty et al. ICCV17]

Improving Image Captioning with Conditional Generative Adversarial Nets

[Chen et al. AAAI 2019]

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 16

Real/Fake Real/Fake MLE Pre-Training Generator

Adversarial Training

16

GANs for More Humanlike Descriptions

Discriminator Generator

Inference

A man is seen speaking …

People are riding down …

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 17

Discriminator Generator

Generator

Adversarial Training

17

GANs for More Humanlike Descriptions

Discriminator Generator

Inference

A man is seen speaking … MLE Pre-Training Real/Fake Real/Fake

People are riding down …

Stable Training for Text Generation is Difficult

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 18

People are riding down …

Generator

Adversarial Training

18

Discriminator Generator

Inference

A man is seen speaking … MLE Pre-Training Real/Fake

Our Adversarial Inference Approach

Discriminator Generator

Real/Fake

SKIP

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 19

Inference

Adversarial Inference

19

Sampling

Discriminator Generator

MLE Pre-Training Real/Fake People are in a raft … People are kayaking … A man is seen speaking … People are shown in the …

People are riding down …

Our Adversarial Inference Approach

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 20

Previous: A man is seen speaking to the camera …

Sample: People are kayaking down the river…

People are in a raft in a large raft in the water… A man is seen speaking to the camera … People are shown in the water riding a boat.

Adversarial Inference - Hybrid Discriminator

20

Hybrid Discriminator

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 21

Visual Discriminator Language Discriminator Pairwise Discriminator

Adversarial Inference - Hybrid Discriminator

21

Previous: A man is seen speaking to the camera …

Sample: People are kayaking down the river…

People are in a raft in a large raft in the water… A man is seen speaking to the camera … People are shown in the water riding a boat.

Visually Relevant Linguistically Fluent Diverse & Coherent Across Sentences

People are shown in the water riding a boat.

Hybrid Discriminator

Visual Discriminator Language Discriminator Pairwise Discriminator

SLIDE 22

10 12 14 16 18

METEOR

Comparison to baselines at paragraph-level

22

Dataset: ActivityNet Captions Baselines:

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 23

Improved Vocabulary Size

23

1600 1800 2000 2200 2400

Vocabulary Size

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 24

0.7 0.72 0.74 0.76 0.78

2-GRAM Diversity

Increased Language Diversity

24

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 25

0.05 0.06 0.07 0.08 0.09

4-GRAM Repetition

Decreased Repetition across Sentences

25

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 26

2

2 4 6 8

Delta (%) between “Better than MLE” and “Worse than MLE”

Better human ratings for multi-sentence video descriptions

26

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 27

27

… …

Ground Truth

A number of women exercise together using a stepping type of implement. The camera pans slightly to the right. The camera pans slightly to the left.

Masked Transformer (Zhou et al.)

We see people in a room. They are dancing in a room. The people continue dancing around the room.

Move Forward and Tell (Xiong et al.)

A group of people are seen standing in a room with a man speaking to the camera. A group of people are inside a gym. A group of people are seen standing in a room with a man speaking to the camera.

Adversarial Inference (Ours)

A group of women are in a gym doing a synchronized move up and down on a stair stepper. They are doing the same dance in a synchronized manner. They are using a synchronized steppers to move.

Ours vs. SoTA

Captures Visual Content More Precisely

SLIDE 28

Adversarial Inference outperforms joint training (GAN)
Easy and stable
Hybrid Discriminator design is effective
Captures visual, linguistic and pairwise consistency
Evaluation can be challenging
Automatic metrics do not necessarily reflect an improvement
Rely on diversity metrics and human evaluation

30

Key takeaways

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

SLIDE 29

This talk

31

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

SLIDE 30

Our earlier work

32

Someone strides to the window. Conventional task/model:

Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017

SLIDE 31

Our earlier work

33

Someone strides to the window. Our task/model: Conventional task/model:

Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017

Previous clip Sophia gags as she pushes past him and walks out. Sophia gags as she pushes past him and walks out.

SLIDE 32

Our earlier work

34

Someone strides to the window. Previous clip Current clip She and Jacob walk down the corridor. Our task/model: Conventional task/model:

Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017

Sophia gags as she pushes past him and walks out.

SLIDE 33

LSMDC v2

Tackle a set of clips at once
End task: Identity-Aware Video Description
Auxiliary task: Fill-in the Characters

35

SLIDE 34

LSMDC v2: Fill-in the Characters

36

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1], [PERSON2], [PERSON3], … ???

SLIDE 35

LSMDC v2: Fill-in the Characters

37

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1]

male

SLIDE 36

LSMDC v2: Fill-in the Characters

38

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1] [PERSON2]

male female

SLIDE 37

LSMDC v2: Fill-in the Characters

39

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1] [PERSON2] ???

male female Need vision unless more context is available!

SLIDE 38

LSMDC v2: Fill-in the Characters

40

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? ???

male female Need vision unless more context is available!

SLIDE 39

LSMDC v2: Fill-in the Characters

41

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? [PERSON2]

male female female Need vision unless more context is available!

SLIDE 40

LSMDC v2: Fill-in the Characters

42

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? [PERSON2]

male female female Need vision unless more context is available!

SLIDE 41

LSMDC v2: Fill-in the Characters

43

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] [PERSON1] [PERSON2]

male female female

SLIDE 42

LSMDC v2: Fill-in the Characters

44

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] ???

male female female

SLIDE 43

LSMDC v2: Fill-in the Characters

45

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] ???

male female female

SLIDE 44

LSMDC v2: Fill-in the Characters

46

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] [PERSON1], [PERSON3]

male female female

SLIDE 45

LSMDC v2: Fill-in the Characters

47

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] [PERSON1], [PERSON3]

SLIDE 46

LSMDC v2: Fill-in the Characters

Complex reasoning over a set of descriptions/clips
Language can be quite powerful, but video is still necessary!

48

SLIDE 47

LSMDC v2: What is new?

New annotations for LSMDC:
Every “person” instance (name and “he”/“she”) is labeled with an ID
Gender is labeled for each character
Each movie split into sets of 5 clips (or less)
Test ground-truth IDs are kept blind

49

Dataset # Movies # Sentences #Sets #Blanks Ø Training set Ø Validation set Ø Test set 153 12 17 101,079 7,408 10,053 20,283 1,486 2,018 87,604 6,457 8,431

SLIDE 48

LSMDC v2: Fill-in the Characters

50

20 40 60

Distribution over local person IDs, %

PERSON1 PERSON2 PERSON3 PERSON4 PERSON5 PERSON6 PERSON7 PERSON8 PERSON9 PERSON10 PERSON11

SLIDE 49

Our approach

51

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

SLIDE 50

Our approach

52

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

SLIDE 51

Our approach

53

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

SLIDE 52

Our approach

54

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

SLIDE 53

Automatic evaluation metric

We make pairwise

comparisons within the ground truth and predicted IDs

For each pair, we assign

“correct” if ground truth and predicted IDs are BOTH different or BOTH the same

Each set is scored as a ratio
f “correct” pairs (e.g. 4/6)

55

[PERSON1] [PERSON2] [PERSON1] [PERSON3] [PERSON1] [PERSON2] [PERSON3] [PERSON1] Ground Truth Predictions Accuracy: 4/6

SLIDE 54

Quantitative results

56

20 40 60 80 100 Test

Accuracy

Different IDs Same ID

SLIDE 55

Quantitative results

57

20 40 60 80 100 Test

Accuracy

Different IDs Same ID 20 40 60 80 100 Test

Accuracy

Different IDs Same ID

In fact:

SLIDE 56

58 60 62 64 66 68 70 72 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video

Quantitative results

58

SLIDE 57

58 60 62 64 66 68 70 72 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video

Quantitative results

59

SoTA 2019

SLIDE 58

58 60 62 64 66 68 70 72 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video

Quantitative results

60

SLIDE 59

58 63 68 73 78 83 88 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video Human w/ video

Quantitative results

61

SLIDE 60

Prediction statistics

62

1000 2000 3000 4000 5000 P E R S O N 1 P E R S O N 2 P E R S O N 3 P E R S O N 4 P E R S O N 5 P E R S O N 6 P E R S O N 7 P E R S O N 8 P E R S O N 9 P E R S O N 1

Histogram over ID frequencies

Reference Ours Yu et al. Brown et al.

SLIDE 61

Qualitative results: Comparison to SoTA

63

[...] smiles. [...] hangs up then stares at her reflection in the mirror. [...] lights candles around the room and pops a CD into a player. [...] peeks out from the bathroom. His back to [...], [...] hangs up.

P1 P1 P2 P1 P1, P2 GT Ours Yu et al. Brown et al. P1 P1 P2 P1 P1, P2 P1 P3 P2 P3 P4, P5 P1 P1 P2 P1 P3, P4

SLIDE 62

Application: Identity-Aware Video Description

65

SOMEONE looks at the girl in the middle of the street. SOMEONE walks through the lobby. SOMEONE walks into the room and turns to other students. SOMEONE takes a sip. SOMEONE looks at him.

Conventional model

SLIDE 63

Key takeaways

Fill-in the Characters: new task with automatic evaluation
Enables Identity-Aware Video Description
Encouraging initial results
Transformer is an effective approach to the Fill-in the Characters task
Room for improvement towards human performance
Some open issues
Need spatially-temporally localized visual representations
How to make more effective use of the visual modality?
Humans perform complex step-by-step reasoning, can these models do it?

66

SLIDE 64

To sum up

We have proposed
A method to obtain multi-sentence

descriptions that are relevant to the video, coherent and diverse

An auxiliary Fill-in-the-Characters task

that allows us to incorporate person identities in descriptions

Many issues remain open and need

creative solutions!

E.g. references to significant objects and

places, use of named entities, common sense for deeper understanding of events, connections to audio, dialog.

67

Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off.

SLIDE 65

68

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

https://sites.google.com/site/describingmovies https://github.com/jamespark3922/adv-inf

Thank you! Questions?

https://anna-rohrbach.net