Towards generating stories about video Anna Rohrbach The - - PowerPoint PPT Presentation

towards generating stories about video
SMART_READER_LITE
LIVE PREVIEW

Towards generating stories about video Anna Rohrbach The - - PowerPoint PPT Presentation

Towards generating stories about video Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020 https://anna-rohrbach.net Lets look at a human generated video description A young singer with moppy dark brown hair


slide-1
SLIDE 1

Towards generating stories about video

Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020

https://anna-rohrbach.net

slide-2
SLIDE 2

2

A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her

  • wn drink down, she faces the stage and takes his hand.

Pete shrugs and gives a delighted smile. Debbie smiles encouragingly.

Let’s look at a human generated video description

slide-3
SLIDE 3

Let’s look at a human generated video description

  • Human descriptions …
  • are relevant to the video
  • are coherent and non-redundant
  • mention distinct person identities and make use
  • f co-references (e.g. she)
  • Besides they …
  • may contain references to objects and places

significant to the story, including named entities (e.g. they meet at the Denny’s)

  • may require common sense for deeper

understanding of events (e.g. they make up after quarreling)

  • And much more (connect to audio, dialog, etc.)

3

A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her own drink down, she faces the stage and takes his hand. Pete shrugs and gives a delighted smile. Debbie smiles encouragingly.

slide-4
SLIDE 4

This talk

4

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

slide-5
SLIDE 5

This talk

5

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

slide-6
SLIDE 6

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Multi-Sentence Video Description

6

Visually Relevant Linguistically Fluent Diverse & Coherent Across Sentences

… …

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-7
SLIDE 7

Multi-Sentence Video Description

7

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-8
SLIDE 8

Multi-Sentence Video Description

8

Content Error

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-9
SLIDE 9

Multi-Sentence Video Description

9

Content Error Incoherent Sentence

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-10
SLIDE 10

Multi-Sentence Video Description

10

Content Error Incoherent Sentence Repetition Across Sentences

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-11
SLIDE 11

Multi-Sentence Video Description

11

Content Error Incoherent Sentence Repetition Across Sentences

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-12
SLIDE 12

Multi-Sentence Video Description

12

… …

Ground Truth

A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out

  • n the side.

Masked Transformer (Zhou et al.)

A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.

Move Forward and Tell (Xiong et al.)

A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.

Adversarial Inference (Ours)

A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

slide-13
SLIDE 13

13

Conventional Video Captioning Model

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-14
SLIDE 14

14

Generator

A man is seen speaking … Inference

Greedy Max / Beam Search

Conventional Video Captioning Model

MLE Training

Generator

Maximum Likelihood Estimation (MLE)

People are riding down …

Favors frequent n-grams in training set Explores limited vocabulary space

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-15
SLIDE 15

15

GANs for More Humanlike Descriptions

  • Towards Diverse and Natural Image Descriptions via a Conditional GAN

[Dai et al. ICCV17]

  • Speaking the Same Language: Matching Machine to Human Captions by

Adversarial Training [Shetty et al. ICCV17]

  • Improving Image Captioning with Conditional Generative Adversarial Nets

[Chen et al. AAAI 2019]

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-16
SLIDE 16

Real/Fake Real/Fake MLE Pre-Training Generator

Adversarial Training

16

GANs for More Humanlike Descriptions

Discriminator Generator

Discriminator Generator

Inference

A man is seen speaking …

People are riding down …

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-17
SLIDE 17

Discriminator Generator

Generator

Adversarial Training

17

GANs for More Humanlike Descriptions

Discriminator Generator

Inference

A man is seen speaking … MLE Pre-Training Real/Fake Real/Fake

People are riding down …

Stable Training for Text Generation is Difficult

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-18
SLIDE 18

People are riding down …

Generator

Adversarial Training

18

Discriminator Generator

Inference

A man is seen speaking … MLE Pre-Training Real/Fake

Our Adversarial Inference Approach

Discriminator Generator

Real/Fake

SKIP

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-19
SLIDE 19

Inference

Adversarial Inference

19

Sampling

Discriminator Generator

Discriminator Generator

MLE Pre-Training Real/Fake People are in a raft … People are kayaking … A man is seen speaking … People are shown in the …

People are riding down …

Our Adversarial Inference Approach

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-20
SLIDE 20

Previous: A man is seen speaking to the camera …

Sample: People are kayaking down the river…

People are in a raft in a large raft in the water… A man is seen speaking to the camera … People are shown in the water riding a boat.

Adversarial Inference - Hybrid Discriminator

20

Hybrid Discriminator

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-21
SLIDE 21

Visual Discriminator Language Discriminator Pairwise Discriminator

Adversarial Inference - Hybrid Discriminator

21

Previous: A man is seen speaking to the camera …

Sample: People are kayaking down the river…

People are in a raft in a large raft in the water… A man is seen speaking to the camera … People are shown in the water riding a boat.

Visually Relevant Linguistically Fluent Diverse & Coherent Across Sentences

People are shown in the water riding a boat.

Hybrid Discriminator

Visual Discriminator Language Discriminator Pairwise Discriminator

slide-22
SLIDE 22

10 12 14 16 18

METEOR

Comparison to baselines at paragraph-level

22

Dataset: ActivityNet Captions Baselines:

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-23
SLIDE 23

Improved Vocabulary Size

23

1600 1800 2000 2200 2400

Vocabulary Size

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-24
SLIDE 24

0.7 0.72 0.74 0.76 0.78

2-GRAM Diversity

Increased Language Diversity

24

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-25
SLIDE 25

0.05 0.06 0.07 0.08 0.09

4-GRAM Repetition

Decreased Repetition across Sentences

25

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-26
SLIDE 26
  • 2

2 4 6 8

Delta (%) between “Better than MLE” and “Worse than MLE”

Better human ratings for multi-sentence video descriptions

26

MLE GAN MLE+SingleDis

MLE+HybridDis (Ours)

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-27
SLIDE 27

27

… …

Ground Truth

A number of women exercise together using a stepping type of implement. The camera pans slightly to the right. The camera pans slightly to the left.

Masked Transformer (Zhou et al.)

We see people in a room. They are dancing in a room. The people continue dancing around the room.

Move Forward and Tell (Xiong et al.)

A group of people are seen standing in a room with a man speaking to the camera. A group of people are inside a gym. A group of people are seen standing in a room with a man speaking to the camera.

Adversarial Inference (Ours)

A group of women are in a gym doing a synchronized move up and down on a stair stepper. They are doing the same dance in a synchronized manner. They are using a synchronized steppers to move.

Ours vs. SoTA

Captures Visual Content More Precisely

slide-28
SLIDE 28
  • Adversarial Inference outperforms joint training (GAN)
  • Easy and stable
  • Hybrid Discriminator design is effective
  • Captures visual, linguistic and pairwise consistency
  • Evaluation can be challenging
  • Automatic metrics do not necessarily reflect an improvement
  • Rely on diversity metrics and human evaluation

30

Key takeaways

Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019

slide-29
SLIDE 29

This talk

31

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

slide-30
SLIDE 30

Our earlier work

32

Someone strides to the window. Conventional task/model:

Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017

slide-31
SLIDE 31

Our earlier work

33

Someone strides to the window. Our task/model: Conventional task/model:

Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017

Previous clip Sophia gags as she pushes past him and walks out. Sophia gags as she pushes past him and walks out.

slide-32
SLIDE 32

Our earlier work

34

Someone strides to the window. Previous clip Current clip She and Jacob walk down the corridor. Our task/model: Conventional task/model:

Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017

Sophia gags as she pushes past him and walks out.

slide-33
SLIDE 33

LSMDC v2

  • Tackle a set of clips at once
  • End task: Identity-Aware Video Description
  • Auxiliary task: Fill-in the Characters

35

slide-34
SLIDE 34

LSMDC v2: Fill-in the Characters

36

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1], [PERSON2], [PERSON3], … ???

slide-35
SLIDE 35

LSMDC v2: Fill-in the Characters

37

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1]

male

slide-36
SLIDE 36

LSMDC v2: Fill-in the Characters

38

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1] [PERSON2]

male female

slide-37
SLIDE 37

LSMDC v2: Fill-in the Characters

39

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1] [PERSON2] ???

male female Need vision unless more context is available!

slide-38
SLIDE 38

LSMDC v2: Fill-in the Characters

40

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? ???

male female Need vision unless more context is available!

slide-39
SLIDE 39

LSMDC v2: Fill-in the Characters

41

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? [PERSON2]

male female female Need vision unless more context is available!

slide-40
SLIDE 40

LSMDC v2: Fill-in the Characters

42

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? [PERSON2]

male female female Need vision unless more context is available!

slide-41
SLIDE 41

LSMDC v2: Fill-in the Characters

43

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] [PERSON1] [PERSON2]

male female female

slide-42
SLIDE 42

LSMDC v2: Fill-in the Characters

44

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] ???

male female female

slide-43
SLIDE 43

LSMDC v2: Fill-in the Characters

45

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] ???

male female female

slide-44
SLIDE 44

LSMDC v2: Fill-in the Characters

46

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] [PERSON1], [PERSON3]

male female female

slide-45
SLIDE 45

LSMDC v2: Fill-in the Characters

47

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] [PERSON1], [PERSON3]

slide-46
SLIDE 46

LSMDC v2: Fill-in the Characters

  • Complex reasoning over a set of descriptions/clips
  • Language can be quite powerful, but video is still necessary!

48

slide-47
SLIDE 47

LSMDC v2: What is new?

  • New annotations for LSMDC:
  • Every “person” instance (name and “he”/“she”) is labeled with an ID
  • Gender is labeled for each character
  • Each movie split into sets of 5 clips (or less)
  • Test ground-truth IDs are kept blind

49

Dataset # Movies # Sentences #Sets #Blanks Ø Training set Ø Validation set Ø Test set 153 12 17 101,079 7,408 10,053 20,283 1,486 2,018 87,604 6,457 8,431

slide-48
SLIDE 48

LSMDC v2: Fill-in the Characters

50

20 40 60

Distribution over local person IDs, %

PERSON1 PERSON2 PERSON3 PERSON4 PERSON5 PERSON6 PERSON7 PERSON8 PERSON9 PERSON10 PERSON11

slide-49
SLIDE 49

Our approach

51

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

slide-50
SLIDE 50

Our approach

52

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

slide-51
SLIDE 51

Our approach

53

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

slide-52
SLIDE 52

Our approach

54

...

[...] folds her arms.

Face attention

3D Conv Face Cluster Blank Embed

His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.

Weighted sum

Clip i Clip 1 Clip i+1

...

Align Clip Seg to Faces

Blank-To-Face Linking Blank Text Embedding

[PERSON 2] [PERSON 1]

Transformer

[PERSON 1] [PERSON 3]

[FEMALE] Gender Prediction Loss

𝑑!

" 𝑑# "

𝑑$

"

...

𝑕!

" 𝑕# "

𝑕%

"

𝑤!

" 𝑤# "

𝑤$

"

𝑢& 𝛽&' ̂ 𝑑& 𝑢&

Mapping Blanks-To-IDs

𝑡& 𝑡! 𝑡&(! 𝑡&(#

Blank 1 Blank b Blank b+1 Blank b+2

slide-53
SLIDE 53

Automatic evaluation metric

  • We make pairwise

comparisons within the ground truth and predicted IDs

  • For each pair, we assign

“correct” if ground truth and predicted IDs are BOTH different or BOTH the same

  • Each set is scored as a ratio
  • f “correct” pairs (e.g. 4/6)

55

[PERSON1] [PERSON2] [PERSON1] [PERSON3] [PERSON1] [PERSON2] [PERSON3] [PERSON1] Ground Truth Predictions Accuracy: 4/6

slide-54
SLIDE 54

Quantitative results

56

20 40 60 80 100 Test

Accuracy

Different IDs Same ID

slide-55
SLIDE 55

Quantitative results

57

20 40 60 80 100 Test

Accuracy

Different IDs Same ID 20 40 60 80 100 Test

Accuracy

Different IDs Same ID

In fact:

slide-56
SLIDE 56

58 60 62 64 66 68 70 72 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video

Quantitative results

58

slide-57
SLIDE 57

58 60 62 64 66 68 70 72 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video

Quantitative results

59

SoTA 2019

slide-58
SLIDE 58

58 60 62 64 66 68 70 72 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video

Quantitative results

60

slide-59
SLIDE 59

58 63 68 73 78 83 88 Test

Accuracy

Different IDs Text only Full model Yu et al. Brown et al. Human w/o video Human w/ video

Quantitative results

61

slide-60
SLIDE 60

Prediction statistics

62

1000 2000 3000 4000 5000 P E R S O N 1 P E R S O N 2 P E R S O N 3 P E R S O N 4 P E R S O N 5 P E R S O N 6 P E R S O N 7 P E R S O N 8 P E R S O N 9 P E R S O N 1

Histogram over ID frequencies

Reference Ours Yu et al. Brown et al.

slide-61
SLIDE 61

Qualitative results: Comparison to SoTA

63

[...] smiles. [...] hangs up then stares at her reflection in the mirror. [...] lights candles around the room and pops a CD into a player. [...] peeks out from the bathroom. His back to [...], [...] hangs up.

P1 P1 P2 P1 P1, P2 GT Ours Yu et al. Brown et al. P1 P1 P2 P1 P1, P2 P1 P3 P2 P3 P4, P5 P1 P1 P2 P1 P3, P4

slide-62
SLIDE 62

Application: Identity-Aware Video Description

65

SOMEONE looks at the girl in the middle of the street. SOMEONE walks through the lobby. SOMEONE walks into the room and turns to other students. SOMEONE takes a sip. SOMEONE looks at him.

Conventional model

slide-63
SLIDE 63

Key takeaways

  • Fill-in the Characters: new task with automatic evaluation
  • Enables Identity-Aware Video Description
  • Encouraging initial results
  • Transformer is an effective approach to the Fill-in the Characters task
  • Room for improvement towards human performance
  • Some open issues
  • Need spatially-temporally localized visual representations
  • How to make more effective use of the visual modality?
  • Humans perform complex step-by-step reasoning, can these models do it?

66

slide-64
SLIDE 64

To sum up

  • We have proposed
  • A method to obtain multi-sentence

descriptions that are relevant to the video, coherent and diverse

  • An auxiliary Fill-in-the-Characters task

that allows us to incorporate person identities in descriptions

  • Many issues remain open and need

creative solutions!

  • E.g. references to significant objects and

places, use of named entities, common sense for deeper understanding of events, connections to audio, dialog.

67

Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off.

slide-65
SLIDE 65

68

Connecting video description to person identities

His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.

(1) (2) (3)

Coherent and diverse multi-sentence video description

https://sites.google.com/site/describingmovies https://github.com/jamespark3922/adv-inf

Thank you! Questions?

https://anna-rohrbach.net