Towards generating stories about video
Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020
https://anna-rohrbach.net
Towards generating stories about video Anna Rohrbach The - - PowerPoint PPT Presentation
Towards generating stories about video Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020 https://anna-rohrbach.net Lets look at a human generated video description A young singer with moppy dark brown hair
Anna Rohrbach The End-of-End-to-End A Video Understanding Pentathlon, CVPR 2020
https://anna-rohrbach.net
2
A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her
Pete shrugs and gives a delighted smile. Debbie smiles encouragingly.
significant to the story, including named entities (e.g. they meet at the Denny’s)
understanding of events (e.g. they make up after quarreling)
3
A young singer with moppy dark brown hair strums a guitar at the mic. Debbie brings Pete a bottle of beer. Setting her own drink down, she faces the stage and takes his hand. Pete shrugs and gives a delighted smile. Debbie smiles encouragingly.
4
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.
(1) (2) (3)
5
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.
(1) (2) (3)
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
6
Visually Relevant Linguistically Fluent Diverse & Coherent Across Sentences
… …
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
7
… …
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
Masked Transformer (Zhou et al.)
A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.
Move Forward and Tell (Xiong et al.)
A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
8
Content Error
… …
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
Masked Transformer (Zhou et al.)
A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.
Move Forward and Tell (Xiong et al.)
A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
9
Content Error Incoherent Sentence
… …
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
Masked Transformer (Zhou et al.)
A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.
Move Forward and Tell (Xiong et al.)
A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
10
Content Error Incoherent Sentence Repetition Across Sentences
… …
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
Masked Transformer (Zhou et al.)
A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.
Move Forward and Tell (Xiong et al.)
A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
11
Content Error Incoherent Sentence Repetition Across Sentences
… …
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
Masked Transformer (Zhou et al.)
A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.
Move Forward and Tell (Xiong et al.)
A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
12
… …
Ground Truth
A man is seen hosting a news segment that shows clips of various floats moving down a rapid with people. One group of people fall out and pull each other off to the side and one man speaks to the camera. More shots are shown of people riding down the river and falling out
Masked Transformer (Zhou et al.)
A man is seen speaking to the camera and leads into clips of people riding in the water. The man continues to speak to the camera while more clips of people riding. The man continues talking to the camera.
Move Forward and Tell (Xiong et al.)
A man is seen speaking to the camera and leads into a man speaking to the camera. A man is seen speaking to the camera and leads into him riding down a river. A man is seen speaking to the camera and leads into him riding down a river.
Adversarial Inference (Ours)
A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.
13
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
14
Generator
A man is seen speaking … Inference
Greedy Max / Beam Search
MLE Training
Generator
Maximum Likelihood Estimation (MLE)
People are riding down …
Favors frequent n-grams in training set Explores limited vocabulary space
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
15
[Dai et al. ICCV17]
Adversarial Training [Shetty et al. ICCV17]
[Chen et al. AAAI 2019]
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
Real/Fake Real/Fake MLE Pre-Training Generator
Adversarial Training
16
Discriminator Generator
Discriminator Generator
Inference
A man is seen speaking …
People are riding down …
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
Discriminator Generator
Generator
Adversarial Training
17
Discriminator Generator
Inference
A man is seen speaking … MLE Pre-Training Real/Fake Real/Fake
People are riding down …
Stable Training for Text Generation is Difficult
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
People are riding down …
Generator
Adversarial Training
18
Discriminator Generator
Inference
A man is seen speaking … MLE Pre-Training Real/Fake
Discriminator Generator
Real/Fake
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
Inference
Adversarial Inference
19
Sampling
Discriminator Generator
Discriminator Generator
MLE Pre-Training Real/Fake People are in a raft … People are kayaking … A man is seen speaking … People are shown in the …
People are riding down …
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
Previous: A man is seen speaking to the camera …
Sample: People are kayaking down the river…
People are in a raft in a large raft in the water… A man is seen speaking to the camera … People are shown in the water riding a boat.
20
Hybrid Discriminator
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
Visual Discriminator Language Discriminator Pairwise Discriminator
21
Previous: A man is seen speaking to the camera …
Sample: People are kayaking down the river…
People are in a raft in a large raft in the water… A man is seen speaking to the camera … People are shown in the water riding a boat.
Visually Relevant Linguistically Fluent Diverse & Coherent Across Sentences
People are shown in the water riding a boat.
Hybrid Discriminator
Visual Discriminator Language Discriminator Pairwise Discriminator
10 12 14 16 18
METEOR
22
Dataset: ActivityNet Captions Baselines:
MLE GAN MLE+SingleDis
MLE+HybridDis (Ours)
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
23
1600 1800 2000 2200 2400
Vocabulary Size
MLE GAN MLE+SingleDis
MLE+HybridDis (Ours)
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
0.7 0.72 0.74 0.76 0.78
2-GRAM Diversity
24
MLE GAN MLE+SingleDis
MLE+HybridDis (Ours)
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
0.05 0.06 0.07 0.08 0.09
4-GRAM Repetition
25
MLE GAN MLE+SingleDis
MLE+HybridDis (Ours)
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
2 4 6 8
Delta (%) between “Better than MLE” and “Worse than MLE”
26
MLE GAN MLE+SingleDis
MLE+HybridDis (Ours)
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
27
… …
Ground Truth
A number of women exercise together using a stepping type of implement. The camera pans slightly to the right. The camera pans slightly to the left.
Masked Transformer (Zhou et al.)
We see people in a room. They are dancing in a room. The people continue dancing around the room.
Move Forward and Tell (Xiong et al.)
A group of people are seen standing in a room with a man speaking to the camera. A group of people are inside a gym. A group of people are seen standing in a room with a man speaking to the camera.
Adversarial Inference (Ours)
A group of women are in a gym doing a synchronized move up and down on a stair stepper. They are doing the same dance in a synchronized manner. They are using a synchronized steppers to move.
Captures Visual Content More Precisely
30
Park et al. Adversarial Inference for Multi-Sentence Video Description. CVPR 2019
31
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.
(1) (2) (3)
32
Someone strides to the window. Conventional task/model:
Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017
33
Someone strides to the window. Our task/model: Conventional task/model:
Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017
Previous clip Sophia gags as she pushes past him and walks out. Sophia gags as she pushes past him and walks out.
34
Someone strides to the window. Previous clip Current clip She and Jacob walk down the corridor. Our task/model: Conventional task/model:
Rohrbach et al. Generating Descriptions with Grounded and Co-Referenced People. CVPR 2017
Sophia gags as she pushes past him and walks out.
35
36
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1], [PERSON2], [PERSON3], … ???
37
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1]
male
38
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1] [PERSON2]
male female
39
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. [PERSON1] [PERSON2] ???
male female Need vision unless more context is available!
40
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? ???
male female Need vision unless more context is available!
41
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? [PERSON2]
male female female Need vision unless more context is available!
42
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] ??? [PERSON2]
male female female Need vision unless more context is available!
43
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. [PERSON1] [PERSON2] [PERSON1] [PERSON2]
male female female
44
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] ???
male female female
45
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] ???
male female female
46
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] [PERSON1], [PERSON3]
male female female
47
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. […] folds her arms. […] approaches […], who leans against the wall of the house. [PERSON1] [PERSON2] [PERSON1] [PERSON2] [PERSON1], [PERSON3]
48
49
Dataset # Movies # Sentences #Sets #Blanks Ø Training set Ø Validation set Ø Test set 153 12 17 101,079 7,408 10,053 20,283 1,486 2,018 87,604 6,457 8,431
50
20 40 60
PERSON1 PERSON2 PERSON3 PERSON4 PERSON5 PERSON6 PERSON7 PERSON8 PERSON9 PERSON10 PERSON11
51
...
[...] folds her arms.
Face attention
3D Conv Face Cluster Blank Embed
His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.
Weighted sum
Clip i Clip 1 Clip i+1
...
Align Clip Seg to Faces
Blank-To-Face Linking Blank Text Embedding
[PERSON 2] [PERSON 1]
Transformer
[PERSON 1] [PERSON 3]
[FEMALE] Gender Prediction Loss
𝑑!
" 𝑑# "
𝑑$
"
...
!
" # "
%
"
𝑤!
" 𝑤# "
𝑤$
"
𝑢& 𝛽&' ̂ 𝑑& 𝑢&
Mapping Blanks-To-IDs
𝑡& 𝑡! 𝑡&(! 𝑡&(#
Blank 1 Blank b Blank b+1 Blank b+2
52
...
[...] folds her arms.
Face attention
3D Conv Face Cluster Blank Embed
His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.
Weighted sum
Clip i Clip 1 Clip i+1
...
Align Clip Seg to Faces
Blank-To-Face Linking Blank Text Embedding
[PERSON 2] [PERSON 1]
Transformer
[PERSON 1] [PERSON 3]
[FEMALE] Gender Prediction Loss
𝑑!
" 𝑑# "
𝑑$
"
...
!
" # "
%
"
𝑤!
" 𝑤# "
𝑤$
"
𝑢& 𝛽&' ̂ 𝑑& 𝑢&
Mapping Blanks-To-IDs
𝑡& 𝑡! 𝑡&(! 𝑡&(#
Blank 1 Blank b Blank b+1 Blank b+2
53
...
[...] folds her arms.
Face attention
3D Conv Face Cluster Blank Embed
His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.
Weighted sum
Clip i Clip 1 Clip i+1
...
Align Clip Seg to Faces
Blank-To-Face Linking Blank Text Embedding
[PERSON 2] [PERSON 1]
Transformer
[PERSON 1] [PERSON 3]
[FEMALE] Gender Prediction Loss
𝑑!
" 𝑑# "
𝑑$
"
...
!
" # "
%
"
𝑤!
" 𝑤# "
𝑤$
"
𝑢& 𝛽&' ̂ 𝑑& 𝑢&
Mapping Blanks-To-IDs
𝑡& 𝑡! 𝑡&(! 𝑡&(#
Blank 1 Blank b Blank b+1 Blank b+2
54
...
[...] folds her arms.
Face attention
3D Conv Face Cluster Blank Embed
His brow furrowed, [...] looks down at the ground. [...] approaches [...], who leans against the wall.
Weighted sum
Clip i Clip 1 Clip i+1
...
Align Clip Seg to Faces
Blank-To-Face Linking Blank Text Embedding
[PERSON 2] [PERSON 1]
Transformer
[PERSON 1] [PERSON 3]
[FEMALE] Gender Prediction Loss
𝑑!
" 𝑑# "
𝑑$
"
...
!
" # "
%
"
𝑤!
" 𝑤# "
𝑤$
"
𝑢& 𝛽&' ̂ 𝑑& 𝑢&
Mapping Blanks-To-IDs
𝑡& 𝑡! 𝑡&(! 𝑡&(#
Blank 1 Blank b Blank b+1 Blank b+2
comparisons within the ground truth and predicted IDs
“correct” if ground truth and predicted IDs are BOTH different or BOTH the same
55
[PERSON1] [PERSON2] [PERSON1] [PERSON3] [PERSON1] [PERSON2] [PERSON3] [PERSON1] Ground Truth Predictions Accuracy: 4/6
56
20 40 60 80 100 Test
Accuracy
Different IDs Same ID
57
20 40 60 80 100 Test
Accuracy
Different IDs Same ID 20 40 60 80 100 Test
Accuracy
Different IDs Same ID
58 60 62 64 66 68 70 72 Test
Accuracy
Different IDs Text only Full model Yu et al. Brown et al. Human w/o video
58
58 60 62 64 66 68 70 72 Test
Accuracy
Different IDs Text only Full model Yu et al. Brown et al. Human w/o video
59
SoTA 2019
58 60 62 64 66 68 70 72 Test
Accuracy
Different IDs Text only Full model Yu et al. Brown et al. Human w/o video
60
58 63 68 73 78 83 88 Test
Accuracy
Different IDs Text only Full model Yu et al. Brown et al. Human w/o video Human w/ video
61
62
1000 2000 3000 4000 5000 P E R S O N 1 P E R S O N 2 P E R S O N 3 P E R S O N 4 P E R S O N 5 P E R S O N 6 P E R S O N 7 P E R S O N 8 P E R S O N 9 P E R S O N 1
Reference Ours Yu et al. Brown et al.
63
[...] smiles. [...] hangs up then stares at her reflection in the mirror. [...] lights candles around the room and pops a CD into a player. [...] peeks out from the bathroom. His back to [...], [...] hangs up.
P1 P1 P2 P1 P1, P2 GT Ours Yu et al. Brown et al. P1 P1 P2 P1 P1, P2 P1 P3 P2 P3 P4, P5 P1 P1 P2 P1 P3, P4
65
SOMEONE looks at the girl in the middle of the street. SOMEONE walks through the lobby. SOMEONE walks into the room and turns to other students. SOMEONE takes a sip. SOMEONE looks at him.
Conventional model
66
67
Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.
(1) (2) (3)
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off.
68
His brow furrowed, […] looks down at the ground. […] eyes him angrily, her jaw clenched. […] heads off. Our work: A man is seen speaking to the camera and leads into several people riding down a rough river. People are shown in the water riding a boat. Several people are shown in slow motion as well as people riding along.
(1) (2) (3)
https://sites.google.com/site/describingmovies https://github.com/jamespark3922/adv-inf
https://anna-rohrbach.net