Weakly-supervised learning from videos and scripts Ivan Laptev - - PowerPoint PPT Presentation

▶

Nov 12, 2022 331 likes •642 views

ERC ALLEGRO workshop INRIA Grenoble July 23, 2014 Weakly-supervised learning from videos and scripts Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Francis Bach Jean

SLIDE 1

Ivan Laptev

ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris

Weakly-supervised learning from videos and scripts

ERC ALLEGRO workshop INRIA Grenoble July 23, 2014

Joint work with: Piotr Bojanowski – Rémi Lajugie – Francis Bach – Jean Ponce – Cordelia Schmid – Josef Sivic

SLIDE 2

SLIDE 3

Where to get training data?

Shoot actions in the lab

KTH dataset

Weizman dataset,…

Limited variability
Unrealistic

Manually annotate existing content

HMDB, Olympic Sports,

UCF50, UCF101, …

Very time-consuming

Use readily-available video scripts

www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
Scripts are available for 1000’s of hours of movies and TV-series
Scripts describe dynamic and static content of videos

SLIDE 4

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

SLIDE 5

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

SLIDE 6

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

SLIDE 7

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

SLIDE 8

Scripts as weak supervision

Uncertainty

24:25 24:51

Imprecise temporal localization

No explicit spatial localization
NLP problems, scripts ≠ training labels
“… Will gets out of the Chevrolet. …”

“… Erin exits her new truck…”

vs. Get-out-car

Challenges:

SLIDE 9

Previous work

Sivic, Everingham, and Zisserman, ''Who are you?'' -- Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman "Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009.

…wanted to know about the history of the trees

SLIDE 10

Joint Learning of Actors and Actions

Rick?

Rick? Walks? Walks?

[Bojanowski et al. ICCV 2013]

Rick walks up behind Ilsa

SLIDE 11

Rick Walks

Rick walks up behind Ilsa

Joint Learning of Actors and Actions

[Bojanowski et al. ICCV 2013]

SLIDE 12

Formulation: Cost function

Rick Ilsa Sam Actor labels Actor image features Actor classifier

SLIDE 13

Formulation: Cost function

Person p appears at least once in clip N :

p = Rick

Weak supervision from scripts:

SLIDE 14

Action a appears at least once in clip N :

a = Walk

Weak supervision from scripts:

Formulation: Cost function

SLIDE 15

Formulation: Cost function

Action a appears in clip N : Weak supervision from scripts: Person p appears in clip N : Person p and Action a appear in clip N :

SLIDE 16

Image and video features

Facial features

[Everingham’06]

HOG descriptor on

normalized face image

Dense Trajectory

features in person bounding box [Wang et al.,’11]

Face features Action features

SLIDE 17

Results for Person Labelling

American beauty (11 character names) Casablanca (17 character names)

SLIDE 18

Results for Person + Action Labelling

Casablanca, Walking

SLIDE 19

Finding Actions and Actors in Movies

[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

SLIDE 20

Action Learning with Ordering Constraints

[Bojanowski et al. ECCV 2014]

SLIDE 21

Action Learning with Ordering Constraints

[Bojanowski et al. ECCV 2014]

SLIDE 22

Cost Function

Weak supervision from

rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

SLIDE 23

Cost Function

Weak supervision from

rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

SLIDE 24

Cost Function

Weak supervision from

rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

SLIDE 25

Is the optimization tractable?

Path constraints are implicit
Cannot use off-the-shelf solvers
Frank-Wolfe optimization algorithm

SLIDE 26

Results

937 video clips from 60 Hollywood movies

16 action classes
Each clip is annotated by a sequence of n actions (2≤n≤11)

SLIDE 27

SLIDE 28

Summary

Reason about action sequences.

Weakly-supervised learning

using time ordering constraints.

Action learning with ordering constraints

Reason about individual people.

Joint Learning of Actors and Actions

Weakly-supervised learning of actions and names.

SLIDE 29

Limitations / Future work

No spatial localization. Want to answer questions:

Who is doing what?
Who interacts with whom?

Actions are modeled at short time intervals (15 frames).

Sequences of action labels are given manually. Want to jointly

cluster videos and scripts.

Action learning with ordering constraints

No temporal localization of actions within person tracks.

Joint Learning of Actors and Actions

Finding people in movies is still a big challenge.

Extracting action labels from scripts is a major

(NLP+vision?) challenge.