Weakly-supervised learning from videos and scripts Ivan Laptev - - PowerPoint PPT Presentation

weakly supervised learning from
SMART_READER_LITE
LIVE PREVIEW

Weakly-supervised learning from videos and scripts Ivan Laptev - - PowerPoint PPT Presentation

ERC ALLEGRO workshop INRIA Grenoble July 23, 2014 Weakly-supervised learning from videos and scripts Ivan Laptev ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris Joint work with: Piotr Bojanowski Rmi Lajugie Francis Bach Jean


slide-1
SLIDE 1

Ivan Laptev

ivan.laptev@inria.fr WILLOW, INRIA/ENS/CNRS, Paris

Weakly-supervised learning from videos and scripts

ERC ALLEGRO workshop INRIA Grenoble July 23, 2014

Joint work with: Piotr Bojanowski – Rémi Lajugie – Francis Bach – Jean Ponce – Cordelia Schmid – Josef Sivic

slide-2
SLIDE 2
slide-3
SLIDE 3

Where to get training data?

Shoot actions in the lab

  • KTH dataset

Weizman dataset,…

  • Limited variability
  • Unrealistic

Manually annotate existing content

  • HMDB, Olympic Sports,

UCF50, UCF101, …

  • Very time-consuming

Use readily-available video scripts

  • www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
  • Scripts are available for 1000’s of hours of movies and TV-series
  • Scripts describe dynamic and static content of videos
slide-4
SLIDE 4

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

5

slide-5
SLIDE 5

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

6

slide-6
SLIDE 6

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

7

slide-7
SLIDE 7

As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...

8

slide-8
SLIDE 8

Scripts as weak supervision

Uncertainty

24:25 24:51

Imprecise temporal localization

  • No explicit spatial localization
  • NLP problems, scripts ≠ training labels
  • “… Will gets out of the Chevrolet. …”

“… Erin exits her new truck…”

  • vs. Get-out-car

Challenges:

slide-9
SLIDE 9

Previous work

Sivic, Everingham, and Zisserman, ''Who are you?'' -- Learning Person Specific Classifiers from Video, In CVPR 2009. Buehler, Everingham, and Zisserman "Learning sign language by watching TV (using weakly aligned subtitles)", In CVPR 2009. Duchenne, Laptev, Sivic, Bach and Ponce, "Automatic Annotation of Human Actions in Video", In ICCV 2009.

…wanted to know about the history of the trees

slide-10
SLIDE 10

Joint Learning of Actors and Actions

Rick?

Rick? Walks? Walks?

[Bojanowski et al. ICCV 2013]

Rick walks up behind Ilsa

slide-11
SLIDE 11

Rick Walks

Rick walks up behind Ilsa

Joint Learning of Actors and Actions

[Bojanowski et al. ICCV 2013]

slide-12
SLIDE 12

Formulation: Cost function

Rick Ilsa Sam Actor labels Actor image features Actor classifier

slide-13
SLIDE 13

Formulation: Cost function

Person p appears at least once in clip N :

p = Rick

Weak supervision from scripts:

slide-14
SLIDE 14

Action a appears at least once in clip N :

a = Walk

Weak supervision from scripts:

Formulation: Cost function

slide-15
SLIDE 15

Formulation: Cost function

Action a appears in clip N : Weak supervision from scripts: Person p appears in clip N : Person p and Action a appear in clip N :

slide-16
SLIDE 16

22

Image and video features

  • Facial features

[Everingham’06]

  • HOG descriptor on

normalized face image

  • Dense Trajectory

features in person bounding box [Wang et al.,’11]

Face features Action features

slide-17
SLIDE 17

23

Results for Person Labelling

American beauty (11 character names) Casablanca (17 character names)

slide-18
SLIDE 18

24

Results for Person + Action Labelling

Casablanca, Walking

slide-19
SLIDE 19

Finding Actions and Actors in Movies

[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]

slide-20
SLIDE 20

26

Action Learning with Ordering Constraints

[Bojanowski et al. ECCV 2014]

slide-21
SLIDE 21

27

Action Learning with Ordering Constraints

[Bojanowski et al. ECCV 2014]

slide-22
SLIDE 22

Cost Function

Weak supervision from

  • rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

slide-23
SLIDE 23

Cost Function

Weak supervision from

  • rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

slide-24
SLIDE 24

Cost Function

Weak supervision from

  • rdering constraints on Z:

Action label Action index 2 4 1 2 3 2 Video time intervals

slide-25
SLIDE 25

Is the optimization tractable?

  • Path constraints are implicit
  • Cannot use off-the-shelf solvers
  • Frank-Wolfe optimization algorithm
slide-26
SLIDE 26

Results

937 video clips from 60 Hollywood movies

  • 16 action classes
  • Each clip is annotated by a sequence of n actions (2≤n≤11)
slide-27
SLIDE 27
slide-28
SLIDE 28

Summary

Reason about action sequences.

  • Weakly-supervised learning

using time ordering constraints.

  • Action learning with ordering constraints

Reason about individual people.

  • Joint Learning of Actors and Actions

Weakly-supervised learning of actions and names.

slide-29
SLIDE 29

Limitations / Future work

No spatial localization. Want to answer questions:

  • Who is doing what?
  • Who interacts with whom?

Actions are modeled at short time intervals (15 frames).

  • Sequences of action labels are given manually. Want to jointly

cluster videos and scripts.

  • Action learning with ordering constraints

No temporal localization of actions within person tracks.

  • Joint Learning of Actors and Actions

Finding people in movies is still a big challenge.

  • Extracting action labels from scripts is a major

(NLP+vision?) challenge.