We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio - - PowerPoint PPT Presentation

we weakly supe supervise sed d vid video eo rec ecogn
SMART_READER_LITE
LIVE PREVIEW

We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio - - PowerPoint PPT Presentation

We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio ition Pa Pascal Mettes University of Amsterdam Video recognition pipeline Learn video labels from spatio-temporal inputs. Do Double scale challenge: : Order(s) of


slide-1
SLIDE 1

We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio ition

Pa Pascal Mettes University of Amsterdam

slide-2
SLIDE 2

Learn video labels from spatio-temporal inputs. Do Double scale challenge: : Order(s) of magnitude more inputs, order(s) of magnitude fewer training samples.

04-04-2019 Weakly-supervised action recognition 1

Video recognition pipeline

[Feichtenhofer et al. NeurIPS 2016]

slide-3
SLIDE 3

Discover what, when, and where actions occur in videos.

06-02-2019 Understanding Actions with Minimal Supervision 2

Spatio-Temporal Action Localization

Diving Skateboarding

slide-4
SLIDE 4

Same double burden as video recognition in general. Additional burden from box annotation in each video frame. How can we learn action locations without all these burdens?

04-04-2019 Weakly-supervised action recognition 3

The annotation burden of localization

slide-5
SLIDE 5

To Towards action lo localiz alizatio tion from video video labels labels Pa Part I

slide-6
SLIDE 6

06-02-2019 Understanding Actions with Minimal Supervision 5

Spatio-Temporal Proposals

At test time, actions can be anywhere in a video. Split videos into action tubes, such that at least one tube matches with each action. Apply model to all proposals and select the best ones during testing.

Illustration courtesy of Jan van Gemert

slide-7
SLIDE 7

Go Goal: al: Alleviate the need for expensive bounding box annotations per frame. Hy Hypoth

  • thesi

esis: s: Spatio-temporal proposals can be used as training example, if guided properly with minimal extra supervision.

06-02-2019 Understanding Actions with Minimal Supervision 6

Pointly-Supervised Action Localization

P.

  • P. Mettes, J.C. van Gemert, and C.G.M. Snoek, “Spot On: Action Localization from Pointly-Supervised Proposals”, ECCV, 2016.

P.

  • P. Mettes and C.G.M. Snoek, “Pointly-Supervised Action Localization”, IJCV, 2018 (in press).
slide-8
SLIDE 8

04-04-2019 Weakly-supervised action recognition 7

Multiple Instance Learning

Traditional supervised learning Multiple-instance learning

positive negative positive bags

[Dietterich et al. 1997]

negative bags

slide-9
SLIDE 9

04-04-2019 Weakly-supervised action recognition 8

Multiple Instance Learning

positive bags negative bags Compute optimal hyper-plane with sparse MIL positive bags negative bags Re-weight positive instances

[Slide by Vijayanarasimhan and Grauman]

slide-10
SLIDE 10

06-02-2019 Understanding Actions with Minimal Supervision 9

Proposed approach

Point supervision Proposal scoring Proposal mining

Annotate actions simply by pointing on action centers. Use point supervision to help find the best proposals to train on.

slide-11
SLIDE 11

06-02-2019 Understanding Actions with Minimal Supervision 10

Multiple Instance Learning with priors

Iteratively learn to discriminate actions (likelihood) and learn the spatio-temporal extent of actions (point priors).

slide-12
SLIDE 12

06-02-2019 Understanding Actions with Minimal Supervision 11

Point and proposal overlap

We introduce a new overlap measure between points and proposals. Proposals should be small and their centers should match with points.

< <

slide-13
SLIDE 13

06-02-2019 Understanding Actions with Minimal Supervision 12

Experiments

Da Datasets

UCF Sports UCF-101 J-HMDB Hollywood2Tubes

Ev Evaluation

Rank detections based on classifier score. Positive if correct class and enough overlap. Results reported in (mean) Average Precision.

slide-14
SLIDE 14

06-02-2019 Understanding Actions with Minimal Supervision 13

Can we train on spatio-temporal proposals?

Tr Training on our approach with points is as effective as training on boxes.

UCF Sports

Trained on box annotations Trained on best proposal

Trained on box annotations Trained on best proposal Trained on point annotations

slide-15
SLIDE 15

06-02-2019 Understanding Actions with Minimal Supervision 14

How much point supervision is required?

Po Points are 15 times faster to annotate than boxes. Sim Similar ilar performan ance at 50-100x s 100x speed eed-up up with h fewer anno nnotations ns.

UCF Sports

slide-16
SLIDE 16

Sim Similar ilar performan ance, dif ifferent behavio vior.

06-02-2019 Understanding Actions with Minimal Supervision 15

Discovered actions

Actions from box supervision Actions from point supervision

slide-17
SLIDE 17

Ne New hypothesis: Point supervision can be replaced with automatic visual cues that correlate with the action location.

06-02-2019 Understanding Actions with Minimal Supervision 16

Replacing manual point supervision

P.

  • P. Mettes, C.G.M. Snoek, and S.F. Chang, “Localizing Actions from Video Labels and Pseudo-Annotations”, BMVC, 2017.
slide-18
SLIDE 18

06-02-2019 Understanding Actions with Minimal Supervision 17

Pseudo-annotations

Person

  • n d

detection

  • n

Action location correlates with the actor location.

Ren et al. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NeurIPS, 2015.

In Indep epen enden ent motion

Actions occurs where motion deviates from global motion.

Jain et al. “Action localization with tubelets from motion”, CVPR, 2014.

Ac Action

  • n p

prop

  • pos
  • sals

Actions occur where proposals agree.

Van Gemert et al. “APT: Action localization proposals from dense trajectories, BMVC, 2015.

Ce Center b bias

Actions are central.

Tseng et al. “Quantifying bias of observers in free viewing of dynamic natural scenes”, JOV, 2004.

Ob Object cts

Actions occur where objects occur.

Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV, 2014.

slide-19
SLIDE 19

Su Supervis visio ion bounds: Upper Full box supervision. Lower Video labels, no priors. Ax Axes: X-axis Minimal overlap threshold for positives. Y-axis Localization performance

06-02-2019 Understanding Actions with Minimal Supervision 18

Individual pseudo-annotation performance

UCF Sports

slide-20
SLIDE 20

All pseudo-annotations are informative. Mo Most helpful: Person detection and motions. Le Least helpful: Center bias and objects.

06-02-2019 Understanding Actions with Minimal Supervision 19

Individual pseudo-annotation performance

UCF Sports

slide-21
SLIDE 21

06-02-2019 Understanding Actions with Minimal Supervision 20

Combining pseudo-annotations

We We rank pseudo-an annotatio ions bas ased on correla latio ion to person detectio ion. Co Correlation matches with individual performance. Performance akin to fu full box supervision using g top 3 pseudo-an annotatio ions.

slide-22
SLIDE 22

Th The good In two papers from box supervision to video label supervision only. Th The bad Performance gap to state-of-the-art is increasing; spatio-temporal proposals are a limiting factor.

06-02-2019 Understanding Actions with Minimal Supervision 21

On point- and pseudo-annotations

Singh et al. ICCV, 2017. Kalogeiton et al. ICCV, 2017.

St State-of

  • f-th

the-ar art appr approac ach: h: Tr Train at box-le level, l, link link bo boxes over tim ime

slide-23
SLIDE 23

Localize actions spatio-temporally from video labels. Now, directly from boxes, instead of spatio-temporal proposals.

06-02-2019 Understanding Actions with Minimal Supervision 22

New goal in weakly-supervised localization

P.

  • P. Mettes and C.G.M. Snoek, “Spatio-Temporal Instance Learning: Action Tubes from Class Supervision”, arXiv, 2019.
slide-24
SLIDE 24

An action consists of a set of boxes; MIL no longer directly applicable. We propose a new instance learning for actions:

06-02-2019 Understanding Actions with Minimal Supervision 23

Spatio-Temporal Instance Learning

Co Conditi tion 1: Each positive video contains at least one positive action instance, which can occur in precisely one tube. Co Conditi tion 2: For each positive video V , the positive action instance is a set of connected boxes of minimal length 1 and maximal length FV , where FV denotes the total video length. Co Conditi tion 3: For each negative video, all tubes and boxes are negative.

slide-25
SLIDE 25

04-04-2019 Weakly-supervised action recognition 24

Spatio-Temporal Instance Learning

Ne New obj bjectiv ive and and optimiz imizatio ion

Max margin optimization with latent variables No more positive box proposals than number of frames in the video All boxes from one tube Boxes are connected over time

slide-26
SLIDE 26

06-02-2019 Understanding Actions with Minimal Supervision 25

STIL on boxes versus MIL on tubes

St State-of

  • f-th

the-ar art result lts, especially ially at hig igh overlap lap threshold lds.

slide-27
SLIDE 27

06-02-2019 Understanding Actions with Minimal Supervision 26

Capturing spatio-temporal action extent

Ac Acti tion

  • n extent

t during g tr training g uncovered for

  • r lon
  • ng

g acti tion

  • ns,

, prob

  • blemati

tic for

  • r shor
  • rt

t on

  • nes.

At At test time, success in single-ac actio ion vid ideos, confu fusio ion in in mult lti-ac actio ion vid ideos.

slide-28
SLIDE 28

Action localization with box annotations is does not scale to large settings. We can adapt Multiple Instance Learning to help solve this problem. Gap to fully supervised approaches still large, remains an open problem.

06-02-2019 Understanding Actions with Minimal Supervision 27

Conclusions on action localization

slide-29
SLIDE 29

Ac Action localization an and rec ecognitio ition wi witho hout ut exampl ples es Pa Part II

slide-30
SLIDE 30

04-04-2019 Weakly-supervised action recognition 29

Zero-shot recognition

[Slide by Zeynep Akata]

slide-31
SLIDE 31

An action involves an actor trying to act on its environment. Objects serve as tools to make changes. .

06-02-2019 Understanding Actions with Minimal Supervision 30

On objects and actions

To what extent can we recognize and localize actions only from objects?

slide-32
SLIDE 32

06-02-2019 Understanding Actions with Minimal Supervision 31

Spatial-aware object embeddings

P.

  • P. Mettes and C.G.M. Snoek, “Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions”, ICCV, 2017.

Act Action decomp mposition: Action defined by actors, objects, and their spatial relations. Re Result: Action classification and localization given solely action names. Co Component 1: : Actors. The action likelihood in a box is proportional to the actor likelihood. Co Component 2: : Objects. The action likelihood is proportional to the likelihood of relevant objects nearby.

Jain et al. ICCV, 2015.

Co Component 3: : Spatial relations. Actors and objects have a preferred spatial relation when interacting.

Bicycle Traffic light Skateboard

slide-33
SLIDE 33

Action decomposition from actors, objects, and spatial relations.

06-02-2019 Understanding Actions with Minimal Supervision 32

Scoring action boxes

Video frame Actor likelihood Relevant objects Spatial awareness

High scoring boxes are linked over time to obtain a localization.

slide-34
SLIDE 34

06-02-2019 Understanding Actions with Minimal Supervision 33

A second object relation for actions

Concrete road Polka dot jersey More bicycles Crash barrier

slide-35
SLIDE 35

06-02-2019 Understanding Actions with Minimal Supervision 34

Success and failure cases

Skateboarding (skateboard) Riding horse (horse) Kicking (tie) Bar swinging (table)

Act Actions localized based on object ct ma match ch and spatial awareness. Act Actions mi missed due to wrong g act ctor focu cus or sema mantic c amb mbigu guity.

slide-36
SLIDE 36

06-02-2019 Understanding Actions with Minimal Supervision 35

Zero-shot localized action retrieval

Backpack ON actor Sports ball (small) RIGHT OF actor Sports ball (medium) RIGHT OF actor

On On-th the-fl fly retrieval for any object, , relation, , and size.

slide-37
SLIDE 37

06-02-2019 Understanding Actions with Minimal Supervision 36

Action localization comparison

slide-38
SLIDE 38

When action examples are scarce or absent, objects are vital. Zero-shot action recognition and localization is possible when decomposing actions in actors, objects, and the spatial awareness.

06-02-2019 Understanding Actions with Minimal Supervision 37

Conclusions on actions from objects

slide-39
SLIDE 39

04-04-2019 Weakly-supervised action recognition 38

Towards representing motion

La Lagrangian mo motion speci cification Follow elements as they move through space and time For objects? Eu Eule lerian ian m motio ion s specif ific icatio ion Fix spatial location and observe temporal changes For textures?