We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio - - PowerPoint PPT Presentation
We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio - - PowerPoint PPT Presentation
We Weakly-supe supervise sed d Vid Video eo Rec ecogn gnitio ition Pa Pascal Mettes University of Amsterdam Video recognition pipeline Learn video labels from spatio-temporal inputs. Do Double scale challenge: : Order(s) of
Learn video labels from spatio-temporal inputs. Do Double scale challenge: : Order(s) of magnitude more inputs, order(s) of magnitude fewer training samples.
04-04-2019 Weakly-supervised action recognition 1
Video recognition pipeline
[Feichtenhofer et al. NeurIPS 2016]
Discover what, when, and where actions occur in videos.
06-02-2019 Understanding Actions with Minimal Supervision 2
Spatio-Temporal Action Localization
Diving Skateboarding
Same double burden as video recognition in general. Additional burden from box annotation in each video frame. How can we learn action locations without all these burdens?
04-04-2019 Weakly-supervised action recognition 3
The annotation burden of localization
To Towards action lo localiz alizatio tion from video video labels labels Pa Part I
06-02-2019 Understanding Actions with Minimal Supervision 5
Spatio-Temporal Proposals
At test time, actions can be anywhere in a video. Split videos into action tubes, such that at least one tube matches with each action. Apply model to all proposals and select the best ones during testing.
Illustration courtesy of Jan van Gemert
Go Goal: al: Alleviate the need for expensive bounding box annotations per frame. Hy Hypoth
- thesi
esis: s: Spatio-temporal proposals can be used as training example, if guided properly with minimal extra supervision.
06-02-2019 Understanding Actions with Minimal Supervision 6
Pointly-Supervised Action Localization
P.
- P. Mettes, J.C. van Gemert, and C.G.M. Snoek, “Spot On: Action Localization from Pointly-Supervised Proposals”, ECCV, 2016.
P.
- P. Mettes and C.G.M. Snoek, “Pointly-Supervised Action Localization”, IJCV, 2018 (in press).
04-04-2019 Weakly-supervised action recognition 7
Multiple Instance Learning
Traditional supervised learning Multiple-instance learning
positive negative positive bags
[Dietterich et al. 1997]
negative bags
04-04-2019 Weakly-supervised action recognition 8
Multiple Instance Learning
positive bags negative bags Compute optimal hyper-plane with sparse MIL positive bags negative bags Re-weight positive instances
[Slide by Vijayanarasimhan and Grauman]
06-02-2019 Understanding Actions with Minimal Supervision 9
Proposed approach
Point supervision Proposal scoring Proposal mining
Annotate actions simply by pointing on action centers. Use point supervision to help find the best proposals to train on.
06-02-2019 Understanding Actions with Minimal Supervision 10
Multiple Instance Learning with priors
Iteratively learn to discriminate actions (likelihood) and learn the spatio-temporal extent of actions (point priors).
06-02-2019 Understanding Actions with Minimal Supervision 11
Point and proposal overlap
We introduce a new overlap measure between points and proposals. Proposals should be small and their centers should match with points.
< <
06-02-2019 Understanding Actions with Minimal Supervision 12
Experiments
Da Datasets
UCF Sports UCF-101 J-HMDB Hollywood2Tubes
Ev Evaluation
Rank detections based on classifier score. Positive if correct class and enough overlap. Results reported in (mean) Average Precision.
06-02-2019 Understanding Actions with Minimal Supervision 13
Can we train on spatio-temporal proposals?
Tr Training on our approach with points is as effective as training on boxes.
UCF Sports
Trained on box annotations Trained on best proposal
Trained on box annotations Trained on best proposal Trained on point annotations
06-02-2019 Understanding Actions with Minimal Supervision 14
How much point supervision is required?
Po Points are 15 times faster to annotate than boxes. Sim Similar ilar performan ance at 50-100x s 100x speed eed-up up with h fewer anno nnotations ns.
UCF Sports
Sim Similar ilar performan ance, dif ifferent behavio vior.
06-02-2019 Understanding Actions with Minimal Supervision 15
Discovered actions
Actions from box supervision Actions from point supervision
Ne New hypothesis: Point supervision can be replaced with automatic visual cues that correlate with the action location.
06-02-2019 Understanding Actions with Minimal Supervision 16
Replacing manual point supervision
P.
- P. Mettes, C.G.M. Snoek, and S.F. Chang, “Localizing Actions from Video Labels and Pseudo-Annotations”, BMVC, 2017.
06-02-2019 Understanding Actions with Minimal Supervision 17
Pseudo-annotations
Person
- n d
detection
- n
Action location correlates with the actor location.
Ren et al. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NeurIPS, 2015.
In Indep epen enden ent motion
Actions occurs where motion deviates from global motion.
Jain et al. “Action localization with tubelets from motion”, CVPR, 2014.
Ac Action
- n p
prop
- pos
- sals
Actions occur where proposals agree.
Van Gemert et al. “APT: Action localization proposals from dense trajectories, BMVC, 2015.
Ce Center b bias
Actions are central.
Tseng et al. “Quantifying bias of observers in free viewing of dynamic natural scenes”, JOV, 2004.
Ob Object cts
Actions occur where objects occur.
Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV, 2014.
Su Supervis visio ion bounds: Upper Full box supervision. Lower Video labels, no priors. Ax Axes: X-axis Minimal overlap threshold for positives. Y-axis Localization performance
06-02-2019 Understanding Actions with Minimal Supervision 18
Individual pseudo-annotation performance
UCF Sports
All pseudo-annotations are informative. Mo Most helpful: Person detection and motions. Le Least helpful: Center bias and objects.
06-02-2019 Understanding Actions with Minimal Supervision 19
Individual pseudo-annotation performance
UCF Sports
06-02-2019 Understanding Actions with Minimal Supervision 20
Combining pseudo-annotations
We We rank pseudo-an annotatio ions bas ased on correla latio ion to person detectio ion. Co Correlation matches with individual performance. Performance akin to fu full box supervision using g top 3 pseudo-an annotatio ions.
Th The good In two papers from box supervision to video label supervision only. Th The bad Performance gap to state-of-the-art is increasing; spatio-temporal proposals are a limiting factor.
06-02-2019 Understanding Actions with Minimal Supervision 21
On point- and pseudo-annotations
Singh et al. ICCV, 2017. Kalogeiton et al. ICCV, 2017.
St State-of
- f-th
the-ar art appr approac ach: h: Tr Train at box-le level, l, link link bo boxes over tim ime
Localize actions spatio-temporally from video labels. Now, directly from boxes, instead of spatio-temporal proposals.
06-02-2019 Understanding Actions with Minimal Supervision 22
New goal in weakly-supervised localization
P.
- P. Mettes and C.G.M. Snoek, “Spatio-Temporal Instance Learning: Action Tubes from Class Supervision”, arXiv, 2019.
An action consists of a set of boxes; MIL no longer directly applicable. We propose a new instance learning for actions:
06-02-2019 Understanding Actions with Minimal Supervision 23
Spatio-Temporal Instance Learning
Co Conditi tion 1: Each positive video contains at least one positive action instance, which can occur in precisely one tube. Co Conditi tion 2: For each positive video V , the positive action instance is a set of connected boxes of minimal length 1 and maximal length FV , where FV denotes the total video length. Co Conditi tion 3: For each negative video, all tubes and boxes are negative.
04-04-2019 Weakly-supervised action recognition 24
Spatio-Temporal Instance Learning
Ne New obj bjectiv ive and and optimiz imizatio ion
Max margin optimization with latent variables No more positive box proposals than number of frames in the video All boxes from one tube Boxes are connected over time
06-02-2019 Understanding Actions with Minimal Supervision 25
STIL on boxes versus MIL on tubes
St State-of
- f-th
the-ar art result lts, especially ially at hig igh overlap lap threshold lds.
06-02-2019 Understanding Actions with Minimal Supervision 26
Capturing spatio-temporal action extent
Ac Acti tion
- n extent
t during g tr training g uncovered for
- r lon
- ng
g acti tion
- ns,
, prob
- blemati
tic for
- r shor
- rt
t on
- nes.
At At test time, success in single-ac actio ion vid ideos, confu fusio ion in in mult lti-ac actio ion vid ideos.
Action localization with box annotations is does not scale to large settings. We can adapt Multiple Instance Learning to help solve this problem. Gap to fully supervised approaches still large, remains an open problem.
06-02-2019 Understanding Actions with Minimal Supervision 27
Conclusions on action localization
Ac Action localization an and rec ecognitio ition wi witho hout ut exampl ples es Pa Part II
04-04-2019 Weakly-supervised action recognition 29
Zero-shot recognition
[Slide by Zeynep Akata]
An action involves an actor trying to act on its environment. Objects serve as tools to make changes. .
06-02-2019 Understanding Actions with Minimal Supervision 30
On objects and actions
To what extent can we recognize and localize actions only from objects?
06-02-2019 Understanding Actions with Minimal Supervision 31
Spatial-aware object embeddings
P.
- P. Mettes and C.G.M. Snoek, “Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions”, ICCV, 2017.
Act Action decomp mposition: Action defined by actors, objects, and their spatial relations. Re Result: Action classification and localization given solely action names. Co Component 1: : Actors. The action likelihood in a box is proportional to the actor likelihood. Co Component 2: : Objects. The action likelihood is proportional to the likelihood of relevant objects nearby.
Jain et al. ICCV, 2015.
Co Component 3: : Spatial relations. Actors and objects have a preferred spatial relation when interacting.
Bicycle Traffic light Skateboard
Action decomposition from actors, objects, and spatial relations.
06-02-2019 Understanding Actions with Minimal Supervision 32
Scoring action boxes
Video frame Actor likelihood Relevant objects Spatial awareness
High scoring boxes are linked over time to obtain a localization.
06-02-2019 Understanding Actions with Minimal Supervision 33
A second object relation for actions
Concrete road Polka dot jersey More bicycles Crash barrier
06-02-2019 Understanding Actions with Minimal Supervision 34
Success and failure cases
Skateboarding (skateboard) Riding horse (horse) Kicking (tie) Bar swinging (table)
Act Actions localized based on object ct ma match ch and spatial awareness. Act Actions mi missed due to wrong g act ctor focu cus or sema mantic c amb mbigu guity.
06-02-2019 Understanding Actions with Minimal Supervision 35
Zero-shot localized action retrieval
Backpack ON actor Sports ball (small) RIGHT OF actor Sports ball (medium) RIGHT OF actor
On On-th the-fl fly retrieval for any object, , relation, , and size.
06-02-2019 Understanding Actions with Minimal Supervision 36
Action localization comparison
When action examples are scarce or absent, objects are vital. Zero-shot action recognition and localization is possible when decomposing actions in actors, objects, and the spatial awareness.
06-02-2019 Understanding Actions with Minimal Supervision 37
Conclusions on actions from objects
04-04-2019 Weakly-supervised action recognition 38
Towards representing motion
La Lagrangian mo motion speci cification Follow elements as they move through space and time For objects? Eu Eule lerian ian m motio ion s specif ific icatio ion Fix spatial location and observe temporal changes For textures?