Action recognition in videos II Action recognition in videos II - - PowerPoint PPT Presentation

action recognition in videos ii action recognition in
SMART_READER_LITE
LIVE PREVIEW

Action recognition in videos II Action recognition in videos II - - PowerPoint PPT Presentation

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble Action recognition - goal Action recognition goal Short actions, i.e. answer phone, shake hands hand shake hand shake answer phone h


slide-1
SLIDE 1

Action recognition in videos II Action recognition in videos II

Cordelia Schmid INRIA Grenoble

slide-2
SLIDE 2

Action recognition - goal Action recognition goal

  • Short actions, i.e. answer phone, shake hands

h hand shake answer phone hand shake

slide-3
SLIDE 3

Action recognition - goal Action recognition goal

Activities/events i e making a sandwich doing homework

  • Activities/events, i.e. making a sandwich, doing homework

M ki d i h D i h k Making sandwich Doing homework TrecVid Multi-media event detection dataset

slide-4
SLIDE 4

Action recognition - goal Action recognition goal

Activities/events i e birthday party parade

  • Activities/events, i.e. birthday party, parade

Birthday party Parade

TrecVid Multi-media event detection dataset

slide-5
SLIDE 5

Tasks Action recognition - tasks

  • Action classification: assigning an action label to a video clip

Tasks Action recognition tasks

Action classification: assigning an action label to a video clip

M ki d i h Making sandwich: present Feeding animal: not present …

slide-6
SLIDE 6

Tasks Action recognition - tasks

  • Action classification: assigning an action label to a video clip

Tasks Action recognition tasks

Action classification: assigning an action label to a video clip

M ki d i h Making sandwich: present Feeding animal: not present …

Action locali ation search locations of an action in a ideo

  • Action localization: search locations of an action in a video
slide-7
SLIDE 7

Outline Outline

Improved video description

  • Improved video description

– Dense trajectories and motion-boundary descriptors

  • Adding temporal information to the bag of features

– Actom sequence model for efficient action detection – Actom sequence model for efficient action detection

  • Modeling human-object interaction

Modeling human-object interaction

slide-8
SLIDE 8

Dense trajectories - motivation Dense trajectories motivation

D li i lt i t t

  • Dense sampling improves results over sparse interest

points for image classification [Fei-Fei'05, Nowak'06]

  • Recent progress by using feature trajectories for action

recognition [Messing'09 Sun'09] recognition [Messing 09, Sun 09]

  • The 2D space domain and 1D time domain in videos have
  • The 2D space domain and 1D time domain in videos have

very different characteristics  Dense trajectories: a combination of dense sampling with feature trajectories [Wang, Klaeser, Schmid & Lui, CVPR’11] feature trajectories [Wang, Klaeser, Schmid & Lui, CVPR 11]

slide-9
SLIDE 9

Approach Approach

D lti l li

  • Dense multi-scale sampling
  • Feature tracking over L frames with optical flow

T j t li d d i t ith ti t l id

  • Trajectory-aligned descriptors with a spatio-temporal grid
slide-10
SLIDE 10

Approach Approach

Dense sampling

– remove untrackable points remove untrackable points – based on the eigenvalues of the auto-correlation matrix

Feature tracking

– by median filtering in dense optical flow field – length is limited to avoid drifting

slide-11
SLIDE 11

Feature tracking Feature tracking

KLT tracks SIFT tracks Dense tracks

slide-12
SLIDE 12

Trajectory descriptors Trajectory descriptors

Motion boundary descriptor

  • Motion boundary descriptor

– spatial derivatives are calculated separately for optical flow in x and y , quantized into a histogram q g – relative dynamics of different regions – suppresses constant motions as appears for example due to b k d ti background camera motion

slide-13
SLIDE 13

Trajectory descriptors Trajectory descriptors

Trajectory shape described by normalized relative point

  • Trajectory shape described by normalized relative point

coordinates

  • HOG, HOF and MBH are encoded along each trajectory
slide-14
SLIDE 14

Experimental setup Experimental setup

Bag of features with 4000 clusters obtained by k means

  • Bag-of-features with 4000 clusters obtained by k-means,

classification by non-linear SVM with RBF + chi-square kernel kernel

– confirmed by recent results with Fisher vector + linear SVM

  • Descriptors are combined by addition of distances
  • Evaluation on two datasets: UCFSport (classification

accuracy) and Hollywood2 (mean average precision) y) y ( g p )

  • Two baseline trajectories: KLT and SIFT

j

slide-15
SLIDE 15

Comparison of descriptors Comparison of descriptors

Hollywood2 UCFSports Hollywood2 UCFSports Trajectory 47.8% 75.4% HOG 41.2% 84.3% HOF 50.3% 76.8% MBH 55.1% 84.2% Combined 58.2% 88.0%

  • Trajectory descriptor performs well
  • HOF >> HOG for Hollywood2, dynamic information is relevant
  • HOG >> HOF for sports datasets, spatial context is relevant
  • MBH consistently outperforms HOF, robust to camera motion
slide-16
SLIDE 16

Comparison of trajectories Comparison of trajectories

Hollywood2 UCFSports y p Dense trajectory + MBH 55.1% 84.2% KLT trajectory + MBH 48.6% 78.4% SIFT trajectory + MBH 40.6% 72.1%

  • Dense >> KLT >> SIFT trajectories
slide-17
SLIDE 17

Improved trajectories (Wang & Schmid ICCV’13) Improved trajectories (Wang & Schmid ICCV 13)

Dense trajectories impacted by camera motion

  • Dense trajectories impacted by camera motion
  • Stabilize camera motion before computing optical flow

Extract features matches (SURF and dense optical flow) – Extract features matches (SURF and dense optical flow) – Compute robust homography

slide-18
SLIDE 18

Improved trajectories Improved trajectories

slide-19
SLIDE 19

Improved trajectories Improved trajectories

slide-20
SLIDE 20

Improved trajectories Improved trajectories

slide-21
SLIDE 21

Experimental setting Experimental setting

slide-22
SLIDE 22

Results Results

slide-23
SLIDE 23

Results Results

slide-24
SLIDE 24

Results Results

slide-25
SLIDE 25

Excellent results in TrecVid MED’13 Excellent results in TrecVid MED 13

Combination of MBH SIFT audio text & speech recognition

  • Combination of MBH SIFT, audio, text & speech recognition
  • First in the know event challenge, first in the adhoc event

challenge challenge

Making sandwich results Making sandwich – results

R k 1 ( ) R k 20 ( ) R k 21 ( ) Rank 1 (pos) Rank 20 (pos) Rank 21 (neg)

slide-26
SLIDE 26

Excellent results in TrecVid MED’13 Excellent results in TrecVid MED 13

Fl hM b th i lt FlashMob gathering – results

Rank 1 (pos) Rank 18 (pos) Rank 19 (neg)

slide-27
SLIDE 27

Impact of different channels Impact of different channels

slide-28
SLIDE 28

Conclusion Conclusion

Dense trajectory representation for action recognition

  • Dense trajectory representation for action recognition
  • utperforms existing approaches
  • Motion stabilization improves performance of motion-

based descriptors MBH and HOF based descriptors MBH and HOF

  • Efficient algorithm on-line available at

Efficient algorithm, on-line available at https://lear.inrialpes.fr/software

  • Recent excellent results in the TrecVID MED 2013

challenge g

slide-29
SLIDE 29

Outline Outline

Improved video description

  • Improved video description

– Dense trajectories and motion-boundary descriptors

  • Adding temporal information to the bag of features

– Actom sequence model for efficient action detection – Actom sequence model for efficient action detection

  • Modeling human-object interaction

Modeling human-object interaction

slide-30
SLIDE 30

Approach for action modeling Approach for action modeling

  • Model of the temporal structure of an action with a

Model of the temporal structure of an action with a sequence of “action atoms” (actoms)

  • Action atoms are action specific short key events whose

Action atoms are action specific short key events, whose sequence is characteristic of the action

slide-31
SLIDE 31

Related work Related work

  • Temporal structuring of video data

– Bag‐of‐features with spatio‐temporal pyramids [Laptev’08] – Loose hierarchical structure of latent motion parts [Niebles’10] – Facial action recognition with action unit detection and structured learning of temporal segments [Si

’10]

structured learning of temporal segments [Simon’10]

slide-32
SLIDE 32

Approach for action modeling Approach for action modeling

( )

  • Actom Sequence Model (ASM):

histogram of time‐anchored visual features

slide-33
SLIDE 33

Actom annotation Actom annotation

  • Actoms for training actions are obtained manually

(3 actoms per action here) (3 actoms per action here)

Alt ti i i t li t ti (b i i

  • Alternative supervision to clips annotation (beginning

and end frames) with similar cost and smaller t ti i bilit annotation variability

  • Automatic detection of actoms at test time
slide-34
SLIDE 34

Actom descriptor Actom descriptor

A t i t i d b

  • An actom is parameterized by:

– central frame location – time‐span – time‐span – temporally weighted feature assignment mechanism

  • Actom descriptor:

– histogram of quantized visual words in the actom’s range – contribution depends on temporal distance to actom center (using temporal Gaussian weighting) (using temporal Gaussian weighting)

slide-35
SLIDE 35

Actom sequence model (ASM) Actom sequence model (ASM)

ASM t ti f t hi t

  • ASM: concatenation of actom histograms
  • Temporally structured extension of BOF
  • Action represented by a single sparse sequential model
slide-36
SLIDE 36

Actom Sequence Model (ASM) q ( )

Parameters

  • ASM model has two parameters: overlap between actoms

ASM model has two parameters: overlap between actoms

(controls radius) and soft‐voting “peakyness” (controls profile)

Keyframe‐like BOF‐like

slide-37
SLIDE 37

Automatic temporal detection ‐ training Automatic temporal detection training

ASM l ifi

  • ASM classifier:

– non‐linear SVM on ASM representations with intersection k l d i i i b bili kernel, random training negatives, probability outputs – estimates posterior probability of an action knowing the temporal location of its actoms temporal location of its actoms

A k i

  • Actoms unknown at test time:

– use training examples to learn prior on temporal structure of t did t actom candidates

slide-38
SLIDE 38

Training

Action classifier

ASM l ifi li SVM ASM i

  • ASM classifier: non‐linear SVM on ASM representations

– intersection kernel: – random training negatives – class‐balancing – probability outputs

estimates posterior probability of an action knowing the temporal location of its actoms

  • Actoms unknown at test time:

use training examples to learn actom candidates g p

38

slide-39
SLIDE 39

Prior on temporal structure Prior on temporal structure

  • Temporal structure: inter actom spacings
  • Temporal structure: inter‐actom spacings

i d l f h l

  • Non‐parametric model of the temporal structure

– kernel density estimation over inter‐actom spacings from i i i l training action examples – discretize it to – discretize it to

(small support in practice: K≈10)

se as prior on temporal str ct re d ring detection – use as prior on temporal structure during detection

slide-40
SLIDE 40

Training

Example of learned candidates

  • Actom models corresponding to the

learned for “smoking”

  • Actom models corresponding to the learned for smoking

(with the ASM parameters used in our experiments)

slide-41
SLIDE 41

Automatic Temporal Detection Automatic Temporal Detection

  • Probability of action at frame tm by marginalizing over

m

all learned candidate actom sequences: Slidi t l f d t ti i l id t

  • Sliding central frame: detection in a long video stream

by evaluating the probability every N frames (N=5)

  • Non‐maxima suppression post‐processing step
slide-42
SLIDE 42

Experiments ‐ Datasets Experiments ‐ Datasets

  • Coffee & Cigarettes: localize drinking , smoking in 36k frames [Laptev’07]
  • DLSBP: localize opening a door, sitting down in 443k frames [Duchenne’09]

l ( ) d l h

  • Evaluation: average precision (AP) computed wrt 20% overlap with

ground truth test actions

slide-43
SLIDE 43

Quantitative Results Quantitative Results

Coffee & Cigarettes g DLSBP

  • ASM method outperforms BOF
  • ASM improves over rigid temporal structure BOF T3

ASM improves over rigid temporal structure, BOF T3

(BOF T3: concatenation of 3 BOF: beginning, middle and end of the action)

  • More accurate detections with ASM compared to the state of

the art

slide-44
SLIDE 44

Qualitative Results

Central frames

F f h 5 i d d i h ASM f Frames of the top 5 actions detected with ASM for drinking and opening a door

( l #2 f i d i f l iti ) (only #2 of opening a door is a false positive)

slide-45
SLIDE 45

Qualitative Results

Actoms

Frames of automatically detected actom sequences for 4 actions

Open Door p Drinking Smoking Sitting Down

slide-46
SLIDE 46

Localization results for action drinking Localization results for action drinking

slide-47
SLIDE 47

Localization results for action smoking Localization results for action smoking

slide-48
SLIDE 48

Conclusion Conclusion

  • ASM: efficient model of actions with a flexible

ASM: efficient model of actions with a flexible sequence of key semantic sub‐actions (actoms)

  • Principled multi‐scale action detection using a

learned prior on temporal str t re learned prior on temporal structure

  • ASM outperforms bag‐of‐features, rigid temporal

structures and state of the art

slide-49
SLIDE 49

Outline Outline

Improved video description

  • Improved video description

– Dense trajectories and motion-boundary descriptors

  • Adding temporal information to the bag of features

– Actom sequence model for efficient action detection – Actom sequence model for efficient action detection

  • Modeling human-object interaction

Modeling human-object interaction

slide-50
SLIDE 50

Action recognition Action recognition

D i ti f th h

  • Description of the human pose

– Silhouette description [Sullivan & Carlsson, 2002] – Histogram of gradients (HOG) [Dalal & Triggs 2005] – Human body part estimation [Felzenzswalb & Huttenlocher 2005]

slide-51
SLIDE 51

Importance of action objects Importance of action objects

  • Human pose often not sufficient by itself
  • Objects define the actions
slide-52
SLIDE 52

Action recognition from still images Action recognition from still images

S i d d li i t ti b t h & bj t

  • Supervised modeling interaction between human & object

[Gupta et al. 2009, Yao & Fei-Fei 2009]

  • Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011]

Results on PASCAL VOC 2010 Human action classification dataset

slide-53
SLIDE 53

Weakly-supervised learning of objects - Overview

Key idea: automatically localize action objects in training images

[Prest et al., PAMI’12]

slide-54
SLIDE 54

Automatic localization of action objects

  • Find object recurring over images at similar positions wrt human
  • Human‐centric: human detection serves as a reference frame

I j g g p Input:

images with automatically detected humans (extension of Felzenszwalb et al. PAMI 2009)

Output:

localized action object

slide-55
SLIDE 55

The Human‐Object Model

  • Objectness  candidate windows
  • find
  • ne

window per image minimizing energy i i f i h TRW S

  • approximate inference with TRW‐S

(Kolmogorov PAMI 2006)

H bj Human‐object spatial relation similarity Object appearance similarity (bag‐of‐words) Unary terms (objectness + human overlap) y (bag of words) human overlap)

Human overlap: penalize overlap with human, as it is the most frequent pattern

slide-56
SLIDE 56

Object appearance similarity

For a pair of candidate windows (j, m) in training images (i, l) measure:

  • between color histograms
  • between color histograms
  • between bag‐of‐words with 3‐level spatial pyramid of SURF

Lazebnik et al CVPR 2006; Bay et al CVIU 2008

slide-57
SLIDE 57

Human‐object spatial relation similarity

Similarity between the spatial relations of two candidate windows wrt to the human in their respective images the human in their respective images

slide-58
SLIDE 58

Human‐object spatial relation similarity

Dissimilar pairs  high energy

Relative scale Relative scale Relative distance Relative overlap Relative location

Si il i Similar pairs  low energy

slide-59
SLIDE 59

Sports dataset

  • 6 action classes
  • 30 training images per class
  • 20 test images per class
  • 20 test images per class

A i Annotations

  • human silhouettes; used for training by

[1] (limb annotations by [2])

  • bject bounding‐boxes; used by [1,2] to

train object detectors Our model is trained only from image labels

[1] Gupta, Kembhavi, Davis, PAMI 2009 [2] Yao, Fei-Fei CVPR 2010

slide-60
SLIDE 60

Example localized action objects

minimize energy  localize objects

Sports dataset (Gupta PAMI 09) TBH dataset (Prest PAMI 12)

slide-61
SLIDE 61

Learning the human‐object interaction model

localized objects  learn human‐object spatial distribution

Example image image Ground‐ truth Our l result

relative x,y position relative scale

results qualitatively close to ground‐truth

slide-62
SLIDE 62

Overall action classifier

Human‐object relative position: as in previous slides Whole‐scene: GIST (Oliva and Torralba IJCV 2001) bj b f d f P f di t GIST d th h Object appearance: bag‐of‐words of SURF Pose‐from‐gradients : GIST around the human

slide-63
SLIDE 63

Action classification results: Sports dataset

Whole‐scene classifier Whole‐scene + Pose‐from‐gradients + Full model (+ hum‐obj relations) Object appearance

Our model 67 76 81 G l [2] 66 79 Gupta et al. [2] 66 ‐ 79 Yao and Fei‐Fei [2] ‐ ‐ 83 Average classification rate on test set

+ perform similar to [1,2] while using substantially less supervision + human object spatial relations contribute visibly to performance + human‐object spatial relations contribute visibly to performance

[1] G t K bh i D i PAMI 2009 [1] Gupta, Kembhavi, Davis, PAMI 2009 [2] Yao, Fei-Fei CVPR 2010

slide-64
SLIDE 64

Action classification results: PASCAL Action 2010

Whole‐ scene Whole‐scene + Pose‐from‐gradients + Full model Koniusz et

  • al. [3]

Object appearance

all classes 28 59 62 62 h bj 28 59 63 58 human‐object classes 28 59 63 58 Average classification accuracy on test set

Compared to [3] = highest mAP entry in challenge 2010 Compared to [3] = highest mAP entry in challenge 2010 All methods input human location + on all classes: performs on par with [3] with only weak supervision + on all classes: performs on par with [3] with only weak supervision + on human‐object classes: outperforms [3]

[3] Everingham et al, The PASCAL VOC 2010 results

slide-65
SLIDE 65

Importance of temporal information Importance of temporal information

  • Video/temporal information necessary to disambiguate

actions

  • Temporal context describes the action/activity
  • Key frames provide significant less information
slide-66
SLIDE 66

Our approach

Modeling temporal human-object interactions

Our approach

Modeling temporal human object interactions

Describing human and object tracks and their relative motion

slide-67
SLIDE 67

Tracking humans and objects Tracking humans and objects

Fully automatic human tracks: state of the art detector + Brox tracks Fully automatic human tracks: state of the art detector + Brox tracks Object tracks: detector learnt from annotated training examples + Brox tracks Brox tracks Extraction of a large number of human-object track pairs

slide-68
SLIDE 68

Action descriptors Action descriptors

Interaction descriptor: relative location area and motion

  • Interaction descriptor: relative location, area and motion

between human and object tracks

  • Human track descriptor: 3DHOG track [Kl

t l ’10]

  • Human track descriptor: 3DHOG-track [Klaeser et al.’10]
slide-69
SLIDE 69

Experimental results on C&C Experimental results on C&C

Drinking Drinking

slide-70
SLIDE 70

Experimental results on C&C Experimental results on C&C

Smoking Smoking

slide-71
SLIDE 71

Experimental results on C&C Experimental results on C&C

slide-72
SLIDE 72

Comparison to the state of the art Comparison to the state of the art

slide-73
SLIDE 73

Experimental results on Gupta dataset Experimental results on Gupta dataset

A i th Answering the phone Making a phone call Drinking Using a light torch Pouring water from a cup Using a spray bottle

slide-74
SLIDE 74

Experimental results on Gupta dataset Experimental results on Gupta dataset

  • Interactions achieve the best performance alone
  • Combination improves results further: only 2 misclassified samples

Combination improves results further: only 2 misclassified samples

  • Comp. state of the art: Gupta use significantly more training information
slide-75
SLIDE 75

Experimental results on Rochester dataset Experimental results on Rochester dataset

Rochester daily activities dataset

  • Rochester daily activities dataset

– 150 videos of 5 persons – leave-one-person-out test scenario leave one person out test scenario

slide-76
SLIDE 76

Experimental results on Rochester dataset Experimental results on Rochester dataset

slide-77
SLIDE 77

Experimental results on Rochester dataset Experimental results on Rochester dataset

slide-78
SLIDE 78

Conclusion Conclusion

Human object interaction descriptor obtains state of the

  • Human-object interaction descriptor obtains state-of-the-

art performance

  • Complementary to 3DHOG-track descriptor
  • Combination obtains excellent performance
  • Automatic extraction of objects
slide-79
SLIDE 79

Automatic extraction of objects

?

Prest, Leistner, Civera, Schmid, Ferrari CVPR 2012, Learning object detectors from weakly annotated video

slide-80
SLIDE 80

Automatic extraction of objects j

slide-81
SLIDE 81

Candidate tubes Candidate tubes

dense point tracks dense point tracks

  • N. Sundaram et al., Dense point trajectories by GPU-accelerated large displacement optical flow,

ECCV 2010

slide-82
SLIDE 82

Candidate tubes Candidate tubes

motion segmentation motion segmentation

slide-83
SLIDE 83

Candidate tubes Candidate tubes

motion segmentation motion segmentation

slide-84
SLIDE 84

Selecting candidate tubes Selecting candidate tubes

slide-85
SLIDE 85

Selecting candidate tubes Selecting candidate tubes

slide-86
SLIDE 86

Training + testing object detectors Training + testing object detectors

Video Still images Combination deo S ages Co b a o Still images from PASCAL VOC 2007