Video-based Action Recognition Ying Wu Electrical Engineering and - - PowerPoint PPT Presentation
Video-based Action Recognition Ying Wu Electrical Engineering and - - PowerPoint PPT Presentation
Video-based Action Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 Outline Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of
Outline
Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
What is an Action?
◮ Action: Atomic motion(s) that can be unambiguously
distinguished (e.g. sitting down, running).
◮ An activity is composed of several actions performed in
succession (e.g. dining, meeting a person).
◮ Event is a combination of activities (e.g. football match,
traffic accident).
What is Action Recognition?
◮ What is Recognition?
◮ Verification: Is the walking man Obama? ◮ Identification: Who is the walking man? ◮ Recognition: What is the man doing?
◮ The recognition of action is to match the observation
(e.g. videos) with previously defined patterns and then assign it a label, i.e. action type.
◮ Input: an action video; ◮ Output: an action label;
Why Need Action Recognition?
◮ Expensive human effort to handle rapidly increasing
amount of video records;
◮ Large number of potential applications:
◮ visual surveillance ◮ crowd behavior analysis ◮ human-machine interfaces ◮ sports video analysis ◮ video retrieval ◮ etc.
Main Challenges in Action Recognition
◮ Different scales
◮ People may appear at different scales in different videos,
yet perform the same action.
◮ Movement of the camera ◮ Background “clutter”
◮ Other objects/humans present in the video frame.
◮ Partial Occlusions ◮ Human/Action variation (large intra-class variation)
◮ Walking movements can differ in speed and stride length.
◮ Etc.
Categories of Action Recognition Methods
Appearance Representation
◮ Focus on extracting “better” appearance representation
from action video;
◮ hand-crafted features: HOG [7], HOF [4], MBH [18] or
combinations [18];
◮ learned features: deep neural network [20, 5, 16]
Categories of Action Recognition Methods
Dynamic Modeling
◮ Focus on molding the dynamics and motions in action
video;
◮ Deterministic models: dynamic time warping [24],
maximum margin temporal warping [19], actom sequence model [6], graphs [3] and deep neural architectures [14, 17];
◮ Generative models: HMM [10], coupled HMM [2],
CRF [21] and dynamic Bayes nets [23].
Small Size Datasets
◮ The KTH Dataset [13]
◮ 6 actions (walking, jogging, running, boxing, hand
waving and hand clapping)
◮ The Weizmann Dataset [1]
◮ 10 actions (walk, run, jump, gallop sideways, bend,
- ne-hand wave, two-hands wave, jump in place, jumping
jack and skip)
◮ The UCF Sports Action Dataset
◮ 9 actions (diving, golf swinging, kicking, weightlifting,
horseback riding, running, skating, swinging a baseball bat and walking)
Large Size Datasets
◮ The IXMAS Dataset [22]
◮ 14 actions (check watch, cross arms, scratch head, sit
down, get up, turn around, walk, wave, punch, kick, point, pick up, throw over head and throw from bottom up)
◮ Hollywood Human Action Dataset [11]
◮ 12 actions (answer phone, get out of car, handshake,
hug, kiss, sit down, sit up, stand up, drive car, eat, fight and run)
◮ The UCF50 Dataset [12]: 50 different actions/activities ◮ The HMDB51 Dataset [8]: 51 different actions/activities ◮ The UCF101 Dataset [15]: 101 different actions/activities
Data Samples
(a) KTH Dataset (b) Hollywood Dataset
Outline
Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
Overview
◮ Title: On Space-Time Interest Points (2005) 1
◮ Motivated by the idea of Harris and Forstner spatial
interest point operators, extended into the spatio-temporal domain;
◮ Aim to find the “good” spatio-temporal positions in a
sequence for feature extraction;
◮ Distinct and stable descriptors are extracted from the
- btained interest points;
◮ Author: Ivan Laptev
- 1I. Laptev. On space-time interest points.
International Journal of Computer Vision, 64(2-3):107–123, 2005
Spatio-Temporal Interest Points
◮ The points that have large variations along both the
spatial and the temporal directions in local spatio temporal volumes.
Figure : Detecting the strongest spatio-temporal interest points in a football sequence with a player heading the ball.
Spatio-Temporal Interest Point Detection
◮ In the spatial domain, we can model an image f sp by its
linear scale-space representation Lsp: Lsp x, y; σ2
l
- = g sp
x, y; σ2
l
- ∗ f sp (x, y)
◮ Like the operation for image, we can model the sequence
by a linear scale-space representation L: L
- ·; σ2
l , τ 2 l
- = g
- ·; σ2
l , τ 2 l
- ∗ f (·)
g
- x, y, t; σ2
l , τ 2 l
- = exp(−(x2 + y 2)/2σ2
l − t2/2τ 2 l )
- (2π)3σ4
l τ 2 l
Spatio-Temporal Interest Point Detection
◮ Construct a 3 × 3 spatio-temporal second-moment matrix:
µ = g
- ·; σ2
i , τ 2 i
- ∗
L2
x
LxLy LxLt LxLy L2
y
LyLt LxLt LyLt L2
t
◮ The first-order derivatives are defined as: (ξ = {x,y,t})
Lξ
- ·; σ2
l , τ 2 l
- = ∂ξ(g ∗ f )
◮ Compute the three eigenvalues λ1, λ2 and λ3 of µ, the
Harris corner function is then defined as: H = det(µ) − k · trace3(µ) = λ1λ2λ3 − k(λ1 + λ2 + λ3)3
◮ Detect the interest points by calculating the positive local
maxima of H;
Space-Time Interest Points: Examples
(a) Action : clapping hands (b) The detected interst points
Spatio-Temporal Scale Adaptation
◮ Let’s recall the scale-space representation L (·; σ2 l , τ 2 l ), the
two scale factors σ2
l and τ 2 l influence the result a lot; ◮ The larger the τ 2 l is, the easier the space-time structures
with long temporal extents are detected;
◮ The larger the σ2 l is, the easier the space-time structures
with long spatial extents are detected;
Spatio-Temporal Scale Adaptation
◮ By finding the extrema of ▽2 normL over both spatial and
temporal scales, we can automatically determine the scale factors.
Result
Figure : Results of spatial/spatio-temporal interest point detection for a zoom-in sequence of a walking person.
Result
Figure : (top): Correct matches in sequences with leg actions; (bottom): Correct matches in sequences with arm actions;
Outline
Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
Overview
◮ Title: Recognizing Human Actions: A Local SVM
Approach (2004) 2
◮ Use local space-time features to represent video
sequences that contain actions.
◮ Classification is done via an SVM.
◮ Author: Christian Schuldt, Ivan Laptev and Barbara
Caputo
- 2C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm
approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004
Local Space-time Features
Figure : Local space-time features detected for a walking pattern 3
- 3C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm
approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004
Representation of Features
◮ Spatial-temporal “jets” (4th order) are computed at each
feature center: j = (Lx, Ly, Lt, Lxx, · · · , Ltttt) |σ2=˜
σ2
i ,τ 2=˜
τ 2
i
Lxmyntk = σm+nτ k(∂xmyntkg) ∗ f
◮ Using k-means clustering over j, a vocabulary consisting
- f words hi is created from the jet descriptors;
◮ Finally, a given video is represented by a histogram of
counts of occurrences of features corresponding to hi in that video: H = (h1, ..., hn)
Recognition by Support Vector Machines
◮ For action recognition, we combine the obtained local
space-time features with SVM;
◮ Given a set of training data from different action classes
{(Hi, yi)}n
i=1, a SVM classifier for each action class is
learned: f (H) = sgn
- n
- i=1
αiyiHi + b
- ◮ Easy to extend to a kernelized version;
Results
Figure : Results of action recognition for different methods and scenarios on KTH dataset.
Outline
Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition
Overview
◮ Title: Coupled Hidden Markov Models for Complex
Action Recognition (1997) 4
◮ Hidden Markov model (HMM) is preferable for modeling
and classifying dynamic behaviors.
◮ But HMM is not suitable for multiple interacting
processes (have structure in both time and space).
◮ Coupled hidden Markov models can model multiple
interacting processes without running afoul of the Markov condition.
◮ Author: Matthew Brand, Nuria Oliver and Alex Pentland
- 4M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for
complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pages 994–999. IEEE, 1997
Restriction of HMM
◮ HMM is favored to implicitly handle time-varying signals,
which fit the Markov condition.
◮ So that, HMMs are ill-suited to systems that have
compositional state, e.g., multiple interacting processes that have structure in both time and space.
◮ Think about how to model “A gave B the C”?
Coupling and Factoring HMMs
◮ In order to handle multiple interacting processes (to
couple HMMs), we need to obtain a joint HMM C from two coupling HMMs A and B;
◮ Given the states ai and bj and transition parameters Pai|aj
and Pbk|bl, the joint states is cij = {ai, bj}, the transition is: Pcik|cjl = Ψ
- Pai|aj, Pbk|bl, Pai|bl, Pbk|aj
- ◮ Pai|bl and Pbk|aj are the coupling parameters;
Coupling and Factoring HMMs
◮ We can also project the joint HMM back into its
components: Pai|aj ∝
- l
- k
Pcik|cjl Pai|bl ∝
- j
- k
Pcik|cjl
◮ So a joint HMM can be trained via standard HMM
methods;
One Case of Application
Figure : Here we can use the proposed model to represent the action performed by two hands.
References I
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.
Actions as space-time shapes. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1395–1402. IEEE, 2005.
- M. Brand, N. Oliver, and A. Pentland.
Coupled hidden markov models for complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pages 994–999. IEEE, 1997.
- W. Brendel and S. Todorovic.
Learning spatiotemporal graphs of human activities. In 2011 International Conference on Computer Vision, pages 778–785. IEEE, 2011.
- R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal.
Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1932–1939. IEEE, 2009.
- G. Ch´
eron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3218–3226, 2015.
References II
- A. Gaidon, Z. Harchaoui, and C. Schmid.
Actom sequence models for efficient action detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3201–3208. IEEE, 2011.
- A. Klaser, M. Marsza
lek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1. British Machine Vision Association, 2008.
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
- I. Laptev.
On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005.
- K. Li, J. Hu, and Y. Fu.
Modeling complex temporal composition of actionlets for activity prediction. In European Conference on Computer Vision, pages 286–299. Springer, 2012.
References III
- M. Marszalek, I. Laptev, and C. Schmid.
Actions in context. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2929–2936. IEEE, 2009.
- K. K. Reddy and M. Shah.
Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981, 2013.
- C. Schuldt, I. Laptev, and B. Caputo.
Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004.
- K. Simonyan and A. Zisserman.
Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014.
- K. Soomro, A. R. Zamir, and M. Shah.
Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- N. Srivastava, E. Mansimov, and R. Salakhutdinov.
Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2, 2015.
References IV
- V. Veeriah, N. Zhuang, and G.-J. Qi.
Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 4041–4049, 2015.
- H. Wang, A. Kl¨
aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169–3176. IEEE, 2011.
- J. Wang and Y. Wu.
Learning maximum margin temporal warping for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2688–2695, 2013.
- L. Wang, Y. Qiao, and X. Tang.
Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2015.
- Y. Wang and G. Mori.
Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7):1310–1323, 2011.
References V
- D. Weinland, E. Boyer, and R. Ronfard.
Action recognition from arbitrary views using 3d exemplars. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.
- T. Xiang and S. Gong.
Beyond tracking: Modelling activity and understanding behaviour. International Journal of Computer Vision, 67(1):21–51, 2006.
- B. Yao and S.-C. Zhu.
Learning deformable action templates from cluttered videos. In 2009 IEEE 12th International Conference on Computer Vision, pages 1507–1514. IEEE, 2009.