Video-based Action Recognition Ying Wu Electrical Engineering and - - PowerPoint PPT Presentation

video based action recognition
SMART_READER_LITE
LIVE PREVIEW

Video-based Action Recognition Ying Wu Electrical Engineering and - - PowerPoint PPT Presentation

Video-based Action Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 Outline Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of


slide-1
SLIDE 1

Video-based Action Recognition

Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208

slide-2
SLIDE 2

Outline

Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition

slide-3
SLIDE 3

What is an Action?

◮ Action: Atomic motion(s) that can be unambiguously

distinguished (e.g. sitting down, running).

◮ An activity is composed of several actions performed in

succession (e.g. dining, meeting a person).

◮ Event is a combination of activities (e.g. football match,

traffic accident).

slide-4
SLIDE 4

What is Action Recognition?

◮ What is Recognition?

◮ Verification: Is the walking man Obama? ◮ Identification: Who is the walking man? ◮ Recognition: What is the man doing?

◮ The recognition of action is to match the observation

(e.g. videos) with previously defined patterns and then assign it a label, i.e. action type.

◮ Input: an action video; ◮ Output: an action label;

slide-5
SLIDE 5

Why Need Action Recognition?

◮ Expensive human effort to handle rapidly increasing

amount of video records;

◮ Large number of potential applications:

◮ visual surveillance ◮ crowd behavior analysis ◮ human-machine interfaces ◮ sports video analysis ◮ video retrieval ◮ etc.

slide-6
SLIDE 6

Main Challenges in Action Recognition

◮ Different scales

◮ People may appear at different scales in different videos,

yet perform the same action.

◮ Movement of the camera ◮ Background “clutter”

◮ Other objects/humans present in the video frame.

◮ Partial Occlusions ◮ Human/Action variation (large intra-class variation)

◮ Walking movements can differ in speed and stride length.

◮ Etc.

slide-7
SLIDE 7

Categories of Action Recognition Methods

Appearance Representation

◮ Focus on extracting “better” appearance representation

from action video;

◮ hand-crafted features: HOG [7], HOF [4], MBH [18] or

combinations [18];

◮ learned features: deep neural network [20, 5, 16]

slide-8
SLIDE 8

Categories of Action Recognition Methods

Dynamic Modeling

◮ Focus on molding the dynamics and motions in action

video;

◮ Deterministic models: dynamic time warping [24],

maximum margin temporal warping [19], actom sequence model [6], graphs [3] and deep neural architectures [14, 17];

◮ Generative models: HMM [10], coupled HMM [2],

CRF [21] and dynamic Bayes nets [23].

slide-9
SLIDE 9

Small Size Datasets

◮ The KTH Dataset [13]

◮ 6 actions (walking, jogging, running, boxing, hand

waving and hand clapping)

◮ The Weizmann Dataset [1]

◮ 10 actions (walk, run, jump, gallop sideways, bend,

  • ne-hand wave, two-hands wave, jump in place, jumping

jack and skip)

◮ The UCF Sports Action Dataset

◮ 9 actions (diving, golf swinging, kicking, weightlifting,

horseback riding, running, skating, swinging a baseball bat and walking)

slide-10
SLIDE 10

Large Size Datasets

◮ The IXMAS Dataset [22]

◮ 14 actions (check watch, cross arms, scratch head, sit

down, get up, turn around, walk, wave, punch, kick, point, pick up, throw over head and throw from bottom up)

◮ Hollywood Human Action Dataset [11]

◮ 12 actions (answer phone, get out of car, handshake,

hug, kiss, sit down, sit up, stand up, drive car, eat, fight and run)

◮ The UCF50 Dataset [12]: 50 different actions/activities ◮ The HMDB51 Dataset [8]: 51 different actions/activities ◮ The UCF101 Dataset [15]: 101 different actions/activities

slide-11
SLIDE 11

Data Samples

(a) KTH Dataset (b) Hollywood Dataset

slide-12
SLIDE 12

Outline

Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition

slide-13
SLIDE 13

Overview

◮ Title: On Space-Time Interest Points (2005) 1

◮ Motivated by the idea of Harris and Forstner spatial

interest point operators, extended into the spatio-temporal domain;

◮ Aim to find the “good” spatio-temporal positions in a

sequence for feature extraction;

◮ Distinct and stable descriptors are extracted from the

  • btained interest points;

◮ Author: Ivan Laptev

  • 1I. Laptev. On space-time interest points.

International Journal of Computer Vision, 64(2-3):107–123, 2005

slide-14
SLIDE 14

Spatio-Temporal Interest Points

◮ The points that have large variations along both the

spatial and the temporal directions in local spatio temporal volumes.

Figure : Detecting the strongest spatio-temporal interest points in a football sequence with a player heading the ball.

slide-15
SLIDE 15

Spatio-Temporal Interest Point Detection

◮ In the spatial domain, we can model an image f sp by its

linear scale-space representation Lsp: Lsp x, y; σ2

l

  • = g sp

x, y; σ2

l

  • ∗ f sp (x, y)

◮ Like the operation for image, we can model the sequence

by a linear scale-space representation L: L

  • ·; σ2

l , τ 2 l

  • = g
  • ·; σ2

l , τ 2 l

  • ∗ f (·)

g

  • x, y, t; σ2

l , τ 2 l

  • = exp(−(x2 + y 2)/2σ2

l − t2/2τ 2 l )

  • (2π)3σ4

l τ 2 l

slide-16
SLIDE 16

Spatio-Temporal Interest Point Detection

◮ Construct a 3 × 3 spatio-temporal second-moment matrix:

µ = g

  • ·; σ2

i , τ 2 i

  L2

x

LxLy LxLt LxLy L2

y

LyLt LxLt LyLt L2

t

 

◮ The first-order derivatives are defined as: (ξ = {x,y,t})

  • ·; σ2

l , τ 2 l

  • = ∂ξ(g ∗ f )

◮ Compute the three eigenvalues λ1, λ2 and λ3 of µ, the

Harris corner function is then defined as: H = det(µ) − k · trace3(µ) = λ1λ2λ3 − k(λ1 + λ2 + λ3)3

◮ Detect the interest points by calculating the positive local

maxima of H;

slide-17
SLIDE 17

Space-Time Interest Points: Examples

(a) Action : clapping hands (b) The detected interst points

slide-18
SLIDE 18

Spatio-Temporal Scale Adaptation

◮ Let’s recall the scale-space representation L (·; σ2 l , τ 2 l ), the

two scale factors σ2

l and τ 2 l influence the result a lot; ◮ The larger the τ 2 l is, the easier the space-time structures

with long temporal extents are detected;

◮ The larger the σ2 l is, the easier the space-time structures

with long spatial extents are detected;

slide-19
SLIDE 19

Spatio-Temporal Scale Adaptation

◮ By finding the extrema of ▽2 normL over both spatial and

temporal scales, we can automatically determine the scale factors.

slide-20
SLIDE 20

Result

Figure : Results of spatial/spatio-temporal interest point detection for a zoom-in sequence of a walking person.

slide-21
SLIDE 21

Result

Figure : (top): Correct matches in sequences with leg actions; (bottom): Correct matches in sequences with arm actions;

slide-22
SLIDE 22

Outline

Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition

slide-23
SLIDE 23

Overview

◮ Title: Recognizing Human Actions: A Local SVM

Approach (2004) 2

◮ Use local space-time features to represent video

sequences that contain actions.

◮ Classification is done via an SVM.

◮ Author: Christian Schuldt, Ivan Laptev and Barbara

Caputo

  • 2C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm

approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004

slide-24
SLIDE 24

Local Space-time Features

Figure : Local space-time features detected for a walking pattern 3

  • 3C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm

approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004

slide-25
SLIDE 25

Representation of Features

◮ Spatial-temporal “jets” (4th order) are computed at each

feature center: j = (Lx, Ly, Lt, Lxx, · · · , Ltttt) |σ2=˜

σ2

i ,τ 2=˜

τ 2

i

Lxmyntk = σm+nτ k(∂xmyntkg) ∗ f

◮ Using k-means clustering over j, a vocabulary consisting

  • f words hi is created from the jet descriptors;

◮ Finally, a given video is represented by a histogram of

counts of occurrences of features corresponding to hi in that video: H = (h1, ..., hn)

slide-26
SLIDE 26

Recognition by Support Vector Machines

◮ For action recognition, we combine the obtained local

space-time features with SVM;

◮ Given a set of training data from different action classes

{(Hi, yi)}n

i=1, a SVM classifier for each action class is

learned: f (H) = sgn

  • n
  • i=1

αiyiHi + b

  • ◮ Easy to extend to a kernelized version;
slide-27
SLIDE 27

Results

Figure : Results of action recognition for different methods and scenarios on KTH dataset.

slide-28
SLIDE 28

Outline

Introduction The Task of Action Recognition Main Challenges in Action Recognition Categorization of Existing Methods Common-used Action Datasets Action Recognition by Appearance Representation - I On Space-Time Interest Points Action Recognition by Appearance Representation - II Recognizing Human Actions: A Local SVM Approach Action Recognition by Dynamic Modeling Coupled Hidden Markov Models for Complex Action Recognition

slide-29
SLIDE 29

Overview

◮ Title: Coupled Hidden Markov Models for Complex

Action Recognition (1997) 4

◮ Hidden Markov model (HMM) is preferable for modeling

and classifying dynamic behaviors.

◮ But HMM is not suitable for multiple interacting

processes (have structure in both time and space).

◮ Coupled hidden Markov models can model multiple

interacting processes without running afoul of the Markov condition.

◮ Author: Matthew Brand, Nuria Oliver and Alex Pentland

  • 4M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for

complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pages 994–999. IEEE, 1997

slide-30
SLIDE 30

Restriction of HMM

◮ HMM is favored to implicitly handle time-varying signals,

which fit the Markov condition.

◮ So that, HMMs are ill-suited to systems that have

compositional state, e.g., multiple interacting processes that have structure in both time and space.

◮ Think about how to model “A gave B the C”?

slide-31
SLIDE 31

Coupling and Factoring HMMs

◮ In order to handle multiple interacting processes (to

couple HMMs), we need to obtain a joint HMM C from two coupling HMMs A and B;

◮ Given the states ai and bj and transition parameters Pai|aj

and Pbk|bl, the joint states is cij = {ai, bj}, the transition is: Pcik|cjl = Ψ

  • Pai|aj, Pbk|bl, Pai|bl, Pbk|aj
  • ◮ Pai|bl and Pbk|aj are the coupling parameters;
slide-32
SLIDE 32

Coupling and Factoring HMMs

◮ We can also project the joint HMM back into its

components: Pai|aj ∝

  • l
  • k

Pcik|cjl Pai|bl ∝

  • j
  • k

Pcik|cjl

◮ So a joint HMM can be trained via standard HMM

methods;

slide-33
SLIDE 33

One Case of Application

Figure : Here we can use the proposed model to represent the action performed by two hands.

slide-34
SLIDE 34

References I

  • M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.

Actions as space-time shapes. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1395–1402. IEEE, 2005.

  • M. Brand, N. Oliver, and A. Pentland.

Coupled hidden markov models for complex action recognition. In Computer vision and pattern recognition, 1997. proceedings., 1997 ieee computer society conference on, pages 994–999. IEEE, 1997.

  • W. Brendel and S. Todorovic.

Learning spatiotemporal graphs of human activities. In 2011 International Conference on Computer Vision, pages 778–785. IEEE, 2011.

  • R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal.

Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1932–1939. IEEE, 2009.

  • G. Ch´

eron, I. Laptev, and C. Schmid. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3218–3226, 2015.

slide-35
SLIDE 35

References II

  • A. Gaidon, Z. Harchaoui, and C. Schmid.

Actom sequence models for efficient action detection. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3201–3208. IEEE, 2011.

  • A. Klaser, M. Marsza

lek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC 2008-19th British Machine Vision Conference, pages 275–1. British Machine Vision Association, 2008.

  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.

Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.

  • I. Laptev.

On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005.

  • K. Li, J. Hu, and Y. Fu.

Modeling complex temporal composition of actionlets for activity prediction. In European Conference on Computer Vision, pages 286–299. Springer, 2012.

slide-36
SLIDE 36

References III

  • M. Marszalek, I. Laptev, and C. Schmid.

Actions in context. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2929–2936. IEEE, 2009.

  • K. K. Reddy and M. Shah.

Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981, 2013.

  • C. Schuldt, I. Laptev, and B. Caputo.

Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004.

  • K. Simonyan and A. Zisserman.

Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014.

  • K. Soomro, A. R. Zamir, and M. Shah.

Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.

  • N. Srivastava, E. Mansimov, and R. Salakhutdinov.

Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2, 2015.

slide-37
SLIDE 37

References IV

  • V. Veeriah, N. Zhuang, and G.-J. Qi.

Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 4041–4049, 2015.

  • H. Wang, A. Kl¨

aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3169–3176. IEEE, 2011.

  • J. Wang and Y. Wu.

Learning maximum margin temporal warping for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2688–2695, 2013.

  • L. Wang, Y. Qiao, and X. Tang.

Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2015.

  • Y. Wang and G. Mori.

Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7):1310–1323, 2011.

slide-38
SLIDE 38

References V

  • D. Weinland, E. Boyer, and R. Ronfard.

Action recognition from arbitrary views using 3d exemplars. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.

  • T. Xiang and S. Gong.

Beyond tracking: Modelling activity and understanding behaviour. International Journal of Computer Vision, 67(1):21–51, 2006.

  • B. Yao and S.-C. Zhu.

Learning deformable action templates from cluttered videos. In 2009 IEEE 12th International Conference on Computer Vision, pages 1507–1514. IEEE, 2009.