SLIDE 1
Efficient 2D Viewpoint Combination for Human Action Recognition
SLIDE 2 Multi-view Action Recognition
- Video describes a 2-dimensional space while actions truly occur in
3-dimensional world space.
- Subject may be occluded by an object or by itself (self-occlusions)
- Researchers have used multiple cameras to obtain a 3D
representation of the subject (visual hull)
SLIDE 3 Drawbacks of Visual Hulls
- A sufficient number of views is required to build a
reliable visual hull.
- In carving a visual hull, some information is lost. Visual
hull is an approximation of the true 3D model.
SLIDE 4 Proposed method (1)
- We propose to extract features from each viewpoint separately
and combine them efficiently such that useful information is reinforced and redundant features are attenuated.
- We extract local features from each view, which is easy to extract
and does not require segmentation.
- As opposed to Peng and Qian who used HMMs, we use a simple
BOW model which is obviously orderless, much more easier to train/test and can be used with classifiers such as SVM.
- Instead of extracting many heterogeneous features, we focus on
computing different models using different codebooks and functions and combining them efficiently.
SLIDE 5 Proposed method (2)
- Multi-class recognition is done using 1-vs-1 scheme instead of 1-
vs-all to achieve more precision and the ability to add a category without the need to re-train the whole system.
- We model the same video with different histograms obtained from
two local features and two vocabularies.
- The distance between histograms are measured using HIK
(Histogram Intersection Kernel) as well as RBF (Radial Basis Function) kernel with Chi-square distance.
- We use an efficient interleaved optimization strategy to learn the
- ptimum weights for the multiple kernels. The obtained optimum
weights score each kernel based on its ability to discriminate between two different categories.
SLIDE 6
Some viewpoints are more discriminative between some pairs of actions
SLIDE 7
SLIDE 8 Feature Types
- Apply a Gaussian filter to the spatial domain and
a quadratic pair of Gabor filters to the temporal
- dimension. Proposed by Dollar et al.
Separable Linear Filters
- Extension of the Harris corner detector proposed
by Laptev and Lindeberg.
Space- Time Corner Detector
SLIDE 9 Codebook sizes
- After extracting features, we use them to obtain a codebook for
each view by applying k-means using Euclidean distance.
- We use two codebooks, one of size V and the other 2V.
- According to Gehler and Nowozin, adding any kernel, even
uninformative and non-discriminative one, to the kernel weight
- ptimization methods will not reduce the classification
- performance. In particular, when the added feature(kernel) is
discriminative, the classification performance will increase.
- Using two different sizes of vocabulary will enable us to model the
actions with two different scales of detail.
SLIDE 10
Kernel Types
Histogram Intersection Kernel (HIK)
Radial Basis Function (RBF) Kernel
with Chi- Square Distance
SLIDE 11 Learning an efficient combination of kernels (1)
- The HIK and RBF kernels from different histograms need to be
combined in an efficient way to acquire an optimized final kernel.
- The final kernel is used with SVM to classify the actions. The
binary SVM classier will be in the form of
- is the kernel weight which changes (scales) the influence of
the kernel space associated with and subsequently the corresponding histogram space.
SLIDE 12 Learning an efficient combination of kernels (2)
- We use 1-vs-1 classification scheme and choose different weights
for each binary classifier. As some of the histograms (feature spaces) may be discriminative in differentiating between a pair of classes but non-informative for another pair.
- For every combination of local feature and codebook size, we
incorporate only one instance of HIK kernel and four instances of RBF kernel with different bandwidths.
- Experimental results show that sparse methods do not perform
much better than baseline methods using average weights, therefore we use a non-sparse general -norm multiple kernel learning algorithm where no feature is removed but all features participate and with different contributions. We empirically select the -norm. Newton descent is used for optimization due to its faster performance compared to cutting planes.
SLIDE 13 MKL (sparse and non-sparse)
- Lp-norm refers to the norm which is used by the regularizer of
learning function.
SLIDE 14
IXMAS dataset
11 actions, 10 subjects, 5 views.
SLIDE 15
Views in IXMAS dataset
SLIDE 16
Confusion matrix for the best result achieved on IXMAS
Recognition accuracy: 95.8%
SLIDE 17
Accuracy for each view (camera) of IXMAS
SLIDE 18
Best accuracy for combination of views in IXMAS
SLIDE 19
Performance of each feature type
SLIDE 20
Performance of using more codebooks
SLIDE 21
Performance of each kernel type and combination of them
SLIDE 22
Comparison of different fusion methods
SLIDE 23
Comparison of Recognition Accuracy on IXMAS dataset
Multi-view Single view Visual Hull