E ffi cient 2D Viewpoint Combination for Human Action Recognition - - PowerPoint PPT Presentation

e ffi cient 2d viewpoint combination for human action
SMART_READER_LITE
LIVE PREVIEW

E ffi cient 2D Viewpoint Combination for Human Action Recognition - - PowerPoint PPT Presentation

E ffi cient 2D Viewpoint Combination for Human Action Recognition Multi-view Action Recognition Video describes a 2-dimensional space while actions truly occur in 3-dimensional world space. Subject may be occluded by an object or by


slide-1
SLIDE 1

Efficient 2D Viewpoint Combination for Human Action Recognition
 


slide-2
SLIDE 2

Multi-view Action Recognition

  • Video describes a 2-dimensional space while actions truly occur in

3-dimensional world space.

  • Subject may be occluded by an object or by itself (self-occlusions)
  • Researchers have used multiple cameras to obtain a 3D

representation of the subject (visual hull)

slide-3
SLIDE 3

Drawbacks of Visual Hulls

  • A sufficient number of views is required to build a

reliable visual hull.

  • In carving a visual hull, some information is lost. Visual

hull is an approximation of the true 3D model.

slide-4
SLIDE 4

Proposed method (1)

  • We propose to extract features from each viewpoint separately

and combine them efficiently such that useful information is reinforced and redundant features are attenuated.

  • We extract local features from each view, which is easy to extract

and does not require segmentation.

  • As opposed to Peng and Qian who used HMMs, we use a simple

BOW model which is obviously orderless, much more easier to train/test and can be used with classifiers such as SVM.

  • Instead of extracting many heterogeneous features, we focus on

computing different models using different codebooks and functions and combining them efficiently.

slide-5
SLIDE 5

Proposed method (2)

  • Multi-class recognition is done using 1-vs-1 scheme instead of 1-

vs-all to achieve more precision and the ability to add a category without the need to re-train the whole system.

  • We model the same video with different histograms obtained from

two local features and two vocabularies.

  • The distance between histograms are measured using HIK

(Histogram Intersection Kernel) as well as RBF (Radial Basis Function) kernel with Chi-square distance.

  • We use an efficient interleaved optimization strategy to learn the
  • ptimum weights for the multiple kernels. The obtained optimum

weights score each kernel based on its ability to discriminate between two different categories.

slide-6
SLIDE 6

Some viewpoints are more discriminative between some pairs of actions

slide-7
SLIDE 7
slide-8
SLIDE 8

Feature Types

  • Apply a Gaussian filter to the spatial domain and

a quadratic pair of Gabor filters to the temporal

  • dimension. Proposed by Dollar et al.

Separable Linear Filters

  • Extension of the Harris corner detector proposed

by Laptev and Lindeberg.

Space- Time Corner Detector

slide-9
SLIDE 9

Codebook sizes

  • After extracting features, we use them to obtain a codebook for

each view by applying k-means using Euclidean distance.

  • We use two codebooks, one of size V and the other 2V.
  • According to Gehler and Nowozin, adding any kernel, even

uninformative and non-discriminative one, to the kernel weight

  • ptimization methods will not reduce the classification
  • performance. In particular, when the added feature(kernel) is

discriminative, the classification performance will increase.

  • Using two different sizes of vocabulary will enable us to model the

actions with two different scales of detail.

slide-10
SLIDE 10

Kernel Types

Histogram Intersection Kernel (HIK)

Radial Basis Function (RBF) Kernel

with Chi- Square Distance

slide-11
SLIDE 11

Learning an efficient combination of kernels (1)

  • The HIK and RBF kernels from different histograms need to be

combined in an efficient way to acquire an optimized final kernel.

  • The final kernel is used with SVM to classify the actions. The

binary SVM classier will be in the form of

  • is the kernel weight which changes (scales) the influence of

the kernel space associated with and subsequently the corresponding histogram space.

slide-12
SLIDE 12

Learning an efficient combination of kernels (2)

  • We use 1-vs-1 classification scheme and choose different weights

for each binary classifier. As some of the histograms (feature spaces) may be discriminative in differentiating between a pair of classes but non-informative for another pair.

  • For every combination of local feature and codebook size, we

incorporate only one instance of HIK kernel and four instances of RBF kernel with different bandwidths.

  • Experimental results show that sparse methods do not perform

much better than baseline methods using average weights, therefore we use a non-sparse general -norm multiple kernel learning algorithm where no feature is removed but all features participate and with different contributions. We empirically select the -norm. Newton descent is used for optimization due to its faster performance compared to cutting planes.

slide-13
SLIDE 13

MKL (sparse and non-sparse)

  • Lp-norm refers to the norm which is used by the regularizer of

learning function.

slide-14
SLIDE 14

IXMAS dataset

11 actions, 10 subjects, 5 views.

slide-15
SLIDE 15

Views in IXMAS dataset

slide-16
SLIDE 16

Confusion matrix for the best result achieved on IXMAS

Recognition accuracy: 95.8%

slide-17
SLIDE 17

Accuracy for each view (camera) of IXMAS

slide-18
SLIDE 18

Best accuracy for combination of views in IXMAS

slide-19
SLIDE 19

Performance of each feature type

slide-20
SLIDE 20

Performance of using more codebooks

slide-21
SLIDE 21

Performance of each kernel type and combination of them

slide-22
SLIDE 22

Comparison of different fusion methods

slide-23
SLIDE 23

Comparison of Recognition Accuracy on IXMAS dataset

Multi-view Single view Visual Hull