Learning Semantic Visual Codebook for Action Recognition by - - PowerPoint PPT Presentation
Learning Semantic Visual Codebook for Action Recognition by - - PowerPoint PPT Presentation
Learning Semantic Visual Codebook for Action Recognition by Embedding into Concept Space Behrouz Saghafi Using Spatio-temporal Features Action recognition using silhouettes or optical flow encounters difficulties when dealing with non-uniform
Using Spatio-temporal Features
Action recognition using silhouettes or optical flow encounters difficulties when dealing with non-uniform background, severe camera jitter and noise Local spatio-temporal features are fast and easy to extract and reliable.
Bag of Words model
The raw features are clustered based on the their appearance rather than their semantic relations. By utilizing the semantics, the recognition accuracy will improve.
Incorporating Semantics into BoW model (Related work)
- Build a model for each category and fit the query to
- ne of the models in an unsupervised framework, like
Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation.
- Their unsupervised nature limits their performance
- The number of topics = the number of categories,
which limits their efficiency.
Generative methods
- Try to construct a semantic vocabulary and use it with
a classifier.
- Liu and shah (CVPR 2008): maximization of mutual
information between visual words and videos > The formed clusters do not necessarily represent topics or synonym words.
- Liu et al. (CVPR 2009): use Diffusion Map (DM) to
construct a semantic visual vocabulary > Considering connectivity in measuring the semantic distance is not appropriate in the presence of polysemy.
Discriminative methods
Embedding into Concept Space (Proposed)
- We propose a framework for constructing a semantic visual
vocabulary via computing a rich semantic space (Concept space). The concept space is computed by Latent Semantic Models or Canonical Correlation Analysis.
- The visual words are embedded into concept space to form
meaningful clusters representing semantic topics, consequently the formed histograms are more discriminative.
- As opposed to generative methods which do not use category labels,
- ur method uses a classifier trained on the training histograms.
- The number of topics can be more than categories as opposed to
the unsupervised framework, which allows analysis in more details.
- By using pLSA in constructing the concept space, the problem of
polysemy is handled.
Overview of the proposed framework
Constructing the semantic visual vocabulary: Training steps of the proposed method:
Latent Semantic Analysis (LSA) (1)
- Latent Semantic Analysis (LSA) originally used in text mining
applications, is the factorization of word-video co-occurrence matrix into linear subspaces of words and videos.
- The word vectors reveal the semantic relations of words,
since semantically synonymous words occur in similar documents.
Videos words
MxN word-video co-occurrence matrix
word vector video vector
Latent Semantic Analysis (LSA) (2)
- The word vectors are sparse so their correlation may not be so
representative of their semantic relations. Therefore, we need to find the reduced dimensional space. Rank L optimal representation:
- The correlation of words based on word vectors:
Rows of are a good representation of rows of (words) in the sense that they approximate the correlation between words.
NxM ~= NxL x LxL x LxM videos videos topics topics topics topics words words
Embedding into concept space using LSA
NxL
: representation of word i in the L-dimensional concept space
Probabilistic Latent Semantic Analysis (pLSA)
Observed word distributions word distributions per topic Topic distributions per document
w d z
Probabilistic Latent Semantic Analysis (pLSA)
w d z
known unknown
Likelihood
Probabilistic Latent Semantic Analysis (pLSA)
w d z
E-step: M-step:
Maximum Likelihood by EM
Embedding into concept space using pLSA
Embedding into concept space using pLSA
representation of word i in the L-dimensional concept space
Using LSA vs pLSA
- pLSA can handle polysemy
– Polysemes are the words which have more than one meaning.
Table
Using LSA vs pLSA
- LSA can perform faster.
LSA pLSA Mean Training time
(having the initial vocabulary)
62 sec 4261 sec Mean Testing time
(having learned the concept space)
0.54 sec 0.71 sec
Canonical Correlation Analysis (CCA)
- Given a pair of vector sets, CCA finds the direction for each set,
such that the projection of the vectors onto these directions have maximal correlation.
Canonical Correlation Analysis (CCA) (2)
Embedding into concept space using CCA
noisy Noise covariance is reduced Raw feature representation Semantic representation
Constructing the semantic visual vocabulary using CCA
Local Feature Extractor
Performance of proposed method (Latent Semantic Space) on KTH dataset with different number of topics
LSA pLSA