Dictionaries, Manifolds and Domain Adaptation for Image and Video- - - PowerPoint PPT Presentation
Dictionaries, Manifolds and Domain Adaptation for Image and Video- - - PowerPoint PPT Presentation
Dictionaries, Manifolds and Domain Adaptation for Image and Video- based Recognition Rama Chellappa University of Maryland Student and the teachers Major points Training and testing data come from different distributions.
Student and the teachers
Major points
- Training and testing data come from different
distributions.
– Distributions are complex due to variations in patterns – Domain adaptation
- Robust representations and distance measures
– Vector space vs manifolds – Euclidean vs geodesics
- Will develop these points for two representations
- f images and videos.
– Dictionaries – Manifolds
Outline of the talk
- Dictionaries
– Learning and applications to image and video-based recognition.
- Manifolds
– Representation, inference and applications to image and video- based recognition. – Analytical and empirical
- Domain adaptation
– How to adapt representations to new domains – Domain shifts could be due to pose, illumination, rate, time lapse, views,.. – Semi-supervised, unsupervised
- Relies on works of Prof. Amari and Chikuse.
Motivation - 1
Motivation – 2
- Task: Given a probe video of one or more subjects, retrieve their
IDs from a gallery of still face images or face videos.
- Challenges: Getting a face image is more than half the problem
Low resolution
Blur
Pose Variation Uncontrolled Illumination Camera motion
Dictionaries for signal and image analysis
- Matching Pursuit algorithms Mallat (early 90’s)
- Orthogonal matching pursuits (Pati, et al,1993, Tropp
2004)
- Saito and Coifman, 1997
- Etemad, Chellappa, 1997
- Represent signals using wavelets, wavelet packets,..
- Learning dictionary from data instead of using off-the-
shelf bases. (Olshausen and Field, 1997), …
Modern day dictionaries
- Represent Signals and images using signals
and images.
- Sparse coding has neural backings.
- Allow compositional representations
- Dictionary updates
– Batch (Method of Optimal directions) – K-SVD
- Dictionaries for images are more complicated
– Need to account for pose, illumination, resolution variations.
Basic formulation
- Assume L classes and n images per class in gallery.
- The training images of the kth class is represented as
- Dictionary D is obtained by concatenating all the training
images
- The unknown test vector can be represented as a linear
combination of the training images as
- The coefficient vector α is sparse.
Wright et al, 2009 Wagner et al, 2011
Dictionary-based face recognition
Find the reconstruction error while representing the test image with coefficients of each class separately. Select the class giving the minimum reconstruction error.
α can be recovered by Basis Pursuit as
Learning dictionaries – K-SVD
Training faces K-SVD Learned dictionary
- M. Aharon, M. Elad, and A. M. Bruckstein,
2006
Outlier rejection
The illumination problem
- Robust albedo estimation (Biswas et al. PAMI
2009)
– Estimate albedo – Relight images with different light source direction – Use relighted images for training
Robust estimation of albedo
+ +
Light Source Intensity Image Albedo Surface Normals Single Intensity Image Albedo + Shape Biswas, et al ICCV 2007 PAMI 2009
Inverse problem
Albedo estimation
Lambertian assumption
Intensity Albedo Surface Normal Light Source
Light Source Estimated : Initial Albedo Estimate Initial Surface Normal : Error in initial albedo estimate
Albedo estimation
Initial albedo estimate Signal Dependent Additive Noise
Non-stationary Mean Non-stationary Variance (NMNV) model for the true unknown albedo Unbiased source assumption Uncorrelated Noise
Estimated albedo – PIE dataset
Relighting using the estimated albedo
Experimental results
- DFR – 99.17 %
- SRC – 98.1 %
- CDPCA – 98.83 %
- V. M. Patel, T. Wu, S. Biswas, P. J. Phillips, and R. Chellappa, “
Dictionary-based face recognition under variable lighting and pose”, IEEE Trans, Information Forensics and Security, 2011.
Yale B data set
An outdoor dataset with 18 subjects with 5 gallery images each and 90 low resolution images.
Outdoor face dataset
Method Recognition
SLRFR 67% Reg. .LDA+SVM 60% CLPM 16.1% Gallery – 120 x 120 Probe – 20 x 20
BTAS 2011
21
Video dictionaries for face recognition
Constructing distance/similarity matrices Recognition / verification ECCV 2012 Preprocessing (extract frames and detect/crop face regions) Using summarization algorithm to partition cropped face images Dictionary learning for each partition and finding sequence- specific dictionaries
ECCV 2012
1] N. Shroff, P. Turaga, and R. Chellappa, “Video precis: High lighting diverse aspects of videos,” IEEE Transactions on Multimedia, 2010., NIPS 2011
Dictionary learning
(build sequence-specific dictionaries)
- Let be the gallery matrix of the k-th partition of
the j-th video sequence of subject i.
- Given , use K-SVD [2] algorithm to build a
(partition level) sub-dictionary such that
- Concatenate the (partition-level) sub-dictionaries to
form a sequence-specific dictionary
[2] M. Aharon, M. Elad and A. M. Bruckstein, “The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311-4322, 2006
Recognition/Identification
- Given the m-th query video sequence
- We generate the partition as
- The distance , between and (i.e.
dictionary of the p-th video sequence) is calculated as where
- We select the best matched with such that
MBGC recognition results
MBGC dataset: 397 walking (frontal-face) videos: 198 SDs + 199 HDs 371 activity (profile-face) videos: 185 SDs + 186 HDs
- Facial expression analysis using AUs and high-level knowledge
available in FACS regarding AUs composition and expression decomposition
- AUs have ambiguous semantic descriptions so it is difficult to
accurately model them
AU-Dictionary
– We use local features to model each AU
25
26
We learn separate dictionaries
for each AU
- AU-Dictionary is then formed
using all the individual AU dictionaries.
27
D =
AU-1 AU-2 AU-5 AU-10 AU-12 AU-23
28
- Objective function to be minimized:
29
B
- Goal:
– To simultaneously learn structures on the expressive face and corresponding subspace representations – We want the final subspaces to be as separate as possible
- Objective:
structures disjoint subsets of local patch descriptors
- :learned dictionaries for the structures
- Learned structures for the universal expressions from the CK+ dataset
30
31
- Min residual error
32
Some additional results
- Competitive results for iris recognition. Enables cancelability. (PAMI 2011)
- Non-linear dictionaries through kernelization produces improvements of 5-
10% depending on the problem. (ICASSP 2012)
– Illustrated using the USPS dataset, Caltech 101 and 256 datasets.
- Building dictionaries in the Radon transform domain yields robustness to in-
plane rotation and scale in CBIR applications. (IEEE TIP)
- Characteristic views (Chakravarthy and Freeman) can be built using sparse
representation theory. (ICIP 2012)
- Joint sparsity driven dictionary learning produces improvements in multi-
modal biometrics applications. (Under review)
- Reconstruction from sparse gradients (IEEE TIP 2012) in collaboration with
Anna Gilbert.
Domain adaptation: Motivation
Image credit: Saenko et al., ECCV 2010, Bergamo et al., NIPS 2010
1 S. J. Pan and Q. Yang. A survey on transfer learning.
IEEE Trans. Knowledge and Data Engineering, 22:1345 –1359, October 2010.
Transfer Learning1 P(Y|X) ≠ P(Y’|X’), P(X) ≈ P(X’) Domain adaptation P(X) ≠ P(X’), P(Y|X) ≈ P(Y’|X’) Source domain Data: X, Labels: Y Target domain Data: X’, Labels: Y’
Domain adaptation - Related work
Semi-supervised
- Learns domain change
through correspondence
– Daume and Marcu, JAIR ’06 – Duan et al., ICML ’09 – Xing et al., KDD ’07 – Saenko et al., ECCV 2010, Kulis et al., CVPR 2011 – Bergamo and Torresani, NIPS 2010 – Lai and Fox, IJRR 2010
Unsupervised
- No correspondence, no
knowledge of domain change
– Ben-David et al., AISTATS ’10 – Blitzer et al., NIPS ’08 – Wang and Mahadevan, IJCAI ’09 – Gopalan, Li and Chellappa, ICCV 2011 – Gong et al, .. CVPR 2012 – Zheng and Chellappa, ICPR 2012
- D. Xu’s group, 2012
Unsupervised domain adaptation*
Domain 1 (labeled) Domain 2 (unlabeled)
GN,d
Labeled source domain (X)
S1 S1.3 S1.6 S2
Unlabeled target domain (X~)
Intermediate domains (Incremental learning)
*R. Gopalan, R. Li, R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach”,
International Conference on Computer Vision, ICCV 2011 (Oral)
Generative subspace from X Generative subspace from X~ (no labels)
Domain adaptation of dictionaries
- Assume there exist K intermediate domains which smoothly
bridge the information gap between the source and target domain. A domain dependent dictionary is learned for each intermediate domain .
- We learn the intermediate data to approximate the observations in
the corresponding intermediate domains. The intermediate data is then utilized to build classifiers.
- Intermediate domains can be derived as solutions to an optimization
problem on a Grassmannian (ongoing work)
The top half of the figure shows some intermediate images synthesized from a given source image of frontal view (in red box). The bottom half shows the intermediate images generated from a given target image of side view (in green box).
{Sk}k=1
K
Sk Dk
Learn the intermediate domains
- The reconstruction residue of the target data decomposed with
the source domain dictionary is utilized as an estimate of the information gap between two domains.
- The dictionaries for intermediate domains are learned
sequentially until the information gap falls below a threshold.
Learn the intermediate data
- The intermediate data are generated pertaining to the transition path
formed by the learned intermediate domains.
Generate intermediate data from the source data
Generate intermediate data from target data
- DA Classifier Invariant Codes (DAC-IC).
- Sparse codes are demonstrated to be invariant across the
source, the target, and the intermediate domains.
- DA Classifier Transition Path (DAC-TP).
- The intermediate data are exploited to define the distance
between the labeled source data and the target data
- Extension to semi-supervised adaptation (with a small
amount of labels in the target domain)
- Given unlabeled target data, compute its distance with labeled
source data and labeled target data.
Recognition under domain shift
xs,i xt,i
d = || xs,i
(k) − xt,i (k) ||2 2 k=0 K +1
∑
⋅⋅⋅
(∗)
(∗)
. .
Experiments on face recognition
- FR across pose variation on PIE (Source: frontal images; Target:
non-frontal images)
- FR across blur and illumination variation on PIE (Source: sharp images,
illuminated by one set of light sources; Target: blurred images, illuminated by a different set of light sources)
Experiments on object recognition
Comparison of top five matches of our method (smaller images in the top row) and Grassmann manifold based DA (smaller images in the bottom row).
Calculate the geodesic path starting from the source domain to the target domain. Project each
sample from both domains onto all the intermediate subspaces. Design classifiers using the projected data.
PCA geodesic
t
X S
1
S
PCA
s
X
samples from the source domain samples from the target domain
source subspace target subspace Grassmann manifold
Grassmann manifold-based domain adaptation
Creating intermediate subspaces on manifolds
Algorithm1: Numerical computation of the velocity matrix – The inverse exponential map [1]
[1] K. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren, Efficient algorithms for inferences on Grassmann manifolds, IEEE Statistical Signal Processing Workshop, pp. 315-318, St. Louis, MO USA, Sep 2003
Algorithm2: Computing the exponential map and sampling points along the geodesic [1]
J B t Q t ) ' exp( ) ' ( = Ψ
The sub-matrix A specifies the direction and the speed of geodesic flow.
- Project each sample from both domains onto all the intermediate subspaces
geodesic source subspace target subspace
S
1
S
1
t
S
2
t
S
Grassmann manifold Collection of infinite intermediate subspaces
Legend: Case-A: Linear domain representation with standardized features (recently made available with the dataset) Case-B, B1: Linear and non-linear domain representation with standardized (B) and protocol-based (B1) features (that we generate using the protocol) Case-C, C1: Physically relevant adaptation by simulating intermediate domains by varying proportions of source and target data (C), doing geodesic sampling between the physically domains (C1). C and C1 use both linear and non-linear domain representations Case D: Boosted adaptation – combines all the above cases in a multi-class boosting setting.
Unsupervised adaptation on Office dataset using finite intermediate subspaces
Semi-supervised adaptation using finite intermediate subspaces
Legend: The un-primed cases (A,B,B1,C,C1,D) are same as unsupervised but for using target labels in learning the classifier (i.e. semi-supervised, so we have some target labels) The primed cases (B’,B1’,C’,C1’) are same as their unprimed counterparts but uses target labels in “both” intermediate domain generation stage “and” the classification stage
Performing recognition under domain Shift
- Modified data representation:
– Project every labeled source data, and unlabeled target data on all the domains (subspaces), and concatenate each of them into a long vector
- Classification:
– Train a discriminative learner on the projected source data
- Used partial least squares
– Use the PLS latent space to estimate the labels of the projected target data
Office data set (Saenko et al., ECCV ’10)
query retrievals
3 domains (webcam, dslr, amazon) – 31 object categories
Multi-domain adaptation on Office dataset
Legend: Cases A and D are same as that for unsupervised and semi-supervised
Unsupervised and semi-supervised adaptation on Bing dataset
Some additional results on manifolds
- Dynamic models for actions and faces in videos
leading to Stiefel manifolds and appropriate inference mechanisms (PAMI 2011)
- Recognition of group activities – IJCV (2013)
- Video summarization (NIPS 2011)
- Alignment manifold (PAMI 2012)
- Age estimation and face recognition across
aging on a Grassmannian manifold (TIFS, 2013)
- Fast approximate NN search on a manifold
(ICVGIP 2010).
Closing remarks
- Dictionaries and manifolds are useful for image and video-based
recognition.
– Appearance and geometry can be integrated in a nice way fully exploiting data explosion. – Should address challenges due to pose, illumination, occlusion, resolution,…etc.
- Dictionary-based methods have neural basis.
- Domain adaptation methods can address training/testing data
scalability.
- Domain adaptation methods nicely bridge computer vision and
pattern recognition.
- More of
– Math, data and computing.
- Recently completed the evaluation of an iris sensor adaptation algorithm on
100GB of UND data!
– Exciting times are ahead.