[PPT] - Dictionaries, Manifolds and Domain Adaptation for Image and Video- PowerPoint Presentation

SLIDE 1

Dictionaries, Manifolds and Domain Adaptation for Image and Video- based Recognition

Rama Chellappa University of Maryland

SLIDE 2

Student and the teachers

SLIDE 3

Major points

Training and testing data come from different

distributions.

– Distributions are complex due to variations in patterns – Domain adaptation

Robust representations and distance measures

– Vector space vs manifolds – Euclidean vs geodesics

Will develop these points for two representations
f images and videos.

– Dictionaries – Manifolds

SLIDE 4

Outline of the talk

Dictionaries

– Learning and applications to image and video-based recognition.

Manifolds

– Representation, inference and applications to image and video- based recognition. – Analytical and empirical

Domain adaptation

– How to adapt representations to new domains – Domain shifts could be due to pose, illumination, rate, time lapse, views,.. – Semi-supervised, unsupervised

Relies on works of Prof. Amari and Chikuse.

SLIDE 5

Motivation - 1

SLIDE 6

Motivation – 2

Task: Given a probe video of one or more subjects, retrieve their

IDs from a gallery of still face images or face videos.

Challenges: Getting a face image is more than half the problem

Low resolution

Blur

Pose Variation Uncontrolled Illumination Camera motion

SLIDE 7

Dictionaries for signal and image analysis

Matching Pursuit algorithms Mallat (early 90’s)
Orthogonal matching pursuits (Pati, et al,1993, Tropp

2004)

Saito and Coifman, 1997
Etemad, Chellappa, 1997
Represent signals using wavelets, wavelet packets,..
Learning dictionary from data instead of using off-the-

shelf bases. (Olshausen and Field, 1997), …

SLIDE 8

Modern day dictionaries

Represent Signals and images using signals

and images.

Sparse coding has neural backings.
Allow compositional representations
Dictionary updates

– Batch (Method of Optimal directions) – K-SVD

Dictionaries for images are more complicated

– Need to account for pose, illumination, resolution variations.

SLIDE 9

Basic formulation

Assume L classes and n images per class in gallery.
The training images of the kth class is represented as
Dictionary D is obtained by concatenating all the training

images

The unknown test vector can be represented as a linear

combination of the training images as

The coefficient vector α is sparse.

Wright et al, 2009 Wagner et al, 2011

SLIDE 10

Dictionary-based face recognition

Find the reconstruction error while representing the test image with coefficients of each class separately. Select the class giving the minimum reconstruction error.

α can be recovered by Basis Pursuit as

SLIDE 11

Learning dictionaries – K-SVD

Training faces K-SVD Learned dictionary

M. Aharon, M. Elad, and A. M. Bruckstein,

2006

SLIDE 12

Outlier rejection

SLIDE 13

The illumination problem

Robust albedo estimation (Biswas et al. PAMI

2009)

– Estimate albedo – Relight images with different light source direction – Use relighted images for training

SLIDE 14

Robust estimation of albedo

+ +

Light Source Intensity Image Albedo Surface Normals Single Intensity Image Albedo + Shape Biswas, et al ICCV 2007 PAMI 2009

Inverse problem

SLIDE 15

Albedo estimation

 Lambertian assumption

Intensity Albedo Surface Normal Light Source

 Light Source Estimated : Initial Albedo Estimate  Initial Surface Normal : Error in initial albedo estimate

SLIDE 16

Albedo estimation

Initial albedo estimate Signal Dependent Additive Noise

Non-stationary Mean Non-stationary Variance (NMNV) model for the true unknown albedo Unbiased source assumption Uncorrelated Noise

SLIDE 17

Estimated albedo – PIE dataset

SLIDE 18

Relighting using the estimated albedo

SLIDE 19

Experimental results

DFR – 99.17 %
SRC – 98.1 %
CDPCA – 98.83 %
V. M. Patel, T. Wu, S. Biswas, P. J. Phillips, and R. Chellappa, “

Dictionary-based face recognition under variable lighting and pose”, IEEE Trans, Information Forensics and Security, 2011.

Yale B data set

SLIDE 20

 An outdoor dataset with 18 subjects with 5 gallery images each and 90 low resolution images.

Outdoor face dataset

Method Recognition

SLRFR 67% Reg. .LDA+SVM 60% CLPM 16.1% Gallery – 120 x 120 Probe – 20 x 20

BTAS 2011

SLIDE 21

21

Video dictionaries for face recognition

Constructing distance/similarity matrices Recognition / verification ECCV 2012 Preprocessing (extract frames and detect/crop face regions) Using summarization algorithm to partition cropped face images Dictionary learning for each partition and finding sequence- specific dictionaries

ECCV 2012

1] N. Shroff, P. Turaga, and R. Chellappa, “Video precis: High lighting diverse aspects of videos,” IEEE Transactions on Multimedia, 2010., NIPS 2011

SLIDE 22

Dictionary learning

(build sequence-specific dictionaries)

Let be the gallery matrix of the k-th partition of

the j-th video sequence of subject i.

Given , use K-SVD [2] algorithm to build a

(partition level) sub-dictionary such that

Concatenate the (partition-level) sub-dictionaries to

form a sequence-specific dictionary

[2] M. Aharon, M. Elad and A. M. Bruckstein, “The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, vol. 54, no. 11, pp. 4311-4322, 2006

SLIDE 23

Recognition/Identification

Given the m-th query video sequence
We generate the partition as
The distance , between and (i.e.

dictionary of the p-th video sequence) is calculated as where

We select the best matched with such that

SLIDE 24

MBGC recognition results

MBGC dataset: 397 walking (frontal-face) videos: 198 SDs + 199 HDs 371 activity (profile-face) videos: 185 SDs + 186 HDs

SLIDE 25

Facial expression analysis using AUs and high-level knowledge

available in FACS regarding AUs composition and expression decomposition

AUs have ambiguous semantic descriptions so it is difficult to

accurately model them

 AU-Dictionary

– We use local features to model each AU

25

SLIDE 26

26

SLIDE 27

 We learn separate dictionaries

for each AU

AU-Dictionary is then formed

using all the individual AU dictionaries.

27

D =

AU-1 AU-2 AU-5 AU-10 AU-12 AU-23

SLIDE 28

28

SLIDE 29

Objective function to be minimized:

29

B

SLIDE 30

Goal:

– To simultaneously learn structures on the expressive face and corresponding subspace representations – We want the final subspaces to be as separate as possible

Objective:

structures  disjoint subsets of local patch descriptors

:learned dictionaries for the structures
Learned structures for the universal expressions from the CK+ dataset

30

SLIDE 31

31

Min residual error

SLIDE 32

32

SLIDE 33

Some additional results

Competitive results for iris recognition. Enables cancelability. (PAMI 2011)
Non-linear dictionaries through kernelization produces improvements of 5-

10% depending on the problem. (ICASSP 2012)

– Illustrated using the USPS dataset, Caltech 101 and 256 datasets.

Building dictionaries in the Radon transform domain yields robustness to in-

plane rotation and scale in CBIR applications. (IEEE TIP)

Characteristic views (Chakravarthy and Freeman) can be built using sparse

representation theory. (ICIP 2012)

Joint sparsity driven dictionary learning produces improvements in multi-

modal biometrics applications. (Under review)

Reconstruction from sparse gradients (IEEE TIP 2012) in collaboration with

Anna Gilbert.

SLIDE 34

Domain adaptation: Motivation

Image credit: Saenko et al., ECCV 2010, Bergamo et al., NIPS 2010

1 S. J. Pan and Q. Yang. A survey on transfer learning.

IEEE Trans. Knowledge and Data Engineering, 22:1345 –1359, October 2010.

Transfer Learning1  P(Y|X) ≠ P(Y’|X’), P(X) ≈ P(X’) Domain adaptation  P(X) ≠ P(X’), P(Y|X) ≈ P(Y’|X’) Source domain Data: X, Labels: Y Target domain Data: X’, Labels: Y’

SLIDE 35

Domain adaptation - Related work

Semi-supervised

Learns domain change

through correspondence

– Daume and Marcu, JAIR ’06 – Duan et al., ICML ’09 – Xing et al., KDD ’07 – Saenko et al., ECCV 2010, Kulis et al., CVPR 2011 – Bergamo and Torresani, NIPS 2010 – Lai and Fox, IJRR 2010

Unsupervised

No correspondence, no

knowledge of domain change

– Ben-David et al., AISTATS ’10 – Blitzer et al., NIPS ’08 – Wang and Mahadevan, IJCAI ’09 – Gopalan, Li and Chellappa, ICCV 2011 – Gong et al, .. CVPR 2012 – Zheng and Chellappa, ICPR 2012

D. Xu’s group, 2012

SLIDE 36

Unsupervised domain adaptation*

Domain 1 (labeled) Domain 2 (unlabeled)

GN,d

Labeled source domain (X)

S1 S1.3 S1.6 S2

Unlabeled target domain (X~)

Intermediate domains (Incremental learning)

*R. Gopalan, R. Li, R. Chellappa, “Domain adaptation for object recognition: An unsupervised approach”,

International Conference on Computer Vision, ICCV 2011 (Oral)

Generative subspace from X Generative subspace from X~ (no labels)

SLIDE 37

Domain adaptation of dictionaries

Assume there exist K intermediate domains which smoothly

bridge the information gap between the source and target domain. A domain dependent dictionary is learned for each intermediate domain .

We learn the intermediate data to approximate the observations in

the corresponding intermediate domains. The intermediate data is then utilized to build classifiers.

Intermediate domains can be derived as solutions to an optimization

problem on a Grassmannian (ongoing work)

The top half of the figure shows some intermediate images synthesized from a given source image of frontal view (in red box). The bottom half shows the intermediate images generated from a given target image of side view (in green box).

{Sk}k=1

K

Sk Dk

SLIDE 38

Learn the intermediate domains

The reconstruction residue of the target data decomposed with

the source domain dictionary is utilized as an estimate of the information gap between two domains.

The dictionaries for intermediate domains are learned

sequentially until the information gap falls below a threshold.

SLIDE 39

Learn the intermediate data

The intermediate data are generated pertaining to the transition path

formed by the learned intermediate domains.

Generate intermediate data from the source data

Generate intermediate data from target data

SLIDE 40

DA Classifier Invariant Codes (DAC-IC).
Sparse codes are demonstrated to be invariant across the

source, the target, and the intermediate domains.

DA Classifier Transition Path (DAC-TP).
The intermediate data are exploited to define the distance

between the labeled source data and the target data

Extension to semi-supervised adaptation (with a small

amount of labels in the target domain)

Given unlabeled target data, compute its distance with labeled

source data and labeled target data.

Recognition under domain shift

xs,i xt,i

d = || xs,i

(k) − xt,i (k) ||2 2 k=0 K +1

∑

⋅⋅⋅

(∗)

SLIDE 41

. .

Experiments on face recognition

FR across pose variation on PIE (Source: frontal images; Target:

non-frontal images)

FR across blur and illumination variation on PIE (Source: sharp images,

illuminated by one set of light sources; Target: blurred images, illuminated by a different set of light sources)

SLIDE 42

Experiments on object recognition

Comparison of top five matches of our method (smaller images in the top row) and Grassmann manifold based DA (smaller images in the bottom row).

SLIDE 43

Calculate the geodesic path starting from the source domain to the target domain. Project each

sample from both domains onto all the intermediate subspaces. Design classifiers using the projected data.

PCA geodesic

t

X S

1

S

PCA

s

X

samples from the source domain samples from the target domain

source subspace target subspace Grassmann manifold

Grassmann manifold-based domain adaptation

SLIDE 44

Creating intermediate subspaces on manifolds

Algorithm1: Numerical computation of the velocity matrix – The inverse exponential map [1]

[1] K. Gallivan, A. Srivastava, X. Liu, and P. Van Dooren, Efficient algorithms for inferences on Grassmann manifolds, IEEE Statistical Signal Processing Workshop, pp. 315-318, St. Louis, MO USA, Sep 2003

Algorithm2: Computing the exponential map and sampling points along the geodesic [1]

J B t Q t ) ' exp( ) ' ( = Ψ

The sub-matrix A specifies the direction and the speed of geodesic flow.

SLIDE 45

Project each sample from both domains onto all the intermediate subspaces

geodesic source subspace target subspace

S

1

S

1

t

S

2

t

S

Grassmann manifold Collection of infinite intermediate subspaces

SLIDE 46

Legend: Case-A: Linear domain representation with standardized features (recently made available with the dataset) Case-B, B1: Linear and non-linear domain representation with standardized (B) and protocol-based (B1) features (that we generate using the protocol) Case-C, C1: Physically relevant adaptation by simulating intermediate domains by varying proportions of source and target data (C), doing geodesic sampling between the physically domains (C1). C and C1 use both linear and non-linear domain representations Case D: Boosted adaptation – combines all the above cases in a multi-class boosting setting.

Unsupervised adaptation on Office dataset using finite intermediate subspaces

SLIDE 47

Semi-supervised adaptation using finite intermediate subspaces

Legend: The un-primed cases (A,B,B1,C,C1,D) are same as unsupervised but for using target labels in learning the classifier (i.e. semi-supervised, so we have some target labels) The primed cases (B’,B1’,C’,C1’) are same as their unprimed counterparts but uses target labels in “both” intermediate domain generation stage “and” the classification stage

SLIDE 48

Performing recognition under domain Shift

Modified data representation:

– Project every labeled source data, and unlabeled target data on all the domains (subspaces), and concatenate each of them into a long vector

Classification:

– Train a discriminative learner on the projected source data

Used partial least squares

– Use the PLS latent space to estimate the labels of the projected target data

SLIDE 49

Office data set (Saenko et al., ECCV ’10)

query retrievals

3 domains (webcam, dslr, amazon) – 31 object categories

SLIDE 50

Multi-domain adaptation on Office dataset

Legend: Cases A and D are same as that for unsupervised and semi-supervised

SLIDE 51

Unsupervised and semi-supervised adaptation on Bing dataset

SLIDE 52

Some additional results on manifolds

Dynamic models for actions and faces in videos

leading to Stiefel manifolds and appropriate inference mechanisms (PAMI 2011)

Recognition of group activities – IJCV (2013)
Video summarization (NIPS 2011)
Alignment manifold (PAMI 2012)
Age estimation and face recognition across

aging on a Grassmannian manifold (TIFS, 2013)

Fast approximate NN search on a manifold

(ICVGIP 2010).

SLIDE 53

Closing remarks

Dictionaries and manifolds are useful for image and video-based

recognition.

– Appearance and geometry can be integrated in a nice way fully exploiting data explosion. – Should address challenges due to pose, illumination, occlusion, resolution,…etc.

Dictionary-based methods have neural basis.
Domain adaptation methods can address training/testing data

scalability.

Domain adaptation methods nicely bridge computer vision and

pattern recognition.

More of

– Math, data and computing.

Recently completed the evaluation of an iris sensor adaptation algorithm on

100GB of UND data!

– Exciting times are ahead.