1 Basic Image/Video Features Image Features Color (a). SCD - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 Basic Image/Video Features Image Features Color (a). SCD - - PDF document

Problems in Video Indexing and Analysis EE 6882 Statistical Methods for Video Indexing, search, and retrieval for images and videos Image/Video search engine Indexing and Analysis find video clips of basketball going through the


slide-1
SLIDE 1

1

1

EE 6882 Statistical Methods for Video Indexing and Analysis (Review)

Fall 2004

  • Prof. Shih-Fu Chang

http://www.ee.columbia.edu/~sfchang 11/15/2004

2 EE6882

  • C

h ang

Problems in Video Indexing and Analysis

  • Indexing, search, and retrieval for images and videos

Image/Video search engine “find video clips of basketball going through the hoop” “find images containing shape shown in the sketch”

  • Automatic annotation of visual content

(e.g., recognition of text, face, scene, vehicle, location, etc)

  • Automatic parsing of video programs into structures

(e.g., break videos into shots, scenes, and stories)

  • Event detection

(e.g., sports events, human activities, meetings, medical, and

  • ther spatio-temporal patterns)
  • Summary

e.g., topic clustering, highlight generation See Columbia’s sports highlight, news topic clustering demo

3 EE6882

  • C

h ang

A Very High-Level Stat. Pattern Recog. Architecture

(From Jain, Duin, and Mao, SPR Review, ’99)

4 EE6882

  • C

h ang

Important issues

  • Image/video pre-processing – quality, resolution etc
  • Feature extraction

Color, texture, motion, shape, layout, regions, parts, etc

  • Feature representation

Discrete vs. continuous, vectorization, dimension Invariance to scale, rotation, translation …

  • Feature selection

PCA, Max. Entropy, Kernel method, etc

  • Classification models

Generative vs. discriminative Multi-modal fusion, early fusion vs. late fusion

  • Validation and evaluation processes
slide-2
SLIDE 2

2

5 EE6882

  • C

h ang

Image Features

6 EE6882

  • C

h ang

Basic Image/Video Features

Color (a). SCD (scalable color descriptor) (b). CSD (color structure descriptor) (c). Dominant Color (d). CLD (color layout descriptor) Texture (a). Texture Browsing Descriptor (b). HTD (homogeneous texture descriptor) (c). Edge Histogram Descriptor Motion (a). Motion Activity (b). Motion Trajectory Shape (a). Curvature Scale Space (b). Region moments

7 EE6882

  • C

h ang

Color Space

brightness varies along the

vertical axis

hue varies along the

circumference

saturation varies along the

radius

8 EE6882

  • C

h ang

Color Histogram

Feature extraction from color images

Choose color space Quantize color space to reduce number of colors Represent image color content using color

histogram

Feature vector IS the color histogram

1 [ , ] , [ , ] , [ , ] [ , , ]

R G B RGB m n

if I m n r I m n g I m n b h r g b

  • therwise

= = =  =  

∑∑

A color histogram represents the distribution of colors where each histogram bin corresponds to a color is the quantized color space

slide-3
SLIDE 3

3

9 EE6882

  • C

h ang

Histogram Metrics

L1 distance L2 distance Histogram Intersection Quadratic Distance

1 1

( , 1) ( ) ( )

i i j

D i i H j H j

+

+ = −

2 2 1

( , 1) ( ) ( )

i i j

D i i H j H j

+

+ = −

( )

1 1

min ( ), ( ) 1 min ( ), ( )

i i j I i i j j

H j H j D H j H j

+ +

= −        

∑ ∑ ∑

( ) ( )

1 2

1 1 1 1 2 2 1 2 1 2 1 2

( ) ( ) ( , ) ( ) ( ) ( , ) : , .

1 2

Q i i i i j j j ,j

D H j H j j j H j H j j j correlation between colors j j e.g. 1-d α α

+ +

= − −

∑∑

Mohalanobis Metric

( ) ( )

2 1 1 2 1 2 1

(1,1) (1,2) ... (1, ) ... ... ... ... ( ,1) ( ,2) ... ( , ) ( , ) ( ) ( ) ( ) ( ) / 1, :

T mah x x N k k k

D x x C x x c c c d covariance matrix C c d c d c d d c i j x i m i x j m j N N number of samples

− =

= − −     =       = − − −        

  • o o
  • xi

xj

  • o o
  • xi

xj

  • xi

xj

  • o
  • xi

xj

  • xi

xj

  • i

j

c s s = − 1 2

i j

c s s = −

c =

1 2

i j

c s s =

i j

c s s =

1 2 1 2 1 2 1 1 1 2 1 2 1 2

| ...| ( , ,..., ) | ...| | ...| ( ( , ,..., )) | ...|

T x d d d T x d d d

C e e e diag e e e C e e e diag e e e λ λ λ λ λ λ

− −

=         =        

  • o
  • e1

e2

si, sj: std. deviation Projects data to the eigen vectors, divide the sd of each eigen dimension, and compute Euclidian distance

11 EE6882

  • C

h ang

More Features to be considered

Color Coherence Vector

A B C D E Regions: Color 1 2 1 3 1 Size 12 15 3 1 5

( ) ( ) ( ) ( ) ( ) ( )

1 1 1 1 1 1

, ,..., , , ,..., ,

I n n I n n n n G i i i i H i i i i i i G H

G G by triangular inequality α β α β α β α β α α β β α α β β

= =

′ ′ ′ ′ ′ = = ′ ′ ′ ′ ∆ − + − ∆ − + − ∆ > ∆

∑ ∑

  • 2

1 2 2 1 1 2 2 1 2 1 1 ... ... B C B B A A B B C B A A Color Quantization B C D B A A Region Segmentaition B B B A E E Labeling B B A A E E B B A A E E             → →                   Not just count of colors, also check adjacency

1 2 3 17 15 3 1 Color Color Co. Vector: α β

slide-4
SLIDE 4

4

MPEG-7 Scalable Color Descriptor

256 Histogram values Nonlinear Quantization Linear Quantization

Coefficients 256 128 64 32 16 Scaling

Bin Value 1 Bin Value 2 Lowpass coefficient (sum) Highpass coefficient (different)

+ + +

  • Haar Transform

MPEG-7 Texture Edge Histogram Descriptor

  • 1

1

  • 1

1 Original Image divided into 16 sub-images Each Sub-image is divided into a fixed number of blocks. Each image block is then partitioned into 2x2 block of

  • pixels. The edge detector
  • perators are then applied to

these 2x2 blocks, treating each block as a pixel and the average intensity as the corresponding block intensity value.

5 bins x 16 = 80 bins 3 bits/bin Total = 240 bits

  • 1
  • 1

1 1 2

  • 2
  • 2

2 2 2 2 − 2 −

Filters for edge detection.

15 EE6882

  • C

h ang

Texture features

Fourier Domain Energy Distribution

Angular features (directionality) Radial features (coarseness)

2 1 1 2

tan , ) , (

2 1

θ θ

θ θ

≤       ≤ =

∫∫

u v where dudv v u F V

2 2 2 1 2

, ) , (

2 1

r v u r where dudv v u F V r

r

< + ≤ = ∫∫

x

ω

y

ω

φ

x

ω

y

ω

r

16 EE6882

  • C

h ang

MPEG-7 Homogeneous Texture descriptor

In a normalized frequency space , the center frequencies of the feature channels are spread equally in in angular direction such that , where r is angular index with . In the radial direction, the center frequencies of the neighboring feature channels are spaced octave apart such that where s is radial index and is the highest center

  • frequency. The channel index i can be expressed as I = 6 x s + r + 1.

1 ≤ ≤W

  • 30

r qr × =

  • 30

} 5 , 4 , 3 , 2 , 1 , { ∈ r

} 4 , 3 , 2 , 1 , { , 2 ∈ =

s

s s

ω ω 4 / 3

0 =

W

slide-5
SLIDE 5

5

17 EE6882

  • C

h ang

Curvature Scale Space

Finds curvature zero

crossing points of the shape’s contour (key points)

Reduces the number of

key points step by step, by applying Gaussian smoothing

The position of key points

are expressed relative to the length of the contour curve

18 EE6882

  • C

h ang

How to measure the performance

19 EE6882

  • C

h ang

Evaluation

Detection False Alarms Misses Correct Dismissals

) /( ) /( ) /( D B B F B A A P C A A R + = + = + = 1

  • N

" Irrelevant " Relevant" " 1

  • =

= n Vn B V D A V C V B V A

N n n N n n K n n K n n

− − = − = − = =

∑ ∑ ∑ ∑

− = − = − = − −

) ) 1 ( ( ) ( ) 1 (

1 1 1 1

N Images in DB K ranked returned Result

D

B

C

A

“Returned” “Relevant Ground Truth”

Recall Precision Fallout Combined

2 / ) (

1

R P R P F + ⋅ =

20 EE6882

  • C

h ang

Evaluation Measures

Precision Recall Curve

  • 2. Receiver Operating Characteristic (ROC Curve)
  • 3. Relative Operating Characteristic
  • 4. P value
  • 5. 3-point P value

) vs ( R P

P

R

B A vs

F A vs ) int(

  • ff

cut at

1

− =

=

N n n k

V k P

0.8 0.5 .2 at Avg = R P

A (hit) B (false)

slide-6
SLIDE 6

6

21 EE6882

  • C

h ang

Evaluation Metric: Average Precision

S

Ranked list of data in response to a query 3/7 3/6 3/5 3/4 2/3 1/2 1/1 Precision 1 1 1 truth Ground D D D D D

s

... ...

21 63 8 15

Average precision:

data relevant

  • f

number : , 1

1

total R I j R R AP

j s j j

=

=

1 2 3 4 5 6 7 Precision j 3

i

P AP measures the average of precision values at R relevant data points 1 2 3 4 5 6 7 Rj j 1 2 3 1.0

22 EE6882

  • C

h ang

Evaluation Metric: Average Precision

Alternative Measure

Ranked result are manually inspected to a

depth of N1

E.g., in TREC VIDEO 2003, N1 =100; in TREC VIDEO 2004, N1

=1000

Observations (AP)

AP depends on the rankings of relevant data and the

size of the relevant data set. E.g., R=10

Case I:

+ + + + + + + + +

  • - - -
  • +

Pre: 1

1 1 1 1 1 1 1 1 0 0 0 0 1

AP=1 Case II:

  • +

Pre: 1/2 AP=1/2

  • + - + - + - + - + - + - + - + - +

1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2 1/2

Case II: Pre:

  • - -
  • + + + + + + + + +

1/11 2/12 10/20 … …

AP~0.3

23 EE6882

  • C

h ang

Image Classification vs. Content-based

Retrieval

24 EE6882

  • C

h ang

Bayesian Image Classification

(Valaiya et al 98 and 01)

  • How to select the categories

and tree?

  • How to estimate the

distributions of features for each class?

slide-7
SLIDE 7

7

25 EE6882

  • C

h ang

Classification Paradigms

x

Likelihood

Probabilistic

Class 1 Class 2

x0

(Height, income, …)

P(x|C=1) > or < P(x|C=2) C(x0 )=?

x1

Decision Boundary

+ + + + + + + + + + + + + + ++ + + + + + + +

  • -
  • -
  • -
  • x2

Discriminative

+ + + + + + + +

f(x) < 0 f(x) > 0

f(x) discriminant function

26 EE6882

  • C

h ang

Bayesian Image Classification

Feature independence MAP Classification VQ as distribution estimator

27 EE6882

  • C

h ang

Gaussian Distribution

  • Gaussian distribution
  • Multivariate Gaussian

/ to from distance s Mahalanobi 997 . ] 3 Pr[ 95 . ] 2 Pr[ 68 . ] Pr[ σ µ µ σ µ σ µ σ µ − = ≅ ≤ − ≅ ≤ − ≅ ≤ − x r x x x x

( )       − −

=

2 2

2 1 2

2 1 ) (

µ σ

πσ

x

e x p

( )

( ) ( )

matrix : vectors l dimensiona

  • D

are , 2 1 ) (

1

2 1 2 /

D D where e p

T

D

× =

      − − −

Σ µ x Σ x

µ x Σ µ x

π x2 x1 General Σ x2 x1

        =

2 2 2 1

σ σ Σ

28 EE6882

  • C

h ang

Gaussian Used In Classification

( ) ( ) ( )

  • evidence

prior j likelihood j j

x p w p w x p x when x p x if w x ) ( ) ( ) | ( w p classifier MAP j i , w w p ,

j i j

  • =

≠ ≥ →

( )

Gaussians by modeled be can ) | ( ) | ( max arg w p uniform : tion classifica ML

j

w x p w x p w if

w j =

slide-8
SLIDE 8

8 Mixture Of Gaussians

Given data x1, …, xN , define log-likelihood:

Define zi as a random variable, indicating

the likelihood of x being generated by component i

( )

( ) ( )

∑ ∑ ∑ ∑

= − Σ − −

Σ = Σ = = =

Z z x x z D z z z z z z z

z z T z

e x N z x p z p z x p x p

1 2 1 2 /

1

) 2 ( 1 , ) | ( ) ( ) , ( ) (

µ µ

π π µ π x p(x) π0 π1

( )

=

Σ + Σ =

N n n n

x N x N l

1 1 1 1

) , ( ) , ( log µ π µ π

( )

} , , , { , , 1

1 1 component

  • f

Σ Σ = = = =

     

µ µ θ θ τ x z p posteriers

i i i ity responsibl Real distributions seldom follow a single Gaussian

mixture of Gaussians

Use EM to estimate the model parameters

) ( ) | ( ) | ( max ) | ( , ) | ( ) | ( ) ( ) ( ) (

1 1 1 1 1

≥ ∆ ∴ = ≥ = = ≥ − = ∆

+ + + + + t t t t t t t t t t t t t

l Q Q Q Q Note Q l l l θ θ θ θ θ θ θ θ θ θ θ θ θ θ

θ

( ) ( )

const z x p x z p Q

n N n z ectation t n t

t

+ = ∑∑

=

  • hidden

&

  • bserved
  • f

likelihood joint 1 current with z

  • ver

exp

| , log , | ) | ( θ θ θ θ

θ

Use EM to iteratively improve l(θ) General steps of EM:

Define likelihood model with parameters θ Identify hidden variables z Derive the auxiliary function and the E and M equations In each iteration, estimate the posteriors of hidden variables Re-estimate the model parameters. Repeat until stop

31 EE6882

  • C

h ang

Example Application of EM

32 EE6882

  • C

h ang

Object Recognition as Machine Translation

  • Duygulu et al ECCV 2002
  • Annotated Images + Image segmentation
  • Translation between image regions and

annotation words

  • Given regions in an image, predict the likely

words

  • But the actual connection is ‘hidden’ try to

estimate the soft probability.

  • Image regions are clustered to form discrete

tokens (visual vocabulary).

  • Direct count of co-occurrences is inaccurate

Image ID Visual token Text token b1 b2 b2 w2 w2 w1 1 2 3 “observed” b1 w1 “true”

slide-9
SLIDE 9

9

33 EE6882

  • C

h ang

Notations

34 EE6882

  • C

h ang

Words with good prediction

35 EE6882

  • C

h ang

Examples of word prediction

Satisfactory Unsatisfactory

36 EE6882

  • C

h ang

Another Generative Model

slide-10
SLIDE 10

10

37 EE6882

  • C

h ang

New Story Segmentation

  • Objective: a story boundary at time

?

  • = { shot boundaries or significant pauses}
  • bservation

time {video, audio}

a static face? motion energy changes? change from music to speech? speech segment? {cue words}j appear {cue words}i appear

k

τ

k

τ

1 k

τ +

1 k

τ −

k

τ

38 EE6882

  • C

h ang

Some ideas about data: different story types

: visual anchors : story (b) different anchor (f) sep. by music or anim. (g): weather report (h): anchor lead-in before comm. (i): comm. after sports : commercial : misc./animation * Visual anchors alone account for only 51% and 67% of stories of ABC/CNN 0.51 0.38 0.80 CNN 0.67 0.67 0.67 ABC Anchor Face F1

Recall Precision

Set Modalities

(32.0%) (21.3%) (15.0%) (8.8%)

Percentage

  • f content

: weather (c): multi-story in an anchor seg. (d): conti. sports briefings (e): cont. short briefings (a): regular anchor segment

CNN examples Foreign example

39 EE6882

  • C

h ang

sigpas music comm. text seg. score face shot motion

Feature selection is important!

33 raw features 195 binary predicates

binary point combinatorial

Misc.

binary segment sports continuous point text seg. score continuous segment motion

Video

binary point shot boundary continuous segment face binary segment commercial continuous point pause

Audio

continuous point pitch jump continuous point significant pause binary segment musc./spch. disc. continuous segment spch seg./rapidity binary point ASR cue terms

Text

binary point V-OCR cue terms Value Data Type Raw Features Modality candidate point

195 binary features (A+V only) derived from the raw features through feature wrappers with different thresholds & observation windows

40 EE6882

  • C

h ang

Exponential Model Fusing Binary Features

1 1 1 1 1 1 1 … 1 1 1 1 1 1 1 9 1 1 1 1 8 1 1 1 1 1 7 1 1 1 1 1 1 6 1 1 1 1 1 1 5 1 1 1 1 4 1 1 1 1 1 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 1 b

One training case

( , )

1 ( | ) ( ) ( , ), {0,1}

i i i

f x b i

q b x e Z x where f x b b

λ λ λ ⋅

∑ = ∈

Each row represents one predicate Face Motion Significant Pause Speech segment Commercial Text segmentation score ASR cue terms

i

f

Maximum entropy model

1 1 2 2 1 2

1 2

predicate ' ' ' ' if current observation: YES = NO ( | ) /( ) ( | ) /( ) f anchor face f significant pause face pause q b YES x e e e q b NO x e e e

λ λ λ λ λ λ

= = = = = + = = +

example

slide-11
SLIDE 11

11

  • Estimate

from training data by minimizing Kullback-Leibler divergence, defined as

  • Equivalent to maximize the ‘entropy’
  • Iteratively find

Parameter Estimation for ME

i i i

λ λ λ ∆ + = ′

i

λ ( | ) q b x

λ

{( , )}

k k

T x b =

, ,

( , ) ( , ) 1 log ( ) ( | ) ( , )

i x b i i x b

p x b f x b M p x q b x f x b

λ

λ     ∆ =    

∑ ∑

  • (

) ( , )log ( | )

p x b

L q p x b q b x

λ λ

≡ ∑∑

  • Because of the convexity of objective function, the iterative process is guaranteed to

reach the optimum.

  • efficiency: Matlab implementations show convergence in ~30 mins when using 30

features and 11,705 training samples

( | ) ( || ) ( , )log ( | ) ( , )log ( | ) constant( )

x b x b

p b x D p q p b x q b x p x b q b x p

λ λ λ

= = − +

∑∑ ∑∑

  • empirical

distribution from data estimated model

( || ) D p q

  • p
  • q
  • Input: collection of candidate features, training

samples, and the desired model size

  • Output: optimal subset of features and their

corresponding exponential weights

  • Current model augmented with feature with

weight ,

  • Select the candidate which improves current model

the most, in each iteration;

Automatic Feature Selection

{ }

{ }

{ }

{ }

* , ,

arg max sup ( || ) ( || ) arg max sup ( ) ( )

h h C p h p h C

h D p q D p q L q L q

α α α α ∈ ∈

= − = −

  • ( , )

,

( | ) ( | ) ( )

h x b h

e q b x q b x Z x

α α α

=

h

α

q

Reduction of divergence Increase of log-likelihood

q

43 EE6882

  • C

h ang

Induced Features (from CNN)

0.0008 0.0016 0.0022 0.0015 0.0015 0.0019 0.0024 0.0058 0.0160 0.3879 gain The surrounding observation window has a pause with the duration larger than 0.25 second. 0.0939 Pause 10 A speech segment starts in the surrounding observation window 0.3734 Speech segment 6 A commercial starts in 15 to 20 seconds after the candidate point. 1.0782 Commercial 7 A speech segment ends after the candidate point

  • 0.4127

Speech segment 8 A speech segment before the candidate point

  • 0.3566

Speech segment 5 An anchor face segment occupies at least 10% of next window 0.7251 Anchor face 9 An audio pause with the duration larger than 2.0 second appears after the boundary point. 0.2434 Pause 3 An anchor face segment just starts after the candidate point 0.4771 Anchor Face 1 Significant pause Significant pause & non-commercial raw feature set The surrounding observation window has a significant pause with the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second. 0.7947 4 A significant pause within the non-commercial section appears in the surrounding observation window. 0.7471 2 interpretation no

* The first 10 “A+V” features automatically discovered for the CNN channel

λ

every modality helps : anchor face, prosody, and speech segment

44 EE6882

  • C

h ang

Comparison with Discriminative

Approaches like SVM

slide-12
SLIDE 12

12

45 EE6882

  • C

h ang

Other tricks: boosting weak classifiers

  • Intuitive aggregation of weak classifiers (slightly better than random guess)

with rigorous error bounds (Freund & Shapire 97)

  • Efficient implementations in content based image retrieval, e.g., face

detection (Tieu and Viola, 2000)

  • May be applied to boost different classifiers (HMM, NN, ME)

t=1 training sample t=2 weighted sample t=3 weighted sample t=T weighted sample

estimation estimation estimation estimation

the weak classifier that improves the objective function (weighted 0/1 loss, exponential loss, etc.) the most at iteration t 46 EE6882

  • C

h ang

What about the magic tool? SVM

  • SVM maps data to a space of much higher dimension

input space feature (high-dim.) space : Support vectors

  • Maximize separation margin in the HD space

Can solve with simple QP

47 EE6882

  • C

h ang

Compare Different Models

  • ME with 35 induced

features close to SVM with 33 raw features

  • SVM with induced features

better than raw feature

  • SVM with 195 binary features

performs the best SVM has excellent feature fusion capability Predicate binarization shields noise in the feature

  • Boosting not so effective

Precision vs. recall on CNN news of ME/Boosting/SVM approaches P r e c i s i

  • n

Recall SVM-based ME BST

48 EE6882

  • C

h ang

Strengths & Weakness of Various Models

SVM better than ME, ME better than Boosting

Discriminative model matches the classification

task better?

Not enough training data to learn the distribution

in the generative model? (we forgot the dimensionality curse!)

Highly imbalanced data (positive data only about

10%)

Each feature too weak for the boosting model to

work?

Feature abstraction (e.g., binarization) seems to

be very useful

due to noise shielding?

slide-13
SLIDE 13

13

49 EE6882

  • C

h ang

Other applications of SVM

50 EE6882

  • C

h ang

Case Study: SVM for Large Scale Concept Detection

  • 46 concepts selected from TRECVID-2003 annotation lexicon
  • Model selection criteria:
  • Frequency f : support of each concept in training set
  • Validity v : concept detector performance (average precision) on

validation set

(IBM: Smith, Naphade, Natsev, Lin et al) 51 EE6882

  • C

h ang

A fusion approach to Concept Detection

  • SVM for fusing multiple features
  • Explore multiple configurations of the same classifier
  • Late fusion of classifiers of different features, algorithms and

configurations

  • Normalization: rank, Gaussian, linear.
  • Combination: average, product, min, max
  • Works well for uni-modal concepts with few training examples
  • Computationally low-cost method of combining multiple classifiers.

SVM

Grid Search

f1 f2 fK : m1 m2 mP mP+1 mP+2 : :

Validation Set

Training Set Features 52 EE6882

  • C

h ang

Graphical Model as Generative Model

slide-14
SLIDE 14

14

53 EE6882

  • C

h ang

A new promising generative model: Graphical Models

  • Graphical representation of probabilistic relationships

between a set of random variables

headache flu fever temperature Running nose Sinus infection Parent of X4 Child of X2

1 1

( ,..., ) ( | )

n n i i i

p x x p x parent

=

=∏

Joint probability

i

x is independent of others

given the parent(s)

1 6 1 2 1 3 1 4 2 5 3 6 2 5

( ,..., ) ( ) ( | ) ( | ) ( | ) ( | ) ( | , ) p x x p x p x x p x x p x x p x x p x x x =

6

2

Model parameter

54 EE6882

  • C

h ang

Generative Model

A graphical model can be considered as a

generative model for which queries regarding the values of sets of random variable can be answered.

Learning model parameters

Complete data: MAP or ML estimation Incomplete data (with hidden variables): non-

linear parametric optimization such as gradient descent or EM algorithm ( | , ) ? p headache yes flu yes running nose yes = = = =

55 EE6882

  • C

h ang

GM for Video Analysis

Conventional block processing Graphical Model + Model jointly multiple causes of variability in the scene + Minimal number of preset parameters + Unsupervised training + Generic model

  • Model can be complex and

computationally intractable + Modular design

  • Each block specializes but

ignores the larger context

  • Inadequacy of one block affects

the entire chain of processing

  • Feedback increases complexity

Nebojsa Jojic, Brendan Frey, ‘A Generative Model for 2.5D Vision: Estimating appearance, transformation, illumination, transparency and occlusion’, IJCV 2002

Hidden Observed

56 EE6882

  • C

h ang

Graphical Model for 2.5D layer separation

Graphical Model Hidden Observed link

slide-15
SLIDE 15

15

57 EE6882

  • C

h ang

Transform Hidden Markov Models (THMM)

c = class index (represents different head poses) z = latent image x = video frame Sequence of x’s = video sequence l = transformation index (2D translation) transformation matrix

Nebojsa Jojic, Nemanja Petrovic, Brendan Frey, Thomas Huang, ‘Transformed Hidden Markov Models: Estimating Mixture Models

  • f Images and Inferring Spatial Transformations in Video Sequences’, CVPR 2000

Hidden Observed

( | ) ( ; , )

c c

p z c N z µ = Φ

( | , ) ( ; , )

l

p x l z N x z = Γ Ψ

58 EE6882

  • C

h ang

Applications of THMM

z provides image stabilization provides video summary provides image segmentation provides de-noised image degraded restored

( | ) ( ; , )

c c

p z c N z µ = Φ

( | , ) ( ; , )

l

p x l z N x z = Γ Ψ

lz

Γ

c

µ

c

Φ

(observed vs. stabilized)

59 EE6882

  • C

h ang

Learning Generative Model

  • Problems for learning a complex generative model

Too many hidden variables and model parameters or no

close form integration solution when marginalizing hidden variables

Computationally intractable and need a large amount of data

  • Solutions

Reduce the size hidden state space:

(e.g., consider discrete-step translations rather than continuous translation)

Employ efficient but approximate learning algorithm

60 EE6882

  • C

h ang

Efficient Learning Algorithm

Learning iterates between 2 EM steps:

Compute p(h|x) the posterior distribution of

hidden variables, h, given the observable, x

Estimate the model parameters

Efficient but approximate learning algorithm can

be used:

(e.g.) Iterated conditional modes, Gibb sampling,

Variational methods, loopy belief propagation

These techniques computes the simplified

distributions q(h) that best approximates p(h|x)

slide-16
SLIDE 16

16

61 EE6882

  • C

h ang

Learning Appearance Manifold

  • A set of object appearance across a video sequence can be

mapped to a low-dimensional manifold, representing semantically meaningfully variations.

  • Example: images of a person walking can be mapped to a 1-

D manifold, representing the phase of the person’s gait.

  • Multiple object appearance manifolds can be learnt

separately and jointly using a generative model of layered

  • bjects

Brenden Frey, Nebojsa Jojic, Anitha Kannan, ‘Learning Appearance and Transparency Manifolds of Occluded Objects in Layers’, CVPR 2003

1 C j j j

x c v µ

=

≈ +∑

  • V = subspace basis

C = subspace coordinate

62 EE6882

  • C

h ang

Input sequences

Substantial occlusion between the two walking persons

63 EE6882

  • C

h ang

Subspace Appearance, Layered Model

Layers are combined from right to left Subspace representation s = scene Contributions of layer l to the final observation Hidden

64 EE6882

  • C

h ang

Subspace Variation Analysis

Layer 3 (blue jean Person) Subspace coordinate Minima = Man in blue jeans has his arms at his side and his legs together Maxima = Man in blue jeans has his arms extended and his legs apart

slide-17
SLIDE 17

17

65 EE6882

  • C

h ang

Relevance Feedback

66 EE6882

  • C

h ang

Content-Based Image Retrieval (CBIR systems)

Query Image User Query: Please find similar pictures Weights: Structure=0.05 Color=0.05 Texture=0.9 Retrieval Results

67 EE6882

  • C

h ang

Relevance feedback Basic Procedure

  • The retrieval process is interactive between users and computers.
  • 1. User make a tentative query as a trial run
  • 2. For a given query, the system select a set of ranked images based on

the predefined similarity measure of features

  • 3. User marks relevant and /or non-relevant to those images

Initial results: Weights: S=0.33, C=0.33, T=0.33

Classic Relevance Feedback Scheme

  • Vector Processing Method
  • Rocchio’s Formula

∑ ∑

− +

− + ⋅ =

items relevent non items relevent i

Di Di Di Di Qi Q γ β α

1 0.5 1.0 0.5 1.0 0.0 Feature 2 Feature 1 QO Q’ Q” D1 Marked relevant by user D2 Marked relevant by user QO = retrieval of information (0.7, 0.3) D1 = information science (0.2, 0.8) D2 = retrieval system (0.9, 0.1) Q’ = ½ QO + ½ D1 = (0.45, 0.55) Q” = ½ QO + ½ D2 = (0.80, 0.20)

slide-18
SLIDE 18

18

SVM - Active Learning

  • Every new labeled data correspond to a hyperplane in

Parameter Space & reduce Version space

  • To get final w for SVM quickly ; to reduce version space

as fast as possible : Choose next query image that can half the size of version space (Lemma by Tong & Koller, 2000 )

  • Not practical to compute size of version space every time :
  • Simply assume that each round wi located in the center
  • Choose next query image closest to hyperplane in F

w F W w

  • a

b d

  • c

a d b c

70 EE6882

  • C

h ang

SVMActive Example

  • Initialize :
  • Round 1 :
  • Round 2 :

71 EE6882

  • C

h ang

SVMActive Example (cont.)

  • Round 3 :
  • Results :

96.3% correct in Top 54 result First error is 42th result

72 EE6882

  • C

h ang

Evaluation of Supervised Classification

slide-19
SLIDE 19

19

73 EE6882

  • C

h ang

Training / Validation / Testing

Assume the same distribution in different set,

  • therwise the optimal solution from validation

may not be optimal in test data

x(1) x(2) Training

  • +

+ + + + -

  • ptimal features,

models, parameters x(1) x(2) Validation

  • +

+ + +

  • Select optimal

hypothesis through validation x(1) x(2) Testing

  • +

+ ++ - +

  • Evaluate

performance

  • ver test data

74 EE6882

  • C

h ang

Training / Validation / Testing (cont.)

Multiple validation sets can be used for different

  • ptimization steps.

Val - 1 Val - 1

Optimal classifier using feature 1

Val - 2

Optimal classifier using feature 2 Optimal classifier fusing multiple features … … Cross validation, leave-one-out

1 2 … K

Training Testing

Rotate the choice of the test set and average the performance over runs

75 EE6882

  • C

h ang

Curse of Dimensionality and Overtraining

Rule of thumb – (# of training patterns per class) / (# of features) > 10

x(1) x(2) Overtraining

  • +

+ + + +

  • +

+ + + +

  • -
  • A case of overtraining

76 EE6882

  • C

h ang

Important issues (revisited)

  • Image/video pre-processing – quality, resolution etc
  • Feature extraction

Color, texture, motion, shape, layout, regions, parts, etc

  • Feature representation

Discrete vs. continuous, vectorization, dimension Invariance to scale, rotation, translation …

  • Feature selection

PCA, MDS, Kernel PCA, etc

  • Classification models

Generative vs. discriminative Multi-modal fusion, early fusion vs. late fusion

  • Validation and evaluation processes
  • Complexity
slide-20
SLIDE 20

20

77 EE6882

  • C

h ang

Lessons Learned

  • Principles: given a problem domain

Clearly identify the target classes Propose suitable models Supervised or unsupervised analysis to find ‘good’ features

and model configurations

Rigorous training/validation/testing Use benchmark data make results repeatable, and citable

  • Multi-modal joint exploration is critical

Many video analysis problems involve multimedia cues Fusion or comparative studies of multi-models are often

fruitful, e.g., ensemble fusion, SVM, ME etc.

Generative models like GM are promising

e.g., video appearance, video transformation, near duplicate etc.

  • Rigorous statistical modeling offers great potential for

advances

78 EE6882

  • C

h ang

Open research issues

  • Classification of temporal events is less explored than spatial concepts
  • e.g., human actions, basketball going through the hoop, etc.
  • Good features are less known. object or frame level? Object tracking is hard
  • How to capture temporal dynamics?
  • Generative models like HMM, HHMM, THMM are promising but still limited.
  • Unsupervised discovery of complex video patterns is still unexplored
  • Most current works address dense dynamics modeled by HMM or DBN. Other patterns

exist

  • E.g., temporal patterns with long range sparse dependence
  • E.g., associative patterns at higher levels
  • Detection and interpretation of patterns, alerts, novel events
  • Role of humans?
  • Annotate training data (offline or online interactive)
  • Formulate queries
  • Interpret discovered patterns
  • Collaboration within social or geographic groups
  • Killer applications?
  • Search, browsing, visualization, filtering, summarization