Statistical Paradigm Many problems can be posed as pattern - - PDF document

statistical paradigm
SMART_READER_LITE
LIVE PREVIEW

Statistical Paradigm Many problems can be posed as pattern - - PDF document

EE 6882 Overview of Statistical Models for Video Indexing Prof. Shih-Fu Chang Columbia University TA: Eric Zavesky Fall 2007, Lecture 4 Course web site: http:/ / www.ee.columbia.edu/ ~ sfchang/ course/ svia 1 Statistical Paradigm Many


slide-1
SLIDE 1

1

1

EE 6882 Overview of Statistical Models for Video Indexing

  • Prof. Shih-Fu Chang

Columbia University

TA:

Eric Zavesky

Fall 2007, Lecture 4

Course web site: http:/ / www.ee.columbia.edu/ ~ sfchang/ course/ svia

Statistical Paradigm

  • Many problems can be posed as pattern recognition

Image classification: indoor vs. outdoor? Face? shot boundary detection, story segmentation

Is the current point a boundary?

  • Statistical models to handle uncertainty and provide flexibility
  • Image processing tools available

E.g., homework # 1

  • Rich tools for learning and prediction

See course web site

  • Increasing data available

NIST TREC Video: 300+ hours Consumer and youtube videos

slide-2
SLIDE 2

2

A Very High-Level Stat. Pattern Recog. Architecture

(From Jain, Duin, and Mao, SPR Review, ’99)

Important issues (1)

Image/video processing

What’s the adequate quality, resolution, etc?

Feature extraction

Color, texture, motion, region, shape,

interest points, audio, speech, text, etc

Feature representation

Histogram, bag, graph etc Invariance to scale, rotation, translation,

view, illumination, …

How to reduce dimensions?

slide-3
SLIDE 3

3

Important issues (2)

  • Distance measurement

How to measure similarity between images/videos? L1, L2, Mahalanobis, Earth Mover’s distance, vector/graph

matching

  • Classification models

Generative vs. discriminative Multi-modal fusion, early fusion vs. late fusion

E.g., how to use joint audio-visual features to detect events

(dancing, wedding…)

  • Efficiency issues

how to speed up training and testing processes? How to rapidly build a model for new domains

  • Validation and evaluation

How to measure performance? Are models generalizable to new domains?

Three related problems

Retrieval, Ranking

Given a query image, find relevant ones May apply rank threshold to decide

relevance

Classification, categorization, detection

Given an image x, predict class label y

Clustering, grouping

Group images/videos into clusters of

distinct attributes

slide-4
SLIDE 4

4

An example

News story segmentation using multi-

modal, multi-scale features

First Understand Data Types and Explore Unique Characteristics

: visual anchors : story (b) different anchor (f) sep. by music or anim. (g): weather report (h): anchor lead-in before comm. (i): comm. after sports : commercial : misc./animation

(32.0%) (21.3%) (15.0%) (8.8%)

Percentage

  • f content

: weather (c): multi-story in an anchor seg. (d): conti. sports briefings (e): cont. short briefings (a): regular anchor segment

slide-5
SLIDE 5

5

News Story Segmentation

  • Objective: a story boundary at time

?

  • = { shot boundaries or significant pauses}
  • bservation

time {video, audio}

An anchor face? motion changes? change from music to speech? speech segment? {cue words}j appear {cue words}i appear

k

τ

k

τ

1 k

τ +

1 k

τ −

k

τ

sigpas music comm. text seg. score face shot motion

Need to decide how to formulate features

binary point combinatorial

Misc.

binary segment sports continuous point text seg. score continuous segment motion

Video

binary point shot boundary continuous segment face binary segment commercial continuous point pause

Audio

continuous point pitch jump continuous point significant pause binary segment musc./spch. disc. continuous segment spch seg./rapidity binary point ASR cue terms

Text

binary point V-OCR cue terms Value Data Type Raw Features Modality candidate point

One way is to use binary predicate: if x > threshold, then predict segment boundary (b=1)

Challenge: diverse features

slide-6
SLIDE 6

6

Example Predicates

The surrounding observation window has a pause with the duration larger than 0.25 second. Pause 10 A speech segment starts in the surrounding observation window Speech segment 6 A commercial starts in 15 to 20 seconds after the candidate point. Commercial 7 A speech segment ends after the candidate point Speech segment 8 A speech segment before the candidate point Speech segment 5 An anchor face segment occupies at least 10% of next window Anchor face 9 An audio pause with the duration larger than 2.0 second appears after the boundary point. Pause 3 An anchor face segment just starts after the candidate point Anchor Face 1 Significant pause Significant pause & non-commercial raw feature set The surrounding observation window has a significant pause with the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second. 4 A significant pause within the non-commercial section appears in the surrounding observation window. 2 Predicates no

Collect Features from Training Samples

1 1 1 1 1 1 1 … 1 1 1 1 1 1 1 9 1 1 1 1 8 1 1 1 1 1 7 1 1 1 1 1 1 6 1 1 1 1 1 1 5 1 1 1 1 4 1 1 1 1 1 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 1 b

One training sample Each row represents one predicate Face Motion Significant Pause Speech segment Commercial Text segmentation score ASR cue terms

i

f

slide-7
SLIDE 7

7

Choose Model

( , )

1 ( | ) ( ) ( , ), {0,1}

i i i

f x b i

q b x e Z x where f x b b

λ λ λ ⋅

∑ = ∈

Maximum entropy model

1 1 2 2 1 2

1 2

predicate ' ' ' ' if current observation: YES = NO ( | ) /( ) ( | ) /( ) f anchor face f significant pause face pause q b YES x e e e q b NO x e e e

λ λ λ λ λ λ

= = = = = + = = +

For example

Classification: if q(b=YES|x) > 0.5, then predict YES.

Background: Entropy

  • Entropy (bits)
  • Kullback-Leibler (K-L) Distance

A measure of ‘distance’ between 2 distributions

  • Not necessarily symmetric, may not satisfy triangular inequality

( )

dx x p x q x q

  • r

x p x q x q x q x p D

  • x

KL

∫ ∑

∞ ∞

= = ) ( ) ( log ) ( ) ( ) ( log ) ( ) ( ), (

( ) ( )

⋅ = ⋅ = ≥ q p ff i and DKL ,

=

− =

m i i i

p p H

1 2

log

slide-8
SLIDE 8

8

  • Estimate

from training data by minimizing Kullback-Leibler divergence, defined as

  • Find to maximize the ‘entropy’
  • Iteratively find

How to Determine the Weights in the Model?

i i i

λ λ λ Δ + = ′

i

λ

, ,

( , ) ( , ) 1 log ( ) ( | ) ( , )

i x b i i x b

p x b f x b M p x q b x f x b

λ

λ ⎛ ⎞ ⎜ ⎟ Δ = ⎜ ⎟ ⎝ ⎠

∑ ∑

  • (

) ( , )log ( | )

p x b

L q p x b q b x

λ λ

≡ ∑∑

  • The objective function is convex.

So the iterative process can reach the optimum.

( | ) q b x

λ

{( , )}

k k

T x b =

( | ) ( || ) ( , )log ( | ) ( , )log ( | ) constant( )

x b x b

p b x D p q p b x q b x p x b q b x p

λ λ λ

= = − +

∑∑ ∑∑

  • empirical

distribution from data estimated model

( || ) D p q

  • p
  • q

i

λ

  • I nput: collection of candidate features, training

samples, and the desired model size

  • Output: optimal subset of features and their

corresponding exponential weights

  • Current model augmented with feature with

weight ,

  • Select the candidate which improves current model

the most, in each iteration;

The same model used to select features

{ }

{ }

{ }

{ }

* , ,

arg max sup ( || ) ( || ) arg max sup ( ) ( )

h h C p h p h C

h D p q D p q L q L q

α α α α ∈ ∈

= − = −

  • ( , )

,

( | ) ( | ) ( )

h x b h

e q b x q b x Z x

α α α

=

h

α

q

Reduction of divergence Increase of log-likelihood

q

slide-9
SLIDE 9

9

Optimal Features (from CNN news video)

0.0008 0.0016 0.0022 0.0015 0.0015 0.0019 0.0024 0.0058 0.0160 0.3879 gain The surrounding observation window has a pause with the duration larger than 0.25 second. 0.0939 Pause 10 A speech segment starts in the surrounding observation window 0.3734 Speech segment 6 A commercial starts in 15 to 20 seconds after the candidate point. 1.0782 Commercial 7 A speech segment ends after the candidate point

  • 0.4127

Speech segment 8 A speech segment before the candidate point

  • 0.3566

Speech segment 5 An anchor face segment occupies at least 10% of next window 0.7251 Anchor face 9 An audio pause with the duration larger than 2.0 second appears after the boundary point. 0.2434 Pause 3 An anchor face segment just starts after the candidate point 0.4771 Anchor Face 1 Significant pause Significant pause & non-commercial raw feature set The surrounding observation window has a significant pause with the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second. 0.7947 4 A significant pause within the non-commercial section appears in the surrounding observation window. 0.7471 2 interpretation no

* The first 10 “A+V” features automatically discovered for the CNN channel

λ

every modality helps : anchor face, prosody, and speech segment

Issues of this model (Discussion)

Features

Binary predicates reasonable? Does it capture the unique characteristics?

Models

Exponential models with linear weights adequate? How about the learning algorithm?

Enough data to learn the probability models? Speed and complexity

( , )

1 ( | ) ( )

i i i

f x b

q b x e Z x

λ λ λ ⋅

∑ =

binary predicate: if x > threshold, then predict segment boundary (b=1)

slide-10
SLIDE 10

10

A Broader Perspective: Classification Paradigms

x1

Decision Boundary

+ + + + + + + + + + + + + + ++ + + + + + + +

  • -
  • -
  • -
  • x2

Discriminative

+ + + + + + + +

f(x) < 0 f(x) > 0

f(x) discriminant function

x

Likelihood Class 1 Class 2

threshold

features

P(x|C=1) > P(x|C=2)?

Generative Which one does the previous model fall into?

Generative Models

Gaussian Mixture Model

slide-11
SLIDE 11

11

One common issue is to learn probability models

Gaussian distribution Given the same mean and

variance, Gaussian has the max entropy

Sum of a large number of

small, independent random variables approaches Gaussian

( ) ⎟

⎠ ⎞ ⎜ ⎝ ⎛ − −

=

2 2

2 1 2

2 1 ) (

μ σ

πσ

x

e x p

/ to from distance s Mahalanobi 997 . ] 3 Pr[ 95 . ] 2 Pr[ 68 . ] Pr[ σ μ μ σ μ σ μ σ μ − = ≅ ≤ − ≅ ≤ − ≅ ≤ − x r x x x x

2

  • f

: 0.5 log ( 2 )

gau

Entropy Gaussian H πσ = +

Comparison w. uniform

Multi-variate Gaussian

Multivariate Gaussian,

2

( ) ( , ) cov( ( ), ( )) [( ( ) ( ))( ( ) ( ))]

ij

i j x i x j E x i i x j j σ μ μ = Σ = = − −

( )

( ) ( )

1

1 2 /2

1 ( ) 2 , are D-dimensional vectors : matrix is the determinant of

T

D

p e where D D π

− ⎛ ⎞ − − ⎜ ⎟ ⎝ ⎠

= ×

x μ Σ x μ

x Σ x μ Σ Σ Σ

x2 x1

General Σ

x2 x1

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ =

2 2 2 1

σ σ Σ

( , ) N μ Σ

slide-12
SLIDE 12

12

Effect of Linear Transformation

Linear transformation of Gaussian

Whitening transform

: 1, : , : 1

t

y k A d k x d

y A x

× × ×

=

( , )

t t

y N A A A μ Σ ∼

1 2

: [ | | ... | ] columns are orthogonal ev : diag. matrix of eigenvalues

(SVD, Eigenvectors)

d

t

φ φ φ Φ Λ

Σ = ΦΛΦ

1/ 2

also PCA Transform

Whitening Trans. ( , )

w t t w w

A y A x N A I μ

= ΦΛ = ∼

t t t

A A A A I Σ = ΦΛΦ =

Mahalanobis Distance

Mahalanobis dist in 1-D Multi-

Dimensional case

Mahalanobis distance from to / Pr[ ] Pr[ 1] 0.68 Pr[ 2 ] Pr[ 2] 0.95 Pr[ 3 ] Pr[ 3] 0.997 x r x x r x r x r μ μ σ μ σ μ σ μ σ = − − ≤ = ≤ ≅ − ≤ = ≤ ≅ − ≤ = ≤ ≅

( ) ( ) ( )

1 / 2

1 1 = exp( ) 2 2

t t D

π

− − ΦΛ Φ − x μ x μ Σ

r2

( ) ⎟

⎠ ⎞ ⎜ ⎝ ⎛ − −

=

2 2

2 1 2

2 1 ) (

μ σ

πσ

x

e x p

( ) ( ) ( )

1 /2

1 1 ( ) exp( ) 2 2

t D

p π

− = − − x x μ Σ x μ Σ

( ) ( ) ( )

/2

1 1 = exp( ( ) ( )) 2 2

t t t w w D

A A π − − − x μ x μ Σ

r is the Mahalanobis distance is also the Euclidean dist in the PCA space

slide-13
SLIDE 13

13

Malalanobis distance

from point x to the mean of a Gaussian

x

u

Gaussian Used In Classification

( )

( )

( )

  • j

i j

, p , i j MAP classifier ( | ) ( ) p ( )

j prior likelihood j j evidence

x C if C x p C x when p x C p C C x p x → ≥ ≠ =

  • (

)

j

ML classification: uniform p arg max ( | ) ( | ) can be modeled by Gaussians

j C

if C C p x C p x C ⇒ =

slide-14
SLIDE 14

14

How to Estimate Gaussian Model Parameters?

θ = (μ, σ2)

2 2 2

1 1 ln ( | ) ln(2 ) ( ) 2 2

k k

l P x x θ πσ μ σ = = − − −

2 2 2 2 2 2

1 ( ) (ln ( | )) ( ) 1 (ln ( | )) 2 2( )

k k k k

x P x l x P x

θ

μ θ θ μ μ θ θ σ σ ∂ ⎡ ⎤ ⎛ ⎞ − ⎢ ⎥ ⎜ ⎟ ∂ ⎢ ⎥ ⎜ ⎟ ∇ = = ⎢ ⎥ ∂ ⎜ ⎟ − − − ⎢ ⎥ ⎜ ⎟ ∂ ⎝ ⎠ ⎣ ⎦

ln ( | )

k k

P x

θ

θ ∇ =

ˆ

k k

x n μ ⇒ = ∑

2 2 1

1 ˆ ˆ ( )

n k k

x n σ μ

=

= −

Multi-Dimensional θ= ( , )

μ Σ

ˆ ˆ (1/ ) (1/ ) ( )( )t

k k k k k

n x n x x μ μ μ ⇒ = Σ = − −

∑ ∑

  • ML estimator: mean -> sample mean, variance -> biased sample variance

Mixture Of Gaussians

Given data x1, …, xN , define log-likelihood:

x p(x) π0 π1

( )

1 1 1 1

log( ( )) log ( , ) ( , )

N n n n n n

l p x N x N x π μ π μ

=

= = Σ + Σ

∑ ∏

( )

( )

1 1

  • f

component

, , { , , , }

i responsiblity i

posteriers p z i x τ θ θ μ μ = = = = Σ Σ

Real distributions seldom follow a single Gaussian

mixture of Gaussians

Posterior probability of x being generated by a specific

component ( ) ( , )

z

p x p x z = ∑

( ) ( | )

z p z p x z

= ∑

( )

,

z z z z

N x π μ = Σ

( ) ( )

1

1 2 / 2 1

1 (2 )

T z z z

Z x x z D z z

e

μ μ

π π

− − Σ − =

= Σ

slide-15
SLIDE 15

15

E-M Optimization Method

Maximization of l(θ) directly is hard due to log_of_sum Instead, look at Jensen’s Inequality

( )

∑ ∑

=

=

N n z n z

x p l

1

| , log ) ( θ θ

θ θ θ θ θ

  • f

estimation : ), ( ) ( ) ( current l l l

t t

− = Δ

log likelihood

{ } ( ) { } ( ) ( ) { } ( ) ( ) { } ( ) ( ) ( ) { } ( ) { } ( )

x f E x E f convex, is f 1 , log log ) log( x f ., . x g f E x g E f x f E x E f concave, is f ≤ = ≥ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ≥ ≥

∑ ∑ ∑

If p where x p x p x g e If

i i i i i i i i

x

Log(x) p(x)

( ) ( ) ( )

1

, | | , log , |

N n n t n z n t

p x z p z x p x z θ θ θ

=

≥∑∑

Auxiliary Function in E-M

  • Note there is no log_of_sum.

So taking derivative is easier

marginalization Jensen’s inequality

( ) ( ) ( )

t

l l l θ θ θ Δ = −

( ) ( )

1 1

log | log |

N N n n t n n

p x p x θ θ

= =

= −

∑ ∑

( ) ( )

1

| log |

N n n n t

p x p x θ θ

=

= ∑

( ) ( )

1

, | log |

N n n z n t

p x z p x θ θ

=

=∑

( ) ( ) ( ) ( )

1

, | , | log | , |

N n n t n z n t n t

p x z p x z p x p x z θ θ θ θ

=

= ∑

( ) ( ) ( )

1

, | log | , , |

N n n t n z n t

p x z p z x p x z θ θ θ

=

= ∑

( | )

t

Q θ θ =

slide-16
SLIDE 16

16

E-M improves likelihood

Auxiliary function derived based on Jensen’s Inequality, Now estimate θt+ 1 by maximizing Q So in the expectation step, compute , the ‘responsibility ‘

  • f component z for sample xn

In the maximization step, take derivative of Q over θ, and

find the new estimate for θ (Note only sum_of_log is involved)

) | ( max arg

1 t t

Q θ θ θ

θ

=

+

( ) ( )

const z x p x z p Q

n N n z t n t

t

+ = ∑∑

=

  • hidden

&

  • bserved
  • f

likelihood joint 1 current with z

  • ver

n expectatio

| , log , | ) | ( θ θ θ θ

θ z n

τ

EM Always Improves Likelihood

( ) ( )

const z x p x z p Q

n N n z ectation t n t

t

+ = ∑∑

=

  • hidden

&

  • bserved
  • f

likelihood joint 1 current with z

  • ver

exp

| , log , | ) | ( θ θ θ θ

θ Why does EM always improve l(θ) ? General steps of EM:

Define likelihood model with parameters θ Identify hidden variables z Derive the auxiliary function and the E and M equations In each iteration, estimate the posteriors of hidden variables Re-estimate the model parameters. Repeat until stop

1 1 1

( ) ( ) ( ) ( | )

t t t t t

l l l Q θ θ θ θ θ

+ + +

Δ = − ≥

1

( | ) max ( | )

t t t

Q Q

θ

θ θ θ θ

+

=

( | )

t t

Q θ θ ≥

=

1

( )

t

l θ + ∴ Δ ≥

slide-17
SLIDE 17

17 Expectation-Maximization (E-M) Solution of GMM

EM for estimating θ and

.

Follow ‘divide and conquer’ principle. In iteration step t:

i

τ

( ) ( ) ( )

( )

( ) ( )

( )

Σ Σ =

j j j n t i i i n t i

x N x N n Expectatio

t t ) ( t t ) ( t i n

, | , | : μ π μ π τ

Weight from component i

( ) ( ) ( )

∑ ∑

=

+ n t i n n n t i n 1 t i

x Maximation τ τ μ :

( ) ( ) ( )

( )

( )

( )

( )

i t t t n t 1 i t n T n i n i n i n

x x τ μ μ τ

+

− − Σ = ∑

Divide data to each group, Compute mean and variance from each group

( ) 1 i t n (t ) n i

N τ π

+

= ∑

( ) ( )

const z x p x z p Q

n N n z ectation t n t

t

+ = ∑∑

=

  • hidden

&

  • bserved
  • f

likelihood joint 1 current with z

  • ver

exp

| , log , | ) | ( θ θ θ θ

θ

Discriminative Models

slide-18
SLIDE 18

18

Simple Discriminative Classifier

x(1) x(2)

  • +

+

x(3)

+ +

  • o
  • +

+ + + + + + + +

TH1 TH2 x(2)> TH1 C+ x(1)> TH2 Co Y N Y N C+

Find most opportunistic

dimension in each step

Selection criterion

Entropy Variance before / after

Stop criterion

Avoid overfitting

Parametric Discriminant Analysis

Example of discriminant function

Linear and quadratic discriminant

1 2

( ) g x ax bx = +

  • 1

2

2 2 1 2

( ) g x ax bx cx x = + +

  • +

+ + +

  • +

x(1) x(2) Decision boundary

+ + ++ + + + + +

  • o
  • +

+

( ) g x >

  • ( )

g x <

  • +

+ + +

  • +

x(1) x(2) Decision boundary

  • +

+ + ++ + + +

slide-19
SLIDE 19

19

Linear Discriminant Classifiers

find weight and bias

  • w

⇒ w

  • Augmented Vector

1

1 1

d

x x y x ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ = = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦

  • 1

d

w w w w ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ ⎡ ⎤ ⎢ ⎥ ⎢ ⎥ = = ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ a w

  • ( )

t

g w = + x w x

( ) ( )

t

g g ⇒ = = x y a y

1 2

map to class if ( )>0, otherwise class g ω ω y y

  • Design Objective
  • Each yi defines a half plane in the weight space (a).
  • Note we search weight solutions in the a-space.

t 1 t 2

>b, if class , < -b, if class , , b>0

i i i

ω ω ∀ a y a y y

a-space

Support Vector Machine (tutorial by Burges ‘98)

:

t

Decision boundary H b + = w x

  • Look for separation plane with

the highest margin

  • Linearly separable

1 2

1 in class i.e. 1 1 in class i.e. 1 Inequality constraints : ( ) ( ) 1 0 ,

t i i i t i i i t i i

b y b y label b i ω ω + ≥+ ∀ = + + ≤− ∀ =− ⋅ + − ≥ ∀ w x x w x x w x

1

hyperplane ( ) : 1

t i

H H b

+

+ = + w x

  • Two parallel hyperplanes defining the margin

2

hyperplane ( ): 1

t i

H H b

+ =− w x

  • Margin: sum of distances of the closest points to the separation plane

margin = 2/ w

  • Best plane defined by w and b
slide-20
SLIDE 20

20

Finding the maximal margin

  • Use the Lagrange multiplier technique for the constrained opt. problem

2 1

1 ( ( ) 1) 2

l t p i i i i

L y b α

=

= − + −

w w x

subject to inequality constraints ( ) 1 1, , ; is label

t i i i

y b i l y + − ≥ = w x

  • 2

1 minimize 2 w

Primal Problem

1 1 1 1

1 2 with conditions :

l l l D i i j i j i j i i j l i i i i

L y y y α α α α α

= = = =

= − ⋅ = ≥

∑ ∑∑ ∑

x x

Dual Problem

  • Prime and Dual have the same solutions of w and b

Quadratic Programming

minimize . . . and

p

L w r t b w

1 1 l p i i i i l p i i i

dL y d dL y db α α

= =

= ⇒ = = ⇒ =

∑ ∑

w x w

maximize . . . and

D

L w r t b w

i

α ≥

α > α =

if 0, is on

  • r

and is a support vector

i i

H H α

+ −

> x

  • Weight sum from positive class =

Weight sum from negative class

  • Direction of w:

roughly from negative support vectors to positive ones

* 1 l i i i i

y α

=

=∑ w x

KKT conditions

w

slide-21
SLIDE 21

21

Non-separable: not every sample can be correctly classified

Lagrange multiplier: minimize

if 1, then is misclassified (i.e. training error)

i i

x ξ >

Ensure positivity

  • Add slack variables

i

ξ

New objective function

KKT Conditions for non-separable Solutions

1 2

If , then 0 : is inside the margin or on the wrong side

  • r

0 : is on

  • r

i i i i i

C H H x x α ξ ξ = > =

1 2

If 0< , then 0 : is on

  • r

i i i

C H H x α ξ < =

slide-22
SLIDE 22

22

All the points located in the margin gap or

the wrong side will get

i

C α =

What if C increases?

i

C α =

i

C α < <

and both b ξ ↓

  • When C increases, incorrect samples get more weights

try to minimize incorrect samples better training accuracy, but smaller margin less generalization performance

after C increases

Generalized Linear Discriminant Functions

1

( ) ( )

d t i i i

g a φ

=

= = Φ

x x a

  • I nclude more than just the linear terms

1 1 1

( )

d d d t t i i ij i j i i j

g w w x w x x w

= = =

= + + = + +

∑ ∑∑

x w x x Wx

  • I n general
  • Example
  • Shape of decision boundary

ellipsoid, hyperhyperboloid, lines etc

[ ]

2 1 2 3 2 1 2 3

( ) 1

t

g x a a x a x a a a x x = + + ⎡ ⎤ = ⎢ ⎥ ⎣ ⎦ [ ][ ]

1 1 2 2 3 1 2 1 2 3 1 1 2

( ) 1

t

g x a x a x a x x a a a x x x = + + =

  • Data become separable in higher-dimensional space

learning parameters in high dimension is hard

(curse of dimensionality)

instead, try to maximize margins SVM

slide-23
SLIDE 23

23

Non-Linear Space

Map to a high dimensional space, to make the data separable

1 1 1 1 1 1

1 ( ) ( ) 2 1 ( , ) 2

l l l D i i j i j i j i i j l l l i i j i j i j i i j

L y y y y K α α α α α α

= = = = = =

= − Φ ⋅ Φ = −

∑ ∑∑ ∑ ∑∑

x x x x

  • Find the SVM in the high-dim space
  • Luckily, we don’t have to find

1

( ) nor ( )

l i i i i i

y α

=

Φ Φ

s s

  • We can use the same method to maximize LD to find

i

α

1

( ) ( ) ( )

s

N i i i i

g y b α

=

= Φ ⋅ Φ +

x s x

  • I nstead, we define kernel

( , ) ( ) ( )

i i

K = Φ ⋅Φ s x s x

1

( ) ( , )

s

N i i i i

g y K b α

=

⇒ = +

x s x

w

Some popular kernels

Cubic polynomial separable non-separable

polynomial Gaussian Radial Basis Function (RBF) sigmoidal neural network

slide-24
SLIDE 24

24

See the SVM demos (Eric)

Evaluation

  • Detection
  • False Alarms
  • Misses
  • Correct Dismissals

) /( ) /( ) /( D B B F B A A P C A A R + = + = + =

1

  • N

" Irrelevant " Relevant" " 1

  • =

= n Vn B V D A V C V B V A

N n n N n n K n n K n n

− − = − = − = =

∑ ∑ ∑ ∑

− = − = − = − −

) ) 1 ( ( ) ( ) 1 (

1 1 1 1

N Test Image K detected results

D

B

C

A

“Returned” “Relevant Ground Truth”

  • Recall
  • Precision
  • Fallout
  • Combined

2 / ) (

1

R P R P F + ⋅ =

slide-25
SLIDE 25

25

Evaluation Measures

Precision Recall Curve

  • 2. Receiver Operating Characteristic (ROC Curve)

P

R

B A vs

A (hit) B (false)

Evaluation Metric: Average Precision

S

Ranked list of data in response to a query

3/7 3/6 3/5 3/4 2/3 1/2 1/1 Precision 1 1 1 truth Ground D D D D D

s

... ...

21 63 8 15

Average precision:

data relevant

  • f

number : , 1

1

total R I j R R AP

j s j j

=

=

1 2 3 4 5 6 7 Precision j

3

i

P

AP measures the average of precision values at R relevant data points 1 2 3 4 5 6 7 Rj j 1 2 3 1.0

slide-26
SLIDE 26

26

Evaluation Metric: Average Precision (2)

AP depends on the rankings of relevant data

and the size of the relevant data set. E.g., R= 10

Case I:

+ + + + + + + + +

  • - - -
  • +

Pre: 1

1 1 1 1 1 1 1 1 0 0 0 0 1

AP= 1 Case II:

  • +

Pre: 1/ 2 AP= 1/2

  • + - + - + - + - + - + - + - + - +

1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2

Case III: Pre:

  • - -
  • +

+ + + + + + + +

1/ 11 2/ 12 10/ 20 … …

AP~ 0.3

Example: SVM for News Story Segmentation

  • SVM with 195 binary features

performs the best SVM has excellent feature fusion capability Predicate binarization shields noise in the feature P r e c i s i

  • n

Recall

SVM-based Maximum Entropy BST

slide-27
SLIDE 27

27

Training / Validation / Testing

Appropriate if the same distributions are

followed over different sets

x(1) x(2) Training

  • +

+ + + +

  • Use this to optimize

parameters x(1) x(2) Validation

  • +

+ + +

  • Select optimal

models through validation x(1) x(2) Testing

  • +

+ + +

  • +
  • Evaluate

performance

  • ver test data

Training / Validation / Testing (cont.)

  • Cross validation, leave-one-out

1 2 … K

Training Testing

Rotate the choice of the test set and average the performance over runs

slide-28
SLIDE 28

28

Curse of Dimensionality and Overtraining

Very rough rule of thumb – # of training samples per class feature dimension > 10

x(1) x(2)

  • verfitting
  • +

+ + + +

  • +

+ + + +

  • -
  • A case of overfitting

test performance training test