Linear Discriminant Functions Linear Discriminant Functions 5.8, - - PDF document

linear discriminant functions linear discriminant
SMART_READER_LITE
LIVE PREVIEW

Linear Discriminant Functions Linear Discriminant Functions 5.8, - - PDF document

10/2/2008 Linear Discriminant Functions Linear Discriminant Functions 5.8, 5.9, 5.11 Jacob Hays Amit Pillay James DeFelice Minimum Squared Error Minimum Squared Error Previous methods only worked on linear separable cases, by looking at


slide-1
SLIDE 1

10/2/2008 1

Linear Discriminant Functions Linear Discriminant Functions

Jacob Hays Amit Pillay James DeFelice 5.8, 5.9, 5.11

Minimum Squared Error Minimum Squared Error

Previous methods only worked on linear

separable cases, by looking at misclassified samples to correct error

MSE looks at all samples, using linear

equations to find estimate

slide-2
SLIDE 2

10/2/2008 2

Minimum Squared Error Minimum Squared Error

x space mapped to y space. For all samples xi in dimension d, there exists a

yi of dimension d^

Find vector a making all atyi > 0 All samples yi in matrix Y, dim n x d^, Ya = b (b is vector of positive constants)

b is our margin

for error

                =                            

n d nd n n d d

b b b a a a y y y y y y y y y ... ... . ... ... ... ... ... ...

2 1 1 1 2 21 20 1 11 10

Minimum Squared Error Minimum Squared Error

Y is rectangular (n x d^), so it does not

have a direct inverse to solve Ya = b

Ya – b = e – gives error, minimize it

Square error ||e||2 Take Gradient Gradient should goto Zero

2 1 2

) ( ) (

i n i i t s

b y a b Ya a J − = − =

=

) ( 2 ) ( 2

1

b Ya Y y b y a J

t i i i n i t s

− = − = ∇

=

b Y Ya Y

t t

=

slide-3
SLIDE 3

10/2/2008 3

Minimum Squared Error Minimum Squared Error

YtYa = Ytb goes to a = (YtY)-1Ytb (YtY)-1Yt is the psuedo-inverse of Y,

dimension d^ x n, can be written as Y†

Y†Y = I

YY† ≠ I

a = Y†b gives us a solution with b being a

margin.

Minimum Squared Error Minimum Squared Error

slide-4
SLIDE 4

10/2/2008 4

Fisher’s Linear Discriminant Fisher’s Linear Discriminant

Based on projection of d-dimensional data

  • nto a line.

Loses a lot of data, but some orientation

  • f the line might give a good split

y = wtx, ||w|| = 1

yi is projection of xi onto line w Goal: Find best w to separate them Highly overlapping data performs poorly

Fisher’s Linear Discriminant Fisher’s Linear Discriminant

Mean of each class Di w = m1 – m2 / || m1 – m2||

=

i

D x i i

x n m 1

slide-5
SLIDE 5

10/2/2008 5

Fisher’s Linear Discriminant Fisher’s Linear Discriminant

Scatter Matrices SW = S1 + S2

t i D x i i

m x m x S

i

) ( ) ( − − = ∑

) (

2 1 1

m m S w

W

− =

Fisher’s Relation to MSE Fisher’s Relation to MSE

MSE and Fisher equivalent for specific b

  • ni = number of
  • 1i is column vector of ni full of ones

Plug into YtYa = Ytb

            =

2 2 1 1

1 1 n n n n b

i

D x∈

      − − =

2 2 1 1

1 1 X X Y

      = w w w w w a

                  − − =             − −       − −

2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1

1 1 1 1 1 1 1 1 n n n n X X w X X X X

t t t t

w w w w

) (

2 1 1

m m nS w

W

− =

α

slide-6
SLIDE 6

10/2/2008 6

Relation to Optimal Discriminant Relation to Optimal Discriminant

If you set b = 1n , MSE approaches the

  • ptimal Bayes discriminant g0 as number
  • f samples approaches infinity. (see 5.8.3)

) | ( ) | ( ) (

2 2

x P x P x g ω ω − =

g(x) is MSE estimation

Widrow Widrow-Hoff / LMS Hoff / LMS

LMS – Least Mean Squared Still solves when YtY is singular

a,b, threshold θ, step η(.), k = 0 begin do k = (k + 1) mod n a = a + η(k)(bk – atyk)yk until | η(k) )(bk – atyk)yk | < θ return a end

slide-7
SLIDE 7

10/2/2008 7

Widrow Widrow-Hoff / LMS Hoff / LMS

LMS not guaranteed to converge to a

separating plane, even if one exists.

Procedural differences Procedural differences

Perceptron, relaxation

  • If samples linearly separable, we can find a

solution

  • Otherwise, we do not converge to a solution

MSE

  • Always yields a weight vector
  • May not be the best solution

Not guaranteed to be a separating vector

slide-8
SLIDE 8

10/2/2008 8

Choosing b Choosing b

Arbitrary b, MSE minimizes ||Ya – b||2 If linearly separable, we can more smartly

choose b

  • Define â and ß such that

Yâ = ß > 0

  • Every component of ß is positive

Modified MSE Modified MSE

Js(a,b) = ||Ya – b||2 a, b allowed to vary Subject to b > 0 Min of Js is zero a that achieves min Js is the separating

vector

slide-9
SLIDE 9

10/2/2008 9

Ho Ho-Kashyap Kashyap/Descent /Descent prodecure prodecure

For any b

– Must avoid b = 0 – Must avoid b < 0

( )

b Ya Y J

t s a

− = ∇ 2

( )

b Ya J s

b

− − = ∇ 2 b Y a

= no... done? re we' and So, = ∇

s aJ

Ho Ho-Kashyap Kashyap/Descent Procedure /Descent Procedure

Pick positive b Don’t allow reduction of b’s components Set all positive components of to zero

  • b(k+1) = b(k) - ηc

s aJ

( )

s b s b

J J c ∇ − ∇ = 2 1    ≤ ∇ ∇ =

  • therwise

if

s b s b

J J c

slide-10
SLIDE 10

10/2/2008 10

Ho Ho-Kashyap Kashyap/Descent Procedure /Descent Procedure

( )

b Ya J s

b

− − = ∇ 2

[ ]

s b s b k k

J J b b ∇ − ∇ − =

+

2 1

1

η b Ya e − =

+ +

+ =

k k k k

e b b η 2

1

( )

k k k

e e e − =

+

2 1

k k

b Y a

=

Ho Ho-Kashyap Kashyap

Algorithm 11

  • Begin initialize a, b, η() < 1, threshold bmin, kmax

do k = k+1 mod n

e = Ya – b e+ = ½(e+abs(e)) b = b + 2η(k)e+ a = Y†b if abs(e) <= bmin then return a,b and exit

Until k = kmax Print “NO SOLUTION”

  • End

When e(k) = 0 we have solution When e(k) <= 0 samples not linearly separable

slide-11
SLIDE 11

10/2/2008 11

Convergence (separable case) Convergence (separable case)

If 0 < η < 1, and linearly separable

  • Solution vector exists
  • We will find in finite k steps

Two possibilities

  • e(k) = 0 for some finite k0
  • No zero in e()

If e(k0)

  • a(k), b(k), e(k) stop changing
  • Ya(k) = b(k) > 0 for all k > k0
  • If we find k0, algorithm terminates with solution

vector

Convergence (separable) Convergence (separable)

e() never zero for finite k If samples are linearly separable

  • Ya = b, b > 0

Because b is positive, either

  • e(k) is zero, or
  • e(k) is positive

Since e(k) cannot be zero (first bullet), it

must be positive

slide-12
SLIDE 12

10/2/2008 12

Convergence (separable) Convergence (separable)

  • ¼(||ek||2-||ek+1||2)= η(1- η)||e+

k||2+ η2e+t kYY†e+ k

YY† is symmetric, positive semi-definite 0 < η < 1

  • Therefore, ||ek||2 > ||ek+1||2 if 0 < η < 1

||e|| will eventually converge to zero a will eventually converge to solution vector

Convergence (non Convergence (non-separable) separable)

If not linearly separable, may obtain a non-

zero error vector without positive components

Still have

¼(||ek||2-||ek+1||2)= η(1- η)||e+

k||2+ η2e+t kYY†e+ k

So limiting ||e|| cannot be zero

Will converge to a non-zero value

Convergence says that

  • e+

k = 0 for some finite k (separable)

  • e+

k will converge to zero while ||e|| is bounded

away from zero (non-separable)

slide-13
SLIDE 13

10/2/2008 13

Support Vector Machines Support Vector Machines (SVMs) (SVMs) SVMs SVMs

Representing data in higher dimensions space, SVM

will construct a separating hyperplane in that space,

  • ne which maximizes margin between the two data

sets.

slide-14
SLIDE 14

10/2/2008 14

Application Application

Face detection, verification, and

recognition

Object detection and recognition Handwritten character and digit

recognition

Text detection and categorization Speech and speaker verification,

recognition

Information and image retrieval

Formalization Formalization

We are given some training data, a set of points of the

form Equation of separating hyperplane:

The vector w is a normal vector. The parameter b/||w|| determines the offset of the hyperplane from the origin along the normal vector

slide-15
SLIDE 15

10/2/2008 15

Formalization cont… Formalization cont…

Defining two hyperplanes given by equations:

H1: H2:

These hyperplanes are defined in such a way that no

points lies between them

To prevent data points falling between these

hyperplanes, following two constraints are defined:

Formulation cont… Formulation cont…

This can be rewritten as:

So the formulation of the optimization problem is

  • Choose w, b to minimize ||w||

subject to

slide-16
SLIDE 16

10/2/2008 16

SVM Hyperplane Example SVM Hyperplane Example SVM Training SVM Training

Langrange Optimization problem Reformulated Optimization Problem is given as: Thus the new optimization problem is to minimize LP

w.r.t w and b subject to:

slide-17
SLIDE 17

10/2/2008 17

SVM Training cont… SVM Training cont…

Dual of Langrange formulation The dual of Langrange states that the gradient descent of LP with respect to w and b vanishes. so we have the dual as: The optimization problem w.r.t dual is to maximize LD subject to:

From the above optimization equation we have: This shows that the solution is the inner product of input

points

Most of the points have α to be zero and for those

points for which α is not zero are the closest points to the separating hyperplane. These points are called support vectors.

slide-18
SLIDE 18

10/2/2008 18

Advantages & Disadvantages of Advantages & Disadvantages of SVM SVM

Advantages

  • Gives high generalization performance
  • Complexity of SVM classifier is characterized by number of

support vectors rather than the dimensionality of transformed space. Disadvantages

  • The training time scales somewhere between quadratic and cubic

with respect to the number of training samples

Recognition of 3D Recognition of 3D-Objects Objects

Experiment involved recognition of 3D objects from the

COIL db

Each coil image is transformed into eight-bit vector of

32X32 = 1024 components

slide-19
SLIDE 19

10/2/2008 19

References References

Pontil, M.; Verri, A., “Support vector machines for 3D object

recognition,” Pattern Analysis and Machine Intelligence, IEEE Transactions Vol. 20, Issue 6, June 1998, pp. 637 – 646

Christopher J. C. Burges,“A tutorial on support vector machines for

pattern recognition” (1998)

R.O. Duda, P. E. Hart and D. Stork, Wiley , Pattern Classification

(2nd Edition) by 2001

R.J. Schalkoff. (1992) Pattern Recognition: Statistical, Structural,

and Neural Approaches, Wiley.*