the covariance matrix insertion
play

The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x - PDF document

1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k


  1. 1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k (x) - c k =g k (x). f(x) and g k (x) must have continuous first partial derivatives A Solution: Lagrangian Multipliers 0 =  x f(x) + Σ  x λ k g k (x) or starting with the Lagrangian : L ( x, λ ) = f(x) + Σ λ k g k (x). with  x L (x, λ ) = 0. 1

  2. 4 The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x N }   N be a real valued random variable (data vectors), with the expectation value of the mean E[ x ] = μ . We define the covariance matrix Σ x of a random variable x as Σ x := E[ ( x - μ ) ( x - μ ) T ] with matrix elements Σ ij = E[ ( x i - μ i ) ( x j - μ j ) T ] . Application: Estimating E[ x ] and E[ ( x - E[ x ] ) ( x - E[ x ] ) T ] from data . We assume m samples of the random variable x = { x 1 , ..., x N }   N that is we have a set of m vectors { x 1 , ..., x m }   N or when put into a data matrix X   N x m Maximum Likelihood estimators 1 m for μ and Σ x are:     x k M L  m 1 k 1 m 1          T T ( x )( x ) XX k k ML  ML ML m k 1 m 5 KLT/PCA Motivation • Find meaningful “directions” in correlated data • Linear dimensionality reduction • Visualization of higher dimensional data • Compression / Noise reduction • PDF-Estimate 2

  3. 7 Karhunen-Loève Transform: 1 st Derivation Problem Let x = { x 1 , ..., x N }   N be a feature vector of zero mean, real valued random variables. We seek the direction a 1 of maximum variance: T x for which a 1 is such as E[ y 1 2 ] is maximum == > y 1 = a 1 with the constraint that a 1 T a 1 = 1 This is a constrained optimization → use of the Lagrangian: L( a 1 , λ 1 ) = E[ a 1 T x x T a 1 ] – λ 1 ( a 1 T a 1 – 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) Lagrange = a 1 multiplier 8 Karhunen-Loève Transform L( a 1 , λ 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) = a 1   ( , ) L a 2 ] to be maximum : for E[ y 1  1 1 0  a 1 => Σ x a 1 – λ 1 a 1 = 0 => a 1 must be eigenvector of Σ x with eigenvalue λ 1 . 2 ] = a 1 T Σ x a 1 = λ 1 E[ y 1 2 ] to be maximum, λ 1 must be the largest eigenvalue. => for E[ y 1 3

  4. 9 Karhunen-Loève Transform Now let’s search for a second direction, a 2 , such that: T x such as E[ y 2 2 ] is maximum, and y 2 = a 2 a 2 T a 1 = 0 and a 2 T a 2 = 1 Similar derivation: L( a 2 , λ 2 ) = a 2 T Σ x a 2 – λ 2 ( a 2 T a 2 – 1 ) with a 2 T a 1 = 0 => a 2 must be the eigenvector of Σ x associated with the second largest eigenvalue λ 2 . We can derive N orthonormal directions that maximize the variance: A = [ a 1 , a 2 ,…, a N ] and y = A T x The resulting matrix A is known as Principal Component Analysis (PCA) N   or Kharunen-Loève transform (KLT) y = A T x x a y i i  i 1 10 Karhunen-Loève Transform: 2 nd Derivation Problem Let x = { x 1 , ..., x N }   N be a feature vector of zero mean, real valued random variables. We seek a transformation A of x that results in a new set of variables y = A T x (feature vectors) which are uncorrelated ( i.e. E [ y i, y j ] = 0 for i  j ) . • Let y = A T x , then by definition of the correlation matrix:  T  T T  T R E [ yy ] E A [ xx A ] A R A y x • R x is symmetric  its eigenvectors are mutually orthogonal 4

  5. 11 Karhunen-Loève Transform  i.e. if we choose A such that its columns a i are orthonormal eigenvectors of R x , we get:    0 0 1        T R A R A 0 0  y x      0 0 N • If we further assume R x to be positive definite, ---- > the eigenvalues  i will be positive. The resulting matrix A is known as N   Karhunen-Loève transform (KLT) y = A T x x y a i i  i 1 12 Karhunen-Loève Transform The Karhunen-Loève transform (KLT) N    A T y x x a y i i  1 i For mean-free vectors ( e.g. replace x by x – E [ x ] ) this process diagonalizes the covariance matrix Σ y 5

  6. 13 KLT Properties: MSE-Approximation We define a new vector in m -dimensional subspace ( m < N ), ˆ x m   using only m basis vectors: ˆ x y a i i  i 1  Projection of x into the subspace spanned by the m used (orthonormal) eigenvectors. Now, what is the expected mean square error between x and its projection : ˆ x   2   N E   2     ˆ      T x x a ( a )( a ) E y E  y y    i i i i j j         i m 1 i j 14 KLT Properties: MSE-Approximation   N N        2       2  ˆ T E x x .... E  ( y a )( y a )  E y     i i j j i i       i m 1 i m 1 i j The error is minimized if we choose as basis those eigenvectors corresponding to the m largest eigenvalues of the correlation matrix. • Amongst all other possible orthogonal transforms KLT is the one leading to minimum MSE This form of KLT ( as presented here ) is also referred to as Principal Component Analysis (PCA). The principal components are the eigenvectors ordered (desc.) by their respective eigenvalue magnitudes  i 6

  7. 15 KLT Properties Total variance Let w.l.o.g. E[ x ]=0 and y = A T x the KLT (PCA) of x . • From the previous definitions we get:     2   2 E y   y i i i • i.e. the eigenvalues of the input covariance matrix are equal to the variances of the transformed coordinates.  Selecting those features corresponding to m largest eigenvalues retains the maximal possible total variance (sum of component variances) associated with the original random variables x i . 16 KLT Properties: Entropy For a random vector y the entropy is a   [ln ( y )] H E p y y measure for the randomness of the underlying process. Example: for a zero-mean ( m =0) m -dim. Gaussian m 1  1          1 [ ] 2 2 H E ln( (2 ) exp( y y ) ) y y y 2        1 T m 1 1 H ln(2 ) ln E [ y y ] y 2 2 y 2 y        T 1 T 1 [ y y ] [ y y ] E E trace y y m   m    1 T  E trace [ yy ]      m 1 ln(2 ) ln y 2 2 i   2   E trace I [ ] m  1 i  Selecting those features corresponding to m largest eigenvalues maximizes the entropy in the remaining features. No wonder: variance and randomness are directly related !  7

  8. 17 Computing a PCA: Problem: Given mean free data X , a set on n feature vectors x i  R m . Compute the orthonormal eigenvectors a i of the correlation matrix R x .  There are many algorithms that can compute very efficiently eigenvectors of a matrix. However, most of these methods can be very unstable in certain special cases.  Here we present SVD, a method that is in general not the most efficient one. However, the method can be made numerically stable very easily! 18 Computing a PCA: S ingular V alue D ecomposition: an Excursus to Linear Algebra ( without Proofs ) 8

  9. 19 Singular Value Decomposition : SVD (reduced Version): For matrices A  R m  n with m ≥ n , there exist matrices U  R m  n with orthonormal columns ( U T U = I ) , V  R n  n orthogonal ( V T V = I ) ,   R n  n diagonal, with A=U  V T n m =  T U V A • The diagonal values of  (  1 ,  2 , ….,  n ) are called the singular values. • It is accustomed to sort them:  1   2  ….   n 20 SVD Applications: SVD is an all-rounder ! Once you have U ,  , V , you can use it to: Solve Linear Systems: A x = b - If A -1 exists  Compute matrix inverse a) b) for fewer equations than unknowns c) for more equations than unknowns d) if there is no solution: compute x that | A x - b | = min e) compute rank (numerical rank) of a matrix - ……. - Compute PCA / KLT 9

  10. 21 SVD : Matrix inverse A -1 A x = b : A=U  V T U,  , V, exist for all A If A is square n x n and not singular, then A -1 exists .    1    1 T A U V Computing A -1 for a singular A !?    1     T 1 1 V U Since U,  , V all exist, the only problem can originate if one σ i = 0   1  or numerically close to zero.  1   T V   U --> singular values indicate if A   1 is singular or not!!    n 22 SVD : Rank of a Matrix - The rank of A is the number of non-zero singular values. If there are very small singular values  i , then A is close of - being singular. We can set a threshold t , and set  i = 0 if  i ≤ t then the numeric_rank ( A ) = # {  i |  i > t } n  1  2  n m =  T A U V 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend