The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x - PDF document

1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k (x) - c k =g k (x). f(x) and g k (x) must have continuous first partial derivatives A Solution: Lagrangian Multipliers 0 =  x f(x) + Σ  x λ k g k (x) or starting with the Lagrangian : L ( x, λ ) = f(x) + Σ λ k g k (x). with  x L (x, λ ) = 0. 1

4 The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x N }   N be a real valued random variable (data vectors), with the expectation value of the mean E[ x ] = μ . We define the covariance matrix Σ x of a random variable x as Σ x := E[ ( x - μ ) ( x - μ ) T ] with matrix elements Σ ij = E[ ( x i - μ i ) ( x j - μ j ) T ] . Application: Estimating E[ x ] and E[ ( x - E[ x ] ) ( x - E[ x ] ) T ] from data . We assume m samples of the random variable x = { x 1 , ..., x N }   N that is we have a set of m vectors { x 1 , ..., x m }   N or when put into a data matrix X   N x m Maximum Likelihood estimators 1 m for μ and Σ x are:     x k M L  m 1 k 1 m 1          T T ( x )( x ) XX k k ML  ML ML m k 1 m 5 KLT/PCA Motivation • Find meaningful “directions” in correlated data • Linear dimensionality reduction • Visualization of higher dimensional data • Compression / Noise reduction • PDF-Estimate 2

7 Karhunen-Loève Transform: 1 st Derivation Problem Let x = { x 1 , ..., x N }   N be a feature vector of zero mean, real valued random variables. We seek the direction a 1 of maximum variance: T x for which a 1 is such as E[ y 1 2 ] is maximum == > y 1 = a 1 with the constraint that a 1 T a 1 = 1 This is a constrained optimization → use of the Lagrangian: L( a 1 , λ 1 ) = E[ a 1 T x x T a 1 ] – λ 1 ( a 1 T a 1 – 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) Lagrange = a 1 multiplier 8 Karhunen-Loève Transform L( a 1 , λ 1 ) T Σ x a 1 – λ 1 ( a 1 T a 1 – 1 ) = a 1   ( , ) L a 2 ] to be maximum : for E[ y 1  1 1 0  a 1 => Σ x a 1 – λ 1 a 1 = 0 => a 1 must be eigenvector of Σ x with eigenvalue λ 1 . 2 ] = a 1 T Σ x a 1 = λ 1 E[ y 1 2 ] to be maximum, λ 1 must be the largest eigenvalue. => for E[ y 1 3

9 Karhunen-Loève Transform Now let’s search for a second direction, a 2 , such that: T x such as E[ y 2 2 ] is maximum, and y 2 = a 2 a 2 T a 1 = 0 and a 2 T a 2 = 1 Similar derivation: L( a 2 , λ 2 ) = a 2 T Σ x a 2 – λ 2 ( a 2 T a 2 – 1 ) with a 2 T a 1 = 0 => a 2 must be the eigenvector of Σ x associated with the second largest eigenvalue λ 2 . We can derive N orthonormal directions that maximize the variance: A = [ a 1 , a 2 ,…, a N ] and y = A T x The resulting matrix A is known as Principal Component Analysis (PCA) N   or Kharunen-Loève transform (KLT) y = A T x x a y i i  i 1 10 Karhunen-Loève Transform: 2 nd Derivation Problem Let x = { x 1 , ..., x N }   N be a feature vector of zero mean, real valued random variables. We seek a transformation A of x that results in a new set of variables y = A T x (feature vectors) which are uncorrelated ( i.e. E [ y i, y j ] = 0 for i  j ) . • Let y = A T x , then by definition of the correlation matrix:  T  T T  T R E [ yy ] E A [ xx A ] A R A y x • R x is symmetric  its eigenvectors are mutually orthogonal 4

11 Karhunen-Loève Transform  i.e. if we choose A such that its columns a i are orthonormal eigenvectors of R x , we get:    0 0 1        T R A R A 0 0  y x      0 0 N • If we further assume R x to be positive definite, ---- > the eigenvalues  i will be positive. The resulting matrix A is known as N   Karhunen-Loève transform (KLT) y = A T x x y a i i  i 1 12 Karhunen-Loève Transform The Karhunen-Loève transform (KLT) N    A T y x x a y i i  1 i For mean-free vectors ( e.g. replace x by x – E [ x ] ) this process diagonalizes the covariance matrix Σ y 5

13 KLT Properties: MSE-Approximation We define a new vector in m -dimensional subspace ( m < N ), ˆ x m   using only m basis vectors: ˆ x y a i i  i 1  Projection of x into the subspace spanned by the m used (orthonormal) eigenvectors. Now, what is the expected mean square error between x and its projection : ˆ x   2   N E   2     ˆ      T x x a ( a )( a ) E y E  y y    i i i i j j         i m 1 i j 14 KLT Properties: MSE-Approximation   N N        2       2  ˆ T E x x .... E  ( y a )( y a )  E y     i i j j i i       i m 1 i m 1 i j The error is minimized if we choose as basis those eigenvectors corresponding to the m largest eigenvalues of the correlation matrix. • Amongst all other possible orthogonal transforms KLT is the one leading to minimum MSE This form of KLT ( as presented here ) is also referred to as Principal Component Analysis (PCA). The principal components are the eigenvectors ordered (desc.) by their respective eigenvalue magnitudes  i 6

15 KLT Properties Total variance Let w.l.o.g. E[ x ]=0 and y = A T x the KLT (PCA) of x . • From the previous definitions we get:     2   2 E y   y i i i • i.e. the eigenvalues of the input covariance matrix are equal to the variances of the transformed coordinates.  Selecting those features corresponding to m largest eigenvalues retains the maximal possible total variance (sum of component variances) associated with the original random variables x i . 16 KLT Properties: Entropy For a random vector y the entropy is a   [ln ( y )] H E p y y measure for the randomness of the underlying process. Example: for a zero-mean ( m =0) m -dim. Gaussian m 1  1          1 [ ] 2 2 H E ln( (2 ) exp( y y ) ) y y y 2        1 T m 1 1 H ln(2 ) ln E [ y y ] y 2 2 y 2 y        T 1 T 1 [ y y ] [ y y ] E E trace y y m   m    1 T  E trace [ yy ]      m 1 ln(2 ) ln y 2 2 i   2   E trace I [ ] m  1 i  Selecting those features corresponding to m largest eigenvalues maximizes the entropy in the remaining features. No wonder: variance and randomness are directly related !  7

17 Computing a PCA: Problem: Given mean free data X , a set on n feature vectors x i  R m . Compute the orthonormal eigenvectors a i of the correlation matrix R x .  There are many algorithms that can compute very efficiently eigenvectors of a matrix. However, most of these methods can be very unstable in certain special cases.  Here we present SVD, a method that is in general not the most efficient one. However, the method can be made numerically stable very easily! 18 Computing a PCA: S ingular V alue D ecomposition: an Excursus to Linear Algebra ( without Proofs ) 8

19 Singular Value Decomposition : SVD (reduced Version): For matrices A  R m  n with m ≥ n , there exist matrices U  R m  n with orthonormal columns ( U T U = I ) , V  R n  n orthogonal ( V T V = I ) ,   R n  n diagonal, with A=U  V T n m =  T U V A • The diagonal values of  (  1 ,  2 , ….,  n ) are called the singular values. • It is accustomed to sort them:  1   2  ….   n 20 SVD Applications: SVD is an all-rounder ! Once you have U ,  , V , you can use it to: Solve Linear Systems: A x = b - If A -1 exists  Compute matrix inverse a) b) for fewer equations than unknowns c) for more equations than unknowns d) if there is no solution: compute x that | A x - b | = min e) compute rank (numerical rank) of a matrix - ……. - Compute PCA / KLT 9

21 SVD : Matrix inverse A -1 A x = b : A=U  V T U,  , V, exist for all A If A is square n x n and not singular, then A -1 exists .    1    1 T A U V Computing A -1 for a singular A !?    1     T 1 1 V U Since U,  , V all exist, the only problem can originate if one σ i = 0   1  or numerically close to zero.  1   T V   U --> singular values indicate if A   1 is singular or not!!    n 22 SVD : Rank of a Matrix - The rank of A is the number of non-zero singular values. If there are very small singular values  i , then A is close of - being singular. We can set a threshold t , and set  i = 0 if  i ≤ t then the numeric_rank ( A ) = # {  i |  i > t } n  1  2  n m =  T A U V 10

The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x - PDF document

1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k

Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort

Covariance Matrix Adaptation Covariance Matrix Adaptation Evolution Strategies Recalling New

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

Covariance Matrices and Covariance Operators Theory and Applications H` a Quang Minh Functional

Insertion Sort Insertion Sort next card? What assumptions do we make at each CSE 680 step?

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Matrix Algebra of Sample Statistics James H. Steiger Department of Psychology and Human

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

Implementation of Covariance Matrix on ReconstructedParticle C. Calancha ILD Analysis &

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Eigenvectors of some large sample covariance matrix ensembles Random Matrix Workshop, Tlcom

Artificial Intelligence. Decision Tasks. Learning. Petr Pok Czech Technical University in

me do this? A qualitative study into staff perceptions of their personal tutor role and the

Modeling Networks from Partially-Observed Network Data Mark S. Handcock University of Washington

Trusted Architecture for Secure Shared Services (with Privacy), Future of Internet PPP, and

Proseminar Linguistische Annotationen SS 2010 Wann Thema Literatur 15/04/10 Einfhrung,

FOster the Comprehension and USe of of Knowledge intensive technologies for coding and sharing

T HE CONCEPT of cognitive radio (CR) for designing is extracted from a pattern and is fed into

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

The Covariance Matrix (insertion) Definition Let x = { x 1 , ..., x - PDF document

1 Feature Selection: Linear Transformations y new = M x old 3 Constraint Optimization (insertion) Problem: Given an objective function f(x) to be optimized and let constraints be given by h k (x)=c k , moving constants to the left, ==> h k

Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort M. Esponda Insertion-Sort

Covariance Matrix Adaptation Covariance Matrix Adaptation Evolution Strategies Recalling New

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

Covariance Matrices and Covariance Operators Theory and Applications H` a Quang Minh Functional

Insertion Sort Insertion Sort next card? What assumptions do we make at each CSE 680 step?

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Matrix Algebra of Sample Statistics James H. Steiger Department of Psychology and Human

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

Implementation of Covariance Matrix on ReconstructedParticle C. Calancha ILD Analysis &amp;

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Eigenvectors of some large sample covariance matrix ensembles Random Matrix Workshop, Tlcom

Artificial Intelligence. Decision Tasks. Learning. Petr Pok Czech Technical University in

me do this? A qualitative study into staff perceptions of their personal tutor role and the

Modeling Networks from Partially-Observed Network Data Mark S. Handcock University of Washington

Trusted Architecture for Secure Shared Services (with Privacy), Future of Internet PPP, and

Proseminar Linguistische Annotationen SS 2010 Wann Thema Literatur 15/04/10 Einfhrung,

FOster the Comprehension and USe of of Knowledge intensive technologies for coding and sharing

T HE CONCEPT of cognitive radio (CR) for designing is extracted from a pattern and is fed into

Recurrent Neural Network Xiaogang Wang xgwang@ee.cuhk.edu.hk February 26, 2019 cuhk Xiaogang

Implementation of Covariance Matrix on ReconstructedParticle C. Calancha ILD Analysis &