Principal component analysis Course of Machine Learning Master - - PowerPoint PPT Presentation

principal component analysis
SMART_READER_LITE
LIVE PREVIEW

Principal component analysis Course of Machine Learning Master - - PowerPoint PPT Presentation

Principal component analysis Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Curse of dimensionality In general, many features: high-dimensional spaces. of


slide-1
SLIDE 1

Principal component analysis

Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019

1

slide-2
SLIDE 2

Curse of dimensionality

In general, many features: high-dimensional spaces.

  • sparseness of data
  • increase in the number of coefficients, for example for dimension D

and order 3 of the polynomial, y(x, w) = w0 +

D

i=1

wixi +

D

i=1 D

j=1

wijxixj +

D

i=1 D

j=1 D

k=1

wijkxixjxk number of coefficients is O(DM) High dimensions lead to difficulties in machine learning algorithms (lower reliability or need of large number of coefficients) this is denoted as curse

  • f dimensionality

2

slide-3
SLIDE 3

Dimensionality reduction

  • for any given classifier, the training set size required to obtain a certain

accuracy grows exponentially wrt the number of features (curse of dimensionality)

  • it is important to bound the number of features, identifying the less

discriminant ones

3

slide-4
SLIDE 4

Discriminant features

  • Discriminant feature: makes it possible to distinguish between two

classes

  • Non discriminant feature: does not allow classes to be distinguished

4

slide-5
SLIDE 5

Searching hyperplanes for the dataset

  • verifying whether training set elements lie on a hyperplane (a space of

lower dimensionality), apart from a limited variability (which could be seen as noise)

  • principal component analysis looks for a d′-dimensional subspace

(d′ < d) such that the projection of elements onto such suspace is a “faithful” representation of the original dataset

  • as “faithful” representation we mean that distances between elements

and their projections are small, even minimal

5

slide-6
SLIDE 6

PCA for d′ = 0

  • Objective: represent all d-dimensional vectors x1, . . . , xn by means of a

unique vector x0, in the most faithful way, that is so that J(x0) =

n

i=1

||x0 − xi||2 is minimum

  • it is easy to show that

x0 = m = 1 n

n

i=1

xi

6

slide-7
SLIDE 7

PCA for d′ = 0

  • In fact,

J(x0) =

n

i=1

||(x0 − m) − (xi − m)||2 =

n

i=1

||x0 − m||2 − 2

n

i=1

(x0 − m)T (xi − m) +

n

i=1

||xi − m||2 =

n

i=1

||x0 − m||2 − 2(x0 − m)T

n

i=1

(xi − m) +

n

i=1

||xi − m||2 =

n

i=1

||x0 − m||2 +

n

i=1

||xi − m||2

  • since

n

i=1

(xi − m) =

n

i=1

xi − n · m = n · m − n · m = 0

  • the second term is independent from x0, while the first one is equal to

zero for x0 = m

7

slide-8
SLIDE 8

PCA for d′ = 1

  • a single vector is too concise a representation of the dataset: anything

related to data variability gets lost

  • a more interesting case is the one when vectors are projected onto a

line passing through m

8

slide-9
SLIDE 9

PCA for d′ = 1

  • let u1 be unit vector (||u1|| = 1) in the line direction: the line equation

is then x = αu1 + m where α is the distance of x from m along the line

  • let ˜

xi = αiu1 + m be the projection of xi (i = 1, . . . , n) onto the line: given x1, . . . , xn, we wish to find the set of projections minimizing the quadratic error

9

slide-10
SLIDE 10

PCA for d′ = 1

The quadratic error is defined as J(α1, . . . , αn, u1) =

n

i=1

||˜ xi − xi||2 =

n

i=1

||(m + αiu1) − xi||2 =

n

i=1

||αiu1 − (xi − m)||2 =

n

i=1

+α2

i ||u1||2 + n

i=1

||xi − m||2 − 2

n

i=1

αiuT

1 (xi − m)

=

n

i=1

α2

i + n

i=1

||xi − m||2 − 2

n

i=1

αiuT

1 (xi − m) 10

slide-11
SLIDE 11

PCA for d′ = 1

Its derivative wrt αk is ∂ ∂αk J(α1, . . . , αn, u1) = 2αk − 2uT

1 (xk − m)

which is zero when αk = uT

1 (xk − m) (the orthogonal projection of xk onto

the line). The second derivative turns out to be positive ∂ ∂α2

k

J(α1, . . . , αn, u1) = 2 showing that what we have found is indeed a minimum.

11

slide-12
SLIDE 12

PCA for d′ = 1

To derive the best direction u1 of the line, we consider the covariance matrix

  • f the dataset

S = 1 n

n

i=1

(xi − m)(xi − m)T By plugging the values computed for αi into the definition of J(α1, . . . , αn, u1), we get J(u1) =

n

i=1

α2

i + n

i=1

||xi − m||2 − 2

n

i=1

α2

i

= −

n

i=1

[uT

1 (xi − m)]2 + n

i=1

||xi − m||2 = −

n

i=1

uT

1 (xi − m)(xi − m)T u1 + n

i=1

||xi − m||2 = −nuT

1 Su1 + n

i=1

||xi − m||2

12

slide-13
SLIDE 13

PCA for d′ = 1

  • uT

1 (xi − m) is the projection of xi onto the line

  • the product

uT

1 (xi − m)(xi − m)T u1

is then the variance of the projection of xi wrt the mean m

  • the sum

n

i=1

uT

1 (xi − m)(xi − m)T u1 = nuT 1 Su1

is the overall variance of the projections of vectors xi wrt the mean m

13

slide-14
SLIDE 14

PCA for d′ = 1

Minimizing J(u1) is equivalent to maximizing uT

1 Su1. That is, J(u1) is

minimum if u1 is the direction which keeps the maximum amount of variance in the dataset Hence, we wish to maximize uT

1 Su1 (wrt u1), with the constraint ||u1|| = 1.

By applying Lagrange multipliers this results equivalent to maximizing u = uT

1 Su1 − λ1(uT 1 u1 − 1)

This can be done by setting the first derivative wrt u1: ∂u ∂u1 = 2Su1 − 2λ1u1 to 0, obtaining Su1 = λ1u1

14

slide-15
SLIDE 15

PCA for d′ = 1

Note that:

  • u is maximized if u1 is an eigenvector of S
  • the overall variance of the projections is then equal to the

corresponding eigenvalue uT

1 Su1 = uT 1 λ1u1 = λ1uT 1 u1 = λ1

  • the variance of the projections is then maximized (and the error

minimized) if u1 is the eigenvector of S corresponding to the maximum eigenvalue λ1

15

slide-16
SLIDE 16

PCA for d′ > 1

  • The quadratic error is minimized by projecting vectors onto a

hyperplane defined by the directions associated to the d′ eigenvectors corresponding to the d′ largest eigenvalues of S

  • If we assume data are modeled by a d-dimensional gaussian

distribution with mean µ and covariance matrix Σ, PCA returns a d′-dimensional subspace corresponding to the hyperplane defined by the eigenvectors associated to the d′ largest eigenvalues of Σ

  • The projections of vectors onto that hyperplane are distributed as a

d′-dimensional distribution which keeps the maximum possible amount

  • f data variability

16

slide-17
SLIDE 17

An example of PCA

  • Digit recognition (D = 28 × 28 = 784)

17

slide-18
SLIDE 18

Choosing d′

Eigenvalue size distribution is usually characterized by a fast initial decrease followed by a small decrease This makes it possible to identify the number of eigenvalues to keep, and thus the dimensionality of the projections.

18

slide-19
SLIDE 19

Choosing d′

Eigenvalues measure the amount of distribution variance kept in the projection. Let us consider, for each k < d, the value rk = ∑k

i=1 λ2 i

∑n

i=1 λ2 i

which provides a measure of the variance fraction associated to the k largest eigenvalues. When r1 < . . . < rd are known, a certain amount p of variance can be kept by setting d′ = argmin

i∈{1,...,d}

ri > p

19

slide-20
SLIDE 20

Singular value decomposition

20

slide-21
SLIDE 21

Singular Value Decomposition

Let W ∈ I Rn×m be a matrix of rank r ≤ min(n, m), and let n > m. Then, there exist

  • U ∈ I

Rn×r orthonormal (that is, UT U = Ir)

  • V ∈ I

Rm×r orthonormal (that is, VVT = Ir)

  • Σ ∈ I

Rr×r diagonal such that W = UΣVT W = U Σ VT

(n × m) (n × r) (r × r) (r × m)

21

slide-22
SLIDE 22

SVD in greater detail

Let us consider the matrix A = WT W ∈ I Rm×m. Observe that

  • by definition, A has the same rank of W, that is r
  • A is symmetric: in fact, aij = wT

i wj by definition, where wk is the k-th

column of W; by the commutativity of vector product, aij = wT

i wj = wT j wi = aji

  • A is semidefinite positive, that is xT Ax ≥ 0 for all non null x ∈ I

Rm: this derives from xT Ax = xT (WT W)x = (Wx)T (Wx) = ||Wx||2 ≥ 0

22

slide-23
SLIDE 23

SVD in greater detail

All eigenvalues of A are real. In fact,

  • let λ ∈ C be an eigenvalue of A, and let v ∈ Cn be a corresponding

eigenvector: then, Av = λv and vT Av = vT λv = λvT v

  • observe that, in general, it must also be that the complex conjugates λ

and v are themselves an eigenvalue-eigenvector pair for A: then, Av = λv. Since λvT = (λv)T = (Av)T = vT AT = vT A by the simmetry of A, it derives vT Av = λvT v

  • as a consequence, λvT v = λvT v, that is λ||v||2 = λ||v||2
  • since v ̸= 0 (being an eigenvector), it must be λ = λ, hence λ ∈ I

R

23

slide-24
SLIDE 24

SVD in greater detail

The eigenvectors of A corresponding to different eigenvalues are

  • rthogonal
  • Let v1, v2 ∈ Cn be two eigenvectors, with corresponding distinct

eigenvalues λ1, λ2

  • then, by the simmetry of A, λ1(vT

1 v2) = (λ1v1)T v2 = (Av1)T v2 =

vT

1 AT v2 = vT 1 Av2 = vT 1 λ2v2 = λ2(vT 1 v2)

  • as a consequence, (λ1 − λ2)vT

1 v2 = 0

  • since λ1 ̸= λ2, it must be vT

1 v2 = 0, that is v1, v2 must be orthogonal

If an eigenvalue λ′ has multiplicity m > 1, it is always possible to find a set

  • f m orthonormal eigenvectors of λ′.

As a result, there exists a set of eigenvectors of A which provides an

  • rthonormal base.

24

slide-25
SLIDE 25

SVD in greater detail

All eigenvalues of a A are greater than zero.

  • A is real and symmetric, then for each eigenvalue λ it must be λ ∈ I

R and there must exist an eigenvector v ∈ I Rn such that Av = λv

  • as a consequence, vT (Av) = λvT v and

λ = vT Av vT v = vT Av ||v||2

  • ||v||2 > 0 since v is an eigenvector and, since A is semidefinite positive,

vT Av ≥ 0

  • as a consequence, λ ≥ 0

25

slide-26
SLIDE 26

SVD in greater detail

Overall,

  • A = WT W has r real and positive eigenvalues λ1, . . . , λr
  • the corresponding eigenvectors v1, . . . , vr are orthonormal
  • Avi = (WT W)vi = λivi, i = 1, . . . , r

Let us define r singular values σi = √ λi i = 1, . . . , r and let us also consider the set of vectors ui = 1 σi Wvi i = 1, . . . , r

26

slide-27
SLIDE 27

SVD in greater detail

  • Observe that u1, . . . , ur are orthogonal, in fact:

uT

i uj=

( 1 σi Wvi )T ( 1 σj Wvj ) = 1 σiσj vT

i WT Wvj =

1 σiσj vT

i (λjvj) = σj

σi vT

i vj

Hence, uT

i uj ̸= 0 iff vT i vj ̸= 0, that is iff i ̸= j.

  • Moreover, u1, . . . , ur have unitary norm, in fact:

||ui||2 =

  • 1

σi Wvi

  • 2

= 1 λi (Wvi)T (Wvi) = 1 λi vT

i (WT Wvi)

= 1 λi vT

i (λivi) = 1

λi λi(vT

i vi) = 1 27

slide-28
SLIDE 28

SVD in greater detail

Let us also consider the following matrices

  • V ∈ I

Rm×r having vectors v1, . . . , vr as columns V =    | | | v1 v2 · · · vr | | |   

  • U ∈ I

Rn×r having vectors u1, . . . , ur as columns U =    | | | u1 u2 · · · ur | | |   

  • Σ ∈ I

Rr×r having singular values on the diagonal Σ =       σ1 · · · σ2 · · · . . . . . . ... . . . · · · σr      

28

slide-29
SLIDE 29

SVD in greater detail

It is easy to verify that WV = UΣ Moreover, since V is orthogonal, its is V−1 = VT and, as a consequence, W = UΣVT W =    | | | u1 u2 · · · ur | | |          σ1 · · · σ2 · · · . . . . . . ... . . . · · · σr             – v1 – – v2 – . . . – vr –      

29

slide-30
SLIDE 30

PCA and SVD

30

slide-31
SLIDE 31

PCA and SVD

  • Given

X =    | | | x1 x2 · · · xn | | |   

  • the mean of vectors x1, . . . , xn is

m = 1 n    | | | x1 x2 · · · xn | | |          1 1 . . . 1       = 1 nX1

  • let ˜

X be the set of such vectors translated to have zero mean: ˜ X =    | | | x1 x2 · · · xn | | |    −    | | | m m · · · m | | |    = X − m1T = X − 1 nX11T = X ( I − 1 n11T )

31

slide-32
SLIDE 32

PCA and SVD

The correlation matrix of x1, . . . , xn is defined as: S =

n

i=1

(xi − m)(xi − m)T =

n

i=1

˜ xi˜ xT

i

where ˜ xi is the i-th column of ˜ X. That is, S = ˜ X ˜ X

T

˜ X has dimension n × d: assuming n > d, we may consider its SVD ˜ X = UΣVT where UUT = VT V = I and Σ is a diagonal matrix.

32

slide-33
SLIDE 33

PCA and SVD

By the properties of SVD, items on the diagonal of Σ are the eigenvalues of S and columns of V are the corresponding eigenvectors. In summary:

  • To perform a PCA on X, it is sufficient to compute the SVD of matrix

˜ X = X ( I − 1 n11T ) = UΣVT

  • The principal components of X are the columns of V, with

corresponding eigenvalues given by the diagonal elements of Σ2.

33

slide-34
SLIDE 34

Latent semantic analysis

34

slide-35
SLIDE 35

Introduction

Definitions Many models in text processing refer to co-occurrence data Given two sets V, D (for example, a set of terms and a collection of documents) a sequence of observations W = {(w1, d1), . . . , (wN, dN)} is considered, with wi ∈ V, di ∈ D (for example, these are occurrences of terms in documents.

35

slide-36
SLIDE 36

Latent semantic analysis

Fundamental hypotheses The Latent Semantic Analysis (LSA) approach is based on the following three hypotheses:

  • it is possible to derive semantic information from the matrix of
  • ccurrences of terms in documents
  • the reduction of dimensionality is a key aspect of this derivation
  • terms and documents can be modeled as points (vectors) in a

euclidean space Context

  • 1. Dictionary V of V terms t1, t2, . . . , tV
  • 2. Collection D of D documents d1, d2, . . . , dD
  • 3. Each document di is a sequence of Ni occurrences of terms in V

36

slide-37
SLIDE 37

Model

Idea

  • 1. A document di can be seen as a multiset of Ni terms in V (bag of words

hypotheses)

  • 2. There exists a correspondance between V and D, and a vector space

S.Each term ti has an associated vector ui, also, to each document dj a vector vj in S is associated Occurrence matrix Let us define the matrix W ∈ I RV ×D, where wi,j is associated to the

  • ccurrences of term ti into document dj. The value wi,j derives from some

measure of the number of occurrences of ti into dj (binary, count, tf, tf-idf, entropy, etc.).

  • Terms corresponds to row vectors (size D)
  • Documents correspond to column vector (size V )

37

slide-38
SLIDE 38

Model

Problem

  • 1. The values V, D are usually quite large
  • 2. Vectors corresponding to ti and dj are very sparse
  • 3. Terms and documents are modeled as vectors defined on different

spaces (I RD and I RV , respectively) Exploit singular value decomposition.

38

slide-39
SLIDE 39

In short

  • The occurrence matrix W is decomposed in the product of three

matrices.

  • A term matrix U, with rows corresponding to terms: each term spans
  • ver r dimensions
  • A document matrix VT , with columns corresponding to documents:

each document spans over r dimensions

  • The matrix of singular values Σ, whose diagonal elements provide a

measure of the relevance of the corresponding dimensions

39

slide-40
SLIDE 40

Use of SVD

W

tV t1 r1 rD

= U Σ VT

(V × D) (V × r) (r × r) (r × D)

Effect Rows of W (terms) are projected onto an r-dimensional subspace of I RD. The columns of VT provide a basis of such subspace, hence each term is associated to a linear combination of these columns. In particular, each term is a vector wrt to that base, with set of coordinates given by UΣ ∈ I Rr: value uikσk provides a measure of the relevance of term ti in the k-th topic.

40

slide-41
SLIDE 41

Use of SVD

WT

dD d1 t1 tV

= V Σ UT

(D × V ) (D × r) (r × r) (r × V )

Effect Rows of WT (documents) are projected onto an r-dimensional subspace of I RV . The columns of UT provide a basis of such subspace, hence each term is associated to a linear combination of these columns. In particular, each document is a vector wrt to that base, with set of coordinates given by VΣ ∈ I Rr: value vjkσk provides a measure of the presence of the k-th topic in document dj.

41

slide-42
SLIDE 42

LSA

Dimensionality reduction The dimension d of the projection subspace can be predefined to be less than the rank of W. In this case, W ≈ W = UΣVT Approximation The following property holds: min

A:rank(A)=d ||W − A||2 = ||W − W||2

That is W is the best approximation of Wamong all matrices of rank d wrt the Frobenius norm ||A||2 =

  • m

i=1 n

j=1

|aij|2

42

slide-43
SLIDE 43

LSA

Effect SVD provides a tranformation of two discrete vector spaces V ∈ Z ZD and D ∈ Z ZV into a unique continuous vector space with lower dimension T ∈ I Rd. The dimension of T is at most equal to the (unknown) rank of W, and is determined by the acceptable amount of distortion induced by the projection Interpretation W keeps most of the associations between terms and documents in W: it

  • nly does not take into account the least significant relations
  • Each term is now seen as a linear combination of unknown “topics”:

terms with similar projections tend to appear in the same documents (or in documents semantically similar, in which similar terms appear)

  • Each document is also seen as a linear combination of the same

unknown topics: documents with similar projections tend to contain the same terms (or terms semantically similar, which appear in similar documents)

43

slide-44
SLIDE 44

LSA and clustering

Co-occurrences

  • WWT ∈ Z

ZV ×V provides co-occurrences of terms in V (number of documents in which both terms appear)

  • WT W ∈ Z

ZD×D provides co-occurrences of documents in D (number

  • f terms appearing in both documents)

SVD and co-occurrence matrix By applying SVD, WWT = UΣVT VΣUT = UΣ2UT and WT W = VΣUT UΣVT = VΣ2VT

44

slide-45
SLIDE 45

Term clustering

WWT

i j

(i, j)

= UΣ

i j

(UΣ)T

·

Proximity of terms A reasonable measure of the proximity between two terms ti, tj is the number of documents in which they co-occur, that is the value of element (i, j) in WWT . This corresponds to the dot product of vectors uiσi (i-th row of UΣ) and ujσj (j-th row of UΣ). In particular, we may define D(ti, tj) = 1 cos(ui, uj) = ||ui|| · ||uj|| uiuT

j 45

slide-46
SLIDE 46

Document clustering

WT W

i j

(i, j)

= VΣ

i j

(VΣ)T

·

A reasonable measure of the proximity between two terms di, dj is the number of terms co-occurring in then, that is the value of element (i, j) in WT W. This corresponds to the dot product of vectors viσi (i-th row of VΣ) and vjσj (j-th row of VΣ). In particular, we may define D(di, dj) = 1 cos(vi, vj) = ||vi|| · ||vj|| vivT

j 46

slide-47
SLIDE 47

Proximity of a document to a topic

Objective Determine, given a document, the topic (in a predefined collection) which is more related to its content. Approach Construction of a vector of weights associated to the topic: can be seen as a further document d (topic template) W can be extended by attaching d as D + 1-th column of W, thus obtaining W ∈ Z ZV ×(D+1)

47

slide-48
SLIDE 48

Proximity of a document to a topic

W W

tV t1 d1 dD d

= U Σ VT V

T

(V × (D + 1)) (V × d) (d × d) (d × (D + 1)) v

Effect SVD provides a vector v ∈ I Rd as D + 1-th row of V, where d = UΣvT

48

slide-49
SLIDE 49

Proximity of a document to a topic

W

T W

i

= VΣ

i

(VΣ)T

·

A reasonable measure of the proximity between a document di and a topic d corresponds to the dot product of vectors viσi (i-th row of VΣ) and v (D + 1-th row of VΣ). In particular, we may define D(di, d) = 1 cos(vi, v) = ||vi|| · ||v|| vivT

49