Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical - - PowerPoint PPT Presentation

principal component analysis
SMART_READER_LITE
LIVE PREVIEW

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical - - PowerPoint PPT Presentation

Principal component analysis DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda Discussion Covariance matrix The spectral theorem Principal


slide-1
SLIDE 1

Principal component analysis

DS-GA 1013 / MATH-GA 2824 Mathematical Tools for Data Science

https://cims.nyu.edu/~cfgranda/pages/MTDS_spring20/index.html Carlos Fernandez-Granda

slide-2
SLIDE 2

Discussion

slide-3
SLIDE 3

Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

slide-4
SLIDE 4

Motivation: Multidimensional data

140 120 100 80 60 Longitude 40 50 60 70 80 Latitude

slide-5
SLIDE 5

Center of dataset

Probabilistic perspective: Data sampled from random vector ˜ x What is the center of the dataset? Possible definition: Minimum difference to all the points on average Center := arg min

w∈Rd E

  • ||˜

x − w||2

2

  • = arg min

w∈Rd d

  • j=1

E

x[j] − w[j])2 =   E(˜ x[1]) · · · E(˜ x[d])   = E(˜ x)

slide-6
SLIDE 6

Center of dataset

In practice, we have a dataset of n d-dimensional vectors X := {x1, . . . , xn} What is the center of the dataset? Reasonable choise: Sample mean av(X) := 1 n

n

  • i=1

xi

slide-7
SLIDE 7

Geometric interpretation

Geometric center := arg min

w∈Rd n

  • i=1

||xi − w||2

2

= arg min

w∈Rd d

  • j=1

n

  • i=1

(xi[j] − w[j])2 =  

1 n

  • i xi[1]

· · ·

1 n

  • i xi[1]

  = av(X)

slide-8
SLIDE 8

Centering

c(xi) := xi − av(X) 40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-9
SLIDE 9

Projection onto a fixed direction

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-10
SLIDE 10

Projection onto a fixed direction

40 20 20 40 Component in selected direction 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-11
SLIDE 11

Variance in direction of a fixed vector v

Var

  • vT ˜

x

  • = E
  • (vT ˜

x − E(vT ˜ x))2 = E

  • (vTc(˜

x))2 = vTE

  • c(˜

x)c(˜ x)T v

slide-12
SLIDE 12

Covariance matrix

The covariance matrix of a random vector ˜ x is defined as Σ˜

x := E

  • c(˜

x)c(˜ x)T =         Var (˜ x[1]) Cov (˜ x[1], ˜ x[2]) · · · Cov (˜ x[1], ˜ x[d]) Cov (˜ x[1], ˜ x[2]) Var (˜ x[2]) · · · Cov (˜ x[2], ˜ x[d]) . . . . . . ... . . . Cov (˜ x[1], ˜ x[d]) Cov (˜ x[2], ˜ x[d]) · · · Var (˜ x[d])        

slide-13
SLIDE 13

Variance in direction of a fixed vector v

Var

  • vT ˜

x

  • = E
  • (vT ˜

x − E(vT ˜ x))2 = E

  • (vTc(˜

x))2 = vTE

  • c(˜

x)c(˜ x)T v = vTΣ˜

xv

slide-14
SLIDE 14

Sample covariance matrix

For a dataset X = {x1, . . . , xn} ΣX := 1 n

n

  • i=1

c(xi)c(xi)T =      var (X[1]) cov (X[1], X[2]) · · · cov (X[1], X[d]) cov (X[1], X[2]) var (X[2]) · · · cov (X[2], X[d]) . . . . . . ... . . . cov (X[1], X[d]) cov (X[2], X[d]) · · · var (X[d])      where Xi := {x[i]1, . . . , x[i]n}

slide-15
SLIDE 15

Sample variance in direction of a fixed vector v

var (Pv X) := 1 n

n

  • i=1
  • vTxi − av (Pv X)

2 = 1 n

n

  • i=1
  • vT (xi − av (X))

2 = vT

  • 1

n

n

  • i=1

c(xi)c(xi)T

  • v

= vTΣX v

slide-16
SLIDE 16

Sample variance = 229 (sample std = 15.1)

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-17
SLIDE 17

Sample variance = 229 (sample std = 15.1)

40 20 20 40 Component in selected direction 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-18
SLIDE 18

f (v) := v TΣXv for ||v||2 = 1

2 1 1 2 v[1] 2 1 1 2 v[2]

10 46 100 200 3 3 531 531 1000 1000 2 2 3000 3000

slide-19
SLIDE 19

f (v) := v TΣXv for ||v||2 = 1

3 2 1 1 2 3 Angle (radians) 100 200 300 400 500 Value of quadratic form

slide-20
SLIDE 20

Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

slide-21
SLIDE 21

Quadratic form

Function f : Rd → R defined by f (x) := xTAx where A is a d × d symmetric matrix Generalization of quadratic functions to multiple dimensions Goal: Study quadratic forms when ||x||2 = 1 Motivation: If A is a covariance matrix, f encodes directional variance

slide-22
SLIDE 22

Does the function necessarily reach a maximum?

3 2 1 1 2 3 Angle (radians) 100 200 300 400 500 Value of quadratic form

slide-23
SLIDE 23

Does the function necessarily reach a maximum? Yes

◮ The function is continuous (second-order polynomial) ◮ Unit sphere is closed and bounded (contains all limit points) ◮ Image of unit sphere is also closed and bounded ◮ Image cannot grow towards limit it does not contain

slide-24
SLIDE 24

Does the function necessarily reach a maximum? Yes

For any symmetric matrix A ∈ Rd×d, there exists u1 ∈ Rd such that u1 = arg max

||x||2=1 xTAx

slide-25
SLIDE 25

Directional derivative

For any differentiable f : Rd → R and any v ∈ Rd such that ||v||2 = 1 f ′

v (x) := lim h→0

f (x + hv) − f (x) h = ∇f (x) , v If f ′

v (x) > 0, then f (x + ǫv) > f (x) for sufficiently small ǫ > 0

slide-26
SLIDE 26

Characterizing maximum of quadratic form

At the maximum u1, we cannot have f ′

v (u1) = ∇f (u1) , v

= 0 for any v such that u1 + ǫv is in the constraint set Wait a minute, can u1 + ǫv be in our constraint set?

slide-27
SLIDE 27

Tangent hyperplane

Unit sphere is level surface of g(x) := xTx x + v is in the tangent plane of g at x if ∇g(x)Tv = 0 If v is in the tangent plane, then g′

v(x) = 0, so

g(x + ǫv) ≈ g(x), i.e. x + ǫv is arbitrarily close to the level surface

slide-28
SLIDE 28

Can this point be a maximum of the quadratic form?

Red arrow = gradient of quadratic form Green line = gradient of g(x) := xTx

2 1 1 2 v[1] 2 1 1 2 v[2]

10 4 6 100 200 3 3 531 531 1000 1000 2000 2000 3000 3000

slide-29
SLIDE 29

Characterizing maximum of quadratic form

If ∇f (u1) , v = 0 for some v in the tangent plane, then f (u1 + ǫv) > f (u1) for a point that is almost on the unit sphere Since f is continuous there exists a y on the sphere such that f (y) ≈ f (u1 + ǫv) > f (u1)

slide-30
SLIDE 30

Where is the maximum?

Red arrow = gradient of quadratic form

2 1 1 2 v[1] 2 1 1 2 v[2]

10 4 6 100 200 3 3 531 531 1000 1000 2000 2000 3000 3000

slide-31
SLIDE 31

Characterizing maximum of quadratic form

We need ∇f (u1) , v = 0 for all v in the tangent plane Equivalent to ∇f (u1) = λ1∇g (u1) for some λ1 ∈ R. Then ∇f (u1) , v = λ1∇g (u1) , v = 0

slide-32
SLIDE 32

Maxima and minima satisfy ∇f (u1) = λ1∇g (u1)

Red arrow = gradient of quadratic form Green line = gradient of g(x) := xTx

2 1 1 2 v[1] 2 1 1 2 v[2]

10 4 6 100 200 3 3 531 531 1000 1000 2000 2000 3000 3000

slide-33
SLIDE 33

Conclusion

Maximum satisfies ∇f (u1) = λ1∇g (u1) ∇f (x) = ∇xTAx = 2Ax ∇g (x) = ∇xTx = 2x so Au1 = λ1u1, i.e. u1 is an eigenvector!

slide-34
SLIDE 34

Conclusion

For any symmetric A ∈ Rd×d, u1 := arg max

||x||2=1 xTAx

is an eigenvector of A. There exists λ1 ∈ R such that Au1 = λ1u1

slide-35
SLIDE 35

Value of the maximum

We have max

||x||2=1 xTAx = uT 1 Au1

= λ1

slide-36
SLIDE 36

Are there more eigenvectors?

Think about A ∈ R3×3 We know u1 attains maximum What happens on plane orthogonal to u1? Without loss of generality assume u1 = e3 Constraint set? Circle Quadratic function? xTAx = x[1] x[2] T A[1, 1] A[1, 2] A[2, 1] A[2, 2] x[1] x[2]

  • So there exists eigenvector u2...
slide-37
SLIDE 37

Spectral theorem

If A ∈ Rd×d is symmetric, then it has an eigendecomposition A =

  • u1

u2 · · · ud

   λ1 · · · λ2 · · · · · · · · · λd    

  • u1

u2 · · · ud T , Eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd are real Eigenvectors u1, u2, . . . , un are real and orthogonal

slide-38
SLIDE 38

Spectral theorem

λ1 = max

||x||2=1 xTAx

u1 = arg max

||x||2=1 xTAx

λk = max

||x||2=1,x⊥u1,...,uk−1

xTAx, 2 ≤ k ≤ d uk = arg max

||x||2=1,x⊥u1,...,uk−1

xTAx, 2 ≤ k ≤ d

slide-39
SLIDE 39

How do we prove this?

Formalize intuition from 3 × 3 case through induction

slide-40
SLIDE 40

Mathematical induction

If a statement Sd dependent on d satisfies: ◮ S1 holds (basis) ◮ If Sd−1 holds then Sd holds (step) Then Sd is true for all natural numbers d = 1, 2, . . .

slide-41
SLIDE 41

Basis

For d = 1 what is u1 and λ1?

slide-42
SLIDE 42

Step

We know u1 exists and satisfies Au1 = λ1u1 Let us consider action of A on orthogonal complement of u1 We want matrix A′ such that A′u1 = 0 A′x = x if x ⊥ u1 A − λ1u1uT

1 works

slide-43
SLIDE 43

Step

We want to apply assumption about d − 1 × d − 1 matrices We need to "compress" A − λ1u1uT

1

Let V⊥ ∈ Rd×d−1 contain orthonormal basis of span(u1)⊥ V⊥V T

⊥ is projection matrix

V⊥V T

⊥ (A − λ1u1uT 1 )V⊥V T ⊥ = A − λ1u1uT 1

We define symmetric B := V T

⊥ (A − λ1u1uT 1 )V⊥ ∈ Rd−1×d−1

slide-44
SLIDE 44

Step

By induction assumption there exist γ1, . . . , γd−1 and w1, . . . , wd−1 such that γ1 = max

||y||2=1 yTBy

w1 = arg max

||y||2=1 yTBy

γk = max

||y||2=1,y⊥w1,...,wk−1

yTBy, 2 ≤ k ≤ d − 2 wk = arg max

||y||2=1,y⊥w1,...,wk−1

yTBy, 2 ≤ k ≤ d − 2

slide-45
SLIDE 45

Step

For any x ∈ span(u1)⊥, x = V⊥y for some y ∈ Rd−1 max

||x||2=1,x⊥u1

xTAx = max

||x||2=1,x⊥u1

xT(A − λ1u1uT

1 )x

= max

||x||2=1,x⊥u1

xTV⊥V T

⊥ (A − λ1u1uT 1 )V⊥V T ⊥ x

= max

||y||2=1 yTBy

= γ1 Inspired by this: uk := V⊥wk−1 for k = 2, . . . , d u1, . . . , ud are orthonormal basis

slide-46
SLIDE 46

Step: eigenvectors

Auk = V⊥V T

⊥ (A − λ1u1uT 1 )V⊥V T ⊥ V⊥wk−1

= V⊥Bwk−1 = γk−1V⊥wk−1 = λkuk uk is an eigenvector of A with eigenvalue λk := γk−1

slide-47
SLIDE 47

Step

Let x ∈ span(u1)⊥ be orthogonal to uk′, where 2 ≤ k′ ≤ d There is y ∈ Rd−1 such that x = V⊥y and wT

k′−1y = wT k′V T ⊥ V⊥y

= uT

k′x

= 0

slide-48
SLIDE 48

Step: eigenvalues

Let x ∈ span(u1)⊥ be orthogonal to uk′, where 2 ≤ k′ ≤ d There is y ∈ Rd−1 such that x = V⊥y and wT

k′−1y = 0

max

||x||2=1,x⊥u1,...,uk−1

xTAx = max

||x||2=1,x⊥u1,...,uk−1

xTV⊥V T

⊥ (A − λ1u1uT 1 )V⊥V T ⊥ x

= max

||y||2=1,y⊥w1,...,wk−2

yTBy = γk−1 = λk

slide-49
SLIDE 49

Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

slide-50
SLIDE 50

Spectral theorem

If A ∈ Rd×d is symmetric, then it has an eigendecomposition A =

  • u1

u2 · · · ud

   λ1 · · · λ2 · · · · · · · · · λd    

  • u1

u2 · · · ud T , Eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd are real Eigenvectors u1, u2, . . . , un are real and orthogonal

slide-51
SLIDE 51

Variance in direction of a fixed vector v

If random vector ˜ x has covariance matrix Σ˜

x

Var

  • vT ˜

x

  • = vTΣ˜

xv

slide-52
SLIDE 52

Principal directions

Let u1, . . . , ud, and λ1 > . . . > λd be the eigenvectors/eigenvalues of Σ˜

x

λ1 = max

||v||2=1 Var(vT ˜

x) u1 = arg max

||v||2=1 Var(vT ˜

x) λk = max

||v||2=1,v⊥u1,...,uk−1

Var(vT ˜ x), 2 ≤ k ≤ d uk = arg max

||v||2=1,v⊥u1,...,uk−1

Var(vT ˜ x), 2 ≤ k ≤ d

slide-53
SLIDE 53

Principal components

Let c(˜ x) := ˜ x − E(˜ x)

  • pc[i] := uT

i c(˜

x), 1 ≤ i ≤ d is the ith principal component Var( pc[i]) := λi, 1 ≤ i ≤ d

slide-54
SLIDE 54

Principal components are uncorrelated

E( pc[i] pc[j]) = E(uT

i c(˜

x)uT

j c(˜

x)) = uT

i E(c(˜

x)c(˜ x)T)uj = uT

i Σ˜ xuj

= λiuT

i uj

= 0

slide-55
SLIDE 55

Principal components

For dataset X containing x1, x2, . . . , xn ∈ Rd

  • 1. Compute sample covariance matrix ΣX
  • 2. Eigendecomposition of ΣX yields principal directions u1, . . . , ud
  • 3. Center the data and compute principal components

pci[j] := uT

j c(xi),

1 ≤ i ≤ n, 1 ≤ j ≤ d, where c(xi) := xi − av(X)

slide-56
SLIDE 56

First principal direction

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-57
SLIDE 57

First principal component

40 20 20 40 First principal component 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-58
SLIDE 58

Second principal direction

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-59
SLIDE 59

Second principal component

40 20 20 40 Second principal component 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-60
SLIDE 60

Sample variance in direction of a fixed vector v

var (Pv X) = vTΣX v

slide-61
SLIDE 61

Principal directions

Let u1, . . . , ud, and λ1 > . . . > λd be the eigenvectors/eigenvalues of ΣX λ1 = max

||v||2=1 var (Pv X)

u1 = arg max

||v||2=1 var (Pv X)

λk = max

||v||2=1,v⊥u1,...,uk−1

var (Pv X) , 2 ≤ k ≤ d uk = arg max

||v||2=1,v⊥u1,...,uk−1

var (Pv X) , 2 ≤ k ≤ d

slide-62
SLIDE 62

Sample variance = 229 (sample std = 15.1)

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-63
SLIDE 63

Sample variance = 229 (sample std = 15.1)

40 20 20 40 Component in selected direction 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-64
SLIDE 64

Sample variance = 531 (sample std = 23.1)

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-65
SLIDE 65

Sample variance = 531 (sample std = 23.1

40 20 20 40 First principal component 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-66
SLIDE 66

Sample variance = 46.2 (sample std = 6.80)

40 20 20 40 Centered longitude 10 10 20 30 Centered latitude

slide-67
SLIDE 67

Sample variance = 46.2 (sample std = 6.80)

40 20 20 40 Second principal component 0.00 0.02 0.04 0.06 0.08 0.10 Density

slide-68
SLIDE 68

PCA of faces

Data set of 400 64 × 64 images from 40 subjects (10 per subject) Each face is vectorized and interpreted as a vector in R4096

slide-69
SLIDE 69

PCA of faces

Center PD 1 PD 2 PD 3 PD 4 PD 5 330 251 192 152 130

slide-70
SLIDE 70

PCA of faces

PD 10 PD 15 PD 20 PD 30 PD 40 PD 50 90.2 70.8 58.7 45.1 36.0 30.8

slide-71
SLIDE 71

PCA of faces

PD 100 PD 150 PD 200 PD 250 PD 300 PD 359 19.0 13.7 10.3 8.01 6.14 3.06

slide-72
SLIDE 72

Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

slide-73
SLIDE 73

Dimensionality reduction

Data with a large number of features can be difficult to analyze or process Dimensionality reduction is a useful preprocessing step If data are modeled as vectors in Rp we can reduce the dimension by projecting onto Rk, where k < p For orthogonal projections, the new representation is v1, x, v2, x, . . . , vk, x for a basis v1, . . . , vk of the subspace that we project on Problem: How do we choose the subspace? Possible criterion: Capture as much sample variance as possible

slide-74
SLIDE 74

Captured variance

For any orthonormal v1, . . . , vk

k

  • i=1

var(Pvi X) =

k

  • i=1

1 n

n

  • j=1

vT

i c(xj)c(xj)Tvi

=

k

  • i=1

vT

i ΣX vi

By spectral theorem, eigenvectors optimize each individual term

slide-75
SLIDE 75

Eigenvectors also optimize sum

For any symmetric A ∈ Rd×d with eigenvectors u1, . . . , uk

k

  • i=1

uT

i Aui ≥ k

  • i=1

vT

i Avi.

for any k orthonormal vectors v1, . . . , vk

slide-76
SLIDE 76

Proof by induction on k

Base (k = 1)? Follows from spectral theorem

slide-77
SLIDE 77

Step

Let S := span(v1, . . . , vk) For any orthonormal basis for S b1, . . . , bk of S VV T = BBT Choice of basis does not change cost function k

i=1 vT i Avi = trace

  • V TAV
  • = trace
  • AVV T

= trace

  • ABBT

= k

i=1 bT i Abi

Let’s choose wisely

slide-78
SLIDE 78

Step

We choose b orthogonal to u1, . . . , uk−1 By spectral theorem uT

k Auk ≥ bTAb

Now choose orthonormal basis b1, b2, . . . , bk for S so that bk := b By induction assumption

k−1

  • i=1

uT

i Aui ≥ k−1

  • i=1

bT

i Abi

slide-79
SLIDE 79

Conclusion

For any k orthonormal vectors v1, . . . , vk

k

  • i=1

var(pc[i]) ≥

k

  • i=1

var(Pvi X), where pc[i] := {pc1[i], . . . , pcn[i]} = Pui X

slide-80
SLIDE 80

Faces

xreduced

i

:= av(X) +

7

  • j=1

pci[j]uj

slide-81
SLIDE 81

Projection onto first 7 principal directions

Center PD 1 PD 2

=

8613

  • 2459

+ 665 PD 3 PD 4 PD 5

  • 180

+ 301 + 566 PD 6 PD 7 + 638 + 403

slide-82
SLIDE 82

Projection onto first k principal directions

Signal 5 PDs 10 PDs 20 PDs 30 PDs 50 PDs 100 PDs 150 PDs 200 PDs 250 PDs 300 PDs 359 PDs

slide-83
SLIDE 83

Nearest-neighbor classification

Training set of points and labels {x1, l1}, . . . , {xn, ln} To classify a new data point y, find i∗ := arg min

1≤i≤n ||y − xi||2 ,

and assign li∗ to y Cost: O (nd) to classify new point

slide-84
SLIDE 84

Nearest neighbors in principal-component space

Idea: Project onto first k main principal directions beforehand Costly reduced to O (nk) Computing eigendecomposition is costly, but only needs to be done once

slide-85
SLIDE 85

Face recognition

Training set: 360 64 × 64 images from 40 different subjects (9 each) Test set: 1 new image from each subject We model each image as a vector in R4096 (d = 4096) To classify we:

  • 1. Project onto first k principal directions
  • 2. Apply nearest-neighbor classification using the ℓ2-norm distance in Rk
slide-86
SLIDE 86

Performance

10 20 30 40 50 60 70 80 90 100 10 20 30 4 Number of principal components Errors

slide-87
SLIDE 87

Nearest neighbor in R41

Test image Projection Closest projection

Corresponding image

slide-88
SLIDE 88

Dimensionality reduction for visualization

Motivation: Visualize high-dimensional features projected onto 2D or 3D Example: Seeds from three different varieties of wheat: Kama, Rosa and Canadian Features: ◮ Area ◮ Perimeter ◮ Compactness ◮ Length of kernel ◮ Width of kernel ◮ Asymmetry coefficient ◮ Length of kernel groove

slide-89
SLIDE 89

Projection onto two first PDs

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

First principal component

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Second principal component

slide-90
SLIDE 90

Projection onto two last PDs

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

(d-1)th principal component

2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

dth principal component

slide-91
SLIDE 91

Covariance matrix The spectral theorem Principal component analysis Dimensionality reduction via PCA Gaussian random vectors

slide-92
SLIDE 92

Gaussian random variables

The pdf of a Gaussian or normal random variable ˜ a with mean µ and standard deviation σ is given by f˜

a (a) =

1 √ 2πσ e− (a−µ)2

2σ2

slide-93
SLIDE 93

Gaussian random variables

−10 −8 −6 −4 −2 2 4 6 8 10 0.1 0.2 0.3 0.4 a f˜

a (a)

µ = 2 σ = 1 µ = 0 σ = 2 µ = 0 σ = 4

slide-94
SLIDE 94

Gaussian random variables

µ = ∞

a=−∞

af˜

a (a) da

σ2 = ∞

a=−∞

(a − µ)2f˜

a (a) da

slide-95
SLIDE 95

Linear transformation of Gaussian

If ˜ a is a Gaussian random variable with mean µ and standard deviation σ, then for any α, β ∈ R ˜ b := α˜ a + β is a Gaussian random variable with αµ + β and standard deviation |α| σ

slide-96
SLIDE 96

Proof

Let α > 0 (proof for a < 0 is very similar), F˜

b (b) = P

  • ˜

b ≤ b

  • = P (α˜

a + β ≤ b) = P

  • ˜

a ≤ b − β α

  • =
  • b−β

α

−∞

1 √ 2πσ e− (a−µ)2

2σ2

da = b

−∞

1 √ 2πασ e− (w−αµ−β)2

2α2σ2

dw change of variables w := αa + β Differentiating with respect to b: f˜

b (b) =

1 √ 2πασ e− (b−αµ−β)2

2α2σ2

slide-97
SLIDE 97

Gaussian random vector

A Gaussian random vector ˜ x is a random vector with joint pdf f˜

x (x) =

1

  • (2π)n |Σ|

exp

  • −1

2 (x − µ)T Σ−1 (x − µ)

  • where µ ∈ Rd is the mean and Σ ∈ Rd×d the covariance matrix

Σ ∈ Rd×d is positive definite (positive eigenvalues)

slide-98
SLIDE 98

Contour surfaces

Set of points at which pdf is constant c = xTΣ−1x assuming µ = 0 = xTUΛ−1Ux =

d

  • i=1

(uT

i x)2

λi Ellipsoid with axes proportional to √λi

slide-99
SLIDE 99

2D example

µ = 0 Σ = 0.5 −0.3 −0.3 0.5

  • λ1 = 0.8

λ2 = 0.2 u1 = 1/ √ 2 −1/ √ 2

  • u2 =

1/ √ 2 1/ √ 2

  • How does the ellipse look like?
slide-100
SLIDE 100

Contour surfaces

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

x[1]

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

x[2]

0.24

10

− 4

10

− 4

10−2 10−2 10−1 0.37

slide-101
SLIDE 101

Contour surfaces

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x[1] −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x[2]

√λ2 u2 √λ1 u1

0.24

10

− 4

10

− 4

10−2 10−2 10−1 0.37

slide-102
SLIDE 102

Uncorrelation implies independence

If the covariance matrix is diagonal, Σ˜

x =

     σ2

1

· · · σ2

2

· · · . . . . . . ... . . . · · · σ2

d

     the entries of a Gaussian random vector are independent

slide-103
SLIDE 103

Proof

Σ−1

˜ x

=      

1 σ2

1

· · ·

1 σ2

2

· · · . . . . . . ... . . . · · ·

1 σ2

d

      |Σ| =

d

  • i=1

σ2

i

slide-104
SLIDE 104

Proof

x (x) =

1

  • (2π)d |Σ|

exp

  • −1

2 (x − µ)T Σ−1 (x − µ)

  • =

d

  • i=1

1

  • (2π)σi

exp

  • −(xi − µi)2

2σ2

i

  • =

d

  • i=1

xi (xi)

slide-105
SLIDE 105

Linear transformations

Let ˜ x be a Gaussian random vector of dimension d with mean µ and covariance matrix Σ For any matrix A ∈ Rm×d and b ∈ Rm ˜ y = A˜ x + b is Gaussian with mean Aµ + b and covariance matrix AΣAT (as long as it is full rank)

slide-106
SLIDE 106

PCA on Gaussian random vectors

Let ˜ x be a Gaussian random vector with covariance matrix Σ := UΛUT The principal components

  • pc := UT ˜

x are Gaussian and have covariance matrix UTΣU = Λ so they are independent Often not the case in practice!

slide-107
SLIDE 107

Maximum likelihood for Gaussian vectors

Log-likelihood of Gaussian parameters (µML, ΣML) := arg max

µ∈Rd,Σ∈Rd×d log n

  • i=1

1

  • (2π)d |Σ|

exp

  • −1

2 (xi − µ)T Σ−1 (xi − µ)

  • = arg

min

µ∈Rd,Σ∈Rd×d n

  • i=1

(xi − µ)T Σ−1 (xi − µ) + n 2 log |Σ| . Solution is sample mean and variance Additional justification, but PCA is useful without Gaussian assumption!